Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-30 Thread Zygo Blaxell
On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > > > - is documenting rejection on request alignment grounds
> > > >   (i.e. EINVAL) in the man page sufficient for app
> > > >   developers to understand what is going on here?
> > > 
> > > I think so.  The manpage says: "The filesystem does not support
> > > reflinking the ranges of the given files", which (to my mind) covers
> > > this case of not supporting dedupe of EOF blocks.
> > 
> > Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> > this; however, on btrfs, not supporting dedupe of EOF blocks means small
> > files (one extent) cannot be deduped at all, because the EOF block holds
> > a reference to the entire dst extent.  If a dedupe app doesn't go all the
> > way to EOF on btrfs, then it should not attempt to dedupe any part of the
> > last extent of the file as the benefit would be zero or slightly negative.
> 
> That's a filesystem implementation issue, not an API or application
> issue.

The API and application issue remains even if btrfs is not considered.
btrfs is just the worst case outcome.  Other filesystems still have
fragmentation issues, and applications have efficiency-vs-capability
tradeoffs to make if they can't rely on dedupe-to-EOF being available.

Tools like 'cp --reflink=auto' work by trying the best case, then falling
back to a second choice if the first choice returns an error.  If the
second choice fails too, the surprising behavior can make inattentive
users lose data.

> > The app developer would need to be aware that such a restriction could
> > exist on some filesystems, and be able to distinguish this from other
> > cases that could lead to EINVAL.  Portable code would have to try a dedupe
> > up to EOF, then if that failed, round down and retry, and if that failed
> > too, the app would have to figure out which filesystem it's running on
> > to know what to do next.  Performance demands the app know what the FS
> > will do in advance, and avoid a whole class of behavior.
> 
> Nobody writes "portable" applications like that. 

As an app developer, and having studied other applications' revision
histories, and having followed IRC and mailing list conversations
involving other developers writing these applications, I can assure
you that is _exactly_ how portable applications get written around
the dedupe function.

Usually people start from experience with tools that use hardlinks to
implement dedupe, so the developer's mental model starts with deduping
entire files.  Their first attempt does this:

stat(fd, );
dedupe( ..., src_offset = 0, dst_offset = 0, length = st.st_size);

then subsequent revisions of their code cope with limits on length,
and then deal with EINVAL on odd lengths, because those are the problems
that are encountered as the code runs for the first time on an expanding
set of filesystems.  After that, they deal with implementation-specific
performance issues.

Other app developers start by ignoring incomplete blocks, then compare
their free-space-vs-time graphs with other dedupe apps on the same
filesystem, then either adapt to handle EOF properly, or just accept
being uncompetitive.

> They read the man
> page first, and work out what the common subset of functionality is
> and then code from that. 

> Man page says:
> 
> "Disk filesystems generally require the offset and length arguments
> to be aligned to the fundamental block size."

> IOWs, code compatible with starts with supporting the general case.
> i.e. a range rounded to filesystem block boundaries (it's already
> run fstat() on the files it wants to dedupe to find their size,
> yes?), hence ignoring the partial EOF block. Will just work on
> everything.

Will cause a significant time/space performance hit too.  EOFs are
everywhere, and they have a higher-than-average duplication rate
for their size.  If an application assumes EOF can't be deduped on
every filesystem, then it leaves a non-trivial amount of free space
unrecovered on filesystems that can dedupe EOF.  It also necessarily
increases fragmentation unless the filesystem implements file tails
(where it keeps fragmentation constant as the tail won't be stored
contiguously in any case).

> Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
> fstatvfs to determine what fs it's operating on and applies the
> necessary quirks. For btrfs it can extend the range to include the
> partial EOF block, and hence will handle the implementation quirks
> btrfs has with single extent dedupe.
> 
> Simple, reliable, and doesn't require any sort of flailing
> about with offsets and lengths to avoid unexpected EINVAL errors.

It would be better if the filesystem was given the EOF block by the
application if it is the 

Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Duncan
Chris Murphy posted on Thu, 30 Aug 2018 11:08:28 -0600 as excerpted:

>> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks
>> in prder to start in degraded mode
> 
> Good luck with this. The Btrfs archives are full of various limitations
> of Btrfs raid1. There is no automatic degraded mount for Btrfs. And if
> you persistently ask for degraded mount, you run the risk of other
> problems if there's merely a delayed discovery of one of the devices.
> Once a Btrfs volume is degraded, it does not automatically resume normal
> operation just because the formerly missing device becomes available.
> 
> So... this is flat out not suitable for use cases where you need
> unattended raid1 degraded boot.

Agreeing in general and adding some detail...

1) Are you intending to use an initr*?  I'm not sure the current status 
(I actually need to test again for myself), but at least in the past, 
booting a btrfs raid1 rootfs required an initr*, and I have and use one 
here, for that purpose alone (until switching to btrfs raid1 root, I went 
initr*-less, and would prefer that again, due to the complications of 
maintaining an initr*).

The base problem is that with raid1 (or other forms of multi-device 
btrfs, but it happens to be raid1 that's in question for both you and I) 
the filesystem needs multiple devices to complete the filesystem and the 
kernel's root= parameter takes only one.  When mounting after userspace 
is up, a btrfs device scan is normally run (often automatically by udev) 
before the mount, that lets btrfs in the kernel track what devices belong 
to what filesystems, so pointing to just one of the devices is enough 
because the kernel knows from that what filesystem is intended and can 
match up the others that go with it from the earlier scan.

Now there's a btrfs mount option, device=/dev/*, that can be provided 
more than once for additional devices, that can /normally/ be used to 
tell the kernel what specific devices to use, bypassing the need for 
btrfs device scan, and in /theory/, passing that like other mount options 
in the kernel commandline via rootflags= /should/ "just work".

But for reasons I as a btrfs user (not dev, and definitely not kernel or 
btrfs dev) don't fully understand, passing device= via rootflags= is, or 
at least was, broken, so properly mounting a multi-device btrfs required 
(and may still require) userspace, thus for a multi-device btrfs rootfs, 
an initr*.

So direct-booting to a multi-device btrfs rootfs didn't normally work.  
It would if you passed rootflags=degraded (at least with a two-device 
raid1 so the one device passed in root= contained one copy of 
everything), but then it was unclear if the additional device was 
successfully added to the raid1 later, or not.  And with no automatic 
sync and bringing back to undegraded status, it was a risk I didn't want 
to take.  So unfortunately, initr* it was!

But I originally tested that when I setup my own btrfs raid1 rootfs very 
long ago in kernel and btrfs terms, kernel 3.6 or so IIRC, and while I've 
not /seen/ anything definitive on-list to suggest rootflags=device= is 
unbroken now (I asked recently and got an affirmative reply, but I asked 
for clarification and I've not seen it, tho perhaps it's there and I've 
not read it yet), perhaps I missed it.  And I've not retested lately, tho 
I really should as while I asked I guess the only real way to know is to 
try it for myself, and it'd definitely be nice to be direct-booting 
without having to bother with an initr*, again.

2) As both Chris and I alluded to, unlike say mdraid, btrfs doesn't (yet) 
have an automatic mechanism to re-sync and "undegrade" after having been 
mounted degraded,rw.  A btrfs scrub can be run to re-sync raid1 chunks, 
but single chunks may have been added while in the degraded state as 
well, and those need a balance convert to raid1 mode, before the 
filesystem and data on it can be be considered reliably able to withstand 
device loss once again.

In fact, while the problem has been fixed now, for quite awhile if the 
filesystem was mounted degraded,rw, you often had exactly that one mount 
to fix the problem, as new chunks would be written in single mode and 
after that the filesystem would refuse to mount writable,degraded, and 
would only let you mount degraded,ro, which would let you get data off it 
but not let you fix the problem.  Word to the wise if you're planning on 
running stable-debian (which tend to be older) kernels, or even just 
trying to use them for recovery if you need to!  (The fix was to have a 
mount check if at least one copy of all chunks were available and allow rw 
mounting if so, instead of simply assuming that any single-mode chunks at 
all meant some wouldn't be available on a multi-device filesystem with a 
device missing, thus forcing read-only mode only mounting, as it used to 
do.)

3) If a btrfs raid1 is mounted degraded,rw with one device missing, then 
mounted again 

[PATCH 3/3] btrfs: qgroup: Remove deprecated feature support in btrfs_qgorup_inhert()

2018-08-30 Thread Qu Wenruo
Since btrfs_validate_inherit() will not allow features like copy
rfer/excl and limit set, remove these dead code.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 57 +--
 1 file changed, 1 insertion(+), 56 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 53daf73b0de9..b57577f45ff3 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2301,8 +2301,7 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans, u64 srcid,
 
if (inherit) {
i_qgroups = (u64 *)(inherit + 1);
-   nums = inherit->num_qgroups + 2 * inherit->num_ref_copies +
-  2 * inherit->num_excl_copies;
+   nums = inherit->num_qgroups;
for (i = 0; i < nums; ++i) {
srcgroup = find_qgroup_rb(fs_info, *i_qgroups);
 
@@ -2354,23 +2353,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans, u64 srcid,
goto unlock;
}
 
-   if (inherit && inherit->flags & BTRFS_QGROUP_INHERIT_SET_LIMITS) {
-   dstgroup->lim_flags = inherit->lim.flags;
-   dstgroup->max_rfer = inherit->lim.max_rfer;
-   dstgroup->max_excl = inherit->lim.max_excl;
-   dstgroup->rsv_rfer = inherit->lim.rsv_rfer;
-   dstgroup->rsv_excl = inherit->lim.rsv_excl;
-
-   ret = update_qgroup_limit_item(trans, dstgroup);
-   if (ret) {
-   fs_info->qgroup_flags |= 
BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT;
-   btrfs_info(fs_info,
-  "unable to update quota limit for %llu",
-  dstgroup->qgroupid);
-   goto unlock;
-   }
-   }
-
if (srcid) {
srcgroup = find_qgroup_rb(fs_info, srcid);
if (!srcgroup)
@@ -2413,43 +2395,6 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle 
*trans, u64 srcid,
++i_qgroups;
}
 
-   for (i = 0; i <  inherit->num_ref_copies; ++i, i_qgroups += 2) {
-   struct btrfs_qgroup *src;
-   struct btrfs_qgroup *dst;
-
-   if (!i_qgroups[0] || !i_qgroups[1])
-   continue;
-
-   src = find_qgroup_rb(fs_info, i_qgroups[0]);
-   dst = find_qgroup_rb(fs_info, i_qgroups[1]);
-
-   if (!src || !dst) {
-   ret = -EINVAL;
-   goto unlock;
-   }
-
-   dst->rfer = src->rfer - level_size;
-   dst->rfer_cmpr = src->rfer_cmpr - level_size;
-   }
-   for (i = 0; i <  inherit->num_excl_copies; ++i, i_qgroups += 2) {
-   struct btrfs_qgroup *src;
-   struct btrfs_qgroup *dst;
-
-   if (!i_qgroups[0] || !i_qgroups[1])
-   continue;
-
-   src = find_qgroup_rb(fs_info, i_qgroups[0]);
-   dst = find_qgroup_rb(fs_info, i_qgroups[1]);
-
-   if (!src || !dst) {
-   ret = -EINVAL;
-   goto unlock;
-   }
-
-   dst->excl = src->excl + level_size;
-   dst->excl_cmpr = src->excl_cmpr + level_size;
-   }
-
 unlock:
spin_unlock(_info->qgroup_lock);
 out:
-- 
2.18.0



[PATCH 2/3] btrfs: qgroup: Validate btrfs_qgroup_inherit structure before passing it to qgroup

2018-08-30 Thread Qu Wenruo
btrfs_qgroup_inherit structure doesn't goes through much validation
check.

Now do a comprehensive check for it, including:
1) inherit size
   Should not exceeding SZ_4K and its num_qgroups should not
   exceed its size passed in btrfs_ioctl_vol_args_v2.

2) flags
   Should not include any unknown flags
   (In fact, no flag is supported at all now)
   Btrfs-progs never has such ability to set flags for btrfs_qgroup_inherit.

3) limit
   Should not contain anything.
   Btrfs-progs never has such ability to set limit for btrfs_qgroup_inherit.

4) rfer/excl copy
   Deprecated feature.
   Btrfs-progs has such interface but never documented and we're already
   going to remove such ability.
   It's the easiest way to screw up qgroup numbers.

3) Qgroupid
   Comprehensive check is already in btrfs_qgroup_inherit(), here we
   only check if there is any obviously invalid qgroupid (0).

Coverity-id: 1021055
Reported-by: Nikolay Borisov 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ioctl.c   |  3 +++
 fs/btrfs/qgroup.c  | 39 ++
 fs/btrfs/qgroup.h  |  2 ++
 include/uapi/linux/btrfs.h | 17 -
 4 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 5db8680b40a9..4f5f453d5d07 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1820,6 +1820,9 @@ static noinline int btrfs_ioctl_snap_create_v2(struct 
file *file,
ret = PTR_ERR(inherit);
goto free_args;
}
+   ret = btrfs_validate_inherit(inherit, vol_args->size);
+   if (ret < 0)
+   goto free_args;
}
 
ret = btrfs_ioctl_snap_create_transid(file, vol_args->name,
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 4353bb69bb86..53daf73b0de9 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2232,6 +2232,45 @@ int btrfs_run_qgroups(struct btrfs_trans_handle *trans)
return ret;
 }
 
+/*
+ * To make sure the inherit passed in is valid
+ *
+ * Here we only check flags and rule out some no-longer supported features.
+ * And we only do very basis qgroupid check to ensure there is no obviously
+ * invalid qgroupid (0). Detailed qgroupid check will be done in
+ * btrfs_qgroup_inherit().
+ */
+int btrfs_validate_inherit(struct btrfs_qgroup_inherit *inherit,
+  u64 inherit_size)
+{
+   u64 i;
+
+   if (inherit->flags & ~BTRFS_QGROUP_INHERIT_FLAGS_SUPP)
+   return -ENOTTY;
+   /* Qgroup rfer/excl copy is deprecated */
+   if (inherit->num_excl_copies || inherit->num_ref_copies)
+   return -ENOTTY;
+
+   /* Since SET_LIMITS is never used, @lim should all be zeroed */
+   if (inherit->lim.max_excl || inherit->lim.max_rfer ||
+   inherit->lim.rsv_excl || inherit->lim.rsv_rfer ||
+   inherit->lim.flags)
+   return -ENOTTY;
+
+   /* Size check */
+   if (sizeof(u64) * inherit->num_qgroups + sizeof(*inherit) >
+   min_t(u64, BTRFS_QGROUP_INHERIT_MAX_SIZE, inherit_size))
+   return -EINVAL;
+
+
+   /* Qgroup 0/0 is not allowed */
+   for (i = 0; i < inherit->num_qgroups; i++) {
+   if (inherit->qgroups[i] == 0)
+   return -EINVAL;
+   }
+   return 0;
+}
+
 /*
  * Copy the accounting information between qgroups. This is necessary
  * when a snapshot or a subvolume is created. Throwing an error will
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 54b8bb282c0e..1bf9c584be70 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -241,6 +241,8 @@ int btrfs_qgroup_account_extent(struct btrfs_trans_handle 
*trans, u64 bytenr,
struct ulist *new_roots);
 int btrfs_qgroup_account_extents(struct btrfs_trans_handle *trans);
 int btrfs_run_qgroups(struct btrfs_trans_handle *trans);
+int btrfs_validate_inherit(struct btrfs_qgroup_inherit *inherit,
+  u64 inherit_size);
 int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans, u64 srcid,
 u64 objectid, struct btrfs_qgroup_inherit *inherit);
 void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 311edb65567c..5a5532a20019 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -74,21 +74,20 @@ struct btrfs_qgroup_limit {
__u64   rsv_excl;
 };
 
-/*
- * flags definition for qgroup inheritance
- *
- * Used by:
- * struct btrfs_qgroup_inherit.flags
- */
+/* flags definition for qgroup inheritance (DEPRECATED) */
 #define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0)
 
+/* No supported flags */
+#define BTRFS_QGROUP_INHERIT_FLAGS_SUPP (0)
+
 #define BTRFS_QGROUP_INHERIT_MAX_SIZE  (SZ_4K)
+
 struct btrfs_qgroup_inherit {
__u64   flags;
__u64   num_qgroups;
-   __u64   num_ref_copies;
-   __u64 

[PATCH 1/3] btrfs: Set qgroup inherit size limit to SZ_4K instead of page size

2018-08-30 Thread Qu Wenruo
Change btrfs_qgroup_inherit maximum size from PAGE_SIZE to SZ_4K to make
it consistent across different architectures.

Although in theory this could lead to incompatibility, but considering
how rare btrfs_qgroup_inherit is used, it's still not too late to change
it without impacting a large user base.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ioctl.c   | 2 +-
 include/uapi/linux/btrfs.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 63600dc2ac4c..5db8680b40a9 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1811,7 +1811,7 @@ static noinline int btrfs_ioctl_snap_create_v2(struct 
file *file,
if (vol_args->flags & BTRFS_SUBVOL_RDONLY)
readonly = true;
if (vol_args->flags & BTRFS_SUBVOL_QGROUP_INHERIT) {
-   if (vol_args->size > PAGE_SIZE) {
+   if (vol_args->size > BTRFS_QGROUP_INHERIT_MAX_SIZE) {
ret = -EINVAL;
goto free_args;
}
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index 5ca1d21fc4a7..311edb65567c 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -82,6 +82,7 @@ struct btrfs_qgroup_limit {
  */
 #define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0)
 
+#define BTRFS_QGROUP_INHERIT_MAX_SIZE  (SZ_4K)
 struct btrfs_qgroup_inherit {
__u64   flags;
__u64   num_qgroups;
-- 
2.18.0



[PATCH 0/3] btrfs: qgroup: Deprecate unused features for btrfs_qgroup_inherit()

2018-08-30 Thread Qu Wenruo
This patchset can be fetched from github:
https://github.com/adam900710/linux/tree/qgroup_inherit_check
Which is based on v4.19-rc1 tag.

This patchset will first set btrfs_qgroup_inherit structure size limit
from PAGE_SIZE to fixed SZ_4K.
I understand this normally will cause compatibility problem, but
considering how minor this feature is and no sane guy should use it for
over 100 qgroups, it should be fine in real world.

The 2nd patch introduce check function for btrfs_qgroup_inherit
structure and deprecates the following features:
1) limit set
   Never utilized by btrfs-progs from the beginning.

2) copy rfer/excl
   Although btrfs-progs provides support for it as a hidden,
   undocumented feature, it's the easiest way to screw up qgroup
   numbers.
   And we already have patches wondering around the ML to remove such
   support.

The last one will just cleanup the code for unsupported features.

Qu Wenruo (3):
  btrfs: Set qgroup inherit size limit to SZ_4K instead of page size
  btrfs: qgroup: Validate btrfs_qgroup_inherit structure before passing
it to qgroup
  btrfs: qgroup: Remove deprecated feature support in
btrfs_qgorup_inhert()

 fs/btrfs/ioctl.c   |  5 +-
 fs/btrfs/qgroup.c  | 96 --
 fs/btrfs/qgroup.h  |  2 +
 include/uapi/linux/btrfs.h | 18 +++
 4 files changed, 55 insertions(+), 66 deletions(-)

-- 
2.18.0



Re: fsck lowmem mode only: ERROR: errors found in fs roots

2018-08-30 Thread Su Yue

Thank for the report.

On 08/31/2018 12:47 AM, Christoph Anton Mitterer wrote:

Hey.

I've the following on a btrfs that's basically the system fs for my
notebook:


When booting from a USB stick with:
# uname -a
Linux heisenberg 4.17.0-3-amd64 #1 SMP Debian 4.17.17-1
(2018-08-18) x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.17

... a lowmem mode fsck gives no error:

# btrfs check --mode=lowmem /dev/mapper/system ; echo $?
Checking filesystem on /dev/mapper/system
UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c
checking extents
checking free space cache
checking fs roots
ERROR: errors found in fs roots
found 495910952960 bytes used, error(s) found
total csum bytes: 481840472
total tree bytes: 2388819968
total fs tree bytes: 1651097600
total extent tree bytes: 161841152
btree space waste bytes: 446707102
file data blocks allocated: 6651878428672
  referenced 542320984064
1

... while a normal mode fsck doesn't give one:

# btrfs check /dev/mapper/system ; echo $?
Checking filesystem on /dev/mapper/system
UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c
checking extents
checking free space cache
checking fs roots
checking only csum items (without verifying data)
checking root refs
found 495910952960 bytes used, no error found
total csum bytes: 481840472
total tree bytes: 2388819968
total fs tree bytes: 1651097600
total extent tree bytes: 161841152
btree space waste bytes: 446707102
file data blocks allocated: 6651878428672
  referenced 542320984064
0

There were no unusual kernel log messages.



Humm, I think it's a bug of lowmem mode.
After looking through releated codes, I can't tell the cause.

Can you please fetch btrfs-progs from my repo and run lowmem check
in readonly?
Repo: https://github.com/Damenly/btrfs-progs/tree/lowmem_debug
It's based on v4.17.1 plus additonal output for debug only.


Back in the normal system (no longer USB)... I spottet this:
Aug 30 18:31:29 heisenberg kernel: BTRFS info (device dm-0): the free
space cache file (22570598400) is invalid, skip it

but not sure whether it's related (probably not)... and I haven't seen
such a free space cache file issue (or any other btrfs errors) in a
long while (I usually watch my kernel log once after booting has
finished).
BTW, what's the mount option of USB?




Any ideas? Perhaps it's just yet another lowmem false positive...
anything I can help in debugging this?


Apart from this I haven't noticed any corruptions recently... just
about to make a full backup of the fs (or better said a rw snapshot of
the most of the data) with tar, so most data will be read soon at least
once... an I would probably notice any further errors that are
otherwise silent.


Don't worry, since normal check didn't report any error, it may be just
a false alert.

Su


Cheers,
Chris.








Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
And also, I'll argue this might have been a btrfs-progs bug as well,
depending on what version was used and the command. Both mkfs and dev
add should not be able to add type code 0x05. At least libblkid
correctly shows that it's 1KiB in size, so really Btrfs should not
succeed at adding this device, it can't put any of the supers in the
correct location.

[chris@f28h ~]$ sudo fdisk -l /dev/loop0
Disk /dev/loop0: 1 GiB, 1073741824 bytes, 2097152 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x7e255cce

Device   Boot  StartEnd Sectors  Size Id Type
/dev/loop0p12048 206847  204800  100M 83 Linux
/dev/loop0p2  206848 411647  204800  100M 83 Linux
/dev/loop0p3  411648 616447  204800  100M 83 Linux
/dev/loop0p4  616448 821247  204800  100M  5 Extended
/dev/loop0p5  618496 821247  202752   99M 83 Linux

[chris@f28h ~]$ sudo kpartx -a /dev/loop0
[chris@f28h ~]$ lsblk
NAMEMAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
loop0 7:00 1G  0 loop
├─loop0p1   253:10   100M  0 part
├─loop0p2   253:20   100M  0 part
├─loop0p3   253:30   100M  0 part
├─loop0p4   253:40 1K  0 part
└─loop0p5   253:5099M  0 part

[chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4
btrfs-progs v4.17.1
See http://btrfs.wiki.kernel.org for more information.

probe of /dev/loop0p4 failed, cannot detect existing filesystem.
ERROR: use the -f option to force overwrite of /dev/loop0p4
[chris@f28h ~]$ sudo mkfs.btrfs /dev/loop0p4 -f
btrfs-progs v4.17.1
See http://btrfs.wiki.kernel.org for more information.

ERROR: mount check: cannot open /dev/loop0p4: No such file or directory
ERROR: cannot check mount status of /dev/loop0p4: No such file or directory
[chris@f28h ~]$


I guess that's a good sign in this case?


Chris Murphy


[PATCH 34/35] btrfs: wait on ordered extents on abort cleanup

2018-08-30 Thread Josef Bacik
If we flip read-only before we initiate writeback on all dirty pages for
ordered extents we've created then we'll have ordered extents left over
on umount, which results in all sorts of bad things happening.  Fix this
by making sure we wait on ordered extents if we have to do the aborted
transaction cleanup stuff.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 54fbdc944a3f..51b2a5bf25e5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4188,6 +4188,14 @@ static void btrfs_destroy_all_ordered_extents(struct 
btrfs_fs_info *fs_info)
spin_lock(_info->ordered_root_lock);
}
spin_unlock(_info->ordered_root_lock);
+
+   /*
+* We need this here because if we've been flipped read-only we won't
+* get sync() from the umount, so we need to make sure any ordered
+* extents that haven't had their dirty pages IO start writeout yet
+* actually get run and error out properly.
+*/
+   btrfs_wait_ordered_roots(fs_info, U64_MAX, 0, (u64)-1);
 }
 
 static int btrfs_destroy_delayed_refs(struct btrfs_transaction *trans,
-- 
2.14.3



[PATCH 11/35] btrfs: don't use global rsv for chunk allocation

2018-08-30 Thread Josef Bacik
We've done this forever because of the voodoo around knowing how much
space we have.  However we have better ways of doing this now, and on
normal file systems we'll easily have a global reserve of 512MiB, and
since metadata chunks are usually 1GiB that means we'll allocate
metadata chunks more readily.  Instead use the actual used amount when
determining if we need to allocate a chunk or not.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index df826f713034..783341e3653e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4366,21 +4366,12 @@ static inline u64 calc_global_rsv_need_space(struct 
btrfs_block_rsv *global)
 static int should_alloc_chunk(struct btrfs_fs_info *fs_info,
  struct btrfs_space_info *sinfo, int force)
 {
-   struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
u64 bytes_used = btrfs_space_info_used(sinfo, false);
u64 thresh;
 
if (force == CHUNK_ALLOC_FORCE)
return 1;
 
-   /*
-* We need to take into account the global rsv because for all intents
-* and purposes it's used space.  Don't worry about locking the
-* global_rsv, it doesn't change except when the transaction commits.
-*/
-   if (sinfo->flags & BTRFS_BLOCK_GROUP_METADATA)
-   bytes_used += calc_global_rsv_need_space(global_rsv);
-
/*
 * in limited mode, we want to have some free space up to
 * about 1% of the FS size.
-- 
2.14.3



[PATCH 17/35] btrfs: move the dio_sem higher up the callchain

2018-08-30 Thread Josef Bacik
We're getting a lockdep splat because we take the dio_sem under the
log_mutex.  What we really need is to protect fsync() from logging an
extent map for an extent we never waited on higher up, so just guard the
whole thing with dio_sem.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/file.c | 12 
 fs/btrfs/tree-log.c |  2 --
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 095f0bb86bb7..c07110edb9de 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2079,6 +2079,14 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
goto out;
 
inode_lock(inode);
+
+   /*
+* We take the dio_sem here because the tree log stuff can race with
+* lockless dio writes and get an extent map logged for an extent we
+* never waited on.  We need it this high up for lockdep reasons.
+*/
+   down_write(_I(inode)->dio_sem);
+
atomic_inc(>log_batch);
 
/*
@@ -2087,6 +2095,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 */
ret = btrfs_wait_ordered_range(inode, start, len);
if (ret) {
+   up_write(_I(inode)->dio_sem);
inode_unlock(inode);
goto out;
}
@@ -2110,6 +2119,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 * checked called fsync.
 */
ret = filemap_check_wb_err(inode->i_mapping, file->f_wb_err);
+   up_write(_I(inode)->dio_sem);
inode_unlock(inode);
goto out;
}
@@ -2128,6 +2138,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
trans = btrfs_start_transaction(root, 0);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
+   up_write(_I(inode)->dio_sem);
inode_unlock(inode);
goto out;
}
@@ -2149,6 +2160,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 * file again, but that will end up using the synchronization
 * inside btrfs_sync_log to keep things safe.
 */
+   up_write(_I(inode)->dio_sem);
inode_unlock(inode);
 
/*
diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 1650dc44a5e3..66b7e059b765 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4374,7 +4374,6 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
 
INIT_LIST_HEAD();
 
-   down_write(>dio_sem);
write_lock(>lock);
test_gen = root->fs_info->last_trans_committed;
logged_start = start;
@@ -4440,7 +4439,6 @@ static int btrfs_log_changed_extents(struct 
btrfs_trans_handle *trans,
}
WARN_ON(!list_empty());
write_unlock(>lock);
-   up_write(>dio_sem);
 
btrfs_release_path(path);
if (!ret)
-- 
2.14.3



[PATCH 31/35] btrfs: clear delayed_refs_rsv for dirty bg cleanup

2018-08-30 Thread Josef Bacik
We keep track of dirty bg's as a reservation in the delayed_refs_rsv, so
when we abort and we cleanup those dirty bgs we need to drop their
reservation so we don't have accounting issues and lots of scary
messages on umount.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index caaca8154a1a..54fbdc944a3f 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4412,6 +4412,7 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
*cur_trans,
 
spin_unlock(_trans->dirty_bgs_lock);
btrfs_put_block_group(cache);
+   btrfs_delayed_refs_rsv_release(fs_info, 1);
spin_lock(_trans->dirty_bgs_lock);
}
spin_unlock(_trans->dirty_bgs_lock);
-- 
2.14.3



[PATCH 13/35] btrfs: reset max_extent_size properly

2018-08-30 Thread Josef Bacik
If we use up our block group before allocating a new one we'll easily
get a max_extent_size that's set really really low, which will result in
a lot of fragmentation.  We need to make sure we're resetting the
max_extent_size when we add a new chunk or add new space.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 22e1f9f55f4f..f4e7caf37d6c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4565,6 +4565,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
*trans, u64 flags,
goto out;
} else {
ret = 1;
+   space_info->max_extent_size = 0;
}
 
space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
@@ -8064,11 +8065,17 @@ static int __btrfs_free_reserved_extent(struct 
btrfs_fs_info *fs_info,
if (pin)
pin_down_extent(fs_info, cache, start, len, 1);
else {
+   struct btrfs_space_info *space_info = cache->space_info;
+
if (btrfs_test_opt(fs_info, DISCARD))
ret = btrfs_discard_extent(fs_info, start, len, NULL,
BTRFS_CLEAR_OP_DISCARD);
btrfs_add_free_space(cache, start, len);
btrfs_free_reserved_bytes(cache, len, delalloc);
+
+   spin_lock(_info->lock);
+   space_info->max_extent_size = 0;
+   spin_unlock(_info->lock);
trace_btrfs_reserved_extent_free(fs_info, start, len);
}
 
-- 
2.14.3



[PATCH 18/35] btrfs: set max_extent_size properly

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

We can't use entry->bytes if our entry is a bitmap entry, we need to use
entry->max_extent_size in that case.  Fix up all the logic to make this
consistent.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 29 +++--
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index db93a5f035a0..53521027dd78 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1766,6 +1766,18 @@ static int search_bitmap(struct btrfs_free_space_ctl 
*ctl,
return -1;
 }
 
+static void set_max_extent_size(struct btrfs_free_space *entry,
+   u64 *max_extent_size)
+{
+   if (entry->bitmap) {
+   if (entry->max_extent_size > *max_extent_size)
+   *max_extent_size = entry->max_extent_size;
+   } else {
+   if (entry->bytes > *max_extent_size)
+   *max_extent_size = entry->bytes;
+   }
+}
+
 /* Cache the size of the max extent in bytes */
 static struct btrfs_free_space *
 find_free_space(struct btrfs_free_space_ctl *ctl, u64 *offset, u64 *bytes,
@@ -1787,8 +1799,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
for (node = >offset_index; node; node = rb_next(node)) {
entry = rb_entry(node, struct btrfs_free_space, offset_index);
if (entry->bytes < *bytes) {
-   if (entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   set_max_extent_size(entry, max_extent_size);
continue;
}
 
@@ -1806,8 +1817,7 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
}
 
if (entry->bytes < *bytes + align_off) {
-   if (entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   set_max_extent_size(entry, max_extent_size);
continue;
}
 
@@ -1819,8 +1829,8 @@ find_free_space(struct btrfs_free_space_ctl *ctl, u64 
*offset, u64 *bytes,
*offset = tmp;
*bytes = size;
return entry;
-   } else if (size > *max_extent_size) {
-   *max_extent_size = size;
+   } else {
+   set_max_extent_size(entry, max_extent_size);
}
continue;
}
@@ -2680,8 +2690,7 @@ static u64 btrfs_alloc_from_bitmap(struct 
btrfs_block_group_cache *block_group,
 
err = search_bitmap(ctl, entry, _start, _bytes, true);
if (err) {
-   if (search_bytes > *max_extent_size)
-   *max_extent_size = search_bytes;
+   set_max_extent_size(entry, max_extent_size);
return 0;
}
 
@@ -2718,8 +2727,8 @@ u64 btrfs_alloc_from_cluster(struct 
btrfs_block_group_cache *block_group,
 
entry = rb_entry(node, struct btrfs_free_space, offset_index);
while (1) {
-   if (entry->bytes < bytes && entry->bytes > *max_extent_size)
-   *max_extent_size = entry->bytes;
+   if (entry->bytes < bytes)
+   set_max_extent_size(entry, max_extent_size);
 
if (entry->bytes < bytes ||
(!entry->bitmap && entry->offset < min_start)) {
-- 
2.14.3



[PATCH 29/35] btrfs: just delete pending bgs if we are aborted

2018-08-30 Thread Josef Bacik
We still need to do all of the accounting cleanup for pending block
groups if we abort.  So set the ret to trans->aborted so if we aborted
the cleanup happens and everybody is happy.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 90f267f4dd0f..132a1157982c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10333,7 +10333,7 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
-   int ret = 0;
+   int ret = trans->aborted;
bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
trans->can_flush_pending_bgs = false;
-- 
2.14.3



[PATCH 30/35] btrfs: cleanup pending bgs on transaction abort

2018-08-30 Thread Josef Bacik
We may abort the transaction during a commit and not have a chance to
run the pending bgs stuff, which will leave block groups on our list and
cause us accounting issues and leaked memory.  Fix this by running the
pending bgs when we cleanup a transaction.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 89d14f135837..0f39a0d302d3 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -2273,6 +2273,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_scrub_continue(fs_info);
 cleanup_transaction:
btrfs_trans_release_metadata(trans);
+   btrfs_create_pending_block_groups(trans);
btrfs_trans_release_chunk_metadata(trans);
trans->block_rsv = NULL;
btrfs_warn(fs_info, "Skipping commit of aborted transaction.");
-- 
2.14.3



[PATCH 12/35] btrfs: add ALLOC_CHUNK_FORCE to the flushing code

2018-08-30 Thread Josef Bacik
With my change to no longer take into account the global reserve for
metadata allocation chunks we have this side-effect for mixed block
group fs'es where we are no longer allocating enough chunks for the
data/metadata requirements.  To deal with this add a ALLOC_CHUNK_FORCE
step to the flushing state machine.  This will only get used if we've
already made a full loop through the flushing machinery and tried
committing the transaction.  If we have then we can try and force a
chunk allocation since we likely need it to make progress.  This
resolves the issues I was seeing with the mixed bg tests in xfstests
with my previous patch.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h | 3 ++-
 fs/btrfs/extent-tree.c   | 7 ++-
 include/trace/events/btrfs.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0a4e55703d48..791e287c2292 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -2731,7 +2731,8 @@ enum btrfs_flush_state {
FLUSH_DELALLOC  =   5,
FLUSH_DELALLOC_WAIT =   6,
ALLOC_CHUNK =   7,
-   COMMIT_TRANS=   8,
+   ALLOC_CHUNK_FORCE   =   8,
+   COMMIT_TRANS=   9,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 783341e3653e..22e1f9f55f4f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4907,6 +4907,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
btrfs_end_transaction(trans);
break;
case ALLOC_CHUNK:
+   case ALLOC_CHUNK_FORCE:
trans = btrfs_join_transaction(root);
if (IS_ERR(trans)) {
ret = PTR_ERR(trans);
@@ -4914,7 +4915,9 @@ static void flush_space(struct btrfs_fs_info *fs_info,
}
ret = do_chunk_alloc(trans,
 btrfs_metadata_alloc_profile(fs_info),
-CHUNK_ALLOC_NO_FORCE);
+(state == ALLOC_CHUNK) ?
+CHUNK_ALLOC_NO_FORCE :
+CHUNK_ALLOC_FORCE);
btrfs_end_transaction(trans);
if (ret > 0 || ret == -ENOSPC)
ret = 0;
@@ -5060,6 +5063,8 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
}
}
spin_unlock(_info->lock);
+   if (flush_state == ALLOC_CHUNK_FORCE && !commit_cycles)
+   flush_state++;
} while (flush_state <= COMMIT_TRANS);
 }
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 7d205e50b09c..fdb23181b5b7 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1051,6 +1051,7 @@ TRACE_EVENT(btrfs_trigger_flush,
{ FLUSH_DELAYED_REFS_NR,"FLUSH_DELAYED_REFS_NR"},   
\
{ FLUSH_DELAYED_REFS,   "FLUSH_ELAYED_REFS"},   
\
{ ALLOC_CHUNK,  "ALLOC_CHUNK"}, 
\
+   { ALLOC_CHUNK_FORCE,"ALLOC_CHUNK_FORCE"},   
\
{ COMMIT_TRANS, "COMMIT_TRANS"})
 
 TRACE_EVENT(btrfs_flush_space,
-- 
2.14.3



[PATCH 27/35] btrfs: handle delayed ref head accounting cleanup in abort

2018-08-30 Thread Josef Bacik
We weren't doing any of the accounting cleanup when we aborted
transactions.  Fix this by making cleanup_ref_head_accounting global and
calling it from the abort code, this fixes the issue where our
accounting was all wrong after the fs aborts.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h   |  5 +
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/extent-tree.c | 13 ++---
 3 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 791e287c2292..67923b2030b8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -35,6 +35,7 @@
 struct btrfs_trans_handle;
 struct btrfs_transaction;
 struct btrfs_pending_snapshot;
+struct btrfs_delayed_ref_root;
 extern struct kmem_cache *btrfs_trans_handle_cachep;
 extern struct kmem_cache *btrfs_bit_radix_cachep;
 extern struct kmem_cache *btrfs_path_cachep;
@@ -2624,6 +2625,10 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
   unsigned long count);
 int btrfs_async_run_delayed_refs(struct btrfs_fs_info *fs_info,
 unsigned long count, u64 transid, int wait);
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head);
 int btrfs_lookup_data_extent(struct btrfs_fs_info *fs_info, u64 start, u64 
len);
 int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 bytenr,
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 1d3f5731d616..caaca8154a1a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4240,6 +4240,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (pin_bytes)
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
btrfs_put_delayed_ref_head(head);
cond_resched();
spin_lock(_refs->lock);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32579221d900..031d2b11ddee 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2466,12 +2466,11 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
-static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
-   struct btrfs_delayed_ref_head *head)
+void
+btrfs_cleanup_ref_head_accounting(struct btrfs_fs_info *fs_info,
+ struct btrfs_delayed_ref_root *delayed_refs,
+ struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_delayed_ref_root *delayed_refs =
-   >transaction->delayed_refs;
int nr_items = 1;
 
if (head->total_ref_mod < 0) {
@@ -2549,7 +2548,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(fs_info, delayed_refs, head);
 
trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
@@ -7191,7 +7190,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
-   cleanup_ref_head_accounting(trans, head);
+   btrfs_cleanup_ref_head_accounting(trans->fs_info, delayed_refs, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 32/35] btrfs: only free reserved extent if we didn't insert it

2018-08-30 Thread Josef Bacik
When we insert the file extent once the ordered extent completes we free
the reserved extent reservation as it'll have been migrated to the
bytes_used counter.  However if we error out after this step we'll still
clear the reserved extent reservation, resulting in a negative
accounting of the reserved bytes for the block group and space info.
Fix this by only doing the free if we didn't successfully insert a file
extent for this extent.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 10455d0aa71c..3391f6a9fc77 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
bool truncated = false;
bool range_locked = false;
bool clear_new_delalloc_bytes = false;
+   bool clear_reserved_extent = true;
 
if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
@@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
logical_len, logical_len,
compress_type, 0, 0,
BTRFS_FILE_EXTENT_REG);
-   if (!ret)
+   if (!ret) {
+   clear_reserved_extent = false;
btrfs_release_delalloc_bytes(fs_info,
 ordered_extent->start,
 ordered_extent->disk_len);
+   }
}
unpin_extent_cache(_I(inode)->extent_tree,
   ordered_extent->file_offset, ordered_extent->len,
@@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
 * wrong we need to return the space for this ordered extent
 * back to the allocator.  We only free the extent in the
 * truncated case if we didn't write out the extent at all.
+*
+* If we made it past insert_reserved_file_extent before we
+* errored out then we don't need to do this as the accounting
+* has already been done.
 */
if ((ret || !logical_len) &&
+   clear_reserved_extent &&
!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
!test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
btrfs_free_reserved_extent(fs_info,
-- 
2.14.3



[PATCH 28/35] btrfs: call btrfs_create_pending_block_groups unconditionally

2018-08-30 Thread Josef Bacik
The first thing we do is loop through the list, this

if (!list_empty())
btrfs_create_pending_block_groups();

thing is just wasted space.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 3 +--
 fs/btrfs/transaction.c | 6 ++
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 031d2b11ddee..90f267f4dd0f 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2970,8 +2970,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle 
*trans,
}
 
if (run_all) {
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
spin_lock(_refs->lock);
node = rb_first(_refs->href_root);
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 2bb19e2ded5e..89d14f135837 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -839,8 +839,7 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
btrfs_trans_release_chunk_metadata(trans);
 
@@ -1927,8 +1926,7 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
cur_trans->delayed_refs.flushing = 1;
smp_wmb();
 
-   if (!list_empty(>new_bgs))
-   btrfs_create_pending_block_groups(trans);
+   btrfs_create_pending_block_groups(trans);
 
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
-- 
2.14.3



[PATCH 35/35] MAINTAINERS: update my email address for btrfs

2018-08-30 Thread Josef Bacik
My work email is completely useless, switch it to my personal address so
I get emails on a account I actually pay attention to.

Signed-off-by: Josef Bacik 
---
 MAINTAINERS | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 32fbc6f732d4..7723dc958e99 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3095,7 +3095,7 @@ F:drivers/gpio/gpio-bt8xx.c
 
 BTRFS FILE SYSTEM
 M: Chris Mason 
-M: Josef Bacik 
+M: Josef Bacik 
 M: David Sterba 
 L: linux-btrfs@vger.kernel.org
 W: http://btrfs.wiki.kernel.org/
-- 
2.14.3



[PATCH 10/35] btrfs: fix truncate throttling

2018-08-30 Thread Josef Bacik
We have a bunch of magic to make sure we're throttling delayed refs when
truncating a file.  Now that we have a delayed refs rsv and a mechanism
for refilling that reserve simply use that instead of all of this magic.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c | 78 
 1 file changed, 16 insertions(+), 62 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 212fa71317d6..10455d0aa71c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -4493,31 +4493,6 @@ static int btrfs_rmdir(struct inode *dir, struct dentry 
*dentry)
return err;
 }
 
-static int truncate_space_check(struct btrfs_trans_handle *trans,
-   struct btrfs_root *root,
-   u64 bytes_deleted)
-{
-   struct btrfs_fs_info *fs_info = root->fs_info;
-   int ret;
-
-   /*
-* This is only used to apply pressure to the enospc system, we don't
-* intend to use this reservation at all.
-*/
-   bytes_deleted = btrfs_csum_bytes_to_leaves(fs_info, bytes_deleted);
-   bytes_deleted *= fs_info->nodesize;
-   ret = btrfs_block_rsv_add(root, _info->trans_block_rsv,
- bytes_deleted, BTRFS_RESERVE_NO_FLUSH);
-   if (!ret) {
-   trace_btrfs_space_reservation(fs_info, "transaction",
- trans->transid,
- bytes_deleted, 1);
-   trans->bytes_reserved += bytes_deleted;
-   }
-   return ret;
-
-}
-
 /*
  * Return this if we need to call truncate_block for the last bit of the
  * truncate.
@@ -4562,7 +4537,6 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
u64 bytes_deleted = 0;
bool be_nice = false;
bool should_throttle = false;
-   bool should_end = false;
 
BUG_ON(new_size > 0 && min_type != BTRFS_EXTENT_DATA_KEY);
 
@@ -4775,15 +4749,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
btrfs_abort_transaction(trans, ret);
break;
}
-   if (btrfs_should_throttle_delayed_refs(trans, fs_info))
-   btrfs_async_run_delayed_refs(fs_info,
-   trans->delayed_ref_updates * 2,
-   trans->transid, 0);
if (be_nice) {
-   if (truncate_space_check(trans, root,
-extent_num_bytes)) {
-   should_end = true;
-   }
if (btrfs_should_throttle_delayed_refs(trans,
   fs_info))
should_throttle = true;
@@ -4795,7 +4761,7 @@ int btrfs_truncate_inode_items(struct btrfs_trans_handle 
*trans,
 
if (path->slots[0] == 0 ||
path->slots[0] != pending_del_slot ||
-   should_throttle || should_end) {
+   should_throttle) {
if (pending_del_nr) {
ret = btrfs_del_items(trans, root, path,
pending_del_slot,
@@ -4807,23 +4773,23 @@ int btrfs_truncate_inode_items(struct 
btrfs_trans_handle *trans,
pending_del_nr = 0;
}
btrfs_release_path(path);
-   if (should_throttle) {
-   unsigned long updates = 
trans->delayed_ref_updates;
-   if (updates) {
-   trans->delayed_ref_updates = 0;
-   ret = btrfs_run_delayed_refs(trans,
-  updates * 2);
-   if (ret)
-   break;
-   }
-   }
+
/*
-* if we failed to refill our space rsv, bail out
-* and let the transaction restart
+* We can generate a lot of delayed refs, so we need to
+* throttle every once and a while and make sure we're
+* adding enough space to keep up with the work we are
+* generating.  Since we hold a transaction here we can
+* only FLUSH_LIMIT, if this fails we just return EAGAIN
+* and let the normal space allocation stuff do it's
+* work.
 */
-   if (should_end) {
- 

[PATCH 08/35] btrfs: release metadata before running delayed refs

2018-08-30 Thread Josef Bacik
We want to release the unused reservation we have since it refills the
delayed refs reserve, which will make everything go smoother when
running the delayed refs if we're short on our reservation.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 99741254e27e..ebb0c0405598 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1915,6 +1915,9 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
return ret;
}
 
+   btrfs_trans_release_metadata(trans);
+   trans->block_rsv = NULL;
+
/* make a pass through all the delayed refs we have so far
 * any runnings procs may add more while we are here
 */
@@ -1924,9 +1927,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
return ret;
}
 
-   btrfs_trans_release_metadata(trans);
-   trans->block_rsv = NULL;
-
cur_trans = trans->transaction;
 
/*
-- 
2.14.3



[PATCH 33/35] btrfs: fix insert_reserved error handling

2018-08-30 Thread Josef Bacik
We were not handling the reserved byte accounting properly for data
references.  Metadata was fine, if it errored out the error paths would
free the bytes_reserved count and pin the extent, but it even missed one
of the error cases.  So instead move this handling up into
run_one_delayed_ref so we are sure that both cases are properly cleaned
up in case of a transaction abort.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 132a1157982c..fd9169f80de0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2405,6 +2405,9 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
   insert_reserved);
else
BUG();
+   if (ret && insert_reserved)
+   btrfs_pin_extent(trans->fs_info, node->bytenr,
+node->num_bytes, 1);
return ret;
 }
 
@@ -8227,21 +8230,14 @@ static int alloc_reserved_tree_block(struct 
btrfs_trans_handle *trans,
}
 
path = btrfs_alloc_path();
-   if (!path) {
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
+   if (!path)
return -ENOMEM;
-   }
 
path->leave_spinning = 1;
ret = btrfs_insert_empty_item(trans, fs_info->extent_root, path,
  _key, size);
if (ret) {
btrfs_free_path(path);
-   btrfs_free_and_pin_reserved_extent(fs_info,
-  extent_key.objectid,
-  fs_info->nodesize);
return ret;
}
 
-- 
2.14.3



[PATCH 26/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delete_ref_head

2018-08-30 Thread Josef Bacik
Instead of open coding this stuff use the helper instead.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 7 +--
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index c72ab2ca7627..1d3f5731d616 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4232,12 +4232,7 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
if (head->must_insert_reserved)
pin_bytes = true;
btrfs_free_delayed_extent_op(head->extent_op);
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
-   atomic_dec(_refs->num_entries);
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(>lock);
spin_unlock(_refs->lock);
mutex_unlock(>mutex);
-- 
2.14.3



[PATCH 24/35] btrfs: pass delayed_refs_root to btrfs_delayed_ref_lock

2018-08-30 Thread Josef Bacik
We don't need the trans except to get the delayed_refs_root, so just
pass the delayed_refs_root into btrfs_delayed_ref_lock and call it a
day.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 5 +
 fs/btrfs/delayed-ref.h | 2 +-
 fs/btrfs/extent-tree.c | 2 +-
 3 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 96ce087747b2..87778645bf4a 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -197,12 +197,9 @@ find_ref_head(struct rb_root *root, u64 bytenr,
return NULL;
 }
 
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
   struct btrfs_delayed_ref_head *head)
 {
-   struct btrfs_delayed_ref_root *delayed_refs;
-
-   delayed_refs = >transaction->delayed_refs;
lockdep_assert_held(_refs->lock);
if (mutex_trylock(>mutex))
return 0;
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 7769177b489e..ee636d7a710a 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -255,7 +255,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
 struct btrfs_delayed_ref_head *
 btrfs_find_delayed_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
u64 bytenr);
-int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
+int btrfs_delayed_ref_lock(struct btrfs_delayed_ref_root *delayed_refs,
   struct btrfs_delayed_ref_head *head);
 static inline void btrfs_delayed_ref_unlock(struct btrfs_delayed_ref_head 
*head)
 {
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index fc30ff96f0d6..32579221d900 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2591,7 +2591,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 
/* grab the lock that says we are going to process
 * all the refs for this head */
-   ret = btrfs_delayed_ref_lock(trans, locked_ref);
+   ret = btrfs_delayed_ref_lock(delayed_refs, locked_ref);
spin_unlock(_refs->lock);
/*
 * we may have dropped the spin lock to get the head
-- 
2.14.3



[PATCH 25/35] btrfs: make btrfs_destroy_delayed_refs use btrfs_delayed_ref_lock

2018-08-30 Thread Josef Bacik
We have this open coded in btrfs_destroy_delayed_refs, use the helper
instead.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 11 ++-
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 11ea2ea7439e..c72ab2ca7627 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -4214,16 +4214,9 @@ static int btrfs_destroy_delayed_refs(struct 
btrfs_transaction *trans,
 
head = rb_entry(node, struct btrfs_delayed_ref_head,
href_node);
-   if (!mutex_trylock(>mutex)) {
-   refcount_inc(>refs);
-   spin_unlock(_refs->lock);
-
-   mutex_lock(>mutex);
-   mutex_unlock(>mutex);
-   btrfs_put_delayed_ref_head(head);
-   spin_lock(_refs->lock);
+   if (btrfs_delayed_ref_lock(delayed_refs, head))
continue;
-   }
+
spin_lock(>lock);
while ((n = rb_first(>ref_tree)) != NULL) {
ref = rb_entry(n, struct btrfs_delayed_ref_node,
-- 
2.14.3



[PATCH 21/35] btrfs: only run delayed refs if we're committing

2018-08-30 Thread Josef Bacik
I noticed in a giant dbench run that we spent a lot of time on lock
contention while running transaction commit.  This is because dbench
results in a lot of fsync()'s that do a btrfs_transaction_commit(), and
they all run the delayed refs first thing, so they all contend with
each other.  This leads to seconds of 0 throughput.  Change this to only
run the delayed refs if we're the ones committing the transaction.  This
makes the latency go away and we get no more lock contention.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/transaction.c | 24 +---
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index ebb0c0405598..2bb19e2ded5e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1918,15 +1918,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
btrfs_trans_release_metadata(trans);
trans->block_rsv = NULL;
 
-   /* make a pass through all the delayed refs we have so far
-* any runnings procs may add more while we are here
-*/
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
cur_trans = trans->transaction;
 
/*
@@ -1939,12 +1930,6 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
if (!list_empty(>new_bgs))
btrfs_create_pending_block_groups(trans);
 
-   ret = btrfs_run_delayed_refs(trans, 0);
-   if (ret) {
-   btrfs_end_transaction(trans);
-   return ret;
-   }
-
if (!test_bit(BTRFS_TRANS_DIRTY_BG_RUN, _trans->flags)) {
int run_it = 0;
 
@@ -2015,6 +2000,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans)
spin_unlock(_info->trans_lock);
}
 
+   /*
+* We are now the only one in the commit area, we can run delayed refs
+* without hitting a bunch of lock contention from a lot of people
+* trying to commit the transaction at once.
+*/
+   ret = btrfs_run_delayed_refs(trans, 0);
+   if (ret)
+   goto cleanup_transaction;
+
extwriter_counter_dec(cur_trans, trans->type);
 
ret = btrfs_start_delalloc_flush(fs_info);
-- 
2.14.3



[PATCH 23/35] btrfs: assert on non-empty delayed iputs

2018-08-30 Thread Josef Bacik
I ran into an issue where there was some reference being held on an
inode that I couldn't track.  This assert wasn't triggered, but it at
least rules out we're doing something stupid.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/disk-io.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 0e42401756b8..11ea2ea7439e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3979,6 +3979,7 @@ void close_ctree(struct btrfs_fs_info *fs_info)
kthread_stop(fs_info->transaction_kthread);
kthread_stop(fs_info->cleaner_kthread);
 
+   ASSERT(list_empty(_info->delayed_iputs));
set_bit(BTRFS_FS_CLOSING_DONE, _info->flags);
 
btrfs_free_qgroup_config(fs_info);
-- 
2.14.3



[PATCH 19/35] btrfs: don't use ctl->free_space for max_extent_size

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

max_extent_size is supposed to be the largest contiguous range for the
space info, and ctl->free_space is the total free space in the block
group.  We need to keep track of these separately and _only_ use the
max_free_space if we don't have a max_extent_size, as that means our
original request was too large to search any of the block groups for and
therefore wouldn't have a max_extent_size set.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 664b867ae499..ca98c39308f6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7460,6 +7460,7 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
struct btrfs_block_group_cache *block_group = NULL;
u64 search_start = 0;
u64 max_extent_size = 0;
+   u64 max_free_space = 0;
u64 empty_cluster = 0;
struct btrfs_space_info *space_info;
int loop = 0;
@@ -7755,8 +7756,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
spin_lock(>tree_lock);
if (ctl->free_space <
num_bytes + empty_cluster + empty_size) {
-   if (ctl->free_space > max_extent_size)
-   max_extent_size = ctl->free_space;
+   max_free_space = max(max_free_space,
+ctl->free_space);
spin_unlock(>tree_lock);
goto loop;
}
@@ -7923,6 +7924,8 @@ static noinline int find_free_extent(struct btrfs_fs_info 
*fs_info,
}
 out:
if (ret == -ENOSPC) {
+   if (!max_extent_size)
+   max_extent_size = max_free_space;
spin_lock(_info->lock);
space_info->max_extent_size = max_extent_size;
spin_unlock(_info->lock);
-- 
2.14.3



[PATCH 22/35] btrfs: make sure we create all new bgs

2018-08-30 Thread Josef Bacik
We can actually allocate new chunks while we're creating our bg's, so
instead of doing list_for_each_safe, just do while (!list_empty()) so we
make sure to catch any new bg's that get added to the list.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index ca98c39308f6..fc30ff96f0d6 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10331,7 +10331,7 @@ int btrfs_read_block_groups(struct btrfs_fs_info *info)
 void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans)
 {
struct btrfs_fs_info *fs_info = trans->fs_info;
-   struct btrfs_block_group_cache *block_group, *tmp;
+   struct btrfs_block_group_cache *block_group;
struct btrfs_root *extent_root = fs_info->extent_root;
struct btrfs_block_group_item item;
struct btrfs_key key;
@@ -10339,7 +10339,10 @@ void btrfs_create_pending_block_groups(struct 
btrfs_trans_handle *trans)
bool can_flush_pending_bgs = trans->can_flush_pending_bgs;
 
trans->can_flush_pending_bgs = false;
-   list_for_each_entry_safe(block_group, tmp, >new_bgs, bg_list) {
+   while (!list_empty(>new_bgs)) {
+   block_group = list_first_entry(>new_bgs,
+  struct btrfs_block_group_cache,
+  bg_list);
if (ret)
goto next;
 
-- 
2.14.3



[PATCH 20/35] btrfs: reset max_extent_size on clear in a bitmap

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

We need to clear the max_extent_size when we clear bits from a bitmap
since it could have been from the range that contains the
max_extent_size.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index 53521027dd78..7faca05e61ea 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -1683,6 +1683,8 @@ static inline void __bitmap_clear_bits(struct 
btrfs_free_space_ctl *ctl,
bitmap_clear(info->bitmap, start, count);
 
info->bytes -= bytes;
+   if (info->max_extent_size > ctl->unit)
+   info->max_extent_size = 0;
 }
 
 static void bitmap_clear_bits(struct btrfs_free_space_ctl *ctl,
-- 
2.14.3



[PATCH 14/35] btrfs: don't enospc all tickets on flush failure

2018-08-30 Thread Josef Bacik
With the introduction of the per-inode block_rsv it became possible to
have really really large reservation requests made because of data
fragmentation.  Since the ticket stuff assumed that we'd always have
relatively small reservation requests it just killed all tickets if we
were unable to satisfy the current request.  However this is generally
not the case anymore.  So fix this logic to instead see if we had a
ticket that we were able to give some reservation to, and if we were
continue the flushing loop again.  Likewise we make the tickets use the
space_info_add_old_bytes() method of returning what reservation they did
receive in hopes that it could satisfy reservations down the line.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 45 +
 1 file changed, 25 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f4e7caf37d6c..7c0e99e1f56c 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4771,6 +4771,7 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
 }
 
 struct reserve_ticket {
+   u64 orig_bytes;
u64 bytes;
int error;
struct list_head list;
@@ -4993,7 +4994,7 @@ static inline int need_do_async_reclaim(struct 
btrfs_fs_info *fs_info,
!test_bit(BTRFS_FS_STATE_REMOUNTING, _info->fs_state));
 }
 
-static void wake_all_tickets(struct list_head *head)
+static bool wake_all_tickets(struct list_head *head)
 {
struct reserve_ticket *ticket;
 
@@ -5002,7 +5003,10 @@ static void wake_all_tickets(struct list_head *head)
list_del_init(>list);
ticket->error = -ENOSPC;
wake_up(>wait);
+   if (ticket->bytes != ticket->orig_bytes)
+   return true;
}
+   return false;
 }
 
 /*
@@ -5057,8 +5061,12 @@ static void btrfs_async_reclaim_metadata_space(struct 
work_struct *work)
if (flush_state > COMMIT_TRANS) {
commit_cycles++;
if (commit_cycles > 2) {
-   wake_all_tickets(_info->tickets);
-   space_info->flush = 0;
+   if (wake_all_tickets(_info->tickets)) {
+   flush_state = FLUSH_DELAYED_ITEMS_NR;
+   commit_cycles--;
+   } else {
+   space_info->flush = 0;
+   }
} else {
flush_state = FLUSH_DELAYED_ITEMS_NR;
}
@@ -5112,10 +5120,11 @@ static void priority_reclaim_metadata_space(struct 
btrfs_fs_info *fs_info,
 
 static int wait_reserve_ticket(struct btrfs_fs_info *fs_info,
   struct btrfs_space_info *space_info,
-  struct reserve_ticket *ticket, u64 orig_bytes)
+  struct reserve_ticket *ticket)
 
 {
DEFINE_WAIT(wait);
+   u64 reclaim_bytes = 0;
int ret = 0;
 
spin_lock(_info->lock);
@@ -5136,14 +5145,12 @@ static int wait_reserve_ticket(struct btrfs_fs_info 
*fs_info,
ret = ticket->error;
if (!list_empty(>list))
list_del_init(>list);
-   if (ticket->bytes && ticket->bytes < orig_bytes) {
-   u64 num_bytes = orig_bytes - ticket->bytes;
-   space_info->bytes_may_use -= num_bytes;
-   trace_btrfs_space_reservation(fs_info, "space_info",
- space_info->flags, num_bytes, 0);
-   }
+   if (ticket->bytes && ticket->bytes < ticket->orig_bytes)
+   reclaim_bytes = ticket->orig_bytes - ticket->bytes;
spin_unlock(_info->lock);
 
+   if (reclaim_bytes)
+   space_info_add_old_bytes(fs_info, space_info, reclaim_bytes);
return ret;
 }
 
@@ -5169,6 +5176,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 {
struct reserve_ticket ticket;
u64 used;
+   u64 reclaim_bytes = 0;
int ret = 0;
 
ASSERT(orig_bytes);
@@ -5204,6 +5212,7 @@ static int __reserve_metadata_bytes(struct btrfs_fs_info 
*fs_info,
 * the list and we will do our own flushing further down.
 */
if (ret && flush != BTRFS_RESERVE_NO_FLUSH) {
+   ticket.orig_bytes = orig_bytes;
ticket.bytes = orig_bytes;
ticket.error = 0;
init_waitqueue_head();
@@ -5244,25 +5253,21 @@ static int __reserve_metadata_bytes(struct 
btrfs_fs_info *fs_info,
return ret;
 
if (flush == BTRFS_RESERVE_FLUSH_ALL)
-   return wait_reserve_ticket(fs_info, space_info, ,
-  orig_bytes);
+   return wait_reserve_ticket(fs_info, 

[PATCH 15/35] btrfs: run delayed iputs before committing

2018-08-30 Thread Josef Bacik
We want to have a complete picture of any delayed inode updates before
we make the decision to commit or not, so make sure we run delayed iputs
before making the decision to commit or not.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 7c0e99e1f56c..064db7ebaf67 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4831,6 +4831,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
goto commit;
}
 
+   mutex_lock(_info->cleaner_delayed_iput_mutex);
+   btrfs_run_delayed_iputs(fs_info);
+   mutex_unlock(_info->cleaner_delayed_iput_mutex);
+
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
spin_unlock(_rsv->lock);
-- 
2.14.3



[PATCH 16/35] btrfs: loop in inode_rsv_refill

2018-08-30 Thread Josef Bacik
With severe fragmentation we can end up with our inode rsv size being
huge during writeout, which would cause us to need to make very large
metadata reservations.  However we may not actually need that much once
writeout is complete.  So instead try to make our reservation, and if we
couldn't make it re-calculate our new reservation size and try again.
If our reservation size doesn't change between tries then we know we are
actually out of space and can error out.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 19 +--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 064db7ebaf67..664b867ae499 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5769,10 +5769,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
 {
struct btrfs_root *root = inode->root;
struct btrfs_block_rsv *block_rsv = >block_rsv;
-   u64 num_bytes = 0;
+   u64 num_bytes = 0, last = 0;
u64 qgroup_num_bytes = 0;
int ret = -ENOSPC;
 
+again:
spin_lock(_rsv->lock);
if (block_rsv->reserved < block_rsv->size)
num_bytes = block_rsv->size - block_rsv->reserved;
@@ -5797,8 +5798,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
*inode,
spin_lock(_rsv->lock);
block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
spin_unlock(_rsv->lock);
-   } else
+   } else {
btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
+
+   /*
+* If we are fragmented we can end up with a lot of outstanding
+* extents which will make our size be much larger than our
+* reserved amount.  If we happen to try to do a reservation
+* here that may result in us trying to do a pretty hefty
+* reservation, which we may not need once delalloc flushing
+* happens.  If this is the case try and do the reserve again.
+*/
+   if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {
+   last = num_bytes;
+   goto again;
+   }
+   }
return ret;
 }
 
-- 
2.14.3



[PATCH 02/35] btrfs: add cleanup_ref_head_accounting helper

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

We were missing some quota cleanups in check_ref_cleanup, so break the
ref head accounting cleanup into a helper and call that from both
check_ref_cleanup and cleanup_ref_head.  This will hopefully ensure that
we don't screw up accounting in the future for other things that we add.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 67 +-
 1 file changed, 39 insertions(+), 28 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6799950fa057..4c9fd35bca07 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2461,6 +2461,41 @@ static int cleanup_extent_op(struct btrfs_trans_handle 
*trans,
return ret ? ret : 1;
 }
 
+static void cleanup_ref_head_accounting(struct btrfs_trans_handle *trans,
+   struct btrfs_delayed_ref_head *head)
+{
+   struct btrfs_fs_info *fs_info = trans->fs_info;
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+
+   if (head->total_ref_mod < 0) {
+   struct btrfs_space_info *space_info;
+   u64 flags;
+
+   if (head->is_data)
+   flags = BTRFS_BLOCK_GROUP_DATA;
+   else if (head->is_system)
+   flags = BTRFS_BLOCK_GROUP_SYSTEM;
+   else
+   flags = BTRFS_BLOCK_GROUP_METADATA;
+   space_info = __find_space_info(fs_info, flags);
+   ASSERT(space_info);
+   percpu_counter_add_batch(_info->total_bytes_pinned,
+  -head->num_bytes,
+  BTRFS_TOTAL_BYTES_PINNED_BATCH);
+
+   if (head->is_data) {
+   spin_lock(_refs->lock);
+   delayed_refs->pending_csums -= head->num_bytes;
+   spin_unlock(_refs->lock);
+   }
+   }
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
+ head->qgroup_reserved);
+}
+
 static int cleanup_ref_head(struct btrfs_trans_handle *trans,
struct btrfs_delayed_ref_head *head)
 {
@@ -2496,31 +2531,6 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
spin_unlock(>lock);
 
-   trace_run_delayed_ref_head(fs_info, head, 0);
-
-   if (head->total_ref_mod < 0) {
-   struct btrfs_space_info *space_info;
-   u64 flags;
-
-   if (head->is_data)
-   flags = BTRFS_BLOCK_GROUP_DATA;
-   else if (head->is_system)
-   flags = BTRFS_BLOCK_GROUP_SYSTEM;
-   else
-   flags = BTRFS_BLOCK_GROUP_METADATA;
-   space_info = __find_space_info(fs_info, flags);
-   ASSERT(space_info);
-   percpu_counter_add_batch(_info->total_bytes_pinned,
-  -head->num_bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH);
-
-   if (head->is_data) {
-   spin_lock(_refs->lock);
-   delayed_refs->pending_csums -= head->num_bytes;
-   spin_unlock(_refs->lock);
-   }
-   }
-
if (head->must_insert_reserved) {
btrfs_pin_extent(fs_info, head->bytenr,
 head->num_bytes, 1);
@@ -2530,9 +2540,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
}
}
 
-   /* Also free its reserved qgroup space */
-   btrfs_qgroup_free_delayed_ref(fs_info, head->qgroup_ref_root,
- head->qgroup_reserved);
+   cleanup_ref_head_accounting(trans, head);
+
+   trace_run_delayed_ref_head(fs_info, head, 0);
btrfs_delayed_ref_unlock(head);
btrfs_put_delayed_ref_head(head);
return 0;
@@ -6991,6 +7001,7 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (head->must_insert_reserved)
ret = 1;
 
+   cleanup_ref_head_accounting(trans, head);
mutex_unlock(>mutex);
btrfs_put_delayed_ref_head(head);
return ret;
-- 
2.14.3



[PATCH 09/35] btrfs: protect space cache inode alloc with nofs

2018-08-30 Thread Josef Bacik
If we're allocating a new space cache inode it's likely going to be
under a transaction handle, so we need to use memalloc_nofs_save() in
order to avoid deadlocks, and more importantly lockdep messages that
make xfstests fail.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/free-space-cache.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index c3888c113d81..db93a5f035a0 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -10,6 +10,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -47,6 +48,7 @@ static struct inode *__lookup_free_space_inode(struct 
btrfs_root *root,
struct btrfs_free_space_header *header;
struct extent_buffer *leaf;
struct inode *inode = NULL;
+   unsigned nofs_flag;
int ret;
 
key.objectid = BTRFS_FREE_SPACE_OBJECTID;
@@ -68,7 +70,9 @@ static struct inode *__lookup_free_space_inode(struct 
btrfs_root *root,
btrfs_disk_key_to_cpu(, _key);
btrfs_release_path(path);
 
+   nofs_flag = memalloc_nofs_save();
inode = btrfs_iget(fs_info->sb, , root, NULL);
+   memalloc_nofs_restore(nofs_flag);
if (IS_ERR(inode))
return inode;
 
-- 
2.14.3



[PATCH 05/35] btrfs: introduce delayed_refs_rsv

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

Traditionally we've had voodoo in btrfs to account for the space that
delayed refs may take up by having a global_block_rsv.  This works most
of the time, except when it doesn't.  We've had issues reported and seen
in production where sometimes the global reserve is exhausted during
transaction commit before we can run all of our delayed refs, resulting
in an aborted transaction.  Because of this voodoo we have equally
dubious flushing semantics around throttling delayed refs which we often
get wrong.

So instead give them their own block_rsv.  This way we can always know
exactly how much outstanding space we need for delayed refs.  This
allows us to make sure we are constantly filling that reservation up
with space, and allows us to put more precise pressure on the enospc
system.  Instead of doing math to see if its a good time to throttle,
the normal enospc code will be invoked if we have a lot of delayed refs
pending, and they will be run via the normal flushing mechanism.

For now the delayed_refs_rsv will hold the reservations for the delayed
refs, the block group updates, and deleting csums.  We could have a
separate rsv for the block group updates, but the csum deletion stuff is
still handled via the delayed_refs so that will stay there.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/ctree.h |  24 +++-
 fs/btrfs/delayed-ref.c   |  28 -
 fs/btrfs/disk-io.c   |   3 +
 fs/btrfs/extent-tree.c   | 268 +++
 fs/btrfs/transaction.c   |  68 +--
 include/trace/events/btrfs.h |   2 +
 6 files changed, 294 insertions(+), 99 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 66f1d3895bca..0a4e55703d48 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -452,8 +452,9 @@ struct btrfs_space_info {
 #defineBTRFS_BLOCK_RSV_TRANS   3
 #defineBTRFS_BLOCK_RSV_CHUNK   4
 #defineBTRFS_BLOCK_RSV_DELOPS  5
-#defineBTRFS_BLOCK_RSV_EMPTY   6
-#defineBTRFS_BLOCK_RSV_TEMP7
+#define BTRFS_BLOCK_RSV_DELREFS6
+#defineBTRFS_BLOCK_RSV_EMPTY   7
+#defineBTRFS_BLOCK_RSV_TEMP8
 
 struct btrfs_block_rsv {
u64 size;
@@ -794,6 +795,8 @@ struct btrfs_fs_info {
struct btrfs_block_rsv chunk_block_rsv;
/* block reservation for delayed operations */
struct btrfs_block_rsv delayed_block_rsv;
+   /* block reservation for delayed refs */
+   struct btrfs_block_rsv delayed_refs_rsv;
 
struct btrfs_block_rsv empty_block_rsv;
 
@@ -2723,10 +2726,12 @@ enum btrfs_reserve_flush_enum {
 enum btrfs_flush_state {
FLUSH_DELAYED_ITEMS_NR  =   1,
FLUSH_DELAYED_ITEMS =   2,
-   FLUSH_DELALLOC  =   3,
-   FLUSH_DELALLOC_WAIT =   4,
-   ALLOC_CHUNK =   5,
-   COMMIT_TRANS=   6,
+   FLUSH_DELAYED_REFS_NR   =   3,
+   FLUSH_DELAYED_REFS  =   4,
+   FLUSH_DELALLOC  =   5,
+   FLUSH_DELALLOC_WAIT =   6,
+   ALLOC_CHUNK =   7,
+   COMMIT_TRANS=   8,
 };
 
 int btrfs_alloc_data_chunk_ondemand(struct btrfs_inode *inode, u64 bytes);
@@ -2777,6 +2782,13 @@ int btrfs_cond_migrate_bytes(struct btrfs_fs_info 
*fs_info,
 void btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 struct btrfs_block_rsv *block_rsv,
 u64 num_bytes);
+void btrfs_delayed_refs_rsv_release(struct btrfs_fs_info *fs_info, int nr);
+void btrfs_update_delayed_refs_rsv(struct btrfs_trans_handle *trans);
+int btrfs_refill_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+ enum btrfs_reserve_flush_enum flush);
+void btrfs_migrate_to_delayed_refs_rsv(struct btrfs_fs_info *fs_info,
+  struct btrfs_block_rsv *src,
+  u64 num_bytes);
 int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_dec_block_group_ro(struct btrfs_block_group_cache *cache);
 void btrfs_put_block_group_cache(struct btrfs_fs_info *info);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 27f7dd4e3d52..96ce087747b2 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -467,11 +467,14 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
  * existing and update must have the same bytenr
  */
 static noinline void
-update_existing_head_ref(struct btrfs_delayed_ref_root *delayed_refs,
+update_existing_head_ref(struct btrfs_trans_handle *trans,
 struct btrfs_delayed_ref_head *existing,
 struct btrfs_delayed_ref_head *update,
 int *old_ref_mod_ret)
 {
+   struct btrfs_delayed_ref_root *delayed_refs =
+   >transaction->delayed_refs;
+   struct 

[PATCH 01/35] btrfs: add btrfs_delete_ref_head helper

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

We do this dance in cleanup_ref_head and check_ref_cleanup, unify it
into a helper and cleanup the calling functions.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 14 ++
 fs/btrfs/delayed-ref.h |  3 ++-
 fs/btrfs/extent-tree.c | 24 
 3 files changed, 20 insertions(+), 21 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 62ff545ba1f7..3a9e4ac21794 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -393,6 +393,20 @@ btrfs_select_ref_head(struct btrfs_trans_handle *trans)
return head;
 }
 
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head)
+{
+   lockdep_assert_held(_refs->lock);
+   lockdep_assert_held(>lock);
+
+   rb_erase(>href_node, _refs->href_root);
+   RB_CLEAR_NODE(>href_node);
+   atomic_dec(_refs->num_entries);
+   delayed_refs->num_heads--;
+   if (head->processing == 0)
+   delayed_refs->num_heads_ready--;
+}
+
 /*
  * Helper to insert the ref_node to the tail or merge with tail.
  *
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index d9f2a4ebd5db..7769177b489e 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -261,7 +261,8 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 {
mutex_unlock(>mutex);
 }
-
+void btrfs_delete_ref_head(struct btrfs_delayed_ref_root *delayed_refs,
+  struct btrfs_delayed_ref_head *head);
 
 struct btrfs_delayed_ref_head *
 btrfs_select_ref_head(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f77226d8020a..6799950fa057 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2492,12 +2492,9 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
spin_unlock(_refs->lock);
return 1;
}
-   delayed_refs->num_heads--;
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   spin_unlock(>lock);
+   btrfs_delete_ref_head(delayed_refs, head);
spin_unlock(_refs->lock);
-   atomic_dec(_refs->num_entries);
+   spin_unlock(>lock);
 
trace_run_delayed_ref_head(fs_info, head, 0);
 
@@ -6984,22 +6981,9 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!mutex_trylock(>mutex))
goto out;
 
-   /*
-* at this point we have a head with no other entries.  Go
-* ahead and process it.
-*/
-   rb_erase(>href_node, _refs->href_root);
-   RB_CLEAR_NODE(>href_node);
-   atomic_dec(_refs->num_entries);
-
-   /*
-* we don't take a ref on the node because we're removing it from the
-* tree, so we just steal the ref the tree was holding.
-*/
-   delayed_refs->num_heads--;
-   if (head->processing == 0)
-   delayed_refs->num_heads_ready--;
+   btrfs_delete_ref_head(delayed_refs, head);
head->processing = 0;
+
spin_unlock(>lock);
spin_unlock(_refs->lock);
 
-- 
2.14.3



[PATCH 00/35] My current patch queue

2018-08-30 Thread Josef Bacik
This is the current queue of things that I've been working on.  The main thing
these patches are doing is separating out the delayed refs reservations from the
global reserve into their own block rsv.  We have been consistently hitting
issues in production where we abort a transaction because we run out of the
global reserve either while running delayed refs or while updating dirty block
groups.  This is because the math around global reserves is made up bullshit
magic that has been tweaked more and more throughout the years.  The result is
something that is inconsistent across the board and sometimes wrong.  So instead
we need a way to know exactly how much space we need to keep around in order to
satisfy our outstanding delayed refs and our dirty block groups.

Since we don't know how many delayed refs we need at the start of any
modification we simply use the nr_items passed into btrfs_start_transaction() as
a guess for what we may need.  This has the side effect of putting more pressure
on the ENOSPC system, but it's pressure we can deal with more intelligently
because we always know how much space we have outstanding, instead of guessing
with weird global reserve math.

This works similar to every other reservation we have, we reserve the worst case
up front, and then at transaction end time we free up any space we didn't
actually use for delayed refs.

My performance tests show that we are bit faster now since we can do more
intelligent flushing and don't have to fall back on simply committing the
transaction in hopes that we have enough space for everything we need to do.

That leads me to the 2nd part of this pull, there's a bunch of fixes around
ENOSPC.  Because we are a bit faster now there were a bunch of things uncovered
in testing, but they seem to be all resolved now.

The final chunk of fixes are around transaction aborts.  There were a lot of
accounting bugs I was running into while running generic/435, so I fixed a bunch
of those up so now it runs cleanly.

I have been running these patches through xfstests on multiple machines for a
while, they are pretty solid and ready for wider testing and review.  Thanks,

Josef


[PATCH 06/35] btrfs: check if free bgs for commit

2018-08-30 Thread Josef Bacik
may_commit_transaction will skip committing the transaction if we don't
have enough pinned space or if we're trying to find space for a SYSTEM
chunk.  However if we have pending free block groups in this transaction
we still want to commit as we may be able to allocate a chunk to make
our reservation.  So instead of just returning ENOSPC, check if we have
free block groups pending, and if so commit the transaction to allow us
to use that free space.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 6e7f350754d2..80615a579b18 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4804,6 +4804,7 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
struct btrfs_trans_handle *trans;
u64 bytes;
u64 reclaim_bytes = 0;
+   bool do_commit = true;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
if (trans)
@@ -4832,8 +4833,10 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 * See if there is some space in the delayed insertion reservation for
 * this reservation.
 */
-   if (space_info != delayed_rsv->space_info)
-   return -ENOSPC;
+   if (space_info != delayed_rsv->space_info) {
+   do_commit = false;
+   goto commit;
+   }
 
spin_lock(_rsv->lock);
reclaim_bytes += delayed_rsv->reserved;
@@ -4848,15 +4851,18 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
 
if (__percpu_counter_compare(_info->total_bytes_pinned,
   bytes,
-  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0) {
-   return -ENOSPC;
-   }
-
+  BTRFS_TOTAL_BYTES_PINNED_BATCH) < 0)
+   do_commit = false;
 commit:
trans = btrfs_join_transaction(fs_info->extent_root);
if (IS_ERR(trans))
return -ENOSPC;
 
+   if (!do_commit &&
+   !test_bit(BTRFS_TRANS_HAVE_FREE_BGS, >transaction->flags)) {
+   btrfs_end_transaction(trans);
+   return -ENOSPC;
+   }
return btrfs_commit_transaction(trans);
 }
 
-- 
2.14.3



[PATCH 04/35] btrfs: only track ref_heads in delayed_ref_updates

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

We use this number to figure out how many delayed refs to run, but
__btrfs_run_delayed_refs really only checks every time we need a new
delayed ref head, so we always run at least one ref head completely no
matter what the number of items on it.  So instead track only the ref
heads added by this trans handle and adjust the counting appropriately
in __btrfs_run_delayed_refs.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/delayed-ref.c | 3 ---
 fs/btrfs/extent-tree.c | 5 +
 2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 3a9e4ac21794..27f7dd4e3d52 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -234,8 +234,6 @@ static inline void drop_delayed_ref(struct 
btrfs_trans_handle *trans,
ref->in_tree = 0;
btrfs_put_delayed_ref(ref);
atomic_dec(_refs->num_entries);
-   if (trans->delayed_ref_updates)
-   trans->delayed_ref_updates--;
 }
 
 static bool merge_ref(struct btrfs_trans_handle *trans,
@@ -460,7 +458,6 @@ static int insert_delayed_ref(struct btrfs_trans_handle 
*trans,
if (ref->action == BTRFS_ADD_DELAYED_REF)
list_add_tail(>add_list, >ref_add_list);
atomic_inc(>num_entries);
-   trans->delayed_ref_updates++;
spin_unlock(>lock);
return ret;
 }
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 87c42a2c45b1..20531389a20a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2583,6 +2583,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
spin_unlock(_refs->lock);
break;
}
+   count++;
 
/* grab the lock that says we are going to process
 * all the refs for this head */
@@ -2596,7 +2597,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
 */
if (ret == -EAGAIN) {
locked_ref = NULL;
-   count++;
continue;
}
}
@@ -2624,7 +2624,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
unselect_delayed_ref_head(delayed_refs, locked_ref);
locked_ref = NULL;
cond_resched();
-   count++;
continue;
}
 
@@ -2642,7 +2641,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
return ret;
}
locked_ref = NULL;
-   count++;
continue;
}
 
@@ -2693,7 +2691,6 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
}
 
btrfs_put_delayed_ref(ref);
-   count++;
cond_resched();
}
 
-- 
2.14.3



[PATCH 03/35] btrfs: use cleanup_extent_op in check_ref_cleanup

2018-08-30 Thread Josef Bacik
From: Josef Bacik 

Unify the extent_op handling as well, just add a flag so we don't
actually run the extent op from check_ref_cleanup and instead return a
value so that we can skip cleaning up the ref head.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4c9fd35bca07..87c42a2c45b1 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2443,18 +2443,23 @@ static void unselect_delayed_ref_head(struct 
btrfs_delayed_ref_root *delayed_ref
 }
 
 static int cleanup_extent_op(struct btrfs_trans_handle *trans,
-struct btrfs_delayed_ref_head *head)
+struct btrfs_delayed_ref_head *head,
+bool run_extent_op)
 {
struct btrfs_delayed_extent_op *extent_op = head->extent_op;
int ret;
 
if (!extent_op)
return 0;
+
head->extent_op = NULL;
if (head->must_insert_reserved) {
btrfs_free_delayed_extent_op(extent_op);
return 0;
+   } else if (!run_extent_op) {
+   return 1;
}
+
spin_unlock(>lock);
ret = run_delayed_extent_op(trans, head, extent_op);
btrfs_free_delayed_extent_op(extent_op);
@@ -2506,7 +2511,7 @@ static int cleanup_ref_head(struct btrfs_trans_handle 
*trans,
 
delayed_refs = >transaction->delayed_refs;
 
-   ret = cleanup_extent_op(trans, head);
+   ret = cleanup_extent_op(trans, head, true);
if (ret < 0) {
unselect_delayed_ref_head(delayed_refs, head);
btrfs_debug(fs_info, "run_delayed_extent_op returned %d", ret);
@@ -6977,12 +6982,8 @@ static noinline int check_ref_cleanup(struct 
btrfs_trans_handle *trans,
if (!RB_EMPTY_ROOT(>ref_tree))
goto out;
 
-   if (head->extent_op) {
-   if (!head->must_insert_reserved)
-   goto out;
-   btrfs_free_delayed_extent_op(head->extent_op);
-   head->extent_op = NULL;
-   }
+   if (cleanup_extent_op(trans, head, false))
+   goto out;
 
/*
 * waiting for the lock here would deadlock.  If someone else has it
-- 
2.14.3



[PATCH 07/35] btrfs: dump block_rsv whe dumping space info

2018-08-30 Thread Josef Bacik
For enospc_debug having the block rsvs is super helpful to see if we've
done something wrong.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/extent-tree.c | 15 +++
 1 file changed, 15 insertions(+)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 80615a579b18..df826f713034 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7910,6 +7910,15 @@ static noinline int find_free_extent(struct 
btrfs_fs_info *fs_info,
return ret;
 }
 
+static void dump_block_rsv(struct btrfs_block_rsv *rsv)
+{
+   spin_lock(>lock);
+   printk(KERN_ERR "%d: size %llu reserved %llu\n",
+  rsv->type, (unsigned long long)rsv->size,
+  (unsigned long long)rsv->reserved);
+   spin_unlock(>lock);
+}
+
 static void dump_space_info(struct btrfs_fs_info *fs_info,
struct btrfs_space_info *info, u64 bytes,
int dump_block_groups)
@@ -7929,6 +7938,12 @@ static void dump_space_info(struct btrfs_fs_info 
*fs_info,
info->bytes_readonly);
spin_unlock(>lock);
 
+   dump_block_rsv(_info->global_block_rsv);
+   dump_block_rsv(_info->trans_block_rsv);
+   dump_block_rsv(_info->chunk_block_rsv);
+   dump_block_rsv(_info->delayed_block_rsv);
+   dump_block_rsv(_info->delayed_refs_rsv);
+
if (!dump_block_groups)
return;
 
-- 
2.14.3



Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
On Thu, Aug 30, 2018 at 9:21 AM, Alberto Bursi  wrote:
>
> On 8/30/2018 11:13 AM, Pierre Couderc wrote:
>> Trying to install a RAID1 on a debian stretch, I made some mistake and
>> got this, after installing on disk1 and trying to add second disk :
>>
>>
>> root@server:~# fdisk -l
>> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disklabel type: dos
>> Disk identifier: 0x2a799300
>>
>> Device Boot StartEndSectors  Size Id Type
>> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>>
>>
>> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>> Units: sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 512 bytes
>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>> Disklabel type: dos
>> Disk identifier: 0x9770f6fa
>>
>> Device Boot StartEndSectors  Size Id Type
>> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
>>
>>
>> And :
>>
>> root@server:~# btrfs fi show
>> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>> Total devices 2 FS bytes used 1.10GiB
>> devid1 size 1.82TiB used 4.02GiB path /dev/sda1
>> devid2 size 1.00KiB used 0.00B path /dev/sdb1
>>
>> ...
>>
>> My purpose is a simple RAID1 main fs, with bootable flag on the 2
>> disks in prder to start in degraded mode
>> How to get out ofr that...?
>>
>> Thnaks
>> PC
>
>
> sdb1 is an extended partition, you cannot format an extended partition.
>
> change sdb1 into primary partition or add a logical partition into it.

Ahh you're correct. There is special treatment of 0x05, it's a logical
container with the start address actually pointing to the address
where the EBR is. And that EBR's first record contains the actual real
extended partition information.

So this represents two bugs in the installer:
1. If there's only one partition on a drive, it should be primary by
default, not extended.
2. But if extended, it must point to an EBR, and the EBR must be
created at that location. Obviously since there is no /dev/sdb2, this
EBR is not present.




-- 
Chris Murphy


Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-30 Thread Austin S. Hemmelgarn

On 2018-08-30 13:13, Axel Burri wrote:

On 29/08/2018 21.02, Austin S. Hemmelgarn wrote:

On 2018-08-29 13:24, Axel Burri wrote:

This patch allows to build distinct binaries for specific btrfs
subcommands, e.g. "btrfs-subvolume-show" which would be identical to
"btrfs subvolume show".


Motivation:

While btrfs-progs offer the all-inclusive "btrfs" command, it gets
pretty cumbersome to restrict privileges to the subcommands [1].
Common approaches are to either setuid root for "/sbin/btrfs" (which
is not recommended at all), or to write sudo rules for each
subcommand.

Separating the subcommands into distinct binaries makes it easy to set
elevated privileges using capabilities(7) or setuid. A typical use
case where this is needed is when it comes to automated scripts,
e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.

Let me start by saying I think this is a great idea to have as an
option, and that the motivation is a particularly good one.

I've posted my opinions on your two open questions below, but there's
two other comments I'd like to make:

* Is there some particular reason that this only includes the commands
it does, and _hard codes_ which ones it works with?  if we just do
everything instead of only the stuff we think needs certain
capabilities, then we can auto-generate the list of commands to be
processed based on function names in the C files, and it will
automatically pick up any newly added commands.  At the very least, it
could still parse through the C files and look for tags in the comments
for the functions to indicate which ones need to be processed this way.
Either case will make it significantly easier to add new commands, and
would also better justify the overhead of shipping all the files
pre-generated (because there would be much more involved in
pre-generating them).


It includes the commands that are required by btrbk. It was quite
painful to figure out the required capabilities (reading kernel code and
some trial and error involved), and I did not get around to include
other commands yet.
Yeah, I can imagine that it was not an easy task.  I've actually been 
thinking of writing a script to scan the kernel sources and assemble a 
summary of the permissions checks performed by each system call and 
ioctl so that stuff like this is a bit easier, but that's unfortunately 
way beyond my abilities right now (parsing C and building call graphs is 
not easy no matter what language you're doing it with).


I like your idea of adding some tags in the C files, I'll try to
implement this, and we'll see what it gets to.
Something embedded in the comments is likely to be the easiest option in 
terms of making sure it doesn't break the regular build.  Just the 
tagging in general would be useful as documentation though.


It would be kind of neat to have the list of capabilities needed for 
each one auto-generated from what it calls, but that's getting into some 
particularly complex territory that would likely require call graphs to 
properly implement.



* While not essential, it would be really neat to have the `btrfs`
command detect if an associated binary exists for whatever command was
just invoked, and automatically exec that (possibly with some
verification) instead of calling the command directly so that desired
permissions are enforced.  This would mitigate the need for users to
remember different command names depending on execution context.


Hmm this sounds a bit too magic for me, and would probably be more
confusing than useful. It would mean than running "btrfs" as user would
work when splitted commands are available, and would not work if not.
It would also mean scripts would not have to add special handling for 
the case of running as a non-root user and seeing if the split commands 
actually exist or not (and, for that matter, would not have to directly 
depend on having the split commands at all), and that users would not 
need to worry about how to call BTRFS based on who they were running as. 
 Realistically, I'd expect the same error to show if the binary isn't 
available as if it's not executable, so that it just becomes a case of 
'if you see this error, re-run the same thing as root and it should work'.





Description:

Patch 1 adds a template as well as a generator shell script for the
splitted subcommands.

Patch 2 adds the generated subcommand source files.

Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
approaches (either hardcoded in Makefile, or more generically by
including "Makefile.install_setcap" generated by "splitcmd-gen.sh").


Open Questions:

1. "make install-splitcmd-setcap" installs the binaries with hardcoded
group "btrfs". This needs to be configurable (how?). Another approach
would be to not set the group at all, and leave this to the user or
distro packaging script.

Leave it to the user or distro.  It's likely to end up standardized on
the name 'btrfs', but it should be agnostic of that.


2. 

Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-30 Thread Axel Burri
On 29/08/2018 21.02, Austin S. Hemmelgarn wrote:
> On 2018-08-29 13:24, Axel Burri wrote:
>> This patch allows to build distinct binaries for specific btrfs
>> subcommands, e.g. "btrfs-subvolume-show" which would be identical to
>> "btrfs subvolume show".
>>
>>
>> Motivation:
>>
>> While btrfs-progs offer the all-inclusive "btrfs" command, it gets
>> pretty cumbersome to restrict privileges to the subcommands [1].
>> Common approaches are to either setuid root for "/sbin/btrfs" (which
>> is not recommended at all), or to write sudo rules for each
>> subcommand.
>>
>> Separating the subcommands into distinct binaries makes it easy to set
>> elevated privileges using capabilities(7) or setuid. A typical use
>> case where this is needed is when it comes to automated scripts,
>> e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.
> Let me start by saying I think this is a great idea to have as an
> option, and that the motivation is a particularly good one.
> 
> I've posted my opinions on your two open questions below, but there's
> two other comments I'd like to make:
> 
> * Is there some particular reason that this only includes the commands
> it does, and _hard codes_ which ones it works with?  if we just do
> everything instead of only the stuff we think needs certain
> capabilities, then we can auto-generate the list of commands to be
> processed based on function names in the C files, and it will
> automatically pick up any newly added commands.  At the very least, it
> could still parse through the C files and look for tags in the comments
> for the functions to indicate which ones need to be processed this way.
> Either case will make it significantly easier to add new commands, and
> would also better justify the overhead of shipping all the files
> pre-generated (because there would be much more involved in
> pre-generating them).

It includes the commands that are required by btrbk. It was quite
painful to figure out the required capabilities (reading kernel code and
some trial and error involved), and I did not get around to include
other commands yet.

I like your idea of adding some tags in the C files, I'll try to
implement this, and we'll see what it gets to.

> * While not essential, it would be really neat to have the `btrfs`
> command detect if an associated binary exists for whatever command was
> just invoked, and automatically exec that (possibly with some
> verification) instead of calling the command directly so that desired
> permissions are enforced.  This would mitigate the need for users to
> remember different command names depending on execution context.

Hmm this sounds a bit too magic for me, and would probably be more
confusing than useful. It would mean than running "btrfs" as user would
work when splitted commands are available, and would not work if not.

>>
>>
>> Description:
>>
>> Patch 1 adds a template as well as a generator shell script for the
>> splitted subcommands.
>>
>> Patch 2 adds the generated subcommand source files.
>>
>> Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
>> approaches (either hardcoded in Makefile, or more generically by
>> including "Makefile.install_setcap" generated by "splitcmd-gen.sh").
>>
>>
>> Open Questions:
>>
>> 1. "make install-splitcmd-setcap" installs the binaries with hardcoded
>> group "btrfs". This needs to be configurable (how?). Another approach
>> would be to not set the group at all, and leave this to the user or
>> distro packaging script.
> Leave it to the user or distro.  It's likely to end up standardized on
> the name 'btrfs', but it should be agnostic of that.
>>
>> 2. Instead of the "install-splitcmd-setcap" make target, we could
>> introduce a "configure --enable-splitted-subcommands" option, which
>> would simply add all splitcmd binaries to the "all" and "install"
>> targets without special treatment, and leave the setcap stuff to the
>> user or distro packaging script (at least in gentoo, this needs to be
>> specified using the "fcaps" eclass anyways [5]).
> A bit of a nitpick, but 'split' is the proper past tense of the word
> 'split', it's one of those exceptions that English has all over the
> place.  Even aside from that though, I think `separate` sounds more
> natural for the configure option, or better yet, just make it
> `--enable-fscaps` like most other packages do.
> 
> That aside, I think having a configure option is the best way to do
> this, it makes it very easy for distro build systems to handle it
> because this is what they're used to doing anyway.  It also makes it a
> bit easier on the user, because it just becomes `make` to build
> whichever version you want installed.


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Chris Murphy
On Thu, Aug 30, 2018 at 3:13 AM, Pierre Couderc  wrote:
> Trying to install a RAID1 on a debian stretch, I made some mistake and got
> this, after installing on disk1 and trying to add second disk  :
>
>
> root@server:~# fdisk -l
> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x2a799300
>
> Device Boot StartEndSectors  Size Id Type
> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>
>
> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x9770f6fa
>
> Device Boot StartEndSectors  Size Id Type
> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended


Extended partition type is not a problem if you're using GRUB as the
bootloader; other bootloaders may not like this. Strictly speaking the
type code 0x05 is incorrect, GRUB ignores type code, as does the
kernel. GRUB also ignores the active bit (boot flag).


>
>
> And :
>
> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
> Total devices 2 FS bytes used 1.10GiB
> devid1 size 1.82TiB used 4.02GiB path /dev/sda1
> devid2 size 1.00KiB used 0.00B path /dev/sdb1

That's odd; and I know you've moved on from this problem but I would
have liked to see the super for /dev/sdb1 and also the installer log
for what commands were used for partitioning, including mkfs and
device add commands.

For what it's worth, 'btrfs dev add' formats the device being added,
it does not need to be formatted in advance, and also it resizes the
file system properly.



> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks in
> prder to start in degraded mode

Good luck with this. The Btrfs archives are full of various
limitations of Btrfs raid1. There is no automatic degraded mount for
Btrfs. And if you persistently ask for degraded mount, you run the
risk of other problems if there's merely a delayed discovery of one of
the devices. Once a Btrfs volume is degraded, it does not
automatically resume normal operation just because the formerly
missing device becomes available.

So... this is flat out not suitable for use cases where you need
unattended raid1 degraded boot.



-- 
Chris Murphy


fsck lowmem mode only: ERROR: errors found in fs roots

2018-08-30 Thread Christoph Anton Mitterer
Hey.

I've the following on a btrfs that's basically the system fs for my
notebook:


When booting from a USB stick with:
# uname -a
Linux heisenberg 4.17.0-3-amd64 #1 SMP Debian 4.17.17-1
(2018-08-18) x86_64 GNU/Linux

# btrfs --version
btrfs-progs v4.17

... a lowmem mode fsck gives no error:

# btrfs check --mode=lowmem /dev/mapper/system ; echo $?
Checking filesystem on /dev/mapper/system
UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c
checking extents
checking free space cache
checking fs roots
ERROR: errors found in fs roots
found 495910952960 bytes used, error(s) found
total csum bytes: 481840472
total tree bytes: 2388819968
total fs tree bytes: 1651097600
total extent tree bytes: 161841152
btree space waste bytes: 446707102
file data blocks allocated: 6651878428672
 referenced 542320984064
1

... while a normal mode fsck doesn't give one:

# btrfs check /dev/mapper/system ; echo $?
Checking filesystem on /dev/mapper/system
UUID: 6050ca10-e778-4d08-80e7-6d27b9c89b3c
checking extents
checking free space cache
checking fs roots
checking only csum items (without verifying data)
checking root refs
found 495910952960 bytes used, no error found
total csum bytes: 481840472
total tree bytes: 2388819968
total fs tree bytes: 1651097600
total extent tree bytes: 161841152
btree space waste bytes: 446707102
file data blocks allocated: 6651878428672
 referenced 542320984064
0

There were no unusual kernel log messages.


Back in the normal system (no longer USB)... I spottet this:
Aug 30 18:31:29 heisenberg kernel: BTRFS info (device dm-0): the free
space cache file (22570598400) is invalid, skip it

but not sure whether it's related (probably not)... and I haven't seen
such a free space cache file issue (or any other btrfs errors) in a
long while (I usually watch my kernel log once after booting has
finished).


Any ideas? Perhaps it's just yet another lowmem false positive...
anything I can help in debugging this?


Apart from this I haven't noticed any corruptions recently... just
about to make a full backup of the fs (or better said a rw snapshot of
the most of the data) with tar, so most data will be read soon at least
once... an I would probably notice any further errors that are
otherwise silent.


Cheers,
Chris.



Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Alberto Bursi

On 8/30/2018 11:13 AM, Pierre Couderc wrote:
> Trying to install a RAID1 on a debian stretch, I made some mistake and 
> got this, after installing on disk1 and trying to add second disk :
>
>
> root@server:~# fdisk -l
> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x2a799300
>
> Device Boot Start    End    Sectors  Size Id Type
> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>
>
> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x9770f6fa
>
> Device Boot Start    End    Sectors  Size Id Type
> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
>
>
> And :
>
> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>     Total devices 2 FS bytes used 1.10GiB
>     devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
>     devid    2 size 1.00KiB used 0.00B path /dev/sdb1
>
> ...
>
> My purpose is a simple RAID1 main fs, with bootable flag on the 2 
> disks in prder to start in degraded mode
> How to get out ofr that...?
>
> Thnaks
> PC


sdb1 is an extended partition, you cannot format an extended partition.

change sdb1 into primary partition or add a logical partition into it.


-Alberto



Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Kai Stian Olstad
On Thursday, 30 August 2018 12:01:55 CEST Pierre Couderc wrote:
> 
> On 08/30/2018 11:35 AM, Qu Wenruo wrote:
> >
> > On 2018/8/30 下午5:13, Pierre Couderc wrote:
> >> Trying to install a RAID1 on a debian stretch, I made some mistake and
> >> got this, after installing on disk1 and trying to add second disk  :
> >>
> >>
> >> root@server:~# fdisk -l
> >> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> Disklabel type: dos
> >> Disk identifier: 0x2a799300
> >>
> >> Device Boot StartEndSectors  Size Id Type
> >> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
> >>
> >>
> >> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> >> Units: sectors of 1 * 512 = 512 bytes
> >> Sector size (logical/physical): 512 bytes / 512 bytes
> >> I/O size (minimum/optimal): 512 bytes / 512 bytes
> >> Disklabel type: dos
> >> Disk identifier: 0x9770f6fa
> >>
> >> Device Boot StartEndSectors  Size Id Type
> >> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
> >>
> >>
> >> And :
> >>
> >> root@server:~# btrfs fi show
> >> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
> >>  Total devices 2 FS bytes used 1.10GiB
> >>  devid1 size 1.82TiB used 4.02GiB path /dev/sda1
> >>  devid2 size 1.00KiB used 0.00B path /dev/sdb1

I think your problem is that sdb1 is a extended partition and not a primary one 
or you'll need to make a logical partition inside the extended partition and 
use that.


-- 
Kai Stian Olstad




Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Qu Wenruo


On 2018/8/30 下午6:01, Pierre Couderc wrote:
> 
> 
> On 08/30/2018 11:35 AM, Qu Wenruo wrote:
>>
>> On 2018/8/30 下午5:13, Pierre Couderc wrote:
>>> Trying to install a RAID1 on a debian stretch, I made some mistake and
>>> got this, after installing on disk1 and trying to add second disk  :
>>>
>>>
>>> root@server:~# fdisk -l
>>> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>>> Units: sectors of 1 * 512 =12 bytes
>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>> Disklabel type: dos
>>> Disk identifier: 0x2a799300
>>>
>>> Device Boot Start    End    Sectors  Size Id Type
>>> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
>>>
>>>
>>> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
>>> Units: sectors of 1 * 512 =12 bytes
>>> Sector size (logical/physical): 512 bytes / 512 bytes
>>> I/O size (minimum/optimal): 512 bytes / 512 bytes
>>> Disklabel type: dos
>>> Disk identifier: 0x9770f6fa
>>>
>>> Device Boot Start    End    Sectors  Size Id Type
>>> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
>>>
>>>
>>> And :
>>>
>>> root@server:~# btrfs fi show
>>> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>>>  Total devices 2 FS bytes used 1.10GiB
>>>  devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
>>>  devid    2 size 1.00KiB used 0.00B path /dev/sdb1
>>>
>>> ...
>>>
>>> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks
>>> in prder to start in degraded mode
>>> How to get out ofr that...?
>> The 2nd device is indeed strange.
>>
>> Considering how old packages Debian tends to deliver, it should be some
>> old btrfs-progs.
>>
>> You could just boot into the system and execute the following commands:
>>
>> # btrfs device remove 2 
> This works fine.
>>
>> Then add a new real device to the fs
>>
>> # btrfs device add  
>>
>>
> Thnk you, this :
> 
> btrfs device add /dev/sdb1 /
> 
>  seems to work but  gives me the same :
> 
> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>     Total devices 2 FS bytes used 1.10GiB
>     devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
>     devid    2 size 1.00KiB used 0.00B path /dev/sdb1
> 
> So I need to "prepare" or format /dev/sdb before adding /dev/sdb1
> I have tried to format /dev/sdb and it works :

What's the output of lsblk command?

If it's 2T (1.8T) show in lsblk, then in above case you could fix it by
resize that device by:

# btrfs resize 2:max 

Thanks,
Qu


> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>     Total devices 2 FS bytes used 1.10GiB
>     devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
>     devid    2 size 1.82TiB used 0.00B path /dev/sdb
> 
> but I have no partition table and no booot flag for degraded mode
> 
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Pierre Couderc




On 08/30/2018 11:35 AM, Qu Wenruo wrote:


On 2018/8/30 下午5:13, Pierre Couderc wrote:

Trying to install a RAID1 on a debian stretch, I made some mistake and
got this, after installing on disk1 and trying to add second disk  :


root@server:~# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2a799300

Device Boot Start    End    Sectors  Size Id Type
/dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux


Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x9770f6fa

Device Boot Start    End    Sectors  Size Id Type
/dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended


And :

root@server:~# btrfs fi show
Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
     Total devices 2 FS bytes used 1.10GiB
     devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
     devid    2 size 1.00KiB used 0.00B path /dev/sdb1

...

My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks
in prder to start in degraded mode
How to get out ofr that...?

The 2nd device is indeed strange.

Considering how old packages Debian tends to deliver, it should be some
old btrfs-progs.

You could just boot into the system and execute the following commands:

# btrfs device remove 2 

This works fine.


Then add a new real device to the fs

# btrfs device add  



Thnk you, this :

btrfs device add /dev/sdb1 /

 seems to work but  gives me the same :

root@server:~# btrfs fi show
Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
    Total devices 2 FS bytes used 1.10GiB
    devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
    devid    2 size 1.00KiB used 0.00B path /dev/sdb1

So I need to "prepare" or format /dev/sdb before adding /dev/sdb1
I have tried to format /dev/sdb and it works :
root@server:~# btrfs fi show
Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
Total devices 2 FS bytes used 1.10GiB
devid1 size 1.82TiB used 4.02GiB path /dev/sda1
devid2 size 1.82TiB used 0.00B path /dev/sdb

but I have no partition table and no booot flag for degraded mode





Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Qu Wenruo


On 2018/8/30 下午5:13, Pierre Couderc wrote:
> Trying to install a RAID1 on a debian stretch, I made some mistake and
> got this, after installing on disk1 and trying to add second disk  :
> 
> 
> root@server:~# fdisk -l
> Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x2a799300
> 
> Device Boot Start    End    Sectors  Size Id Type
> /dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux
> 
> 
> Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
> Units: sectors of 1 * 512 = 512 bytes
> Sector size (logical/physical): 512 bytes / 512 bytes
> I/O size (minimum/optimal): 512 bytes / 512 bytes
> Disklabel type: dos
> Disk identifier: 0x9770f6fa
> 
> Device Boot Start    End    Sectors  Size Id Type
> /dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended
> 
> 
> And :
> 
> root@server:~# btrfs fi show
> Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
>     Total devices 2 FS bytes used 1.10GiB
>     devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
>     devid    2 size 1.00KiB used 0.00B path /dev/sdb1
> 
> ...
> 
> My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks
> in prder to start in degraded mode
> How to get out ofr that...?

The 2nd device is indeed strange.

Considering how old packages Debian tends to deliver, it should be some
old btrfs-progs.

You could just boot into the system and execute the following commands:

# btrfs device remove 2 

Then add a new real device to the fs

# btrfs device add  

Then convert the fs to RAID1

# btrfs balance start -dconvert=RAID1 -mconvert=RAID1 

Thanks,
Qu

> 
> Thnaks
> PC



signature.asc
Description: OpenPGP digital signature


Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Pierre Couderc
Trying to install a RAID1 on a debian stretch, I made some mistake and 
got this, after installing on disk1 and trying to add second disk  :



root@server:~# fdisk -l
Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2a799300

Device Boot Start    End    Sectors  Size Id Type
/dev/sda1  * 2048 3907028991 3907026944  1.8T 83 Linux


Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x9770f6fa

Device Boot Start    End    Sectors  Size Id Type
/dev/sdb1  * 2048 3907029167 3907027120  1.8T  5 Extended


And :

root@server:~# btrfs fi show
Label: none  uuid: eed65d24-6501-4991-94bd-6c3baf2af1ed
    Total devices 2 FS bytes used 1.10GiB
    devid    1 size 1.82TiB used 4.02GiB path /dev/sda1
    devid    2 size 1.00KiB used 0.00B path /dev/sdb1

...

My purpose is a simple RAID1 main fs, with bootable flag on the 2 disks 
in prder to start in degraded mode

How to get out ofr that...?

Thnaks
PC


[PATCH] btrfs-progs: dump-tree: print invalid argument and strerror

2018-08-30 Thread Su Yue
Before this patch:
$ ls nothingness
ls: cannot access 'nothingness': No such file or directory
$ btrfs inspect-internal dump-tree nothingness
ERROR: not a block device or regular file: nothingness

The confusing error message makes users thinks that nonexistent
file is existed but in wrong type.

This patch let check_arg_type return -errno if realpath failed.
And print strerror if check_arg_type failed and the returned code
is negative. Like:

$ btrfs inspect-internal dump-tree nothingness
ERROR: invalid argument: nothingness: No such file or directory

Signed-off-by: Su Yue 
---
 cmds-inspect-dump-tree.c | 7 ++-
 utils.c  | 2 ++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index c8acd55a0c3a..f70a1c47d145 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -313,7 +313,12 @@ int cmd_inspect_dump_tree(int argc, char **argv)
 
ret = check_arg_type(argv[optind]);
if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) {
-   error("not a block device or regular file: %s", argv[optind]);
+   if (ret < 0)
+   error("invalid argument: %s: %s", argv[optind],
+ strerror(-ret));
+   else
+   error("not a block device or regular file: %s",
+ argv[optind]);
goto out;
}
 
diff --git a/utils.c b/utils.c
index 1e275c668cd5..80eca77b1b20 100644
--- a/utils.c
+++ b/utils.c
@@ -502,6 +502,8 @@ int check_arg_type(const char *input)
return BTRFS_ARG_REG;
 
return BTRFS_ARG_UNKNOWN;
+   } else {
+   return -errno;
}
 
if (strlen(input) == (BTRFS_UUID_UNPARSED_SIZE - 1) &&
-- 
2.17.1





Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-08-30 Thread Dave Chinner
On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> On Mon, Aug 20, 2018 at 08:33:49AM -0700, Darrick J. Wong wrote:
> > On Mon, Aug 20, 2018 at 11:09:32AM +1000, Dave Chinner wrote:
> > >   - is documenting rejection on request alignment grounds
> > > (i.e. EINVAL) in the man page sufficient for app
> > > developers to understand what is going on here?
> > 
> > I think so.  The manpage says: "The filesystem does not support
> > reflinking the ranges of the given files", which (to my mind) covers
> > this case of not supporting dedupe of EOF blocks.
> 
> Older versions of btrfs dedupe (before v4.2 or so) used to do exactly
> this; however, on btrfs, not supporting dedupe of EOF blocks means small
> files (one extent) cannot be deduped at all, because the EOF block holds
> a reference to the entire dst extent.  If a dedupe app doesn't go all the
> way to EOF on btrfs, then it should not attempt to dedupe any part of the
> last extent of the file as the benefit would be zero or slightly negative.

That's a filesystem implementation issue, not an API or application
issue.

> The app developer would need to be aware that such a restriction could
> exist on some filesystems, and be able to distinguish this from other
> cases that could lead to EINVAL.  Portable code would have to try a dedupe
> up to EOF, then if that failed, round down and retry, and if that failed
> too, the app would have to figure out which filesystem it's running on
> to know what to do next.  Performance demands the app know what the FS
> will do in advance, and avoid a whole class of behavior.

Nobody writes "portable" applications like that. They read the man
page first, and work out what the common subset of functionality is
and then code from that. Man page says:

"Disk filesystems generally require the offset and length arguments
to be aligned to the fundamental block size."

IOWs, code compatible with starts with supporting the general case.
i.e. a range rounded to filesystem block boundaries (it's already
run fstat() on the files it wants to dedupe to find their size,
yes?), hence ignoring the partial EOF block. Will just work on
everything.

Code that then wants to optimise for btrfs/xfs/ocfs quirks runs
fstatvfs to determine what fs it's operating on and applies the
necessary quirks. For btrfs it can extend the range to include the
partial EOF block, and hence will handle the implementation quirks
btrfs has with single extent dedupe.

Simple, reliable, and doesn't require any sort of flailing
about with offsets and lengths to avoid unexpected EINVAL errors.

> btrfs dedupe reports success if the src extent is inline and the same
> size as the dst extent (i.e. file is smaller than one page).  No dedupe
> can occur in such cases--a clone results in a simple copy, so the best
> a dedupe could do would be a no-op.  Returning EINVAL there would break
> a few popular tools like "cp --reflink".  Returning OK but doing nothing
> seems to be the best option in that case.

Again, those are a filesystem implementation issues, not problems
with the API itself.

> > >   - should we just round down the EOF dedupe request to the
> > > block before EOF so dedupe still succeeds?
> > 
> > I've often wondered if the interface should (have) be(en) that we start
> > at src_off/dst_off and share as many common blocks as possible until we
> > find a mismatch, then tell userspace where we stopped... instead of like
> > now where we compare the entire extent and fail if any part of it
> > doesn't match.
> 
> The usefulness or harmfulness of that approach depends a lot on what
> the application expects the filesystem to do.
> 
> In btrfs, the dedupe operation acts on references to data, not the
> underlying data blocks.  If there are 1000 slightly overlapping references
> to a single contiguous range of data blocks in dst on disk, each dedupe
> operation acts on only one of those, leaving the other 999 untouched.
> If the app then submits 999 other dedupe requests, no references to the
> dst blocks remain and the underlying data blocks can be deleted.

Assuming your strawman is valid, if you have a thousand separate
references across the same set of data blocks on disk, then that data is
already heavily deduplicated.  Trying to optimise that further
seems misguided, way down the curve of diminishing returns.

> In a parallel universe (or a better filesystem, or a userspace emulation
  ^^^
> built out of dedupe and other ioctls), dedupe could work at the extent
> data (physical) level.  The app points at src and dst extent references
> (inode/offset/length tuples), and the filesystem figures out which
> physical blocks these point to, then adjusts all the references to the
> dst blocks at once,

That's XFS dedupe in a nutshell. And we aren't in a parallel
universe here. :P

> dealing with partial overlaps and snapshots and nodatacow
> and whatever other exotic features might be