Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 11:31:33PM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> > On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> >> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> [...]
> >> Before you pointed out that the non-contiguous block written has
> >> an impact on performance. I am replaying  that the switching from a
> >> different BG happens at the stripe-disk boundary, so in any case the
> >> block is physically interrupted and switched to another disk
> > 
> > The difference is that the write is switched to a different local address
> > on the disk.
> > 
> > It's not "another" disk if it's a different BG.  Recall in this plan
> > there is a full-width BG that is on _every_ disk, which means every
> > small-width BG shares a disk with the full-width BG.  Every extent tail
> > write requires a seek on a minimum of two disks in the array for raid5,
> > three disks for raid6.  A tail that is strip-width minus one will hit
> > N - 1 disks twice in an N-disk array.
> 
> Below I made a little simulation; my results telling me another thing:
> 
> Current BTRFS (w/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case A.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks (3data + 2 parity)
> 
> Case A.2.2): extent size = 256kb: (optimistic case: contiguous space 
> available)
> 5 writes of 64kb spread on 5 disks (3 data + 2 parity)
> 2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
> 3 writes of 64 kb spread on 3 disks (data + 2 parity)
> 
> Note that the two reads are contiguous to the 5 writes both in term of
> space and time. The three writes are contiguous only in terms of space,
> but not in terms of time, because these could happen only after the 2
> reads and the consequent parities computations. So we should consider
> that between these two events, some disk activities happen; this means
> seeks between the 2 reads and the 3 writes
> 
> 
> BTRFS with multiple BG (wo/write hole)
> 
> Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)
> 
> Case B.1): extent size = 192kb:
> 5 writes of 64kb spread on 5 disks
> 
> Case B.2): extent size = 256kb:
> 5 writes of 64kb spread on 5 disks in BG#1
> 3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)
> 
> So if I count correctly:
> - case B1 vs A1: these are equivalent
> - case B2 vs A2.1/A2.2:
>   8 writes vs 8 writes
>   3 seeks vs 3 seeks
>   0 reads vs 2 reads
> 
> So to me it seems that the cost of doing a RMW cycle is worse than
> seeking to another BG.

Well, RMW cycles are dangerous, so being slow as well is just a second
reason never to do them.

> Anyway I am reaching the conclusion, also thanks of this discussion,
> that this is not enough. Even if we had solve the problem of the
> "extent smaller than stripe" write, we still face gain this issue when
> part of the file is changed.
> In this case the file update breaks the old extent and will create a
> three extents: the first part, the new part, the last part. Until that
> everything is OK. However the "old" part of the file would be marked
> as free space. But using this part could require a RMW cycle

You cannot use that free space within RAID stripes because it would
require RMW, and RMW causes write hole.  The space would have to be kept
unavailable until the rest of the RAID stripe was deleted.

OTOH, if you can solve that free space management problem, you don't
have to do anything else to solve write hole.  If you never RMW then
you never have the write hole in the first place.

> I am concluding that the only two reliable solution are 
> a) variable stripe size (like ZFS does) 
> or b) logging the RMW cycle of a stripe 

Those are the only solutions that don't require a special process for
reclaiming unused space in RAID stripes.  If you have that, you have a
few more options; however, they all involve making a second copy of the
data at a later time (as opposed to option b, which makes a second
copy of the data during the original write).

a) also doesn't support nodatacow files (AFAIK ZFS doesn't have those)
and it would require defrag to get the inefficiently used space back.

b) is the best of the terrible options.  It minimizes the impact on the
rest of the filesystem since it can fix RMW inconsistency without having
to eliminate the RMW cases.  It doesn't require rewriting the allocator
nor does it require users to run defrag or balance periodically.

> [**] Does someone know if the checksum are checked during this read ?
> [...]
>  
> BR
> G.Baroncelli
> 
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature


Re: Status of RAID5/6

2018-04-04 Thread Goffredo Baroncelli
On 04/04/2018 08:01 AM, Zygo Blaxell wrote:
> On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
>> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
[...]
>> Before you pointed out that the non-contiguous block written has
>> an impact on performance. I am replaying  that the switching from a
>> different BG happens at the stripe-disk boundary, so in any case the
>> block is physically interrupted and switched to another disk
> 
> The difference is that the write is switched to a different local address
> on the disk.
> 
> It's not "another" disk if it's a different BG.  Recall in this plan
> there is a full-width BG that is on _every_ disk, which means every
> small-width BG shares a disk with the full-width BG.  Every extent tail
> write requires a seek on a minimum of two disks in the array for raid5,
> three disks for raid6.  A tail that is strip-width minus one will hit
> N - 1 disks twice in an N-disk array.

Below I made a little simulation; my results telling me another thing:

Current BTRFS (w/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case A.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks (3data + 2 parity)

Case A.2.2): extent size = 256kb: (optimistic case: contiguous space available)
5 writes of 64kb spread on 5 disks (3 data + 2 parity)
2 reads of 64 kb spread on 2 disks (two old data of the stripe) [**]
3 writes of 64 kb spread on 3 disks (data + 2 parity)

Note that the two reads are contiguous to the 5 writes both in term of space 
and time. The three writes are contiguous only in terms of space, but not in 
terms of time, because these could happen only after the 2 reads and the 
consequent parities computations. So we should consider that between these two 
events, some disk activities happen; this means seeks between the 2 reads and 
the 3 writes


BTRFS with multiple BG (wo/write hole)

Supposing 5 disk raid 6 and stripe size=64kb*3=192kb (disk stripe=64kb)

Case B.1): extent size = 192kb:
5 writes of 64kb spread on 5 disks

Case B.2): extent size = 256kb:
5 writes of 64kb spread on 5 disks in BG#1
3 writes of 64 kb spread on 3 disks in BG#2 (which requires 3 seeks)

So if I count correctly:
- case B1 vs A1: these are equivalent
- case B2 vs A2.1/A2.2:
8 writes vs 8 writes
3 seeks vs 3 seeks
0 reads vs 2 reads

So to me it seems that the cost of doing a RMW cycle is worse than seeking to 
another BG.

Anyway I am reaching the conclusion, also thanks of this discussion, that this 
is not enough. Even if we had solve the problem of the "extent smaller than 
stripe" write, we still face gain this issue when part of the file is changed.
In this case the file update breaks the old extent and will create a three 
extents: the first part, the new part, the last part. Until that everything is 
OK. However the "old" part of the file would be marked as free space. But using 
this part could require a RMW cycle

I am concluding that the only two reliable solution are 
a) variable stripe size (like ZFS does) 
or b) logging the RMW cycle of a stripe 


[**] Does someone know if the checksum are checked during this read ?
[...]
 
BR
G.Baroncelli


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: build: Do not use cp -a to install files

2018-04-04 Thread Peter Kjellerstedt
Using cp -a to install files will preserve the ownership of the
original files (if possible), which is typically not wanted. E.g., if
the files were built by a normal user, but are being installed by
root, then the installed files would maintain the UIDs/GIDs of the
user that built the files rather than be owned by root.

Signed-off-by: Peter Kjellerstedt 
---
 Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Makefile b/Makefile
index 92cfe7b5..0e8bfd98 100644
--- a/Makefile
+++ b/Makefile
@@ -578,7 +578,7 @@ install: $(libs) $(progs_install) $(INSTALLDIRS)
$(LN_S) -f btrfs $(DESTDIR)$(bindir)/btrfsck
$(INSTALL) -m755 -d $(DESTDIR)$(libdir)
$(INSTALL) $(libs) $(DESTDIR)$(libdir)
-   cp -a $(lib_links) $(DESTDIR)$(libdir)
+   cp -d $(lib_links) $(DESTDIR)$(libdir)
$(INSTALL) -m755 -d $(DESTDIR)$(incdir)
$(INSTALL) -m644 $(headers) $(DESTDIR)$(incdir)
 ifneq ($(udevdir),)
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Drop delayed_refs argument from btrfs_check_delayed_seq

2018-04-04 Thread Nikolay Borisov
It's used to print its pointer in a debug statement but doesn't really
bring any useful information to the error message.

Signed-off-by: Nikolay Borisov 
---
 fs/btrfs/delayed-ref.c | 9 +++--
 fs/btrfs/delayed-ref.h | 4 +---
 fs/btrfs/extent-tree.c | 2 +-
 3 files changed, 5 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 2677257c149d..5ac2d7909782 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -336,9 +336,7 @@ void btrfs_merge_delayed_refs(struct btrfs_trans_handle 
*trans,
}
 }
 
-int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
-   struct btrfs_delayed_ref_root *delayed_refs,
-   u64 seq)
+int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq)
 {
struct seq_list *elem;
int ret = 0;
@@ -349,10 +347,9 @@ int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
struct seq_list, list);
if (seq >= elem->seq) {
btrfs_debug(fs_info,
-   "holding back delayed_ref %#x.%x, lowest is 
%#x.%x (%p)",
+   "holding back delayed_ref %#x.%x, lowest is 
%#x.%x",
(u32)(seq >> 32), (u32)seq,
-   (u32)(elem->seq >> 32), (u32)elem->seq,
-   delayed_refs);
+   (u32)(elem->seq >> 32), (u32)elem->seq);
ret = 1;
}
}
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 9e3e5aff0937..cdf4a94ce5c1 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -280,9 +280,7 @@ static inline void btrfs_delayed_ref_unlock(struct 
btrfs_delayed_ref_head *head)
 struct btrfs_delayed_ref_head *
 btrfs_select_ref_head(struct btrfs_trans_handle *trans);
 
-int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info,
-   struct btrfs_delayed_ref_root *delayed_refs,
-   u64 seq);
+int btrfs_check_delayed_seq(struct btrfs_fs_info *fs_info, u64 seq);
 
 /*
  * helper functions to cast a node into its container
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d83d449e749a..80ed159b0991 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2718,7 +2718,7 @@ static noinline int __btrfs_run_delayed_refs(struct 
btrfs_trans_handle *trans,
ref = select_delayed_ref(locked_ref);
 
if (ref && ref->seq &&
-   btrfs_check_delayed_seq(fs_info, delayed_refs, ref->seq)) {
+   btrfs_check_delayed_seq(fs_info, ref->seq)) {
spin_unlock(_ref->lock);
unselect_delayed_ref_head(delayed_refs, locked_ref);
locked_ref = NULL;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Tue, Apr 03, 2018 at 09:08:01PM -0600, Chris Murphy wrote:
> On Tue, Apr 3, 2018 at 11:03 AM, Goffredo Baroncelli  
> wrote:
> > On 04/03/2018 02:31 AM, Zygo Blaxell wrote:
> >> On Mon, Apr 02, 2018 at 06:23:34PM -0400, Zygo Blaxell wrote:
> >>> On Mon, Apr 02, 2018 at 11:49:42AM -0400, Austin S. Hemmelgarn wrote:
>  On 2018-04-02 11:18, Goffredo Baroncelli wrote:
> > I thought that a possible solution is to create BG with different
>  number of data disks. E.g. supposing to have a raid 6 system with 6
>  disks, where 2 are parity disk; we should allocate 3 BG
> > BG #1: 1 data disk, 2 parity disks
> > BG #2: 2 data disks, 2 parity disks,
> > BG #3: 4 data disks, 2 parity disks
> >
> > For simplicity, the disk-stripe length is assumed = 4K.
> >
> > So If you have a write with a length of 4 KB, this should be placed
>  in BG#1; if you have a write with a length of 4*3KB, the first 8KB,
>  should be placed in in BG#2, then in BG#1.
> > This would avoid space wasting, even if the fragmentation will
>  increase (but shall the fragmentation matters with the modern solid
>  state disks ?).
> >>> I don't really see why this would increase fragmentation or waste space.
> >
> >> Oh, wait, yes I do.  If there's a write of 6 blocks, we would have
> >> to split an extent between BG #3 (the first 4 blocks) and BG #2 (the
> >> remaining 2 blocks).  It also flips the usual order of "determine size
> >> of extent, then allocate space for it" which might require major surgery
> >> on the btrfs allocator to implement.
> >
> > I have to point out that in any case the extent is physically interrupted 
> > at the disk-stripe size. Assuming disk-stripe=64KB, if you want to write 
> > 128KB, the first half is written in the first disk, the other in the 2nd 
> > disk.  If you want to write 96kb, the first 64 are written in the first 
> > disk, the last part in the 2nd, only on a different BG.
> > So yes there is a fragmentation from a logical point of view; from a 
> > physical point of view the data is spread on the disks in any case.
> >
> > In any case, you are right, we should gather some data, because the 
> > performance impact are no so clear.
> 
> They're pretty clear, and there's a lot written about small file size
> and parity raid performance being shit, no matter the implementation
> (md, ZFS, Btrfs, hardware maybe less so just because of all the
> caching and extra processing hardware that's dedicated to the task).

Pretty much everything goes fast if you put a faster non-volatile cache
in front of it.

> The linux-raid@ list is full of optimizations for this that are use
> case specific. One of those that often comes up is how badly suited
> raid56 are for e.g. mail servers, tons of small file reads and writes,
> and all the disk contention that comes up, and it's even worse when
> you lose a disk, and even if you're running raid 6 and lose two disk
> it's really god awful. It can be unexpectedly a disqualifying setup
> without prior testing in that condition: can your workload really be
> usable for two or three days in a double degraded state on that raid6?
> *shrug*
> 
> Parity raid is well suited for full stripe reads and writes, lots of
> sequential writes. Ergo a small file is anything less than a full
> stripe write. Of course, delayed allocation can end up making for more
> full stripe writes. But now you have more RMW which is the real
> performance killer, again no matter the raid.

RMW isn't necessary if you have properly configured COW on top.
ZFS doesn't do RMW at all.  OTOH for some workloads COW is a step in a
different wrong direction--the btrfs raid5 problems with nodatacow
files can be solved by stripe logging and nothing else.

Some equivalent of autodefrag that repacks your small RAID stripes
into bigger ones will burn 3x your write IOPS eventually--it just
lets you defer the inevitable until a hopefully more convenient time.
A continuously loaded server never has a more convenient time, so it
needs a different solution.

> > I am not worried abut having different BG; we have problem with these 
> > because we never developed tool to handle this issue properly (i.e. a 
> > daemon which starts a balance when needed). But I hope that this will be 
> > solved in future.
> >
> > In any case, the all solutions proposed have their trade off:
> >
> > - a) as is: write hole bug
> > - b) variable stripe size (like ZFS): big impact on how btrfs handle the 
> > extent. limited waste of space
> > - c) logging data before writing: we wrote the data two times in a short 
> > time window. Moreover the log area is written several order of magnitude 
> > more than the other area; there was some patch around
> > - d) rounding the writing to the stripe size: waste of space; simple to 
> > implement;
> > - e) different BG with different stripe size: limited waste of space; 
> > logical fragmentation.
> 
> I'd say for sure you're 

[PATCH RESEND 1/2] btrfs: Enhance btrfs_trim_fs function to handle error better

2018-04-04 Thread Qu Wenruo
Function btrfs_trim_fs() doesn't handle errors in a consistent way, if
error happens when trimming existing block groups, it will skip the
remaining blocks and continue to trim unallocated space for each device.

And the return value will only reflect the final error from device
trimming.

This patch will fix such behavior by:

1) Recording first error from block group or device trimming
   So return value will also reflect any error found when trimming.
   Make developer more aware of the problem.

2) Outputting btrfs warning message for each trimming failure
   Any error for block group or device trimming will cause btrfs warning
   kernel message.

3) Continuing trimming if we can
   If we failed to trim one block group or device, we could still try
   next block group or device.

Such behavior can avoid confusion for case like failure to trim the
first block group and then only unallocated space is trimmed.

Reported-by: Chris Murphy 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 59 --
 1 file changed, 43 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index d83d449e749a..f3b088665b7a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10975,6 +10975,16 @@ static int btrfs_trim_free_extents(struct btrfs_device 
*device,
return ret;
 }
 
+/*
+ * Trim the whole fs, by:
+ * 1) Trimming free space in each block group
+ * 2) Trimming unallocated space in each device
+ *
+ * Will try to continue trimming even if we failed to trim one block group or
+ * device.
+ * The return value will be the error return value of the first error.
+ * Or 0 if nothing wrong happened.
+ */
 int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct fstrim_range *range)
 {
struct btrfs_block_group_cache *cache = NULL;
@@ -10985,6 +10995,8 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range)
u64 end;
u64 trimmed = 0;
u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
+   int bg_ret = 0;
+   int dev_ret = 0;
int ret = 0;
 
/*
@@ -10995,7 +11007,7 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, struct 
fstrim_range *range)
else
cache = btrfs_lookup_block_group(fs_info, range->start);
 
-   while (cache) {
+   for (; cache; cache = next_block_group(fs_info, cache)) {
if (cache->key.objectid >= (range->start + range->len)) {
btrfs_put_block_group(cache);
break;
@@ -11009,29 +11021,36 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
struct fstrim_range *range)
if (!block_group_cache_done(cache)) {
ret = cache_block_group(cache, 0);
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to cache block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+   bg_ret = ret;
+   continue;
}
ret = wait_block_group_cache_done(cache);
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to wait cache for block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+   bg_ret = ret;
+   continue;
}
}
-   ret = btrfs_trim_block_group(cache,
-_trimmed,
-start,
-end,
-range->minlen);
+   ret = btrfs_trim_block_group(cache, _trimmed,
+   start, end, range->minlen);
 
trimmed += group_trimmed;
if (ret) {
-   btrfs_put_block_group(cache);
-   break;
+   btrfs_warn_rl(fs_info,
+   "failed to trim block group %llu ret %d",
+  cache->key.objectid, ret);
+   if (!bg_ret)
+ 

[PATCH RESEND 2/2] btrfs: Ensure btrfs_trim_fs can trim the whole fs

2018-04-04 Thread Qu Wenruo
[BUG]
fstrim on some btrfs only trims the unallocated space, not trimming any
space in existing block groups.

[CAUSE]
Before fstrim_range passed to btrfs_trim_fs(), it get truncated to
range [0, super->total_bytes).
So later btrfs_trim_fs() will only be able to trim block groups in range
[0, super->total_bytes).

While for btrfs, any bytenr aligned to sector size is valid, since btrfs use
its logical address space, there is nothing limiting the location where
we put block groups.

For btrfs with routine balance, it's quite easy to relocate all
block groups and bytenr of block groups will start beyond super->total_bytes.

In that case, btrfs will not trim existing block groups.

[FIX]
Just remove the truncation in btrfs_ioctl_fitrim(), so btrfs_trim_fs()
can get the unmodified range, which is normally set to [0, U64_MAX].

Reported-by: Chris Murphy 
Fixes: f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in fitrim 
ioctl")
Cc:  # v4.0+
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 10 +-
 fs/btrfs/ioctl.c   | 11 +++
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f3b088665b7a..647691fc16e8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10994,19 +10994,11 @@ int btrfs_trim_fs(struct btrfs_fs_info *fs_info, 
struct fstrim_range *range)
u64 start;
u64 end;
u64 trimmed = 0;
-   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
int bg_ret = 0;
int dev_ret = 0;
int ret = 0;
 
-   /*
-* try to trim all FS space, our block group may start from non-zero.
-*/
-   if (range->len == total_bytes)
-   cache = btrfs_lookup_first_block_group(fs_info, range->start);
-   else
-   cache = btrfs_lookup_block_group(fs_info, range->start);
-
+   cache = btrfs_lookup_first_block_group(fs_info, range->start);
for (; cache; cache = next_block_group(fs_info, cache)) {
if (cache->key.objectid >= (range->start + range->len)) {
btrfs_put_block_group(cache);
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index ac85e07f567b..761fba8d8f75 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -364,7 +364,6 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
struct fstrim_range range;
u64 minlen = ULLONG_MAX;
u64 num_devices = 0;
-   u64 total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
int ret;
 
if (!capable(CAP_SYS_ADMIN))
@@ -388,11 +387,15 @@ static noinline int btrfs_ioctl_fitrim(struct file *file, 
void __user *arg)
return -EOPNOTSUPP;
if (copy_from_user(, arg, sizeof(range)))
return -EFAULT;
-   if (range.start > total_bytes ||
-   range.len < fs_info->sb->s_blocksize)
+
+   /*
+* NOTE: Don't truncate the range using super->total_bytes.
+* Bytenr of btrfs block group is in btrfs logical address space,
+* which can be any sector size aligned bytenr in [0, U64_MAX].
+*/
+   if (range.len < fs_info->sb->s_blocksize)
return -EINVAL;
 
-   range.len = min(range.len, total_bytes - range.start);
range.minlen = max(range.minlen, minlen);
ret = btrfs_trim_fs(fs_info, );
if (ret < 0)
-- 
2.16.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Status of RAID5/6

2018-04-04 Thread Zygo Blaxell
On Wed, Apr 04, 2018 at 07:15:54AM +0200, Goffredo Baroncelli wrote:
> On 04/04/2018 12:57 AM, Zygo Blaxell wrote:
> >> I have to point out that in any case the extent is physically
> >> interrupted at the disk-stripe size. Assuming disk-stripe=64KB, if
> >> you want to write 128KB, the first half is written in the first disk,
> >> the other in the 2nd disk.  If you want to write 96kb, the first 64
> >> are written in the first disk, the last part in the 2nd, only on a
> >> different BG.
> > The "only on a different BG" part implies something expensive, either
> > a seek or a new erase page depending on the hardware.  Without that,
> > nearby logical blocks are nearby physical blocks as well.
> 
> In any case it happens on a different disk

No it doesn't.  The small-BG could be on the same disk(s) as the big-BG.

> >> So yes there is a fragmentation from a logical point of view; from a
> >> physical point of view the data is spread on the disks in any case.
> 
> > What matters is the extent-tree point of view.  There is (currently)
> > no fragmentation there, even for RAID5/6.  The extent tree is unaware
> > of RAID5/6 (to its peril).
> 
> Before you pointed out that the non-contiguous block written has
> an impact on performance. I am replaying  that the switching from a
> different BG happens at the stripe-disk boundary, so in any case the
> block is physically interrupted and switched to another disk

The difference is that the write is switched to a different local address
on the disk.

It's not "another" disk if it's a different BG.  Recall in this plan
there is a full-width BG that is on _every_ disk, which means every
small-width BG shares a disk with the full-width BG.  Every extent tail
write requires a seek on a minimum of two disks in the array for raid5,
three disks for raid6.  A tail that is strip-width minus one will hit
N - 1 disks twice in an N-disk array.

> However yes: from an extent-tree point of view there will be an increase
> of number extents, because the end of the writing is allocated to
> another BG (if the size is not stripe-boundary)
> 
> > If an application does a loop writing 68K then fsync(), the multiple-BG
> > solution adds two seeks to read every 68K.  That's expensive if sequential
> > read bandwidth is more scarce than free space.
> 
> Why you talk about an additional seeks? In any case (even without the
> additional BG) the read happens from another disks

See above:  not another disk, usually a different location on two or
more of the same disks.

> >> * c),d),e) are applied only for the tail of the extent, in case the
> > size is less than the stripe size.
> > 
> > It's only necessary to split an extent if there are no other writes
> > in the same transaction that could be combined with the extent tail
> > into a single RAID stripe.  As long as everything in the RAID stripe
> > belongs to a single transaction, there is no write hole
> 
> May be that a more "simpler" optimization would be close the transaction
> when the data reach the stripe boundary... But I suspect that it is
> not so simple to implement.

Transactions exist in btrfs to batch up writes into big contiguous extents
already.  The trick is to _not_ do that when one transaction ends and
the next begins, i.e. leave a space at the end of the partially-filled
stripe so that the next transaction begins in an empty stripe.

This does mean that there will only be extra seeks during transaction
commit and fsync()--which were already very seeky to begin with.  It's
not necessary to write a partial stripe when there are other extents to
combine.

So there will be double the amount of seeking, but depending on the
workload, it could double a very small percentage of writes.

> > Not for d.  Balance doesn't know how to get rid of unreachable blocks
> > in extents (it just moves the entire extent around) so after a balance
> > the writes would still be rounded up to the stripe size.  Balance would
> > never be able to free the rounded-up space.  That space would just be
> > gone until the file was overwritten, deleted, or defragged.
> 
> If balance is capable to move the extent, why not place one near the
> other during a balance ? The goal is not to limit the the writing of
> the end of a extent, but avoid writing the end of an extent without
> further data (e.g. the gap to the stripe has to be filled in the
> same transaction)

That's plan f (leave gaps in RAID stripes empty).  Balance will repack
short extents into RAID stripes nicely.

Plan d can't do that because plan d overallocates the extent so that
the extent fills the stripe (only some of the extent is used for data).
Small but important difference.

> BR
> G.Baroncelli
> 
> -- 
> gpg @keyserver.linux.it: Goffredo Baroncelli 
> Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


signature.asc
Description: PGP signature