Re: [PATCH 1/2] btrfs-progs: tests: Extend cli/003

2021-02-19 Thread David Sterba
On Mon, Jan 25, 2021 at 12:43:57PM +0200, Nikolay Borisov wrote:
> Add a test which ensures that when resize is tried on an image instead
> of a directory appropriate warning is produced and the command fails.
> 
> Signed-off-by: Nikolay Borisov 

Added to devel with some fixups, thanks.


Re: [PATCH] btrfs-progs: --init-extent-tree if extent tree is unreadable

2021-02-19 Thread David Sterba
On Wed, Aug 12, 2020 at 07:14:44PM +0200, David Sterba wrote:
> On Wed, Aug 12, 2020 at 09:29:18AM +0800, Qu Wenruo wrote:
> > On 2020/7/28 上午10:12, Daniel Xu wrote:
> > > This change can save the user an extra step of running `btrfs check
> > > --init-extent-tree ...` if the user was already trying to repair the
> > > filesystem.
> > 
> > This looks too aggressive to me.
> > 
> > Extent tree repair, not only --init-extent-tree, is not considered safe
> > overall.
> > 
> > In fact, we could hit cases with things like completely sane fs trees,
> > but corrupted extent and csum trees.
> > 
> > In that case, I'm not sure --init-extent-tree would solve or just worse
> > the situation.
> > 
> > Thus --init-extent-tree should only be triggered by users who know what
> > they are doing.
> > (In that case, I would call them developers other than users)
> 
> I have basically the same answer, just did not get to writing it.  I'll
> have another look after the merge window is over.
> 
> This touches on the higher level topic what shoud check do, we can't
> trade convenience for safety. The extra step to specify the option on
> the command line can be actually the difference between repairing and
> not repairing.

To answer that, favoring safety over convenience here, so the option
needs to be specified manually if needed.


Re: error in backport of 'btrfs: fix possible free space tree corruption with online conversion'

2021-02-19 Thread Holger Hoffstätte

On 2021-02-19 04:17, Wang Yugui wrote:

Hi, Josef Bacik

We noticed an error in 5.10.x backport of 'btrfs: fix possible free
space tree corruption with online conversion'

It is wrong in 5.10.13, but right in 5.11.

5.10.13
@@ -146,6 +146,9 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
  };

the usage sample of this enum:
set_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, &fs_info->fs_state);


5.11
enum{
..
 /* Indicate that the discard workqueue can service discards. */
 BTRFS_FS_DISCARD_RUNNING,

 /* Indicate that we need to cleanup space cache v1 */
 BTRFS_FS_CLEANUP_SPACE_CACHE_V1,

 /* Indicate that we can't trust the free space tree for caching yet */
 BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
};

the usage sample of this enum:
set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);



Out of curiosity I decided to check how this happened, but don't see it.
Here is the commit that went into 5.10.13 and it looks correct to me:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57

The patch that went into 5.10 looks identical to the original commit in 5.11.
What tree are you looking at?

-h


Re: [PATCH v2] btrfs-progs: filesystem-resize: make output more readable

2021-02-19 Thread David Sterba
On Sat, Jan 23, 2021 at 03:37:20PM +, Sidong Yang wrote:
> This patch make output of filesystem-resize command more readable and
> give detail information for users. This patch provides more information
> about filesystem like below.
> 
> Before:
> Resize '/mnt' of '1:-1G'
> 
> After:
> Resize device id 1 (/dev/vdb) from 4.00GiB to 3.00GiB
> 
> Signed-off-by: Sidong Yang 
> ---
> v2:
>   - print more detailed error
>   - covers all the possibilities format provides
> ---
>  cmds/filesystem.c | 112 +-
>  1 file changed, 111 insertions(+), 1 deletion(-)
> 
> diff --git a/cmds/filesystem.c b/cmds/filesystem.c
> index ba2e5928..cec3f380 100644
> --- a/cmds/filesystem.c
> +++ b/cmds/filesystem.c
> @@ -28,6 +28,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -1074,6 +1075,109 @@ static const char * const 
> cmd_filesystem_resize_usage[] = {
>   NULL
>  };
>  
> +static int check_resize_args(const char *amount, const char *path) {
> + struct btrfs_ioctl_fs_info_args fi_args;
> + struct btrfs_ioctl_dev_info_args *di_args = NULL;
> + int ret, i, devid = 0, dev_idx = -1;

devid should be u64

> + const char *res_str = NULL;
> + char *devstr = NULL, *sizestr = NULL;
> + u64 new_size = 0, old_size = 0;
> + int mod = 0;
> + char amount_dup[BTRFS_VOL_NAME_MAX];

Bffer of a fixed size ...

> +
> + ret = get_fs_info(path, &fi_args, &di_args);
> +
> + if (ret) {
> + error("unable to retrieve fs info");
> + return 1;
> + }
> +
> + if (!fi_args.num_devices) {
> + error("no devices found");
> + free(di_args);
> + return 1;
> + }
> +
> + strcpy(amount_dup, amount);

... and strcpy from a user specified buffer, this is from chapter 1 of
how buffer overflows in C work. Please use safe string copy.

> + devstr = strchr(amount_dup, ':');
> + if (devstr) {
> + sizestr = devstr + 1;
> + *devstr = '\0';
> + devstr = amount_dup;
> +
> + errno = 0;
> + devid = strtoull(devstr, NULL, 10);
> +
> + if (errno) {
> + error("failed to parse devid %s", devstr);
> + free(di_args);
> + return 1;
> + }
> + }
> +
> + dev_idx = -1;
> + for(i = 0; i < fi_args.num_devices; i++) {
> + if (di_args[i].devid == devid) {
> + dev_idx = i;
> + break;
> + }
> + }
> +
> + if (dev_idx < 0) {
> + error("cannot find devid : %d", devid);
> + free(di_args);
> + return 1;
> + }
> +
> + if (!strcmp(sizestr, "max")) {
> + res_str = "max";
> + }
> + else {
> + if (sizestr[0] == '-') {
> + mod = -1;
> + sizestr++;
> + } else if (sizestr[0] == '+') {
> + mod = 1;
> + sizestr++;
> + }
> + new_size = parse_size(sizestr);
> + if (!new_size) {
> + error("failed to parse size %s", sizestr);
> + free(di_args);
> + return 1;
> + }
> + old_size = di_args[dev_idx].total_bytes;
> +
> + if (mod < 0) {
> + if (new_size > old_size) {
> + error("current size is %s which is smaller than 
> %s",
> +   pretty_size_mode(old_size, UNITS_DEFAULT),
> +   pretty_size_mode(new_size, 
> UNITS_DEFAULT));
> + free(di_args);
> + return 1;
> + }
> + new_size = old_size - new_size;
> + } else if (mod > 0) {
> + if (new_size > ULLONG_MAX - old_size) {
> + error("increasing %s is out of range",
> +   pretty_size_mode(new_size, 
> UNITS_DEFAULT));
> + free(di_args);
> + return 1;
> + }
> + new_size = old_size + new_size;

This all got me confused, old_size and new_size sound like absolute
numbers for the size but new_size is in fact used as the delta, or the
relative change.

Otherwise looks good.

> + }
> + new_size = round_down(new_size, fi_args.sectorsize);
> + res_str = pretty_size_mode(new_size, UNITS_DEFAULT);
> + }
> +
> + printf("Resize device id %d (%s) from %s to %s\n", devid, 
> di_args[dev_idx].path,
> + pretty_size_mode(di_args[dev_idx].total_bytes, UNITS_DEFAULT),
> + res_str);
> +
> + free(di_args);
> + return 0;
> +}
> +
>  static int cmd_filesystem_resize(const struct cmd_struct *cmd,
>  

Re: [PATCH btrfs-progs] btrfs-progs: do not fail when offset of a ROOT_ITEM is not -1

2021-02-19 Thread David Sterba
On Tue, Feb 09, 2021 at 06:34:06PM +0100, Marek Behún wrote:
> When the btrfs_read_fs_root() function is searching a ROOT_ITEM with
> location key offset other than -1, it currently fails via BUG_ON.
> 
> The offset can have other value than -1, though. This can happen for
> example if a subvolume is renamed:
> 
>   $ btrfs subvolume create X && sync
>   Create subvolume './X'
>   $ btrfs inspect-internal dump-tree /dev/root | grep -B 2 'name: X$
> location key (270 ROOT_ITEM 18446744073709551615) type DIR
> transid 283 data_len 0 name_len 1
> name: X
>   $ mv X Y && sync
>   $ btrfs inspect-internal dump-tree /dev/root | grep -B 2 'name: Y$
> location key (270 ROOT_ITEM 0) type DIR
> transid 285 data_len 0 name_len 1
> name: Y
> 
> As can be seen the offset changed from -1ULL to 0.
> 
> Do not fail in this case.
> 
> Signed-off-by: Marek Behún 
> Cc: David Sterba 
> Cc: Qu Wenruo 
> Cc: Tom Rini 

Added to devel, thanks.


Re: [PATCH 2/2] btrfs-progs: tests: check the result log for critical warnings

2021-02-19 Thread David Sterba
On Mon, Nov 09, 2020 at 01:39:52PM +0800, Qu Wenruo wrote:
> Introduce a new function, check_test_results(), for
> misc/fsck/convert/mkfs test cases.
> 
> This function is currently to catch warning message for subpage support,
> but can be later expanded for other usages.

That sounds very useful, thanks.

Added to devel.


Re: page->index limitation on 32bit system?

2021-02-19 Thread Matthew Wilcox
On Thu, Feb 18, 2021 at 01:27:09PM -0800, Erik Jensen wrote:
> On 2/18/21 4:15 AM, Matthew Wilcox wrote:
> 
> > On Thu, Feb 18, 2021 at 04:54:46PM +0800, Qu Wenruo wrote:
> > > Recently we got a strange bug report that, one 32bit systems like armv6
> > > or non-64bit x86, certain large btrfs can't be mounted.
> > > 
> > > It turns out that, since page->index is just unsigned long, and on 32bit
> > > systemts, that can just be 32bit.
> > > 
> > > And when filesystems is utilizing any page offset over 4T, page->index
> > > get truncated, causing various problems.
> > 4TB?  I think you mean 16TB (4kB * 4GB)
> > 
> > Yes, this is a known limitation.  Some vendors have gone to the trouble
> > of introducing a new page_index_t.  I'm not convinced this is a problem
> > worth solving.  There are very few 32-bit systems with this much storage
> > on a single partition (everything should work fine if you take a 20TB
> > drive and partition it into two 10TB partitions).
> For what it's worth, I'm the reporter of the original bug. My use case is a
> custom NAS system. It runs on a 32-bit ARM processor, and has 5 8TB drives,
> which I'd like to use as a single, unified storage array. I chose btrfs for
> this project due to the filesystem-integrated snapshots and checksums.
> Currently, I'm working around this issue by exporting the raw drives using
> nbd and mounting them on a 64-bit system to access the filesystem, but this
> is very inconvenient, only allows one machine to access the filesystem at a
> time, and prevents running any tools that need access to the filesystem
> (such as backup and file sync utilities) on the NAS itself.
> 
> It sounds like this limitation would also prevent me from trying to use a
> different filesystem on top of software RAID, since in that case the logical
> filesystem would still be over 16TB.
> 
> > As usual, the best solution is for people to stop buying 32-bit systems.
> I purchased this device in 2018, so it's not exactly ancient. At the time,
> it was the only SBC I could find that was low power, used ECC RAM, had a
> crypto accelerator, and had multiple sata ports with port-multiplier
> support.

I'm sorry you bought unsupported hardware.

This limitation has been known since at least 2009:
https://lore.kernel.org/lkml/19041.4714.686158.130252@notabene.brown/

In the last decade, nobody's tried to fix it in mainline that I know of.
As I said, some vendors have tried to fix it in their NAS products,
but I don't know where to find that patch any more.

https://bootlin.com/blog/large-page-support-for-nas-systems-on-32-bit-arm/
might help you, but btrfs might still contain assumptions that will trip
you up.


Re: [PATCH 0/3] btrfs-progs: rescue: Add create-control-device subcommand

2021-02-19 Thread David Sterba
On Thu, Oct 29, 2020 at 04:17:35PM -0700, Daniel Xu wrote:
> This patchset adds a new `btrfs rescue create-control-device` subcommand
> that acts as a convenient way to invoke:
> 
>   # mknod --mode=600 c 10 234 /dev/btrfs-control
> 
> on systems that don't have `mknod` installed or when you're lazy.

Well, I don't think the part 'lazy' applies.

The whole thing with the control device is simpler that I originally
thought. On a system without loaded btrfs module there's no
/dev/btrfs-control. This is correct because the device node is created
at load time or when btrfs_interface_init is called.

Creating just the device node makes no sense because there's nothing
handling it. Once module is loaded it appears and works as expected.

The only case where the rescue command makes sense is when the module is
loaded, device node creatd and then manually deleted. This is possible
but highly unlikely. For that reason the rescue command still has some
sense but the reasoning needs to reflect how it's related to the module
status.

As this is docs update only, I'll fix that myself, no need to resend.


Re: corrupt leaf, unexpected item end, unmountable

2021-02-19 Thread Daniel Dawson
On 2/18/21 9:03 PM, Chris Murphy wrote:
> Once everything else is figured out, you should consider converting
> metadata to raid1c3.

Got it.

> The new replacement is devid 0 during the replacement. The drive being
> replaced keeps its devid until the end, and then there's a switch,
> that device is removed, and the signature on the old drive is wiped.
> Sooo something is still wrong with the above because there's no
> devid 3, there's kernel and btrfs check messages saying devid 3 is
> missing.
>
> It doesn't seem likely that /dev/sdc3 is devid 3 because it can't be
> both missing and be the mounted dev node.

It seems I was unclear. I removed the old drive prior to the
replacement, hence degraded mode.

A while ago, I imaged the drives, to see what I could do without risk
(on another machine). Turns out I was able to mount the filesystem using
-o ro,nologreplay,degraded and copy almost all files. A small number
were unreadable/un-stat-able. Fortunately nothing critical, though the
OS may well be unusable.

(Also, in case you were wondering, memory testing has revealed no errors
so far.)

> If a tree log is damaged and prevents mount then, you need to make a
> calculation. You can try to mount with ro,nologreplay and freshen
> backups for anything you'd rather not lose - just in case things get
> worse. And then you can zero the log and see if that'll let you
> normally mount the device (i.e. rw and not degraded). But some of it
> will depend on what's wrong.

That doesn't work. It gives the same errors as when I tried to run
check, but repeated once each for extent tree and device tree. It just
can't get past this problem.

At this point, I think it's best to just reinstall with a fresh
filesystem, and not make the same mistakes. Thanks for the help, once again.



Re: corrupt leaf, unexpected item end, unmountable

2021-02-19 Thread Daniel Dawson
On 2/18/21 9:03 PM, Chris Murphy wrote:
> Once everything else is figured out, you should consider converting
> metadata to raid1c3.

Got it.

> The new replacement is devid 0 during the replacement. The drive being
> replaced keeps its devid until the end, and then there's a switch,
> that device is removed, and the signature on the old drive is wiped.
> Sooo something is still wrong with the above because there's no
> devid 3, there's kernel and btrfs check messages saying devid 3 is
> missing.
>
> It doesn't seem likely that /dev/sdc3 is devid 3 because it can't be
> both missing and be the mounted dev node.

It seems I was unclear. I removed the old drive prior to the
replacement, hence degraded mode.

A while ago, I imaged the drives, to see what I could do without risk
(on another machine). Turns out I was able to mount the filesystem using
-o ro,nologreplay,degraded and copy almost all files. A small number
were unreadable/un-stat-able. Fortunately nothing critical, though the
OS may well be unusable.

(Also, in case you were wondering, memory testing has revealed no errors
so far.)

> If a tree log is damaged and prevents mount then, you need to make a
> calculation. You can try to mount with ro,nologreplay and freshen
> backups for anything you'd rather not lose - just in case things get
> worse. And then you can zero the log and see if that'll let you
> normally mount the device (i.e. rw and not degraded). But some of it
> will depend on what's wrong.

That doesn't work. It gives the same errors as when I tried to run
check, but repeated once each for extent tree and device tree. It just
can't get past this problem.

At this point, I think it's best to just reinstall with a fresh
filesystem, and not make the same mistakes. Thanks for the help, once again.




Re: error in backport of 'btrfs: fix possible free space tree corruption with online conversion'

2021-02-19 Thread Wang Yugui
Hi,

> On 2021-02-19 04:17, Wang Yugui wrote:
> > Hi, Josef Bacik
> >
> > We noticed an error in 5.10.x backport of 'btrfs: fix possible free
> > space tree corruption with online conversion'
> >
> > It is wrong in 5.10.13, but right in 5.11.
> >
> > 5.10.13
> > @@ -146,6 +146,9 @@ enum {
> > BTRFS_FS_STATE_DEV_REPLACING,
> > /* The btrfs_fs_info created for self-tests */
> > BTRFS_FS_STATE_DUMMY_FS_INFO,
> > +
> > +   /* Indicate that we can't trust the free space tree for caching yet */
> > +   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
> >   };
> >
> > the usage sample of this enum:
> > set_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, &fs_info->fs_state);
> >
> >
> > 5.11
> > enum{
> > ..
> >  /* Indicate that the discard workqueue can service discards. */
> >  BTRFS_FS_DISCARD_RUNNING,
> >
> >  /* Indicate that we need to cleanup space cache v1 */
> >  BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
> >
> >  /* Indicate that we can't trust the free space tree for caching yet */
> >  BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
> > };
> >
> > the usage sample of this enum:
> > set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);
> > 
> Out of curiosity I decided to check how this happened, but don't see it.
> Here is the commit that went into 5.10.13 and it looks correct to me:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57

> The patch that went into 5.10 looks identical to the original commit in 5.11.
> What tree are you looking at?

the 5.10.y is the URL that you point out.
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57

but the right one for 5.11 is
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f

5.11:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0225c5208f44c..47ca8edafb5e6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -564,6 +564,9 @@ enum {
 
/* Indicate that we need to cleanup space cache v1 */
BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
 };
 
 /*

but 5.10.y:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e01545538e07f..30ea9780725ff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -146,6 +146,9 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
 };
 
 #define BTRFS_BACKREF_REV_MAX  256

Both the line(Line:146 vs Line:564) and the content are wrong.

Best Regards
Wang Yugui (wangyu...@e16-tech.com)
2021/02/19




Re: [PATCH] btrfs-progs: receive: fix btrfs_mount_root substring bug

2021-02-19 Thread David Sterba
On Mon, Nov 16, 2020 at 04:58:20PM -0800, Boris Burkov wrote:
> The current mount detection code in btrfs receive is not quite perfect.
> For example, suppose /tmp is mounted as a tmpfs. In that case,
> btrfs receive /tmp2 will find /tmp as the longest mount that matches a
> prefix of /tmp2 and blow up because it is not a btrfs filesystem, even
> if /tmp2 is just a directory in / mounted as btrfs.
> 
> Fix this by replacing the substring check with a dirname recursion to
> only check the directories in the path of the dir, rather than every
> substring.
> 
> Add a new test for this case.
> 
> Signed-off-by: Boris Burkov 

Added to devel, thanks.


Re: page->index limitation on 32bit system?

2021-02-19 Thread Theodore Ts'o
On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> So it means the 32bit archs are already 2nd tier targets for at least
> upstream linux kernel?

At least as far as btrfs is concerned, anyway

> Or would it be possible to make it an option to make the index u64?
> So guys who really wants large file support can enable it while most
> other 32bit guys can just keep the existing behavior?

I think if this is going to be done at all, it would need to be a
compile-time CONFIG option to make the index be 64-bits.  That's
because there are a huge number of low-end Android devices (retail
price ~$30 USD in India, for example --- this set of customers is
sometimes called "the next billion users" by some folks) that are
using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
would be massively unfortunate/unfair/etc. to impose that performance
penalty on them.

It sounds like what Willy is saying is that supporting a 64-bit page
index on 32-bit platforms is going to be have a lot of downsides, and
not just the performance / memory overhead issue.  It's also a code
mainteinance concern, and that tax would land on the mm developers.
And if it's not well-maintained, without regular testing, it's likely
to be heavily subject to bitrot.  (Although I suppose if we don't mind
doubling the number of configs that kernelci has to test, this could
be mitigated.)

In contrast, changing btrfs to not depend on a single address space
for all of its metadata might be a lot of work, but it's something
which lands on the btrfs developers, as opposed to a another (perhaps
more central) kernel subsystem.  Managing at this tradeoff is
something that is going to be between the mm developers and the btrfs
developers, but as someone who doesn't do any work on either of these
subsystems, it seems like a pretty obvious choice.

The final observation I'll make is that if we know which NAS box
vendor can (properly) support volumes > 16 TB, we can probably find
the 64-bit page index patch.  It'll probably be against a fairly old
kernel, so it might not all _that_ helpful, but it might give folks a
bit of a head start.

I can tell you that the NAS box vendor that it _isn't_ is Synology.
Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
volume size limit, and this is enforced by the Synology NAS
software[1].  However, Synology NAS boxes can support multiple
volumes; until today, I never understood why, since it seemed to be
unnecessary complexity, but I suspect the real answer was this was how
Synology handled storage array sizes > 16TB on their older systems.
(All of their new NAS boxes use 64-bit processors.)

[1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/

Cheers,

- Ted


Re: error in backport of 'btrfs: fix possible free space tree corruption with online conversion'

2021-02-19 Thread Holger Hoffstätte

On 2021-02-19 16:20, Wang Yugui wrote:

Hi,


On 2021-02-19 04:17, Wang Yugui wrote:

Hi, Josef Bacik

We noticed an error in 5.10.x backport of 'btrfs: fix possible free
space tree corruption with online conversion'

It is wrong in 5.10.13, but right in 5.11.

5.10.13
@@ -146,6 +146,9 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
   };

the usage sample of this enum:
set_bit(BTRFS_FS_STATE_DUMMY_FS_INFO, &fs_info->fs_state);


5.11
enum{
..
  /* Indicate that the discard workqueue can service discards. */
  BTRFS_FS_DISCARD_RUNNING,

  /* Indicate that we need to cleanup space cache v1 */
  BTRFS_FS_CLEANUP_SPACE_CACHE_V1,

  /* Indicate that we can't trust the free space tree for caching yet */
  BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
};

the usage sample of this enum:
set_bit(BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED, &fs_info->flags);


Out of curiosity I decided to check how this happened, but don't see it.
Here is the commit that went into 5.10.13 and it looks correct to me:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57



The patch that went into 5.10 looks identical to the original commit in 5.11.
What tree are you looking at?


the 5.10.y is the URL that you point out.

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57


but the right one for 5.11 is
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f

5.11:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0225c5208f44c..47ca8edafb5e6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -564,6 +564,9 @@ enum {
  
  	/* Indicate that we need to cleanup space cache v1 */

BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
  };
  
  /*


but 5.10.y:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index e01545538e07f..30ea9780725ff 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -146,6 +146,9 @@ enum {
BTRFS_FS_STATE_DEV_REPLACING,
/* The btrfs_fs_info created for self-tests */
BTRFS_FS_STATE_DUMMY_FS_INFO,
+
+   /* Indicate that we can't trust the free space tree for caching yet */
+   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
  };
  
  #define BTRFS_BACKREF_REV_MAX		256


Both the line(Line:146 vs Line:564) and the content are wrong.



Ahh..now I understand, indeed the merge of BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED 
went into
the wrong enum. I misunderstood your original posting to mean that it had 
somehow missed
a chunk or used the wrong enum value in set_bit.

Anyway, good catch! I guess Dave needs to decide how to fix this, maybe
let Greg revert & re-apply properly.

Can anybody explain why git decided to do this?

-h


Re: btrfs-progs test error on fsck/012-leaf-corruption

2021-02-19 Thread David Sterba
On Thu, Feb 18, 2021 at 02:56:14AM +, Sidong Yang wrote:
> I found some error when I run unittest code in btrfs-progs.
> fsck/012-leaf-corruption test corrupt leaf and check that it's recovered.
> but the test was failed and demsg below
> 
> [   47.284095] BTRFS error (device loop5): device total_bytes should be at 
> most 27660288 but found 67108864
> [   47.284207] BTRFS error (device loop5): failed to read chunk tree: -22
> [   47.286465] BTRFS error (device loop5): open_ctree failed
> 
> I'm using kernel version 5.11 and there is no error in old version kernel.
> I traced the kernel code and found the code that prints error message.
> When it tried to mount btrfs, the function read_one_dev() failed.
> I found that code added by the commit 3a160a9331112 cause this problem.
> The unittest in btrfs-progs should be changed or kernel code should be 
> patched?

The kernel check makes sense. The unit test fails because the image is
restored from a dump and not extended to the full size automatically.

After 'extract_image' the image is

-rw-r--r-- 1 root root 27660288 Feb 19 17:47 good.img.restored
-rw-r--r-- 1 root root   186392 Jul 27  2020 good.img.xz
-rwxr-xr-x 1 root root 2788 Feb 19 17:46 test.sh

but with a manual 'truncate -s 67108864 good.img.restored' the test
succeeds.

btrfs-image enlarges the file but it's probably taking the wrong size

2281 dev_size = key.offset + btrfs_dev_extent_length(path.nodes[0], 
dev_ext);
2282 btrfs_release_path(&path);
2283
2284 btrfs_set_stack_device_total_bytes(dev_item, dev_size);
2285 btrfs_set_stack_device_bytes_used(dev_item, mdres->alloced_chunks);
2286 ret = fstat(out_fd, &buf);
2287 if (ret < 0) {
2288 error("failed to stat result image: %m");
2289 return -errno;
2290 }
2291 if (S_ISREG(buf.st_mode)) {
2292 /* Don't forget to enlarge the real file */
2293 ret = ftruncate64(out_fd, dev_size);
2294 if (ret < 0) {
2295 error("failed to enlarge result image: %m");
2296 return -errno;
2297 }
2298 }

here it's the 'dev_size'. In the superblock dump, the sb.total_size and
sb.dev_item.total_size are both 67108864, which is the correct value.

The size as obtained from the device item in the device tree also matches the
right value

item 6 key (1 DEV_EXTENT 61210624) itemoff 3667 itemsize 48
dev extent chunk_tree 3
chunk_objectid 256 chunk_offset 61210624 length 5898240
chunk_tree_uuid b2834867-4e78-47ee-9877-94d4e39bda43

Which is the key.offset + length = 61210624 + 5898240 = 67108864.

But the code is not called in restore_metadump because of condition
"btrfs_super_num_devices(mdrestore.original_super) != 1"


Re: error in backport of 'btrfs: fix possible free space tree corruption with online conversion'

2021-02-19 Thread David Sterba
On Fri, Feb 19, 2021 at 11:20:51PM +0800, Wang Yugui wrote:
> > Out of curiosity I decided to check how this happened, but don't see it.
> > Here is the commit that went into 5.10.13 and it looks correct to me:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57
> 
> > The patch that went into 5.10 looks identical to the original commit in 
> > 5.11.
> > What tree are you looking at?
> 
> the 5.10.y is the URL that you point out.
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57
> 
> but the right one for 5.11 is
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f
> 
> 5.11:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/btrfs?id=2f96e40212d435b328459ba6b3956395eed8fa9f
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 0225c5208f44c..47ca8edafb5e6 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -564,6 +564,9 @@ enum {
>  
>   /* Indicate that we need to cleanup space cache v1 */
>   BTRFS_FS_CLEANUP_SPACE_CACHE_V1,
> +
> + /* Indicate that we can't trust the free space tree for caching yet */
> + BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
>  };
>  
>  /*
> 
> but 5.10.y:
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.10.y&id=2175bf57dc9522c58d93dcd474758434a3f05c57
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index e01545538e07f..30ea9780725ff 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -146,6 +146,9 @@ enum {
>   BTRFS_FS_STATE_DEV_REPLACING,
>   /* The btrfs_fs_info created for self-tests */
>   BTRFS_FS_STATE_DUMMY_FS_INFO,
> +
> + /* Indicate that we can't trust the free space tree for caching yet */
> + BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
>  };
>  
>  #define BTRFS_BACKREF_REV_MAX256
> 
> Both the line(Line:146 vs Line:564) and the content are wrong.

You're right, good catch.

The wrong value corresponds to BTRFS_FS_QUOTA_ENABLE in the right enum
set, so this could collide. With quotas enabled the on-line conversion
won't be possible as the free space tree would be considered untrusted.
The other way around, no quotas enabled by user, but with tree
conversion going on, then there are a lot of check for the bit set, now
it won't have the quota tree and other structures initialized. This
could be problmenatic.

I'll send a fixup.


[PATCH v3] btrfs-progs: filesystem-resize: make output more readable

2021-02-19 Thread Sidong Yang
This patch make output of filesystem-resize command more readable and
give detail information for users. This patch provides more information
about filesystem like below.

Before:
Resize '/mnt' of '1:-1G'

After:
Resize device id 1 (/dev/vdb) from 4.00GiB to 3.00GiB

Signed-off-by: Sidong Yang 
---
v2:
  - print more detailed error
  - covers all the possibilities format provides
v3:
  - use snprintf than strcpy for safety
  - add diff variable for code readability
---
 cmds/filesystem.c | 119 +-
 1 file changed, 118 insertions(+), 1 deletion(-)

diff --git a/cmds/filesystem.c b/cmds/filesystem.c
index 0d23daf4..faa06a52 100644
--- a/cmds/filesystem.c
+++ b/cmds/filesystem.c
@@ -28,6 +28,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -1074,6 +1075,116 @@ static const char * const cmd_filesystem_resize_usage[] 
= {
NULL
 };
 
+static int check_resize_args(const char *amount, const char *path) {
+   struct btrfs_ioctl_fs_info_args fi_args;
+   struct btrfs_ioctl_dev_info_args *di_args = NULL;
+   int ret, i, dev_idx = -1;
+   u64 devid = 0;
+   const char *res_str = NULL;
+   char *devstr = NULL, *sizestr = NULL;
+   u64 new_size = 0, old_size = 0, diff = 0;
+   int mod = 0;
+   char amount_dup[BTRFS_VOL_NAME_MAX];
+
+   ret = get_fs_info(path, &fi_args, &di_args);
+
+   if (ret) {
+   error("unable to retrieve fs info");
+   return 1;
+   }
+
+   if (!fi_args.num_devices) {
+   error("no devices found");
+   free(di_args);
+   return 1;
+   }
+
+   ret = snprintf(amount_dup, BTRFS_VOL_NAME_MAX, "%s", amount);
+   if (strlen(amount) != ret) {
+   error("newsize argument is too long");
+   free(di_args);
+   return 1;
+   }
+
+   devstr = strchr(amount_dup, ':');
+   if (devstr) {
+   sizestr = devstr + 1;
+   *devstr = '\0';
+   devstr = amount_dup;
+
+   errno = 0;
+   devid = strtoull(devstr, NULL, 10);
+
+   if (errno) {
+   error("failed to parse devid %s", devstr);
+   free(di_args);
+   return 1;
+   }
+   }
+
+   dev_idx = -1;
+   for(i = 0; i < fi_args.num_devices; i++) {
+   if (di_args[i].devid == devid) {
+   dev_idx = i;
+   break;
+   }
+   }
+
+   if (dev_idx < 0) {
+   error("cannot find devid : %lld", devid);
+   free(di_args);
+   return 1;
+   }
+
+   if (!strcmp(sizestr, "max")) {
+   res_str = "max";
+   }
+   else {
+   if (sizestr[0] == '-') {
+   mod = -1;
+   sizestr++;
+   } else if (sizestr[0] == '+') {
+   mod = 1;
+   sizestr++;
+   }
+   diff = parse_size_from_string(sizestr);
+   if (!diff) {
+   error("failed to parse size %s", sizestr);
+   free(di_args);
+   return 1;
+   }
+   old_size = di_args[dev_idx].total_bytes;
+
+   if (mod < 0) {
+   if (diff > old_size) {
+   error("current size is %s which is smaller than 
%s",
+ pretty_size_mode(old_size, UNITS_DEFAULT),
+ pretty_size_mode(diff, UNITS_DEFAULT));
+   free(di_args);
+   return 1;
+   }
+   new_size = old_size - diff;
+   } else if (mod > 0) {
+   if (diff > ULLONG_MAX - old_size) {
+   error("increasing %s is out of range",
+ pretty_size_mode(diff, UNITS_DEFAULT));
+   free(di_args);
+   return 1;
+   }
+   new_size = old_size + diff;
+   }
+   new_size = round_down(new_size, fi_args.sectorsize);
+   res_str = pretty_size_mode(new_size, UNITS_DEFAULT);
+   }
+
+   printf("Resize device id %lld (%s) from %s to %s\n", devid, 
di_args[dev_idx].path,
+   pretty_size_mode(di_args[dev_idx].total_bytes, UNITS_DEFAULT),
+   res_str);
+
+   free(di_args);
+   return 0;
+}
+
 static int cmd_filesystem_resize(const struct cmd_struct *cmd,
 int argc, char **argv)
 {
@@ -1134,7 +1245,13 @@ static int cmd_filesystem_resize(const struct cmd_struct 
*cmd,
return 1;
}
 
-   printf("Resize '%s' of '%s'\n", path, amount);
+   ret = check_re

Re: error in backport of 'btrfs: fix possible free space tree corruption with online conversion'

2021-02-19 Thread David Sterba
On Fri, Feb 19, 2021 at 05:12:12PM +0100, Holger Hoffstätte wrote:
> On 2021-02-19 16:20, Wang Yugui wrote:
> > --- a/fs/btrfs/ctree.h
> > +++ b/fs/btrfs/ctree.h
> > @@ -146,6 +146,9 @@ enum {
> > BTRFS_FS_STATE_DEV_REPLACING,
> > /* The btrfs_fs_info created for self-tests */
> > BTRFS_FS_STATE_DUMMY_FS_INFO,
> > +
> > +   /* Indicate that we can't trust the free space tree for caching yet */
> > +   BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
> >   };
> >   
> >   #define BTRFS_BACKREF_REV_MAX 256
> > 
> > Both the line(Line:146 vs Line:564) and the content are wrong.
> > 
> 
> Ahh..now I understand, indeed the merge of BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED 
> went into
> the wrong enum. I misunderstood your original posting to mean that it had 
> somehow missed
> a chunk or used the wrong enum value in set_bit.
> 
> Anyway, good catch! I guess Dave needs to decide how to fix this, maybe
> let Greg revert & re-apply properly.
> 
> Can anybody explain why git decided to do this?

Git finds that the patch does not apply and lets the user to fix it.

I did git cherry-pick of 2f96e40212d435b3284 on v5.10.12 and got 2
conflicts:

first was in caching_thread around

if (btrfs_fs_compat_ro(fs_info, FREE_SPACE_TREE))

that got resolved correctly, and the second one in the enum, but the
conflict was suggested in the right enum (lines 559+), so all I had to
do was to remove unmatched context and the <<< >>> markers. It's
possible that git version could affect that, mine is 2.29.2. Or stable
team does not use git for the intermediate patches and quilt did not get
it right.

I doubt that the conflict resolution was done incorrectly by hand, the
enums are quite far away so it would not be just a trivial change (like
context fixups) that are in the scope of semi-automatic stable
backports.


Re: Large multi-device BTRFS array (usually) fails to mount on boot.

2021-02-19 Thread Joshua
February 3, 2021 3:16 PM, "Graham Cobb"  wrote:

> On 03/02/2021 21:54, jos...@mailmag.net wrote:
> 
>> Good Evening.
>> 
>> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having 
>> problems mounting on
>> boot without timing out. This causes the system to drop to emergency mode. I 
>> am then able to mount
>> the array in emergency mode and all data appears fine, but upon reboot it 
>> fails again.
>> 
>> I actually first had this problem around a year ago, and initially put 
>> considerable effort into
>> extending the timeout in systemd, as I believed that to be the problem. 
>> However, all the methods I
>> attempted did not work properly or caused the system to continue booting 
>> before the array was
>> mounted, causing all sorts of issues. Eventually, I was able to almost 
>> completely resolve it by
>> defragmenting the extent tree and subvolume tree for each subvolume. (btrfs 
>> fi defrag
>> /mountpoint/subvolume/) This seemed to reduce the time required to mount, 
>> and made it mount on boot
>> the majority of the time.
> 
> Not what you asked, but adding "x-systemd.mount-timeout=180s" to the
> mount options in /etc/fstab works reliably for me to extend the timeout.
> Of course, my largest filesystem is only 20TB, across only two devices
> (two lvm-over-LUKS, each on separate physical drives) but it has very
> heavy use of snapshot creation and deletion. I also run with commit=15
> as power is not too reliable here and losing power is the most frequent
> cause of a reboot.

Thanks for the suggestion, but I have not been able to get this method to work 
either.

Here's what my fstab looks like, let me know if this is not what you meant!

UUID={snip} / ext4  errors=remount-ro 0 0
UUID={snip} /mnt/data btrfs 
defaults,noatime,compress-force=zstd:2,x-systemd.mount-timeout=300s 0 0

However, the system still fails to mount in less than 5 minutes, and drops to 
emergency mode.
Upon checking dmesg logs, it is clear the system is only wait 120 seconds, 
before giving up on mounting, and dropping to emergency mode.

--Joshua


Re: page->index limitation on 32bit system?

2021-02-19 Thread Matthew Wilcox
On Fri, Feb 19, 2021 at 02:22:01PM +, Matthew Wilcox wrote:
> In the last decade, nobody's tried to fix it in mainline that I know of.
> As I said, some vendors have tried to fix it in their NAS products,
> but I don't know where to find that patch any more.

Arnd found it for me.

https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/25426branch/alpine-source/linux-3.10.x-bsp.txz/download

They've done a perfect job of making the source available while making it
utterly dreadful to extract anything useful from.

 16084 files changed, 1322769 insertions(+), 285257 deletions(-)

It's full of gratuitous whitespace changes to files that definitely
aren't used (arch/alpha?  really?) and they've stripped out a lot of
comments that they didn't need to touch.

Forward porting a patch from 10 years ago wouldn't be easy, even if
they hadn't tried very hard to obfuscate their patch.  I don't think
this will be a fruitful line of inquiry.


Re: [PATCH v3] btrfs-progs: filesystem-resize: make output more readable

2021-02-19 Thread David Sterba
On Fri, Feb 19, 2021 at 05:18:18PM +, Sidong Yang wrote:
> This patch make output of filesystem-resize command more readable and
> give detail information for users. This patch provides more information
> about filesystem like below.
> 
> Before:
> Resize '/mnt' of '1:-1G'
> 
> After:
> Resize device id 1 (/dev/vdb) from 4.00GiB to 3.00GiB
> 
> Signed-off-by: Sidong Yang 

Code-wise it looks good, but I tried a simple test and it does not work:

# truncate -s 4g image
# mkfs.btrfs image
# mount -o loop image mnt
# btrfs fi resize -1G mnt
ERROR: cannot find devid: 0

while running the same command with the installed system 'btrfs' resizes
the fs: "Resize '.' of '-1G'".


Re: [PATCH 5/5] btrfs: add allocator_hint mode

2021-02-19 Thread Goffredo Baroncelli

On 2/1/21 10:28 PM, Goffredo Baroncelli wrote:

From: Goffredo Baroncelli 

When this mode is enabled, the chunk allocation policy is modified as follow.

Each disk may have a different tag:
- BTRFS_DEV_ALLOCATION_PREFERRED_METADATA
- BTRFS_DEV_ALLOCATION_METADATA_ONLY
- BTRFS_DEV_ALLOCATION_DATA_ONLY
- BTRFS_DEV_ALLOCATION_PREFERRED_DATA (default)

Where:
- ALLOCATION_PREFERRED_X means that it is preferred to use this disk for the
X chunk type (the other type may be allowed when the space is low)
- ALLOCATION_X_ONLY means that it is used *only* for the X chunk type. This
means also that it is a preferred choice.

Each time the allocator allocates a chunk of type X , first it takes the disks
tagged as ALLOCATION_X_ONLY or ALLOCATION_PREFERRED_X; if the space is not
enough, it uses also the disks tagged as ALLOCATION_METADATA_ONLY; if the space
is not enough, it uses also the other disks, with the exception of the one
marked as ALLOCATION_PREFERRED_Y, where Y the other type of chunk (i.e. not X).

Signed-off-by: Goffredo Baroncelli 
---
  fs/btrfs/volumes.c | 81 +-
  fs/btrfs/volumes.h |  1 +
  2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 68b346c5465d..57ee3e2fdac0 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4806,13 +4806,18 @@ static int btrfs_add_system_chunk(struct btrfs_fs_info 
*fs_info,
  }
  
  /*

- * sort the devices in descending order by max_avail, total_avail
+ * sort the devices in descending order by alloc_hint,
+ * max_avail, total_avail
   */
  static int btrfs_cmp_device_info(const void *a, const void *b)
  {
const struct btrfs_device_info *di_a = a;
const struct btrfs_device_info *di_b = b;
  
+	if (di_a->alloc_hint > di_b->alloc_hint)

+   return -1;
+   if (di_a->alloc_hint < di_b->alloc_hint)
+   return 1;
if (di_a->max_avail > di_b->max_avail)
return -1;
if (di_a->max_avail < di_b->max_avail)
@@ -4939,6 +4944,15 @@ static int gather_device_info(struct btrfs_fs_devices 
*fs_devices,
int ndevs = 0;
u64 max_avail;
u64 dev_offset;
+   int hint;
+
+   static const char alloc_hint_map[BTRFS_DEV_ALLOCATION_MASK_COUNT] = {
+   [BTRFS_DEV_ALLOCATION_DATA_ONLY] = -1,
+   [BTRFS_DEV_ALLOCATION_PREFERRED_DATA] = 0,
+   [BTRFS_DEV_ALLOCATION_METADATA_ONLY] = 1,
+   [BTRFS_DEV_ALLOCATION_PREFERRED_METADATA] = 2


Finally I found the reason of the wrong allocation. The last two values
are swapped: the priority starts from BTRFS_DEV_ALLOCATION_DATA_ONLY
and ends to BTRFS_DEV_ALLOCATION_METADATA_ONLY.

Ok, now I have to restart the tests :-)


+   /* the other values are set to 0 */
+   };
  
  	/*

 * in the first pass through the devices list, we gather information
@@ -4991,16 +5005,81 @@ static int gather_device_info(struct btrfs_fs_devices 
*fs_devices,
devices_info[ndevs].max_avail = max_avail;
devices_info[ndevs].total_avail = total_avail;
devices_info[ndevs].dev = device;
+
+   if (((ctl->type & BTRFS_BLOCK_GROUP_DATA) &&
+(ctl->type & BTRFS_BLOCK_GROUP_METADATA)) ||
+   info->allocation_hint_mode ==
+BTRFS_ALLOCATION_HINT_DISABLED) {
+   /*
+* if mixed bg or the allocator hint is
+* disable, set all the alloc_hint
+* fields to the same value, so the sorting
+* is not affected
+*/
+   devices_info[ndevs].alloc_hint = 0;
+   } else if(ctl->type & BTRFS_BLOCK_GROUP_DATA) {
+   hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+   /*
+* skip BTRFS_DEV_METADATA_ONLY disks
+*/
+   if (hint == BTRFS_DEV_ALLOCATION_METADATA_ONLY)
+   continue;
+   /*
+* if a data chunk must be allocated,
+* sort also by hint (data disk
+* higher priority)
+*/
+   devices_info[ndevs].alloc_hint = -alloc_hint_map[hint];
+   } else { /* BTRFS_BLOCK_GROUP_METADATA */
+   hint = device->type & BTRFS_DEV_ALLOCATION_MASK;
+
+   /*
+* skip BTRFS_DEV_DATA_ONLY disks
+*/
+   if (hint == BTRFS_DEV_ALLOCATION_DATA_ONLY)
+   continue;
+   /*
+* if a data chunk must be allocated,
+* sort also by hint (metadata hint
+* higher priority)
+   

Re: [PATCH] btrfs-progs: fix return code for failed replace start

2021-02-19 Thread David Sterba
On Wed, Sep 30, 2020 at 10:46:14AM +0800, Anand Jain wrote:
> When replace-starts with no-background and fails for the reason that
> a BTRFS_FS_EXCL_OP is in progress, we still return the value 0 and also
> leak the target device open, because in cmd_replace_start() we missed
> the goto leave_with_error for this error.
> 
> So the test case btrfs/064 in its seqres.full output reports...
> 
>   Replacing /dev/sdf with /dev/sdc
>   ERROR: /dev/sdc is mounted
> 
> instead of...
> 
>   Replacing /dev/sdc with /dev/sdf
>   ERROR: ioctl(DEV_REPLACE_START) '/mnt/scratch': 
> add/delete/balance/replace/resize operation in progress
> 
> for the failed replace attempts in the test case
> 
> Fix it by adding a goto leave_with_error for this error which also fixes
> the device open leak.
> 
> Signed-off-by: Anand Jain 

Added to devel, thanks.


Re: [PATCH] btrfs-progs: doc: snapshot -r and -i can be used together

2021-02-19 Thread David Sterba
On Mon, Jun 29, 2020 at 01:15:00PM +0200, chrysn wrote:
> This aligns the man page with the usage output of the tool.
> 
> Signed-off-by: Christian Amsüss 
> ---
> 
> This confused me a bit when I first read the man page (why could a
> read-only snapshot not be assigned to a qgroup), but experimentation and
> looking at --help indicate that the mutual exclusivity of those options
> in the man page was most likely due to an editing error when the option
> was introduced.

Sorry for the delay, the fix is now in devel. Thanks.


Re: ERROR: failed to read block groups: Input/output error

2021-02-19 Thread Zygo Blaxell
On Thu, Jan 14, 2021 at 01:09:40AM +0200, Dāvis Mosāns wrote:
> Hi,
> 
> I've 6x 3TB HDD RAID1 BTRFS filesystem where HBA card failed and
> caused some corruption.
> When I try to mount it I get
> $ mount /dev/sdt /mnt
> mount: /mnt/: wrong fs type, bad option, bad superblock on /dev/sdt,
> missing codepage or helper program, or other error
> $ dmesg | tail -n 9
> [  617.158962] BTRFS info (device sdt): disk space caching is enabled
> [  617.158965] BTRFS info (device sdt): has skinny extents
> [  617.756924] BTRFS info (device sdt): bdev /dev/sdl errs: wr 0, rd
> 0, flush 0, corrupt 473, gen 0
> [  617.756929] BTRFS info (device sdt): bdev /dev/sdj errs: wr 31626,
> rd 18765, flush 178, corrupt 5841, gen 0
> [  617.756933] BTRFS info (device sdt): bdev /dev/sdg errs: wr 6867,
> rd 2640, flush 178, corrupt 1066, gen 0

You have write errors on 2 disks, read errors on 3 disks, and raid1
tolerates only 1 disk failure, so successful recovery is unlikely.

> [  631.353725] BTRFS warning (device sdt): sdt checksum verify failed
> on 21057101103104 wanted 0x753cdd5f found 0x9c0ba035 level 0
> [  631.376024] BTRFS warning (device sdt): sdt checksum verify failed
> on 21057101103104 wanted 0x753cdd5f found 0xb908effa level 0

Both copies of this metadata block are corrupted, differently.

This is consistent with some kinds of HBA failure:  every outgoing block
from the host is potentially corrupted, usually silently.  Due to the HBA
failure, there is no indication of failure available to the filesystem
until after several corrupt blocks are written to disk.  By the time
failure is detected, damage is extensive, especially for metadata where
overwrites are frequent.

This is failure mode that you need backups to recover from (or mirror
disks on separate, non-failing HBA hardware).

> [  631.376038] BTRFS error (device sdt): failed to read block groups: -5
> [  631.422811] BTRFS error (device sdt): open_ctree failed
> 
> $ uname -r
> 5.9.14-arch1-1
> $ btrfs --version
> btrfs-progs v5.9
> $ btrfs check /dev/sdt
> Opening filesystem to check...
> checksum verify failed on 21057101103104 found 00B9 wanted 0075
> checksum verify failed on 21057101103104 found 009C wanted 0075
> checksum verify failed on 21057101103104 found 00B9 wanted 0075
> Csum didn't match
> ERROR: failed to read block groups: Input/output error
> ERROR: cannot open file system
> 
> $ btrfs filesystem show
> Label: 'RAID'  uuid: 8aef11a9-beb6-49ea-9b2d-7876611a39e5
> Total devices 6 FS bytes used 4.69TiB
> devid1 size 2.73TiB used 1.71TiB path /dev/sdt
> devid2 size 2.73TiB used 1.70TiB path /dev/sdl
> devid3 size 2.73TiB used 1.71TiB path /dev/sdj
> devid4 size 2.73TiB used 1.70TiB path /dev/sds
> devid5 size 2.73TiB used 1.69TiB path /dev/sdg
> devid6 size 2.73TiB used 1.69TiB path /dev/sdc
> 
> 
> My guess is that some drives dropped out while kernel was still
> writing to rest thus causing inconsistency.
> There should be some way to find out which drives has the most
> up-to-date info and assume those are correct.

Neither available copy is correct, so the kernel's self-healing mechanism
doesn't work.  Thousands of pages are damaged, possibly only with minor
errors, but multiply a minor error by a thousand and it's no longer minor.

At this point it is a forensic recovery exercise.

> I tried to mount with
> $ mount -o ro,degraded,rescue=usebackuproot /dev/sdt /mnt
> but that didn't make any difference
> 
> So any idea how to fix this filesystem?

Before you can mount the filesystem read-write again, you would need to
rebuild the extent tree from the surviving pages of the subvol trees.
All other metadata pages on the filesystem must be scanned, any excess
reference items must be deleted, and any missing reference items must
be inserted.  Once the metadata references are correct, btrfs can
rebuild the free space maps, and then you can scrub and delete/replace
any damaged data files.

'btrfs check --repair' might work if only a handful of blocks are
corrupted (it takes a few short cuts and can repair minor damage)
but according to your dev stats you have thousands of corrupted blocks,
so the filesystem is probably beyond the capabilities of this tool.

'btrfs check --repair --init-extent-tree' is a brute-force operation that
will more or less rebuild the entire filesystem by scraping metadata
leaf pages off the disks.  This is your only hope here, and it's not a
good one.

Both methods are likely to fail in the presence of so much corruption
and they may take so long to run that mkfs + restore from backups could
be significantly faster.  Definitely extract any data from the filesystem
that you want to keep _before_ attempting any of these operations.

It might be possible to recover by manually inspecting the corrupted
metadata blocks and making guesses and adjustments, but that could take
even longer than check --repair if there are thousands of damaged pages.

> Thanks!
> 
> Best regards,
> Dāvi

Re: Btrfs progs release 5.10.1

2021-02-19 Thread David Sterba
On Tue, Feb 16, 2021 at 11:00:18AM +, Filipe Manana wrote:
> On Fri, Feb 5, 2021 at 11:33 AM David Sterba  wrote:
> >
> > Hi,
> >
> > btrfs-progs version 5.10.1 have been released.
> >
> > The static build got broken due to libmount added in 5.10, this works now. 
> > The
> > minimum libmount version is 2.24 that is not available on some LTS distros 
> > like
> > CentOS 7. The plan is to bring the support back, reimplementing some 
> > libmount
> > functionality and dropping the dependency again.
> >
> > Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
> > Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git
> >
> > Shortlog:
> >
> > David Sterba (6):
> >   btrfs-progs: build: fix linking with static libmount
> 
> Btw, this causes two fstests to fail:
> 
> $ ./check btrfs/100 btrfs/101
> FSTYP -- btrfs
> PLATFORM  -- Linux/x86_64 debian8 5.11.0-rc6-btrfs-next-80 #1 SMP
> PREEMPT Wed Feb 3 11:28:05 WET 2021
> MKFS_OPTIONS  -- /dev/sdc
> MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
> 
> btrfs/100 6s ... [failed, exit status 1]- output mismatch (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/100.out.bad)
> --- tests/btrfs/100.out 2020-06-10 19:29:03.818519162 +0100
> +++ /home/fdmanana/git/hub/xfstests/results//btrfs/100.out.bad
> 2021-02-16 10:55:53.145343890 +
> @@ -2,10 +2,7 @@
>  Label: none  uuid: 
>   Total devices  FS bytes used 
>   devid  size  used  path SCRATCH_DEV
> - devid  size  used  path /dev/mapper/error-test
> + devid  size  used  path dm-0
> 
> -Label: none  uuid: 
> ...
> (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/100.out
> /home/fdmanana/git/hub/xfstests/results//btrfs/100.out.bad'  to see
> the entire diff)
> btrfs/101 8s ... [failed, exit status 1]- output mismatch (see
> /home/fdmanana/git/hub/xfstests/results//btrfs/101.out.bad)
> --- tests/btrfs/101.out 2020-06-10 19:29:03.818519162 +0100
> +++ /home/fdmanana/git/hub/xfstests/results//btrfs/101.out.bad
> 2021-02-16 10:55:58.105503554 +
> @@ -2,10 +2,7 @@
>  Label: none  uuid: 
>   Total devices  FS bytes used 
>   devid  size  used  path SCRATCH_DEV
> - devid  size  used  path /dev/mapper/error-test
> + devid  size  used  path dm-0
> 
> -Label: none  uuid: 
> ...
> (Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/101.out
> /home/fdmanana/git/hub/xfstests/results//btrfs/101.out.bad'  to see
> the entire diff)
> Ran: btrfs/100 btrfs/101
> Failures: btrfs/100 btrfs/101
> Failed 2 of 2 tests
> 
> 
> Is there any plan to fix this?

Yes, it's fixed in devel, the path canonicalization got accidentally
broken by my libmount workarounds.


Re: [RFC] btrfs-progs: format-output: remove newline in fmt_end text mode

2021-02-19 Thread David Sterba
On Tue, Feb 16, 2021 at 04:28:40PM +, Sidong Yang wrote:
> Remove a code that inserting new line in fmt_end() for text mode.
> Old code made a failure in fstest btrfs/006.
> 
> Signed-off-by: Sidong Yang 
> ---
> Hi, I've just read mail that Filipe written that some failure about fstest.
> I'm worried about this patch makes other problem. So make it RFC. Thanks.

I found the discussion under the device stats patch adding json, the
added line was known and "hopefully not causing problems", but the
fstests seem to notice.

I think we can fix that by removing the fmt_end newline but we also need
to update how the fmt_print is done for the text output. Ie. for json
there are some strict rules for line continuations  (",") but for the
textual output, each line ended by "\n" right away, without delaying
that to the next fmt_* call should work.


Re: Large multi-device BTRFS array (usually) fails to mount on boot.

2021-02-19 Thread Graham Cobb


On 19/02/2021 17:42, Joshua wrote:
> February 3, 2021 3:16 PM, "Graham Cobb"  wrote:
> 
>> On 03/02/2021 21:54, jos...@mailmag.net wrote:
>>
>>> Good Evening.
>>>
>>> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having 
>>> problems mounting on
>>> boot without timing out. This causes the system to drop to emergency mode. 
>>> I am then able to mount
>>> the array in emergency mode and all data appears fine, but upon reboot it 
>>> fails again.
>>>
>>> I actually first had this problem around a year ago, and initially put 
>>> considerable effort into
>>> extending the timeout in systemd, as I believed that to be the problem. 
>>> However, all the methods I
>>> attempted did not work properly or caused the system to continue booting 
>>> before the array was
>>> mounted, causing all sorts of issues. Eventually, I was able to almost 
>>> completely resolve it by
>>> defragmenting the extent tree and subvolume tree for each subvolume. (btrfs 
>>> fi defrag
>>> /mountpoint/subvolume/) This seemed to reduce the time required to mount, 
>>> and made it mount on boot
>>> the majority of the time.
>>
>> Not what you asked, but adding "x-systemd.mount-timeout=180s" to the
>> mount options in /etc/fstab works reliably for me to extend the timeout.
>> Of course, my largest filesystem is only 20TB, across only two devices
>> (two lvm-over-LUKS, each on separate physical drives) but it has very
>> heavy use of snapshot creation and deletion. I also run with commit=15
>> as power is not too reliable here and losing power is the most frequent
>> cause of a reboot.
> 
> Thanks for the suggestion, but I have not been able to get this method to 
> work either.
> 
> Here's what my fstab looks like, let me know if this is not what you meant!
> 
> UUID={snip} / ext4  errors=remount-ro 0 0
> UUID={snip} /mnt/data btrfs 
> defaults,noatime,compress-force=zstd:2,x-systemd.mount-timeout=300s 0 0

Hmmm. The line from my fstab is:

LABEL=lvmdata   /mnt/data   btrfs
defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,commit=15,space_cache=v2,x-systemd.mount-timeout=180s,nofail
  0   3

I note that I do have "nofail" in there, although it doesn't fail for me
so I assume it shouldn't make a difference.

I can't swear that the disk is currently taking longer to mount than the
systemd default (and I will not be in a position to reboot this system
any time soon to check). But I am quite sure this made a difference when
I added it.

Not sure why it isn't working for you, unless it is some systemd
problem. It isn't systemd giving up and dropping to emergency because of
some other startup problem that occurs before the mount is finished, is
it? I could believe systemd cancels any mounts in progress when that
happens.

Graham


Re: page->index limitation on 32bit system?

2021-02-19 Thread Qu Wenruo




On 2021/2/20 上午12:12, Theodore Ts'o wrote:

On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:

So it means the 32bit archs are already 2nd tier targets for at least
upstream linux kernel?


At least as far as btrfs is concerned, anyway


I'm afraid that would be the case.

But I'm still interested in how other fses handle such problem.

Doesn't they rely on page::index to handle their metadata?
Or all other fses just don't support allocating/deleting their AG/BG
dynamically so they can reject the fs at mount time?

Or they limit their metadata page::index to just inside each AG/BG?

Anyway, I'm afraid we have to reject the fs at both mount time and
runtime for now.




Or would it be possible to make it an option to make the index u64?
So guys who really wants large file support can enable it while most
other 32bit guys can just keep the existing behavior?


I think if this is going to be done at all, it would need to be a
compile-time CONFIG option to make the index be 64-bits.  That's
because there are a huge number of low-end Android devices (retail
price ~$30 USD in India, for example --- this set of customers is
sometimes called "the next billion users" by some folks) that are
using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
would be massively unfortunate/unfair/etc. to impose that performance
penalty on them.

It sounds like what Willy is saying is that supporting a 64-bit page
index on 32-bit platforms is going to be have a lot of downsides, and
not just the performance / memory overhead issue.  It's also a code
mainteinance concern, and that tax would land on the mm developers.
And if it's not well-maintained, without regular testing, it's likely
to be heavily subject to bitrot.  (Although I suppose if we don't mind
doubling the number of configs that kernelci has to test, this could
be mitigated.)

In contrast, changing btrfs to not depend on a single address space
for all of its metadata might be a lot of work, but it's something
which lands on the btrfs developers, as opposed to a another (perhaps
more central) kernel subsystem.  Managing at this tradeoff is
something that is going to be between the mm developers and the btrfs
developers, but as someone who doesn't do any work on either of these
subsystems, it seems like a pretty obvious choice.


Yeah, I totally understand that.

And it doesn't look that worthy (or even possible) to make several
metadata inodes (address space to be more specific) just to support
32bit systemts.

As the lack of test coverage problem is still the same.

I don't see any active btrfs developer using 32bit system to test, even
for ARM systems.

Even rejecting the fs is in fact much more complex and may not get
enough tests after the initial submission.


The final observation I'll make is that if we know which NAS box
vendor can (properly) support volumes > 16 TB, we can probably find
the 64-bit page index patch.  It'll probably be against a fairly old
kernel, so it might not all _that_ helpful, but it might give folks a
bit of a head start.

I can tell you that the NAS box vendor that it _isn't_ is Synology.
Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
volume size limit, and this is enforced by the Synology NAS
software[1].  However, Synology NAS boxes can support multiple
volumes; until today, I never understood why, since it seemed to be
unnecessary complexity, but I suspect the real answer was this was how
Synology handled storage array sizes > 16TB on their older systems.
(All of their new NAS boxes use 64-bit processors.)


BTW, even for Synology, 32bit systems can easily go beyond 16T in its
local address space while the underlying fs is only 1T or even smaller.

They only need to run routine balance and finally they will go beyond
that 16T limit.

Thanks,
Qu



[1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/

Cheers,

- Ted



Re: page->index limitation on 32bit system?

2021-02-19 Thread Qu Wenruo




On 2021/2/20 上午1:51, Matthew Wilcox wrote:

On Fri, Feb 19, 2021 at 02:22:01PM +, Matthew Wilcox wrote:

In the last decade, nobody's tried to fix it in mainline that I know of.
As I said, some vendors have tried to fix it in their NAS products,
but I don't know where to find that patch any more.


Arnd found it for me.

https://sourceforge.net/projects/dsgpl/files/Synology%20NAS%20GPL%20Source/25426branch/alpine-source/linux-3.10.x-bsp.txz/download

They've done a perfect job of making the source available while making it
utterly dreadful to extract anything useful from.

  16084 files changed, 1322769 insertions(+), 285257 deletions(-)


Wow, I thought RedHat was the only open-source vendor that tries to send
out a super big patch to make life of every other guys miserable.
And I'm definitely wrong now.



It's full of gratuitous whitespace changes to files that definitely
aren't used (arch/alpha?  really?) and they've stripped out a lot of
comments that they didn't need to touch.

Forward porting a patch from 10 years ago wouldn't be easy, even if
they hadn't tried very hard to obfuscate their patch.  I don't think
this will be a fruitful line of inquiry.


Yeah, I believe it's not worthy now.

I'll make btrfs to try its best to reject the fs instead.

Thanks,
Qu


Re: Large multi-device BTRFS array (usually) fails to mount on boot.

2021-02-19 Thread Joshua
February 19, 2021 2:45 PM, "Graham Cobb"  wrote:

> On 19/02/2021 17:42, Joshua wrote:
> 
>> February 3, 2021 3:16 PM, "Graham Cobb"  wrote:
>> 
>>> On 03/02/2021 21:54, jos...@mailmag.net wrote:
>> 
>> Good Evening.
>> 
>> I have a large BTRFS array, (14 Drives, ~100 TB RAW) which has been having 
>> problems mounting on
>> boot without timing out. This causes the system to drop to emergency mode. I 
>> am then able to mount
>> the array in emergency mode and all data appears fine, but upon reboot it 
>> fails again.
>> 
>> I actually first had this problem around a year ago, and initially put 
>> considerable effort into
>> extending the timeout in systemd, as I believed that to be the problem. 
>> However, all the methods I
>> attempted did not work properly or caused the system to continue booting 
>> before the array was
>> mounted, causing all sorts of issues. Eventually, I was able to almost 
>> completely resolve it by
>> defragmenting the extent tree and subvolume tree for each subvolume. (btrfs 
>> fi defrag
>> /mountpoint/subvolume/) This seemed to reduce the time required to mount, 
>> and made it mount on boot
>> the majority of the time.
>>> Not what you asked, but adding "x-systemd.mount-timeout=180s" to the
>>> mount options in /etc/fstab works reliably for me to extend the timeout.
>>> Of course, my largest filesystem is only 20TB, across only two devices
>>> (two lvm-over-LUKS, each on separate physical drives) but it has very
>>> heavy use of snapshot creation and deletion. I also run with commit=15
>>> as power is not too reliable here and losing power is the most frequent
>>> cause of a reboot.
>> 
>> Thanks for the suggestion, but I have not been able to get this method to 
>> work either.
>> 
>> Here's what my fstab looks like, let me know if this is not what you meant!
>> 
>> UUID={snip} / ext4 errors=remount-ro 0 0
>> UUID={snip} /mnt/data btrfs 
>> defaults,noatime,compress-force=zstd:2,x-systemd.mount-timeout=300s 0 0
> 
> Hmmm. The line from my fstab is:
> 
> LABEL=lvmdata /mnt/data btrfs
> defaults,subvolid=0,noatime,nodiratime,compress=lzo,skip_balance,commit=15,space_cache=v2,x-systemd.
> ount-timeout=180s,nofail
> 0 3

Not very important, but note that noatime implies nodiratime.  
https://lwn.net/Articles/245002/

> I note that I do have "nofail" in there, although it doesn't fail for me
> so I assume it shouldn't make a difference.

Ahh, I bet you're right, at least indirectly.

It appears nofail makes the system continue booting even if the mount was 
unsuccessful, which I'd rather not since some services do depend on this 
volume.  For example, some docker containers could misbehave if the path to the 
data they expect doesn't exist.

Not exactly the outcome I'd prefer, (due to services that may depend on the 
mount existing being allowed to start) but it may work.


I'm really very unsure how nofail interacts with x-systemd.mount-timeout.  I 
would think it would increase the timeout period.  But that's not what I'm 
seeing.  Perhaps there's some other kind of internal systemd timeout, and it 
gives up and continues to boot after that runs out, but allows mount to 
continue for the time specified?  Seems kinda weird.

I'll give it a try and see what happens.  I'll try and remember to report back 
here if so.


> I can't swear that the disk is currently taking longer to mount than the
> systemd default (and I will not be in a position to reboot this system
> any time soon to check). But I am quite sure this made a difference when
> I added it.
> 
> Not sure why it isn't working for you, unless it is some systemd
> problem. It isn't systemd giving up and dropping to emergency because of
> some other startup problem that occurs before the mount is finished, is
> it? I could believe systemd cancels any mounts in progress when that
> happens.
> 
> Graham


Re: page->index limitation on 32bit system?

2021-02-19 Thread Matthew Wilcox
On Sat, Feb 20, 2021 at 07:10:14AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/2/20 上午12:12, Theodore Ts'o wrote:
> > On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:
> > > So it means the 32bit archs are already 2nd tier targets for at least
> > > upstream linux kernel?
> > 
> > At least as far as btrfs is concerned, anyway
> 
> I'm afraid that would be the case.

btrfs already treats 32-bit arches as second class citizens.
I found a1fbc6750e212c5675a4e48d7f51d44607eb8756 by code inspection,
so clearly it hasn't been tested in five years.  I wouldn't recommend
that anybody use btrfs with a 32-bit kernel.



5.11 free space tree remount warning

2021-02-19 Thread Chris Murphy
Hi,

systemd does remount ro at reboot/shutdown time, and if free space
tree exists, this is always logged:

[   27.476941] systemd-shutdown[1]: Unmounting file systems.
[   27.479756] [1601]: Remounting '/' read-only in with options
'seclabel,compress=zstd:1,space_cache=v2,subvolid=258,subvol=/root'.
[   27.489196] BTRFS info (device vda3): using free space tree
[   27.492009] BTRFS warning (device vda3): remount supports changing
free space tree only from ro to rw

Is there a way to better detect that this isn't an attempt to change
to v2? If there's no v1 present, it's not a change.

-- 
Chris Murphy


[PATCH] btrfs: do more graceful error/warning for 32bit kernel

2021-02-19 Thread Qu Wenruo
Due to the pagecache limit of 32bit systems, btrfs can't access metadata
at or beyond 16T boundary correctly.

And unlike other fses, btrfs uses internally mapped u64 address space for
all of its metadata, this is more tricky than other fses.

Users can have a fs which doesn't have metadata beyond 16T boundary at
mount time, but later balance can cause btrfs to create metadata beyond
16T boundary.

And modification to MM layer is unrealistic just for such minor use
case.

To address such problem, this patch will introduce the following checks:

- Mount time rejection
  This will reject any fs which has metadata chunk at or beyond 16T
  boundary.

- Mount time early warning
  If there is any metadata chunk beyond 10T boundary, we do an early
  warning and hope the end user will see it.

- Runtime extent buffer rejection
  If we're going to allocate an extent buffer at or beyond 16T boundary,
  reject such request with -EOVERFLOW.

- Runtime extent buffer early warning
  If an extent buffer beyond 10T is beyond allocated, do an early
  warning.

Above error/warning message will only be outputted once for each fs to
reduce dmesg flood.

Reported-by: Erik Jensen 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h | 12 ++
 fs/btrfs/extent_io.c | 12 ++
 fs/btrfs/super.c | 24 
 fs/btrfs/volumes.c   | 54 ++--
 4 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 40ec3393d2a1..91536c3bd5d8 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -572,6 +572,12 @@ enum {
 
/* Indicate that we can't trust the free space tree for caching yet */
BTRFS_FS_FREE_SPACE_TREE_UNTRUSTED,
+
+#if BITS_PER_LONG == 32
+   /* Indicate if we have error/warn message outputted for 32bit system */
+   BTRFS_FS_32BIT_ERROR,
+   BTRFS_FS_32BIT_WARN,
+#endif
 };
 
 /*
@@ -3405,6 +3411,12 @@ static inline void assertfail(const char *expr, const 
char* file, int line) { }
 #define ASSERT(expr)   (void)(expr)
 #endif
 
+#if BITS_PER_LONG == 32
+#define BTRFS_32BIT_EARLY_WARN_THRESHOLD   (10ULL * 1024 * SZ_1G)
+void btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info);
+void btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info);
+#endif
+
 /*
  * Get the correct offset inside the page of extent buffer.
  *
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 4dfb3ead1175..6af6714d49c1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5554,6 +5554,18 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
return ERR_PTR(-EINVAL);
}
 
+#if BITS_PER_LONG == 32
+   if (start >= MAX_LFS_FILESIZE) {
+   btrfs_err(fs_info,
+   "extent buffer %llu is beyond 32bit page cache limit",
+ start);
+   btrfs_err_32bit_limit(fs_info);
+   return ERR_PTR(-EOVERFLOW);
+   }
+   if (start >= BTRFS_32BIT_EARLY_WARN_THRESHOLD)
+   btrfs_warn_32bit_limit(fs_info);
+#endif
+
if (fs_info->sectorsize < PAGE_SIZE &&
offset_in_page(start) + len > PAGE_SIZE) {
btrfs_err(fs_info,
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f8435641b912..bd959fc664b5 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -252,6 +252,30 @@ void __cold btrfs_printk(const struct btrfs_fs_info 
*fs_info, const char *fmt, .
 }
 #endif
 
+#if BITS_PER_LONG == 32
+void __cold btrfs_warn_32bit_limit(struct btrfs_fs_info *fs_info)
+{
+   if (!test_and_set_bit(BTRFS_FS_32BIT_WARN, &fs_info->flags)) {
+   btrfs_warn(fs_info, "btrfs is reaching 32bit kernel limit.");
+   btrfs_warn(fs_info,
+"due to 32bit page cache limit, btrfs can't access metadata at or beyond 
16T.");
+   btrfs_warn(fs_info,
+  "please consider upgrade to 64bit kernel/hardware.");
+   }
+}
+
+void __cold btrfs_err_32bit_limit(struct btrfs_fs_info *fs_info)
+{
+   if (!test_and_set_bit(BTRFS_FS_32BIT_ERROR, &fs_info->flags)) {
+   btrfs_err(fs_info, "btrfs reached 32bit kernel limit.");
+   btrfs_err(fs_info,
+"due to 32bit page cache limit, btrfs can't access metadata at or beyond 
16T.");
+   btrfs_err(fs_info,
+  "please consider upgrade to 64bit kernel/hardware.");
+   }
+}
+#endif
+
 /*
  * We only mark the transaction aborted and then set the file system read-only.
  * This will prevent new transactions from starting or trying to join this
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index b8fab44394f5..5dc22daa684d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -6787,6 +6787,46 @@ static u64 calc_stripe_length(u64 type, u64 chunk_len, 
int num_stripes)
return div_u64(chunk_len, data_stripes);
 }
 
+#if BITS_PER_LONG == 32
+/*
+ * Due to page cache limit, btrfs can't access metadata at or 

Re: page->index limitation on 32bit system?

2021-02-19 Thread Erik Jensen

On 2/19/21 8:12 AM, Theodore Ts'o wrote:

On Fri, Feb 19, 2021 at 08:37:30AM +0800, Qu Wenruo wrote:

So it means the 32bit archs are already 2nd tier targets for at least
upstream linux kernel?


At least as far as btrfs is concerned, anyway


Or would it be possible to make it an option to make the index u64?
So guys who really wants large file support can enable it while most
other 32bit guys can just keep the existing behavior?


I think if this is going to be done at all, it would need to be a
compile-time CONFIG option to make the index be 64-bits.  That's
because there are a huge number of low-end Android devices (retail
price ~$30 USD in India, for example --- this set of customers is
sometimes called "the next billion users" by some folks) that are
using 32-bit ARM systems.  And they will be using ext4 or f2fs, and it
would be massively unfortunate/unfair/etc. to impose that performance
penalty on them.


A CONFIG option would certainly work for my use case. I was also 
wondering (and I ask this as and end user with admittedly no knowledge 
whatsoever about how the page cache works) whether it might be possible 
to treat the top bit as a kind of "extended address" bit, with some kind 
of additional side table that handles indexes more than 31 bits. That 
way, filesystems that are 8TB or less wouldn't lose any performance, 
while still supporting those larger than 16TB.


I assume the 4KiB entry size in the page cache is fundamental, and can't 
be, e.g., increased to 16KiB to allow addressing up to 64TiB of storage?



It sounds like what Willy is saying is that supporting a 64-bit page
index on 32-bit platforms is going to be have a lot of downsides, and
not just the performance / memory overhead issue.  It's also a code
mainteinance concern, and that tax would land on the mm developers.
And if it's not well-maintained, without regular testing, it's likely
to be heavily subject to bitrot.  (Although I suppose if we don't mind
doubling the number of configs that kernelci has to test, this could
be mitigated.)

In contrast, changing btrfs to not depend on a single address space
for all of its metadata might be a lot of work, but it's something
which lands on the btrfs developers, as opposed to a another (perhaps
more central) kernel subsystem.  Managing at this tradeoff is
something that is going to be between the mm developers and the btrfs
developers, but as someone who doesn't do any work on either of these
subsystems, it seems like a pretty obvious choice.

The final observation I'll make is that if we know which NAS box
vendor can (properly) support volumes > 16 TB, we can probably find
the 64-bit page index patch.  It'll probably be against a fairly old
kernel, so it might not all _that_ helpful, but it might give folks a
bit of a head start.

I can tell you that the NAS box vendor that it _isn't_ is Synology.
Synology boxes uses btrfs, and on 32-bit processors, they have a 16TB
volume size limit, and this is enforced by the Synology NAS
software[1].  However, Synology NAS boxes can support multiple
volumes; until today, I never understood why, since it seemed to be
unnecessary complexity, but I suspect the real answer was this was how
Synology handled storage array sizes > 16TB on their older systems.
(All of their new NAS boxes use 64-bit processors.)

[1] https://www.reddit.com/r/synology/comments/a62xrx/max_volume_size_of_16tb/

Cheers,

- Ted



Re: "bad tree block start" when trying to mount on ARM

2021-02-19 Thread Erik Jensen
On Thu, Feb 18, 2021 at 12:59 AM Qu Wenruo  wrote:
> Just send a mail to the fs-devel mail list, titled "page->index
> limitation on 32bit system?".
>
> I guess your experience as a real world user would definitely bring more
> weight to the discussion.
>
> Thanks,
> Qu

Given that it sounds like the issue is the metadata address space, and
given that I surely don't actually have 16TiB of metadata on a 24TiB
file system (indeed, Metadata, RAID1: total=30.00GiB, used=28.91GiB),
is there any way I could compact the metadata offsets into the lower
16TiB of the virtual metadata inode? Perhaps that could be something
balance could be taught to do? (Obviously, the initial run of such a
balance would have to be performed using a 64-bit system.)

Perhaps, on 32-bit, btrfs itself or some monitoring tool could even
kick off such a metadata balance automatically when the offset hits
10TiB to hopefully avoid ever reaching 16TiB?


Re: "bad tree block start" when trying to mount on ARM

2021-02-19 Thread Qu Wenruo




On 2021/2/20 上午10:47, Erik Jensen wrote:

On Thu, Feb 18, 2021 at 12:59 AM Qu Wenruo  wrote:

Just send a mail to the fs-devel mail list, titled "page->index
limitation on 32bit system?".

I guess your experience as a real world user would definitely bring more
weight to the discussion.

Thanks,
Qu


Given that it sounds like the issue is the metadata address space, and
given that I surely don't actually have 16TiB of metadata on a 24TiB
file system (indeed, Metadata, RAID1: total=30.00GiB, used=28.91GiB),
is there any way I could compact the metadata offsets into the lower
16TiB of the virtual metadata inode? Perhaps that could be something
balance could be taught to do? (Obviously, the initial run of such a
balance would have to be performed using a 64-bit system.)


Unfortunately, no.

Btrfs relies on increasing bytenr in the logical address space for
things like balance, thus we can't relocate chunks to smaller bytenr.



Perhaps, on 32-bit, btrfs itself or some monitoring tool could even
kick off such a metadata balance automatically when the offset hits
10TiB to hopefully avoid ever reaching 16TiB?


That would be worse, as each balanced block group can only go higher
bytenr, not lower, thus it will speed up the problem.

Thanks,
Qu


Re: page->index limitation on 32bit system?

2021-02-19 Thread Matthew Wilcox
On Fri, Feb 19, 2021 at 06:20:43PM -0800, Erik Jensen wrote:
> I assume the 4KiB entry size in the page cache is fundamental, and can't be,
> e.g., increased to 16KiB to allow addressing up to 64TiB of storage?

The bootlin link i sent in the other email does exactly that.


Re: "bad tree block start" when trying to mount on ARM

2021-02-19 Thread Erik Jensen
On Fri, Feb 19, 2021 at 7:16 PM Qu Wenruo  wrote:
> On 2021/2/20 上午10:47, Erik Jensen wrote:
> > Given that it sounds like the issue is the metadata address space, and
> > given that I surely don't actually have 16TiB of metadata on a 24TiB
> > file system (indeed, Metadata, RAID1: total=30.00GiB, used=28.91GiB),
> > is there any way I could compact the metadata offsets into the lower
> > 16TiB of the virtual metadata inode? Perhaps that could be something
> > balance could be taught to do? (Obviously, the initial run of such a
> > balance would have to be performed using a 64-bit system.)
>
> Unfortunately, no.
>
> Btrfs relies on increasing bytenr in the logical address space for
> things like balance, thus we can't relocate chunks to smaller bytenr.

That's… unfortunate. How much relies on the assumption that bytenr is monotonic?

Brainstorming some ideas, is compacting the address space something
that could be done offline? E.g., maybe some two-pass process: first
something balance-like that bumps all of the metadata up to a compact
region of address space, starting at a new 16TiB boundary, and then a
follow up pass that just strips the top bits off?

Or maybe once all of the bytenrs are brought within 16TiB of each
other by balance, btrfs could just keep track of an offset that needs
to be applied when mapping page cache indexes?

Or maybe btrfs could use multiple virtual inodes on 32-bit systems,
one for each 16TiB block of address space with metadata in it? If this
were to ever grow to need more than a handful of virtual inodes, it
seems like a balance *would* actually help in this case by compacting
the metadata higher in the address space, allowing the virtual inodes
for lower in the address space to be dropped.

Or maybe btrfs could just not use the page cache for the metadata
inode once the offset exceeds 16TiB, and only cache at the block
layer? This would surely hurt performance, but at least the filesystem
could still be accessed.

Given that this issue appears to be not due to the size of the
filesystem, but merely how much I've used it, having the only solution
be to copy all of the data off, reformat the drives, and then restore
every time filesystem usage exceeds a certain thresholds is… not very
satisfying.

Finally, I've never done kernel dev before, but I do have some C
experience, so if there is a solution that falls into the category of
seeming reasonable, likely to be accepted if implemented, but being
unlikely to get implemented given the low priority of supporting
32-bit systems, let me know and maybe I can carve out some time to
give it a try.


Re: "bad tree block start" when trying to mount on ARM

2021-02-19 Thread Qu Wenruo




On 2021/2/20 下午12:28, Erik Jensen wrote:

On Fri, Feb 19, 2021 at 7:16 PM Qu Wenruo  wrote:

On 2021/2/20 上午10:47, Erik Jensen wrote:

Given that it sounds like the issue is the metadata address space, and
given that I surely don't actually have 16TiB of metadata on a 24TiB
file system (indeed, Metadata, RAID1: total=30.00GiB, used=28.91GiB),
is there any way I could compact the metadata offsets into the lower
16TiB of the virtual metadata inode? Perhaps that could be something
balance could be taught to do? (Obviously, the initial run of such a
balance would have to be performed using a 64-bit system.)


Unfortunately, no.

Btrfs relies on increasing bytenr in the logical address space for
things like balance, thus we can't relocate chunks to smaller bytenr.


That's… unfortunate. How much relies on the assumption that bytenr is monotonic?


IIRC mostly balance itself.



Brainstorming some ideas, is compacting the address space something
that could be done offline? E.g., maybe some two-pass process: first
something balance-like that bumps all of the metadata up to a compact
region of address space, starting at a new 16TiB boundary, and then a
follow up pass that just strips the top bits off?


We need btrfs-progs support for off-line balancing.

I used to have this idea, but see very limited usage.

This would be the safest bet, but needs a lot of work, although in user
space.



Or maybe once all of the bytenrs are brought within 16TiB of each
other by balance, btrfs could just keep track of an offset that needs
to be applied when mapping page cache indexes?


But further balance/new chunk allocation can still go beyond the limit.

This is biggest problem other fs don't need to bother.
We can dynamically allocate chunks while others can't.



Or maybe btrfs could use multiple virtual inodes on 32-bit systems,
one for each 16TiB block of address space with metadata in it? If this
were to ever grow to need more than a handful of virtual inodes, it
seems like a balance *would* actually help in this case by compacting
the metadata higher in the address space, allowing the virtual inodes
for lower in the address space to be dropped.


This may be a good idea.

But the problem of test coverage is always here.

We can spend tons of lines, but at the end it will not really be well
tested, as it's really hard


Or maybe btrfs could just not use the page cache for the metadata
inode once the offset exceeds 16TiB, and only cache at the block
layer? This would surely hurt performance, but at least the filesystem
could still be accessed.


I don't believe it's really possible, unless we override the XArray
thing provided by MM completely and implemented a btrfs only structure.

That's too costy.



Given that this issue appears to be not due to the size of the
filesystem, but merely how much I've used it, having the only solution
be to copy all of the data off, reformat the drives, and then restore
every time filesystem usage exceeds a certain thresholds is… not very
satisfying.


Yeah, definitely not a good experience.



Finally, I've never done kernel dev before, but I do have some C
experience, so if there is a solution that falls into the category of
seeming reasonable, likely to be accepted if implemented, but being
unlikely to get implemented given the low priority of supporting
32-bit systems, let me know and maybe I can carve out some time to
give it a try.


BTW, if you want things like 64K page size, while still keep the 4K
sector size of your existing btrfs, then I guess you may be interested
in the recent subpage support.

Which allow btrfs to mount 4K sector size fs with 64K page size.

Unfortunately it's still WIP, but may fit your usecase, as ARM support
multiple page sizes (4K, 16K, 64K).
(Although we are only going to support 64K page for now)

Thanks,
Qu