Re: nfs subvolume access?

2021-03-10 Thread Hugo Mills
On Wed, Mar 10, 2021 at 08:46:20AM +0100, Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same 
> file system loop as '/nfs/tsmsrvj/fex'.
> 1
> root@tsmsrvi:~# 
> 
> 
> 
> On tsmsrvj I have in /etc/exports:
> 
> /data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash)
> 
> This is a btrfs subvolume with snapshots:
> 
> root@tsmsrvj:~# btrfs subvolume list /data
> ID 257 gen 35 top level 5 path fex
> ID 270 gen 36 top level 257 path fex/spool
> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> 
> root@tsmsrvj:~# find /data/fex | wc -l
> 489887
> root@tsmsrvj:~# 
> 
> What must I add to /etc/exports to enable subvolume access for the nfs
> client?
> 
> tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
> btrfs-progs v5.4.1 

   I can't remember if this is why, but I've had to put a distinct
fsid field in each separate subvolume being exported:

/srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash

   It doesn't matter what value you use, as long as each one's
different.

   Hugo.

-- 
Hugo Mills | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: nfs subvolume access?

2021-03-10 Thread Ulli Horlacher
On Wed 2021-03-10 (07:59), Hugo Mills wrote:

> > On tsmsrvj I have in /etc/exports:
> > 
> > /data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash)
> > 
> > This is a btrfs subvolume with snapshots:
> > 
> > root@tsmsrvj:~# btrfs subvolume list /data
> > ID 257 gen 35 top level 5 path fex
> > ID 270 gen 36 top level 257 path fex/spool
> > ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> > ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> > ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> > ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> > 
> > root@tsmsrvj:~# find /data/fex | wc -l
> > 489887

>I can't remember if this is why, but I've had to put a distinct
> fsid field in each separate subvolume being exported:
> 
> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash

I must export EACH subvolume?!
The snapshots are generated automatically (via cron)!
I cannot add them to /etc/exports


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20210310075957.gg22...@savella.carfax.org.uk>


Re: nfs subvolume access?

2021-03-10 Thread Ulli Horlacher
On Wed 2021-03-10 (08:46), Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same 
> file system loop as '/nfs/tsmsrvj/fex'.
> 1

> tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
> btrfs-progs v5.4.1 

On Ubuntu 18.04 this setup works without errors:

root@mutter:/backup/rsync# grep tandem /etc/exports 
/backup/rsync/tandem
176.9.135.138(rw,async,no_subtree_check,no_root_squash)

root@mutter:/backup/rsync# btrfs subvolume list /backup/rsync | grep tandem
ID 257 gen 62652 top level 5 path tandem
ID 5898 gen 62284 top level 257 path tandem/.snapshot/2021-03-01_0300.rsync
ID 5906 gen 62284 top level 257 path tandem/.snapshot/2021-03-02_0300.rsync
ID 5914 gen 62284 top level 257 path tandem/.snapshot/2021-03-03_0300.rsync
ID 5924 gen 62284 top level 257 path tandem/.snapshot/2021-03-04_0300.rsync
ID 5932 gen 62284 top level 257 path tandem/.snapshot/2021-03-05_0300.rsync
ID 5941 gen 62284 top level 257 path tandem/.snapshot/2021-03-06_0300.rsync
ID 5950 gen 62284 top level 257 path tandem/.snapshot/2021-03-07_0300.rsync
ID 5962 gen 62413 top level 257 path tandem/.snapshot/2021-03-08_0300.rsync
ID 5970 gen 62522 top level 257 path tandem/.snapshot/2021-03-09_0300.rsync
ID 5978 gen 62626 top level 257 path tandem/.snapshot/2021-03-10_0300.rsync

root@mutter:/backup/rsync# btrfs version
btrfs-progs v4.15.1

root@tandem:/backup# mount | grep backup
mutter:/backup/rsync/tandem on /backup type nfs 
(ro,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,soft,proto=tcp,timeo=600,retrans=1,sec=sys,mountaddr=176.9.68.251,mountvers=3,mountport=52943,mountproto=tcp,local_lock=none,addr=176.9.68.251)

root@tandem:/backup# ls -l .snapshot/
total 0
drwxr-xr-x 1 root root 392 Mar  1 03:00 2021-03-01_0300.rsync
drwxr-xr-x 1 root root 392 Mar  2 03:00 2021-03-02_0300.rsync
drwxr-xr-x 1 root root 392 Mar  3 03:00 2021-03-03_0300.rsync
drwxr-xr-x 1 root root 392 Mar  4 03:00 2021-03-04_0300.rsync
drwxr-xr-x 1 root root 392 Mar  5 03:00 2021-03-05_0300.rsync
drwxr-xr-x 1 root root 392 Mar  6 03:00 2021-03-06_0300.rsync
drwxr-xr-x 1 root root 392 Mar  7 03:00 2021-03-07_0300.rsync
drwxr-xr-x 1 root root 392 Mar  8 03:00 2021-03-08_0300.rsync
drwxr-xr-x 1 root root 392 Mar  9 03:00 2021-03-09_0300.rsync
drwxr-xr-x 1 root root 392 Mar 10 03:00 2021-03-10_0300.rsync

So, it is an issue with the newer btrfs version on Ubuntu 20.04? 


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20210310074620.ga2...@tik.uni-stuttgart.de>


[PATCH v2 00/15] btrfs: support read-write for subpage metadata

2021-03-10 Thread Qu Wenruo
This patchset can be fetched from the following github repo, along with
the full subpage RW support:
https://github.com/adam900710/linux/tree/subpage

This patchset is for metadata read write support.

[FULL RW TEST]
Since the data write path is not included in this patchset, we can't
really test the patchset itself, but anyone can grab the patch from
github repo and do fstests/generic tests.

There are some known issues:
- Very very rare random ASSERT() failure for data page::private
  It looks like we can lock a data page without page::private set for
  subpage.
  This problem seems to be caused some set_page_extent_mapped() callers
  are not holding the page locked, thus leaving a small window.
  Investigating.

- Defrag related test failure
  Since current defrag is doing per-page defrag, to support subpage
  defrag, we need some change in the loop.
  Thus for now, defrag is disabled completely for subpage RW mount.

- No compression support yet
  There are at least 2 known bugs if forcing compression for subpage
  * Some hard coded PAGE_SIZE screwing up space rsv
  * Subpage ASSERT() triggered
This is because some compression code is unlocking locked_page by
calling extent_clear_unlock_delalloc() with locked_page == NULL.
  So for now compression is also disabled.

[DIFFERENCE AGAINST REGULAR SECTORSIZE]
The metadata part in fact has more new code than data part, as it has
some different behaviors compared to the regular sector size handling:

- No more page locking
  Now metadata read/write relies on extent io tree locking, other than
  page locking.
  This is to allow behaviors like read lock one eb while also try to
  read lock another eb in the same page.
  We can't rely on page lock as now we have multiple extent buffers in
  the same page.

- Page status update
  Now we use subpage wrappers to handle page status update.

- How to submit dirty extent buffers
  Instead of just grabbing extent buffer from page::private, we need to
  iterate all dirty extent buffers in the page and submit them.

[CHANGELOG]
v2:
- Rebased to latest misc-next
  No conflicts at all.

- Add new sysfs interface to grab supported RO/RW sectorsize
  This will allow mkfs.btrfs to detect unmountable fs better.

- Use newer naming schema for each patch
  No more "extent_io:" or "inode:" schema anymore.

- Move two pure cleanups to the series
  Patch 2~3, originally in RW part.

- Fix one uninitialized variable
  Patch 6.

Qu Wenruo (15):
  btrfs: add sysfs interface for supported sectorsize
  btrfs: use min() to replace open-code in btrfs_invalidatepage()
  btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
  btrfs: introduce helpers for subpage dirty status
  btrfs: introduce helpers for subpage writeback status
  btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
metadata
  btrfs: support subpage metadata csum calculation at write time
  btrfs: make alloc_extent_buffer() check subpage dirty bitmap
  btrfs: make the page uptodate assert to be subpage compatible
  btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
  btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
compatible
  btrfs: introduce end_bio_subpage_eb_writepage() function
  btrfs: introduce write_one_subpage_eb() function
  btrfs: make lock_extent_buffer_for_io() to be subpage compatible
  btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

 fs/btrfs/disk-io.c   | 143 +++
 fs/btrfs/extent_io.c | 420 ---
 fs/btrfs/inode.c |  14 +-
 fs/btrfs/subpage.c   |  73 
 fs/btrfs/subpage.h   |  17 ++
 fs/btrfs/sysfs.c |  34 
 6 files changed, 598 insertions(+), 103 deletions(-)

-- 
2.30.1



[PATCH v2 01/15] btrfs: add sysfs interface for supported sectorsize

2021-03-10 Thread Qu Wenruo
Add extra sysfs interface features/supported_ro_sectorsize and
features/supported_rw_sectorsize to indicate subpage support.

Currently for supported_rw_sectorsize all architectures only have their
PAGE_SIZE listed.

While for supported_ro_sectorsize, for systems with 64K page size, 4K
sectorsize is also supported.

This new sysfs interface would help mkfs.btrfs to do more accurate
warning.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/sysfs.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 6eb1c50fa98c..3ef419899472 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -360,11 +360,45 @@ static ssize_t supported_rescue_options_show(struct 
kobject *kobj,
 BTRFS_ATTR(static_feature, supported_rescue_options,
   supported_rescue_options_show);
 
+static ssize_t supported_ro_sectorsize_show(struct kobject *kobj,
+   struct kobj_attribute *a,
+   char *buf)
+{
+   ssize_t ret = 0;
+   int i = 0;
+
+   /* For 64K page size, 4K sector size is supported */
+   if (PAGE_SIZE == SZ_64K) {
+   ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%u", SZ_4K);
+   i++;
+   }
+   /* Other than above subpage, only support PAGE_SIZE as sectorsize yet */
+   ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%s%lu\n",
+(i ? " " : ""), PAGE_SIZE);
+   return ret;
+}
+BTRFS_ATTR(static_feature, supported_ro_sectorsize,
+  supported_ro_sectorsize_show);
+
+static ssize_t supported_rw_sectorsize_show(struct kobject *kobj,
+   struct kobj_attribute *a,
+   char *buf)
+{
+   ssize_t ret = 0;
+
+   /* Only PAGE_SIZE as sectorsize is supported */
+   ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n", PAGE_SIZE);
+   return ret;
+}
+BTRFS_ATTR(static_feature, supported_rw_sectorsize,
+  supported_rw_sectorsize_show);
 static struct attribute *btrfs_supported_static_feature_attrs[] = {
BTRFS_ATTR_PTR(static_feature, rmdir_subvol),
BTRFS_ATTR_PTR(static_feature, supported_checksums),
BTRFS_ATTR_PTR(static_feature, send_stream_version),
BTRFS_ATTR_PTR(static_feature, supported_rescue_options),
+   BTRFS_ATTR_PTR(static_feature, supported_ro_sectorsize),
+   BTRFS_ATTR_PTR(static_feature, supported_rw_sectorsize),
NULL
 };
 
-- 
2.30.1



[PATCH v2 08/15] btrfs: make alloc_extent_buffer() check subpage dirty bitmap

2021-03-10 Thread Qu Wenruo
In alloc_extent_buffer(), we make sure that the newly allocated page is
never dirty.

This is fine for sector size == PAGE_SIZE case, but for subpage it's
possible that one extent buffer in the page is dirty, thus the whole
page is marked dirty, and could cause false alert.

To support subpage, call btrfs_page_test_dirty() to handle both cases.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 82cc8b9ce744..796beb5a0b6b 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5635,7 +5635,7 @@ struct extent_buffer *alloc_extent_buffer(struct 
btrfs_fs_info *fs_info,
btrfs_page_inc_eb_refs(fs_info, p);
spin_unlock(&mapping->private_lock);
 
-   WARN_ON(PageDirty(p));
+   WARN_ON(btrfs_page_test_dirty(fs_info, p, eb->start, eb->len));
eb->pages[i] = p;
if (!PageUptodate(p))
uptodate = 0;
-- 
2.30.1



[PATCH v2 03/15] btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()

2021-03-10 Thread Qu Wenruo
In btrfs_invalidatepage() we re-declare @tree variable as
btrfs_ordered_inode_tree.

Since it's only used to do the spinlock, we can grab it from inode
directly, and remove the unnecessary declaration completely.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2973cec05505..f99554f0bd48 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8404,15 +8404,11 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 * for the finish_ordered_io
 */
if (TestClearPagePrivate2(page)) {
-   struct btrfs_ordered_inode_tree *tree;
-
-   tree = &inode->ordered_tree;
-
-   spin_lock_irq(&tree->lock);
+   spin_lock_irq(&inode->ordered_tree.lock);
set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
ordered->truncated_len = min(ordered->truncated_len,
start - ordered->file_offset);
-   spin_unlock_irq(&tree->lock);
+   spin_unlock_irq(&inode->ordered_tree.lock);
 
if (btrfs_dec_test_ordered_pending(inode, &ordered,
   start,
-- 
2.30.1



[PATCH v2 07/15] btrfs: support subpage metadata csum calculation at write time

2021-03-10 Thread Qu Wenruo
Add a new helper, csum_dirty_subpage_buffers(), to iterate through all
dirty extent buffers in one bvec.

Also extract the code of calculating csum for one extent buffer into
csum_one_extent_buffer(), so that both the existing csum_dirty_buffer()
and the new csum_dirty_subpage_buffers() can reuse the same routine.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 96 ++
 1 file changed, 72 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index d53df276923e..371502021a60 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -441,6 +441,74 @@ static int btree_read_extent_buffer_pages(struct 
extent_buffer *eb,
return ret;
 }
 
+static int csum_one_extent_buffer(struct extent_buffer *eb)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   u8 result[BTRFS_CSUM_SIZE];
+   int ret;
+
+   ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
+   offsetof(struct btrfs_header, fsid),
+   BTRFS_FSID_SIZE) == 0);
+   csum_tree_block(eb, result);
+
+   if (btrfs_header_level(eb))
+   ret = btrfs_check_node(eb);
+   else
+   ret = btrfs_check_leaf_full(eb);
+
+   if (ret < 0) {
+   btrfs_print_tree(eb, 0);
+   btrfs_err(fs_info,
+   "block=%llu write time tree block corruption detected",
+ eb->start);
+   WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+   return ret;
+   }
+   write_extent_buffer(eb, result, 0, fs_info->csum_size);
+
+   return 0;
+}
+
+/* Checksum all dirty extent buffers in one bio_vec. */
+static int csum_dirty_subpage_buffers(struct btrfs_fs_info *fs_info,
+ struct bio_vec *bvec)
+{
+   struct page *page = bvec->bv_page;
+   u64 bvec_start = page_offset(page) + bvec->bv_offset;
+   u64 cur;
+   int ret = 0;
+
+   for (cur = bvec_start; cur < bvec_start + bvec->bv_len;
+cur += fs_info->nodesize) {
+   struct extent_buffer *eb;
+   bool uptodate;
+
+   eb = find_extent_buffer(fs_info, cur);
+   uptodate = btrfs_subpage_test_uptodate(fs_info, page, cur,
+  fs_info->nodesize);
+
+   /* A dirty eb shouldn't disappera from buffer_radix */
+   if (WARN_ON(!eb))
+   return -EUCLEAN;
+
+   if (WARN_ON(cur != btrfs_header_bytenr(eb))) {
+   free_extent_buffer(eb);
+   return -EUCLEAN;
+   }
+   if (WARN_ON(!uptodate)) {
+   free_extent_buffer(eb);
+   return -EUCLEAN;
+   }
+
+   ret = csum_one_extent_buffer(eb);
+   free_extent_buffer(eb);
+   if (ret < 0)
+   return ret;
+   }
+   return ret;
+}
+
 /*
  * Checksum a dirty tree block before IO.  This has extra checks to make sure
  * we only fill in the checksum field in the first page of a multi-page block.
@@ -451,9 +519,10 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct bio_vec *bvec
struct page *page = bvec->bv_page;
u64 start = page_offset(page);
u64 found_start;
-   u8 result[BTRFS_CSUM_SIZE];
struct extent_buffer *eb;
-   int ret;
+
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return csum_dirty_subpage_buffers(fs_info, bvec);
 
eb = (struct extent_buffer *)page->private;
if (page != eb->pages[0])
@@ -475,28 +544,7 @@ static int csum_dirty_buffer(struct btrfs_fs_info 
*fs_info, struct bio_vec *bvec
if (WARN_ON(!PageUptodate(page)))
return -EUCLEAN;
 
-   ASSERT(memcmp_extent_buffer(eb, fs_info->fs_devices->metadata_uuid,
-   offsetof(struct btrfs_header, fsid),
-   BTRFS_FSID_SIZE) == 0);
-
-   csum_tree_block(eb, result);
-
-   if (btrfs_header_level(eb))
-   ret = btrfs_check_node(eb);
-   else
-   ret = btrfs_check_leaf_full(eb);
-
-   if (ret < 0) {
-   btrfs_print_tree(eb, 0);
-   btrfs_err(fs_info,
-   "block=%llu write time tree block corruption detected",
- eb->start);
-   WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
-   return ret;
-   }
-   write_extent_buffer(eb, result, 0, fs_info->csum_size);
-
-   return 0;
+   return csum_one_extent_buffer(eb);
 }
 
 static int check_tree_block_fsid(struct extent_buffer *eb)
-- 
2.30.1



[PATCH v2 06/15] btrfs: allow btree_set_page_dirty() to do more sanity check on subpage metadata

2021-03-10 Thread Qu Wenruo
For btree_set_page_dirty(), we should also check the extent buffer
sanity for subpage support.

Unlike the regular sector size case, since one page can contain multiple
extent buffers, we need to make sure there is at least one dirty extent
buffer in the page.

So this patch will iterate through the btrfs_subpage::dirty_bitmap
to get the extent buffers, and check if any dirty extent buffer in the page
range has EXTENT_BUFFER_DIRTY and proper refs.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 47 --
 1 file changed, 41 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 41b718cfea40..d53df276923e 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -42,6 +42,7 @@
 #include "discard.h"
 #include "space-info.h"
 #include "zoned.h"
+#include "subpage.h"
 
 #define BTRFS_SUPER_FLAG_SUPP  (BTRFS_HEADER_FLAG_WRITTEN |\
 BTRFS_HEADER_FLAG_RELOC |\
@@ -992,14 +993,48 @@ static void btree_invalidatepage(struct page *page, 
unsigned int offset,
 static int btree_set_page_dirty(struct page *page)
 {
 #ifdef DEBUG
+   struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+   struct btrfs_subpage *subpage;
struct extent_buffer *eb;
+   int cur_bit = 0;
+   u64 page_start = page_offset(page);
+
+   if (fs_info->sectorsize == PAGE_SIZE) {
+   BUG_ON(!PagePrivate(page));
+   eb = (struct extent_buffer *)page->private;
+   BUG_ON(!eb);
+   BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+   BUG_ON(!atomic_read(&eb->refs));
+   btrfs_assert_tree_locked(eb);
+   return __set_page_dirty_nobuffers(page);
+   }
+   ASSERT(PagePrivate(page) && page->private);
+   subpage = (struct btrfs_subpage *)page->private;
+
+   ASSERT(subpage->dirty_bitmap);
+   while (cur_bit < BTRFS_SUBPAGE_BITMAP_SIZE) {
+   unsigned long flags;
+   u64 cur;
+   u16 tmp = (1 << cur_bit);
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   if (!(tmp & subpage->dirty_bitmap)) {
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   cur_bit++;
+   continue;
+   }
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   cur = page_start + cur_bit * fs_info->sectorsize;
 
-   BUG_ON(!PagePrivate(page));
-   eb = (struct extent_buffer *)page->private;
-   BUG_ON(!eb);
-   BUG_ON(!test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
-   BUG_ON(!atomic_read(&eb->refs));
-   btrfs_assert_tree_locked(eb);
+   eb = find_extent_buffer(fs_info, cur);
+   ASSERT(eb);
+   ASSERT(test_bit(EXTENT_BUFFER_DIRTY, &eb->bflags));
+   ASSERT(atomic_read(&eb->refs));
+   btrfs_assert_tree_locked(eb);
+   free_extent_buffer(eb);
+
+   cur_bit += (fs_info->nodesize >> fs_info->sectorsize_bits);
+   }
 #endif
return __set_page_dirty_nobuffers(page);
 }
-- 
2.30.1



[PATCH v2 09/15] btrfs: make the page uptodate assert to be subpage compatible

2021-03-10 Thread Qu Wenruo
There are quite some assert check on page uptodate in extent buffer write
accessors.
They ensure the destination page is already uptodate.

This is fine for regular sector size case, but not for subpage case, as
for subpage we only mark the page uptodate if the page contains no hole
and all its extent buffers are uptodate.

So instead of checking PageUptodate(), for subpage case we check the
uptodate bitmap of btrfs_subpage structure.

To make the check more elegant, introduce a helper,
assert_eb_page_uptodate() to do the check for both subpage and regular
sector size cases.

The following functions are involved:
- write_extent_buffer_chunk_tree_uuid()
- write_extent_buffer_fsid()
- write_extent_buffer()
- memzero_extent_buffer()
- copy_extent_buffer()
- extent_buffer_test_bit()
- extent_buffer_bitmap_set()
- extent_buffer_bitmap_clear()

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 796beb5a0b6b..208e603acf0c 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -6187,12 +6187,34 @@ int memcmp_extent_buffer(const struct extent_buffer 
*eb, const void *ptrv,
return ret;
 }
 
+/*
+ * A helper to ensure that the extent buffer is uptodate.
+ *
+ * For regular sector size == PAGE_SIZE case, check if @page is uptodate.
+ * For subpage case, check if the range covered by the eb has EXTENT_UPTODATE.
+ */
+static void assert_eb_page_uptodate(const struct extent_buffer *eb,
+   struct page *page)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+
+   if (fs_info->sectorsize < PAGE_SIZE) {
+   bool uptodate;
+
+   uptodate = btrfs_subpage_test_uptodate(fs_info, page,
+   eb->start, eb->len);
+   WARN_ON(!uptodate);
+   } else {
+   WARN_ON(!PageUptodate(page));
+   }
+}
+
 void write_extent_buffer_chunk_tree_uuid(const struct extent_buffer *eb,
const void *srcv)
 {
char *kaddr;
 
-   WARN_ON(!PageUptodate(eb->pages[0]));
+   assert_eb_page_uptodate(eb, eb->pages[0]);
kaddr = page_address(eb->pages[0]) + get_eb_offset_in_page(eb, 0);
memcpy(kaddr + offsetof(struct btrfs_header, chunk_tree_uuid), srcv,
BTRFS_FSID_SIZE);
@@ -6202,7 +6224,7 @@ void write_extent_buffer_fsid(const struct extent_buffer 
*eb, const void *srcv)
 {
char *kaddr;
 
-   WARN_ON(!PageUptodate(eb->pages[0]));
+   assert_eb_page_uptodate(eb, eb->pages[0]);
kaddr = page_address(eb->pages[0]) + get_eb_offset_in_page(eb, 0);
memcpy(kaddr + offsetof(struct btrfs_header, fsid), srcv,
BTRFS_FSID_SIZE);
@@ -6227,7 +6249,7 @@ void write_extent_buffer(const struct extent_buffer *eb, 
const void *srcv,
 
while (len > 0) {
page = eb->pages[i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(eb, page);
 
cur = min(len, PAGE_SIZE - offset);
kaddr = page_address(page);
@@ -6256,7 +6278,7 @@ void memzero_extent_buffer(const struct extent_buffer 
*eb, unsigned long start,
 
while (len > 0) {
page = eb->pages[i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(eb, page);
 
cur = min(len, PAGE_SIZE - offset);
kaddr = page_address(page);
@@ -6314,7 +6336,7 @@ void copy_extent_buffer(const struct extent_buffer *dst,
 
while (len > 0) {
page = dst->pages[i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(dst, page);
 
cur = min(len, (unsigned long)(PAGE_SIZE - offset));
 
@@ -6376,7 +6398,7 @@ int extent_buffer_test_bit(const struct extent_buffer 
*eb, unsigned long start,
 
eb_bitmap_offset(eb, start, nr, &i, &offset);
page = eb->pages[i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(eb, page);
kaddr = page_address(page);
return 1U & (kaddr[offset] >> (nr & (BITS_PER_BYTE - 1)));
 }
@@ -6401,7 +6423,7 @@ void extent_buffer_bitmap_set(const struct extent_buffer 
*eb, unsigned long star
 
eb_bitmap_offset(eb, start, pos, &i, &offset);
page = eb->pages[i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(eb, page);
kaddr = page_address(page);
 
while (len >= bits_to_set) {
@@ -6412,7 +6434,7 @@ void extent_buffer_bitmap_set(const struct extent_buffer 
*eb, unsigned long star
if (++offset >= PAGE_SIZE && len > 0) {
offset = 0;
page = eb->pages[++i];
-   WARN_ON(!PageUptodate(page));
+   assert_eb_page_uptodate(eb, page);
kaddr 

[PATCH v2 10/15] btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible

2021-03-10 Thread Qu Wenruo
For set_extent_buffer_dirty() to support subpage sized metadata, just
call btrfs_page_set_dirty() to handle both cases.

For clear_extent_buffer_dirty(), it needs to clear the page dirty if and
only if all extent buffers in the page range are no longer dirty.
Also do the same for page error.

This is pretty different from the exist clear_extent_buffer_dirty()
routine, so add a new helper function,
clear_subpage_extent_buffer_dirty() to do this for subpage metadata.

Also since the main part of clearing page dirty code is still the same,
extract that into btree_clear_page_dirty() so that it can be utilized
for both cases.

But there is a special race between set_extent_buffer_dirty() and
clear_extent_buffer_dirty(), where we can clear the page dirty.

[POSSIBLE RACE WINDOW]
For the race window between clear_subpage_extent_buffer_dirty() and
set_extent_buffer_dirty(), due to the fact that we can't call
clear_page_dirty_for_io() under subpage spin lock, we can race like
below:

   T1 (eb1 in the same page)|  T2 (eb2 in the same page)
 ---+--
 set_extent_buffer_dirty()  | clear_extent_buffer_dirty()
 |- was_dirty = false;  | |- clear_subpagE_extent_buffer_dirty()
 |  ||- btrfs_clear_and_test_dirty()
 |  ||  Since eb2 is the last dirty page
 |  ||  we got:
 |  ||  last == true;
 |  ||
 |- btrfs_page_set_dirty()  ||
 |  We set the page dirty and   ||
 |  subpage dirty bitmap||
 |  ||- if (last)
 |  ||  Since we don't have subpage lock
 |  ||  hold, now @last is no longer
 |  ||  correct
 |  ||- btree_clear_page_dirty()
 |  |   Now PageDirty == false, even we
 |  |   have dirty_bitmap not zero.
 |- ASSERT(PageDirty());|
 CRASH

The solution here is to also lock the eb->pages[0] for subpage case of
set_extent_buffer_dirty(), to prevent racing with
clear_extent_buffer_dirty().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 65 
 1 file changed, 53 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 208e603acf0c..2d16d92107bc 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -5784,28 +5784,51 @@ void free_extent_buffer_stale(struct extent_buffer *eb)
release_extent_buffer(eb);
 }
 
+static void btree_clear_page_dirty(struct page *page)
+{
+   ASSERT(PageDirty(page));
+   ASSERT(PageLocked(page));
+   clear_page_dirty_for_io(page);
+   xa_lock_irq(&page->mapping->i_pages);
+   if (!PageDirty(page))
+   __xa_clear_mark(&page->mapping->i_pages,
+   page_index(page), PAGECACHE_TAG_DIRTY);
+   xa_unlock_irq(&page->mapping->i_pages);
+}
+
+static void clear_subpage_extent_buffer_dirty(const struct extent_buffer *eb)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   struct page *page = eb->pages[0];
+   bool last;
+
+   /* btree_clear_page_dirty() needs page locked */
+   lock_page(page);
+   last = btrfs_subpage_clear_and_test_dirty(fs_info, page, eb->start,
+ eb->len);
+   if (last)
+   btree_clear_page_dirty(page);
+   unlock_page(page);
+   WARN_ON(atomic_read(&eb->refs) == 0);
+}
+
 void clear_extent_buffer_dirty(const struct extent_buffer *eb)
 {
int i;
int num_pages;
struct page *page;
 
+   if (eb->fs_info->sectorsize < PAGE_SIZE)
+   return clear_subpage_extent_buffer_dirty(eb);
+
num_pages = num_extent_pages(eb);
 
for (i = 0; i < num_pages; i++) {
page = eb->pages[i];
if (!PageDirty(page))
continue;
-
lock_page(page);
-   WARN_ON(!PagePrivate(page));
-
-   clear_page_dirty_for_io(page);
-   xa_lock_irq(&page->mapping->i_pages);
-   if (!PageDirty(page))
-   __xa_clear_mark(&page->mapping->i_pages,
-   page_index(page), PAGECACHE_TAG_DIRTY);
-   xa_unlock_irq(&page->mapping->i_pages);
+   btree_clear_page_dirty(page);
ClearPageError(page);
unlock_page(page);
}
@@ -5826,10 +5849,28 @@ bool set_extent_buffer_dirty(struct extent_buffer *eb)
WARN_ON(atomic_read(&eb->refs) == 0);
WARN_ON(!test_bit(EXTENT_BUFFER_TREE_REF, &eb->bflags));
 
-   if (!was_dirty)
-   for (i = 0; i < num_pages; i++)
-   set_page_dirty(eb

[PATCH v2 05/15] btrfs: introduce helpers for subpage writeback status

2021-03-10 Thread Qu Wenruo
This patch introduces the following functions to handle btrfs subpage
writeback status:
- btrfs_subpage_set_writeback()
- btrfs_subpage_clear_writeback()
- btrfs_subpage_test_writeback()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_writeback()
- btrfs_page_clear_writeback()
- btrfs_page_test_writeback()
  Those helpers can handle both regular sector size and subpage without
  problem.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/subpage.c | 30 ++
 fs/btrfs/subpage.h |  2 ++
 2 files changed, 32 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index 183925902031..2a326d6385ed 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -260,6 +260,33 @@ void btrfs_subpage_clear_dirty(const struct btrfs_fs_info 
*fs_info,
clear_page_dirty_for_io(page);
 }
 
+void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->writeback_bitmap |= tmp;
+   set_page_writeback(page);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
+void btrfs_subpage_clear_writeback(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->writeback_bitmap &= ~tmp;
+   if (subpage->writeback_bitmap == 0)
+   end_page_writeback(page);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -281,6 +308,7 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info 
*fs_info, \
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(writeback);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -319,3 +347,5 @@ IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, 
ClearPageUptodate,
 IMPLEMENT_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
 IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
 PageDirty);
+IMPLEMENT_BTRFS_PAGE_OPS(writeback, set_page_writeback, end_page_writeback,
+   PageWriteback);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index adaece5ce294..fe43267e31f3 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -21,6 +21,7 @@ struct btrfs_subpage {
u16 uptodate_bitmap;
u16 error_bitmap;
u16 dirty_bitmap;
+   u16 writeback_bitmap;
union {
/*
 * Structures only used by metadata
@@ -89,6 +90,7 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info 
*fs_info,  \
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
 DECLARE_BTRFS_SUBPAGE_OPS(dirty);
+DECLARE_BTRFS_SUBPAGE_OPS(writeback);
 
 /*
  * Extra clear_and_test function for subpage dirty bitmap.
-- 
2.30.1



[PATCH v2 11/15] btrfs: make set_btree_ioerr() accept extent buffer and to be subpage compatible

2021-03-10 Thread Qu Wenruo
Current set_btree_ioerr() only accepts @page parameter and grabs extent
buffer from page::private.

This works fine for sector size == PAGE_SIZE case, but not for subpage
case.

Adds an extra parameter, @eb, for callers to pass extent buffer to this
function, so that subpage code can reuse this function.

And also add subpage special handling to update
btrfs_subpage::error_bitmap.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 2d16d92107bc..b6fbb512abfd 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3982,12 +3982,11 @@ static noinline_for_stack int 
lock_extent_buffer_for_io(struct extent_buffer *eb
return ret;
 }
 
-static void set_btree_ioerr(struct page *page)
+static void set_btree_ioerr(struct page *page, struct extent_buffer *eb)
 {
-   struct extent_buffer *eb = (struct extent_buffer *)page->private;
-   struct btrfs_fs_info *fs_info;
+   struct btrfs_fs_info *fs_info = eb->fs_info;
 
-   SetPageError(page);
+   btrfs_page_set_error(fs_info, page, eb->start, eb->len);
if (test_and_set_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags))
return;
 
@@ -3995,7 +3994,6 @@ static void set_btree_ioerr(struct page *page)
 * If we error out, we should add back the dirty_metadata_bytes
 * to make it consistent.
 */
-   fs_info = eb->fs_info;
percpu_counter_add_batch(&fs_info->dirty_metadata_bytes,
 eb->len, fs_info->dirty_metadata_batch);
 
@@ -4039,13 +4037,13 @@ static void set_btree_ioerr(struct page *page)
 */
switch (eb->log_index) {
case -1:
-   set_bit(BTRFS_FS_BTREE_ERR, &eb->fs_info->flags);
+   set_bit(BTRFS_FS_BTREE_ERR, &fs_info->flags);
break;
case 0:
-   set_bit(BTRFS_FS_LOG1_ERR, &eb->fs_info->flags);
+   set_bit(BTRFS_FS_LOG1_ERR, &fs_info->flags);
break;
case 1:
-   set_bit(BTRFS_FS_LOG2_ERR, &eb->fs_info->flags);
+   set_bit(BTRFS_FS_LOG2_ERR, &fs_info->flags);
break;
default:
BUG(); /* unexpected, logic error */
@@ -4070,7 +4068,7 @@ static void end_bio_extent_buffer_writepage(struct bio 
*bio)
if (bio->bi_status ||
test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
ClearPageUptodate(page);
-   set_btree_ioerr(page);
+   set_btree_ioerr(page, eb);
}
 
end_page_writeback(page);
@@ -4126,7 +4124,7 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
 end_bio_extent_buffer_writepage,
 0, 0, 0, false);
if (ret) {
-   set_btree_ioerr(p);
+   set_btree_ioerr(p, eb);
if (PageWriteback(p))
end_page_writeback(p);
if (atomic_sub_and_test(num_pages - i, &eb->io_pages))
-- 
2.30.1



[PATCH v2 02/15] btrfs: use min() to replace open-code in btrfs_invalidatepage()

2021-03-10 Thread Qu Wenruo
In btrfs_invalidatepage() we introduce a temporary variable, new_len, to
update ordered->truncated_len.

But we can use min() to replace it completely and no need for the
variable.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/inode.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 52dc5f52ea58..2973cec05505 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8405,15 +8405,13 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 */
if (TestClearPagePrivate2(page)) {
struct btrfs_ordered_inode_tree *tree;
-   u64 new_len;
 
tree = &inode->ordered_tree;
 
spin_lock_irq(&tree->lock);
set_bit(BTRFS_ORDERED_TRUNCATED, &ordered->flags);
-   new_len = start - ordered->file_offset;
-   if (new_len < ordered->truncated_len)
-   ordered->truncated_len = new_len;
+   ordered->truncated_len = min(ordered->truncated_len,
+   start - ordered->file_offset);
spin_unlock_irq(&tree->lock);
 
if (btrfs_dec_test_ordered_pending(inode, &ordered,
-- 
2.30.1



[PATCH v2 13/15] btrfs: introduce write_one_subpage_eb() function

2021-03-10 Thread Qu Wenruo
The new function, write_one_subpage_eb(), as a subroutine for subpage
metadata write, will handle the extent buffer bio submission.

The major differences between the new write_one_subpage_eb() and
write_one_eb() is:
- No page locking
  When entering write_one_subpage_eb() the page is no longer locked.
  We only lock the page for its status update, and unlock immediately.
  Now we completely rely on extent io tree locking.

- Extra bitmap update along with page status update
  Now page dirty and writeback is controlled by
  btrfs_subpage::dirty_bitmap and btrfs_subpage::writeback_bitmap.
  They both follow the schema that any sector is dirty/writeback, then
  the full page get dirty/writeback.

- When to update the nr_written number
  Now we take a short cut, if we have cleared the last dirty bit of the
  page, we update nr_written.
  This is not completely perfect, but should emulate the old behavior
  good enough.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 55 
 1 file changed, 55 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 74d59b292c9a..74525ebf2b83 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4166,6 +4166,58 @@ static void end_bio_extent_buffer_writepage(struct bio 
*bio)
bio_put(bio);
 }
 
+/*
+ * Unlike the work in write_one_eb(), we rely completely on extent locking.
+ * Page locking is only utizlied at minimal to keep the VM code happy.
+ *
+ * Caller should still call write_one_eb() other than this function directly.
+ * As write_one_eb() has extra prepration before submitting the extent buffer.
+ */
+static int write_one_subpage_eb(struct extent_buffer *eb,
+   struct writeback_control *wbc,
+   struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = eb->fs_info;
+   struct page *page = eb->pages[0];
+   unsigned int write_flags = wbc_to_write_flags(wbc) | REQ_META;
+   bool no_dirty_ebs = false;
+   int ret;
+
+   /* clear_page_dirty_for_io() in subpage helper need page locked. */
+   lock_page(page);
+   btrfs_subpage_set_writeback(fs_info, page, eb->start, eb->len);
+
+   /* If we're the last dirty bit to update nr_written */
+   no_dirty_ebs = btrfs_subpage_clear_and_test_dirty(fs_info, page,
+ eb->start, eb->len);
+   if (no_dirty_ebs)
+   clear_page_dirty_for_io(page);
+
+   ret = submit_extent_page(REQ_OP_WRITE | write_flags, wbc, page,
+   eb->start, eb->len, eb->start - page_offset(page),
+   &epd->bio, end_bio_extent_buffer_writepage, 0, 0, 0,
+   false);
+   if (ret) {
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   set_btree_ioerr(page, eb);
+   unlock_page(page);
+
+   if (atomic_dec_and_test(&eb->io_pages))
+   end_extent_buffer_writeback(eb);
+   return -EIO;
+   }
+   unlock_page(page);
+   /*
+* Submission finishes without problem, if no range of the page is
+* dirty anymore, we have submitted a page.
+* Update the nr_written in wbc.
+*/
+   if (no_dirty_ebs)
+   update_nr_written(wbc, 1);
+   return ret;
+}
+
 static noinline_for_stack int write_one_eb(struct extent_buffer *eb,
struct writeback_control *wbc,
struct extent_page_data *epd)
@@ -4197,6 +4249,9 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
memzero_extent_buffer(eb, start, end - start);
}
 
+   if (eb->fs_info->sectorsize < PAGE_SIZE)
+   return write_one_subpage_eb(eb, wbc, epd);
+
for (i = 0; i < num_pages; i++) {
struct page *p = eb->pages[i];
 
-- 
2.30.1



[PATCH v2 12/15] btrfs: introduce end_bio_subpage_eb_writepage() function

2021-03-10 Thread Qu Wenruo
The new function, end_bio_subpage_eb_writepage(), will handle the
metadata writeback endio.

The major differences involved are:
- How to grab extent buffer
  Now page::private is a pointer to btrfs_subpage, we can no longer grab
  extent buffer directly.
  Thus we need to use the bv_offset to locate the extent buffer manually
  and iterate through the whole range.

- Use btrfs_subpage_end_writeback() caller
  This helper will handle the subpage writeback for us.

Since this function is executed under endio context, when grabbing
extent buffers it can't grab eb->refs_lock as that lock is not designed
to be grabbed under hardirq context.

So here introduce a helper, find_extent_buffer_nospinlock(), for such
situation, and convert find_extent_buffer() to use that helper.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 135 +--
 1 file changed, 106 insertions(+), 29 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index b6fbb512abfd..74d59b292c9a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4050,13 +4050,97 @@ static void set_btree_ioerr(struct page *page, struct 
extent_buffer *eb)
}
 }
 
+/*
+ * This is the endio specific version which won't touch any unsafe spinlock
+ * in endio context.
+ */
+static struct extent_buffer *find_extent_buffer_nospinlock(
+   struct btrfs_fs_info *fs_info, u64 start)
+{
+   struct extent_buffer *eb;
+
+   rcu_read_lock();
+   eb = radix_tree_lookup(&fs_info->buffer_radix,
+  start >> fs_info->sectorsize_bits);
+   if (eb && atomic_inc_not_zero(&eb->refs)) {
+   rcu_read_unlock();
+   return eb;
+   }
+   rcu_read_unlock();
+   return NULL;
+}
+/*
+ * The endio function for subpage extent buffer write.
+ *
+ * Unlike end_bio_extent_buffer_writepage(), we only call end_page_writeback()
+ * after all extent buffers in the page has finished their writeback.
+ */
+static void end_bio_subpage_eb_writepage(struct btrfs_fs_info *fs_info,
+struct bio *bio)
+{
+   struct bio_vec *bvec;
+   struct bvec_iter_all iter_all;
+
+   ASSERT(!bio_flagged(bio, BIO_CLONED));
+   bio_for_each_segment_all(bvec, bio, iter_all) {
+   struct page *page = bvec->bv_page;
+   u64 bvec_start = page_offset(page) + bvec->bv_offset;
+   u64 bvec_end = bvec_start + bvec->bv_len - 1;
+   u64 cur_bytenr = bvec_start;
+
+   ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+
+   /* Iterate through all extent buffers in the range */
+   while (cur_bytenr <= bvec_end) {
+   struct extent_buffer *eb;
+   int done;
+
+   /*
+* Here we can't use find_extent_buffer(), as it may
+* try to lock eb->refs_lock, which is not safe in endio
+* context.
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, cur_bytenr);
+   ASSERT(eb);
+
+   cur_bytenr = eb->start + eb->len;
+
+   ASSERT(test_bit(EXTENT_BUFFER_WRITEBACK, &eb->bflags));
+   done = atomic_dec_and_test(&eb->io_pages);
+   ASSERT(done);
+
+   if (bio->bi_status ||
+   test_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags)) {
+   ClearPageUptodate(page);
+   set_btree_ioerr(page, eb);
+   }
+
+   btrfs_subpage_clear_writeback(fs_info, page, eb->start,
+ eb->len);
+   end_extent_buffer_writeback(eb);
+   /*
+* free_extent_buffer() will grab spinlock which is not
+* safe in endio context. Thus here we manually dec
+* the ref.
+*/
+   atomic_dec(&eb->refs);
+   }
+   }
+   bio_put(bio);
+}
+
 static void end_bio_extent_buffer_writepage(struct bio *bio)
 {
+   struct btrfs_fs_info *fs_info;
struct bio_vec *bvec;
struct extent_buffer *eb;
int done;
struct bvec_iter_all iter_all;
 
+   fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
+   if (fs_info->sectorsize < PAGE_SIZE)
+   return end_bio_subpage_eb_writepage(fs_info, bio);
+
ASSERT(!bio_flagged(bio, BIO_CLONED));
bio_for_each_segment_all(bvec, bio, iter_all) {
struct page *page = bvec->bv_page;
@@ -5437,36 +5521,29 @@ struct extent_buffer *find_extent_buffer(struct 
btrfs_fs_info *fs_info,
 {
struct extent_buffer *eb;
 
-   rcu_read_lock();
-   eb = r

[PATCH v2 04/15] btrfs: introduce helpers for subpage dirty status

2021-03-10 Thread Qu Wenruo
This patch introduce the following functions to handle btrfs subpage
dirty status:
- btrfs_subpage_set_dirty()
- btrfs_subpage_clear_dirty()
- btrfs_subpage_test_dirty()
  Those helpers can only be called when the range is ensured to be
  inside the page.

- btrfs_page_set_dirty()
- btrfs_page_clear_dirty()
- btrfs_page_test_dirty()
  Those helpers can handle both regular sector size and subpage without
  problem.
  Thus those would be used to replace PageDirty() related calls in
  later commits.

There is one special point to note here, just like set_page_dirty() and
clear_page_dirty_for_io(), btrfs_*page_set_dirty() and
btrfs_*page_clear_dirty() must be called with page locked.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/subpage.c | 43 +++
 fs/btrfs/subpage.h | 15 +++
 2 files changed, 58 insertions(+)

diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
index c69049e7daa9..183925902031 100644
--- a/fs/btrfs/subpage.c
+++ b/fs/btrfs/subpage.c
@@ -220,6 +220,46 @@ void btrfs_subpage_clear_error(const struct btrfs_fs_info 
*fs_info,
spin_unlock_irqrestore(&subpage->lock, flags);
 }
 
+void btrfs_subpage_set_dirty(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->dirty_bitmap |= tmp;
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   set_page_dirty(page);
+}
+
+bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   struct btrfs_subpage *subpage = (struct btrfs_subpage *)page->private;
+   u16 tmp = btrfs_subpage_calc_bitmap(fs_info, page, start, len);
+   unsigned long flags;
+   bool last = false;
+
+
+   spin_lock_irqsave(&subpage->lock, flags);
+   subpage->dirty_bitmap &= ~tmp;
+   if (subpage->dirty_bitmap == 0)
+   last = true;
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   return last;
+}
+
+void btrfs_subpage_clear_dirty(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len)
+{
+   bool last;
+
+   last = btrfs_subpage_clear_and_test_dirty(fs_info, page, start, len);
+   if (last)
+   clear_page_dirty_for_io(page);
+}
+
 /*
  * Unlike set/clear which is dependent on each page status, for test all bits
  * are tested in the same way.
@@ -240,6 +280,7 @@ bool btrfs_subpage_test_##name(const struct btrfs_fs_info 
*fs_info, \
 }
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(uptodate);
 IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(error);
+IMPLEMENT_BTRFS_SUBPAGE_TEST_OP(dirty);
 
 /*
  * Note that, in selftests (extent-io-tests), we can have empty fs_info passed
@@ -276,3 +317,5 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info 
*fs_info,\
 IMPLEMENT_BTRFS_PAGE_OPS(uptodate, SetPageUptodate, ClearPageUptodate,
 PageUptodate);
 IMPLEMENT_BTRFS_PAGE_OPS(error, SetPageError, ClearPageError, PageError);
+IMPLEMENT_BTRFS_PAGE_OPS(dirty, set_page_dirty, clear_page_dirty_for_io,
+PageDirty);
diff --git a/fs/btrfs/subpage.h b/fs/btrfs/subpage.h
index b86a4881475d..adaece5ce294 100644
--- a/fs/btrfs/subpage.h
+++ b/fs/btrfs/subpage.h
@@ -20,6 +20,7 @@ struct btrfs_subpage {
spinlock_t lock;
u16 uptodate_bitmap;
u16 error_bitmap;
+   u16 dirty_bitmap;
union {
/*
 * Structures only used by metadata
@@ -87,5 +88,19 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info 
*fs_info, \
 
 DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
 DECLARE_BTRFS_SUBPAGE_OPS(error);
+DECLARE_BTRFS_SUBPAGE_OPS(dirty);
+
+/*
+ * Extra clear_and_test function for subpage dirty bitmap.
+ *
+ * Return true if we're the last bits in the dirty_bitmap and clear the
+ * dirty_bitmap.
+ * Return false otherwise.
+ *
+ * NOTE: Callers should manually clear page dirty for true case, as we have
+ * extra handling for tree blocks.
+ */
+bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
+   struct page *page, u64 start, u32 len);
 
 #endif
-- 
2.30.1



[PATCH v2 15/15] btrfs: introduce submit_eb_subpage() to submit a subpage metadata page

2021-03-10 Thread Qu Wenruo
The new function, submit_eb_subpage(), will submit all the dirty extent
buffers in the page.

The major difference between submit_eb_page() and submit_eb_subpage()
is:
- How to grab extent buffer
  Now we use find_extent_buffer_nospinlock() other than using
  page::private.

All other different handling is already done in functions like
lock_extent_buffer_for_io() and write_one_eb().

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 95 
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 18730d3ab50f..7281ec72a86a 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4293,6 +4293,98 @@ static noinline_for_stack int write_one_eb(struct 
extent_buffer *eb,
return ret;
 }
 
+/*
+ * Submit one subpage btree page.
+ *
+ * The main difference between submit_eb_page() is:
+ * - Page locking
+ *   For subpage, we don't rely on page locking at all.
+ *
+ * - Flush write bio
+ *   We only flush bio if we may be unable to fit current extent buffers into
+ *   current bio.
+ *
+ * Return >=0 for the number of submitted extent buffers.
+ * Return <0 for fatal error.
+ */
+static int submit_eb_subpage(struct page *page,
+struct writeback_control *wbc,
+struct extent_page_data *epd)
+{
+   struct btrfs_fs_info *fs_info = btrfs_sb(page->mapping->host->i_sb);
+   int submitted = 0;
+   u64 page_start = page_offset(page);
+   int bit_start = 0;
+   int nbits = BTRFS_SUBPAGE_BITMAP_SIZE;
+   int sectors_per_node = fs_info->nodesize >> fs_info->sectorsize_bits;
+   int ret;
+
+   /* Lock and write each dirty extent buffers in the range */
+   while (bit_start < nbits) {
+   struct btrfs_subpage *subpage = (struct btrfs_subpage 
*)page->private;
+   struct extent_buffer *eb;
+   unsigned long flags;
+   u64 start;
+
+   /*
+* Take private lock to ensure the subpage won't be detached
+* halfway.
+*/
+   spin_lock(&page->mapping->private_lock);
+   if (!PagePrivate(page)) {
+   spin_unlock(&page->mapping->private_lock);
+   break;
+   }
+   spin_lock_irqsave(&subpage->lock, flags);
+   if (!((1 << bit_start) & subpage->dirty_bitmap)) {
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+   bit_start++;
+   continue;
+   }
+
+   start = page_start + bit_start * fs_info->sectorsize;
+   bit_start += sectors_per_node;
+
+   /*
+* Here we just want to grab the eb without touching extra
+* spin locks. So here we call find_extent_buffer_nospinlock().
+*/
+   eb = find_extent_buffer_nospinlock(fs_info, start);
+   spin_unlock_irqrestore(&subpage->lock, flags);
+   spin_unlock(&page->mapping->private_lock);
+
+   /*
+* The eb has already reached 0 refs thus find_extent_buffer()
+* doesn't return it. We don't need to write back such eb
+* anyway.
+*/
+   if (!eb)
+   continue;
+
+   ret = lock_extent_buffer_for_io(eb, epd);
+   if (ret == 0) {
+   free_extent_buffer(eb);
+   continue;
+   }
+   if (ret < 0) {
+   free_extent_buffer(eb);
+   goto cleanup;
+   }
+   ret = write_one_eb(eb, wbc, epd);
+   free_extent_buffer(eb);
+   if (ret < 0)
+   goto cleanup;
+   submitted++;
+   }
+   return submitted;
+
+cleanup:
+   /* We hit error, end bio for the submitted extent buffers */
+   end_write_bio(epd, ret);
+   return ret;
+}
+
 /*
  * Submit all page(s) of one extent buffer.
  *
@@ -4325,6 +4417,9 @@ static int submit_eb_page(struct page *page, struct 
writeback_control *wbc,
if (!PagePrivate(page))
return 0;
 
+   if (btrfs_sb(page->mapping->host->i_sb)->sectorsize < PAGE_SIZE)
+   return submit_eb_subpage(page, wbc, epd);
+
spin_lock(&mapping->private_lock);
if (!PagePrivate(page)) {
spin_unlock(&mapping->private_lock);
-- 
2.30.1



[PATCH v2 14/15] btrfs: make lock_extent_buffer_for_io() to be subpage compatible

2021-03-10 Thread Qu Wenruo
For subpage metadata, we don't use page locking at all.
So just skip the page locking part for subpage.

All the remaining routine can be reused.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent_io.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 74525ebf2b83..18730d3ab50f 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3937,7 +3937,13 @@ static noinline_for_stack int 
lock_extent_buffer_for_io(struct extent_buffer *eb
 
btrfs_tree_unlock(eb);
 
-   if (!ret)
+   /*
+* Either we don't need to submit any tree block, or we're submitting
+* subpage.
+* Subpage metadata doesn't use page locking at all, so we can skip
+* the page locking.
+*/
+   if (!ret || fs_info->sectorsize < PAGE_SIZE)
return ret;
 
num_pages = num_extent_pages(eb);
-- 
2.30.1



Re: [PATCH] btrfs-progs: output sectorsize related warning message into stdout

2021-03-10 Thread David Sterba
On Wed, Mar 10, 2021 at 08:18:16AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/9 下午9:33, David Sterba wrote:
> > On Tue, Mar 09, 2021 at 03:39:09PM +0800, Qu Wenruo wrote:
> >> Since commit 90020a760584 ("btrfs-progs: mkfs: refactor how we handle
> >> sectorsize override") we have extra warning message if the sectorsize of
> >> mkfs doesn't match page size.
> >>
> >> But this warning is show as stderr, which makes a lot of fstests cases
> >> failure due to golden output mismatch.
> >
> > Well, no. Using message helpers in progs is what we want to do
> > everywhere, working around fstests output matching design is fixing the
> > problem in the wrong place. That this is fragile has been is known and
> > I want to keep the liberty to adjust output in progs as users need, not
> > as fstests require.
> 
> OK, then I guess the best way to fix the problem is to add sysfs
> interface to export supported rw/ro sectorsize.
> 
> It shouldn't be that complex and would be small enough for next merge
> window.

The subpage support should be advertised somewhere in sysfs so the range
of supported sector sizes sounds like a good idea.


Re: Re: nfs subvolume access?

2021-03-10 Thread Graham Cobb
On 10/03/2021 08:09, Ulli Horlacher wrote:
> On Wed 2021-03-10 (07:59), Hugo Mills wrote:
> 
>>> On tsmsrvj I have in /etc/exports:
>>>
>>> /data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash)
>>>
>>> This is a btrfs subvolume with snapshots:
>>>
>>> root@tsmsrvj:~# btrfs subvolume list /data
>>> ID 257 gen 35 top level 5 path fex
>>> ID 270 gen 36 top level 257 path fex/spool
>>> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
>>> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
>>> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
>>> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
>>>
>>> root@tsmsrvj:~# find /data/fex | wc -l
>>> 489887
> 
>>I can't remember if this is why, but I've had to put a distinct
>> fsid field in each separate subvolume being exported:
>>
>> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
> 
> I must export EACH subvolume?!

I have had similar problems. I *think* the current case is that modern
NFS, using NFS V4, can cope with the whole disk being accessible without
giving each subvolume its own FSID (which I have stopped doing).

HOWEVER, I think that find (and anything else which uses fsids and inode
numbers) will see subvolumes as having duplicated inodes.

> The snapshots are generated automatically (via cron)!
> I cannot add them to /etc/exports

Well, you could write some scripts... but I don't think it is necessary.
I *think* it is only necessary if you want `find` to be able to cross
between subvolumes on the NFS mounted disks.

However, I am NOT an NFS expert, nor have I done a lot of work on this.
I might be wrong. But I do NFS mount my snapshots disk remotely and use
it. And I do see occasional complaints from find, but I live with it.


Re: [PATCH] btrfs: add test for cases when a dio write has to fallback to a buffered write

2021-03-10 Thread Filipe Manana
On Sun, Mar 7, 2021 at 3:24 PM Eryu Guan  wrote:
>
> On Sun, Mar 07, 2021 at 03:07:43PM +, Filipe Manana wrote:
> > On Sun, Mar 7, 2021 at 2:41 PM Eryu Guan  wrote:
> > >
> > > On Thu, Feb 11, 2021 at 05:01:18PM +, fdman...@kernel.org wrote:
> > > > From: Filipe Manana 
> > > >
> > > > Test cases where a direct IO write, with O_DSYNC, can not be done and 
> > > > has
> > > > to fallback to a buffered write.
> > > >
> > > > This is motivated by a regression that was introduced in kernel 5.10 by
> > > > commit 0eb79294dbe328 ("btrfs: dio iomap DSYNC workaround")) and was
> > > > fixed in kernel 5.11 by commit ecfdc08b8cc65d ("btrfs: remove dio iomap
> > > > DSYNC workaround").
> > > >
> > > > Signed-off-by: Filipe Manana 
> > >
> > > Sorry for the late review..
> > >
> > > So this is supposed to fail with v5.10 kernel, right? But I got it
> > > passed
> >
> > Because either you are testing with a patched 5.10.x kernel, or you
> > don't have CONFIG_BTRFS_ASSERT=y in your config.
> > The fix landed in 5.10.18:
>
> You're right, I don't have CONFIG_BTRFS_ASSERT=y. As the test dumps the
> od output of the file content, so I thought the failure would be a data
> corruption, and expected a od output diff failure.

I see the test was not merged yet, do you expect me to update anything
in the patch?

Thanks.

>
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.18&id=a6703c71153438d3ebdf58a75d53dd5e57b33095
> >
> > >
> > >   [root@fedoravm xfstests]# ./check -s btrfs btrfs/231
> > >   SECTION   -- btrfs
> > >   RECREATING-- btrfs on /dev/mapper/testvg-lv1
> > >   FSTYP -- btrfs
> > >   PLATFORM  -- Linux/x86_64 fedoravm 5.10.0 #6 SMP Sun Mar 7 22:25:35 
> > > CST 2021
> > >   MKFS_OPTIONS  -- /dev/mapper/testvg-lv2
> > >   MOUNT_OPTIONS -- /dev/mapper/testvg-lv2 /mnt/scratch
> > >
> > >   btrfs/231 13s ...  8s
> > >   Ran: btrfs/231
> > >   Passed all 1 tests
> > >
> > >   SECTION   -- btrfs
> > >   =
> > >   Ran: btrfs/231
> > >   Passed all 1 tests
> > >
> > > > ---
> > > >  tests/btrfs/231 | 61 +
> > > >  tests/btrfs/231.out | 21 
> > > >  tests/btrfs/group   |  1 +
> > > >  3 files changed, 83 insertions(+)
> > > >  create mode 100755 tests/btrfs/231
> > > >  create mode 100644 tests/btrfs/231.out
> > > >
> > > > diff --git a/tests/btrfs/231 b/tests/btrfs/231
> > > > new file mode 100755
> > > > index ..9a404f57
> > > > --- /dev/null
> > > > +++ b/tests/btrfs/231
> > > > @@ -0,0 +1,61 @@
> > > > +#! /bin/bash
> > > > +# SPDX-License-Identifier: GPL-2.0
> > > > +# Copyright (C) 2021 SUSE Linux Products GmbH. All Rights Reserved.
> > > > +#
> > > > +# FS QA Test No. btrfs/231
> > > > +#
> > > > +# Test cases where a direct IO write, with O_DSYNC, can not be done 
> > > > and has to
> > > > +# fallback to a buffered write.
> > > > +#
> > > > +seq=`basename $0`
> > > > +seqres=$RESULT_DIR/$seq
> > > > +echo "QA output created by $seq"
> > > > +
> > > > +tmp=/tmp/$$
> > > > +status=1 # failure is the default!
> > > > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > > > +
> > > > +_cleanup()
> > > > +{
> > > > + cd /
> > > > + rm -f $tmp.*
> > > > +}
> > > > +
> > > > +# get standard environment, filters and checks
> > > > +. ./common/rc
> > > > +. ./common/filter
> > > > +. ./common/attr
> > > > +
> > > > +# real QA test starts here
> > > > +_supported_fs btrfs
> > > > +_require_scratch
> > > > +_require_odirect
> > > > +_require_chattr c
> > > > +
> > > > +rm -f $seqres.full
> > > > +
> > > > +_scratch_mkfs >>$seqres.full 2>&1
> > > > +_scratch_mount
> > > > +
> > > > +# First lets test with an attempt to write into a file range with 
> > > > compressed
> > > > +# extents.
> > > > +touch $SCRATCH_MNT/foo
> > > > +$CHATTR_PROG +c $SCRATCH_MNT/foo
> > >
> > > It's not so clear to me why writing into a compressed file is required,
> > > would you please add more comments?
> >
> > The test is meant to test cases where we can deterministically make a
> > direct IO write fallback to buffered IO.
> > There are 2 such cases:
> >
> > 1) Attempting to write to an unaligned offset - this was the bug in
> > 5.10 that resulted in a crash when CONFIG_BTRFS_ASSERT=y (default in
> > many distros, such as openSUSE).
> >
> > 2) Writing to a range that has compressed extents. This has nothing to
> > do with the 5.10 regression, I just added it since there's no existing
> > test that explicitly and deterministically triggers this.
> > So yes, I decided to add a test case for all possible cases of
> > direct IO falling back to buffered instead of adding one just to test
> > a regression (and help detecting any possible future regressions).
>
> Sounds good, thanks!
>
> Eryu


no memory is freed after snapshots are deleted

2021-03-10 Thread telsch
Dear devs,

after my root partiton was full, i deleted the last monthly snapshots. however, 
no memory was freed.
so far rebalancing helped:

btrfs balance start -v -musage=0 /
btrfs balance start -v -dusage=0 /

i have deleted all snapshots, but no memory is being freed this time.

du -hcsx /
16G /
16G total

btrfs-progs v5.10.1
Linux arch-server 5.10.21-1-lts #1 SMP Sun, 07 Mar 2021 11:56:15 + x86_64 
GNU/Linux

btrfs fi show /
Label: none  uuid: 3d242677-6a15-4ce7-853a-5c82f0427769
Total devices 1 FS bytes used 37.24GiB
devid1 size 39.95GiB used 39.95GiB path /dev/mapper/root

btrfs fi df /
Data, single: total=36.45GiB, used=35.86GiB
System, DUP: total=32.00MiB, used=16.00KiB
Metadata, DUP: total=1.72GiB, used=1.38GiB
GlobalReserve, single: total=215.94MiB, used=0.00B


any ideas how to solve this without recreating filesystem?

thx!


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Neal Gompa
On Thu, Feb 25, 2021 at 7:23 PM Shiyang Ruan  wrote:
>
> This patchset is attempt to add CoW support for fsdax, and take XFS,
> which has both reflink and fsdax feature, as an example.
>
> Changes from V1:
>  - Factor some helper functions to simplify dax fault code
>  - Introduce iomap_apply2() for dax_dedupe_file_range_compare()
>  - Fix mistakes and other problems
>  - Rebased on v5.11
>
> One of the key mechanism need to be implemented in fsdax is CoW.  Copy
> the data from srcmap before we actually write data to the destance
> iomap.  And we just copy range in which data won't be changed.
>
> Another mechanism is range comparison.  In page cache case, readpage()
> is used to load data on disk to page cache in order to be able to
> compare data.  In fsdax case, readpage() does not work.  So, we need
> another compare data with direct access support.
>
> With the two mechanism implemented in fsdax, we are able to make reflink
> and fsdax work together in XFS.
>
>
> Some of the patches are picked up from Goldwyn's patchset.  I made some
> changes to adapt to this patchset.
>
> (Rebased on v5.11)

Forgive my ignorance, but is there a reason why this isn't wired up to
Btrfs at the same time? It seems weird to me that adding a feature
like DAX to work with CoW filesystems is not being wired into *the*
CoW filesystem in the Linux kernel that fully takes advantage of
copy-on-write. I'm aware that XFS supports reflinks and does some
datacow stuff, but I don't know if I would consider XFS integration
sufficient for integrating this feature now, especially if it's
possible that the design might not work with Btrfs (I hadn't seen any
feedback from Btrfs developers, though given how much email there is
here, it's entirely possible that I missed it).


-- 
真実はいつも一つ!/ Always, there's only one truth!


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Matthew Wilcox
On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> Forgive my ignorance, but is there a reason why this isn't wired up to
> Btrfs at the same time? It seems weird to me that adding a feature

btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.

If you think about it, btrfs and DAX are diametrically opposite things.
DAX is about giving raw access to the hardware.  btrfs is about offering
extra value (RAID, checksums, ...), none of which can be done if the
filesystem isn't in the read/write path.

That's why there's no DAX support in btrfs.  If you want DAX, you have
to give up all the features you like in btrfs.  So you may as well use
a different filesystem.


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Neal Gompa
On Wed, Mar 10, 2021 at 8:02 AM Matthew Wilcox  wrote:
>
> On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > Forgive my ignorance, but is there a reason why this isn't wired up to
> > Btrfs at the same time? It seems weird to me that adding a feature
>
> btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
>
> If you think about it, btrfs and DAX are diametrically opposite things.
> DAX is about giving raw access to the hardware.  btrfs is about offering
> extra value (RAID, checksums, ...), none of which can be done if the
> filesystem isn't in the read/write path.
>
> That's why there's no DAX support in btrfs.  If you want DAX, you have
> to give up all the features you like in btrfs.  So you may as well use
> a different filesystem.

So does that mean that DAX is incompatible with those filesystems when
layered on DM (e.g. through LVM)?

Also, based on what you're saying, that means that DAX'd resources
would not be able to use reflinks on XFS, right? That'd put it in
similar territory as swap files on Btrfs, I would think.



--
真実はいつも一つ!/ Always, there's only one truth!


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Matthew Wilcox
On Wed, Mar 10, 2021 at 08:36:06AM -0500, Neal Gompa wrote:
> On Wed, Mar 10, 2021 at 8:02 AM Matthew Wilcox  wrote:
> >
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> >
> > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> >
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware.  btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> >
> > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > to give up all the features you like in btrfs.  So you may as well use
> > a different filesystem.
> 
> So does that mean that DAX is incompatible with those filesystems when
> layered on DM (e.g. through LVM)?

Yes.  It might be possible to work through RAID-0 or read-only through
RAID-1, but I'm not sure anybody's bothered to do that work.

> Also, based on what you're saying, that means that DAX'd resources
> would not be able to use reflinks on XFS, right? That'd put it in
> similar territory as swap files on Btrfs, I would think.

You can use DAX with reflinks because the CPU can do read-only mmaps.
On a write fault, we break the reflink, copy the data and put in a
writable PTE.


Re: no memory is freed after snapshots are deleted

2021-03-10 Thread Graham Cobb
On 10/03/2021 12:07, telsch wrote:
> Dear devs,
> 
> after my root partiton was full, i deleted the last monthly snapshots. 
> however, no memory was freed.
> so far rebalancing helped:
> 
>   btrfs balance start -v -musage=0 /
>   btrfs balance start -v -dusage=0 /
> 
> i have deleted all snapshots, but no memory is being freed this time.

Don't forget that, in general, deleting a snapshot does nothing - if the
original files are still there (or any other snapshots of the same files
are still there). In my experience, if you *really* need space urgently
you are best of starting with deleting some big files *and* all the
snapshots containing them, rather than starting by deleting snapshots.

If you are doing balances with low space, I find it useful to watch
dmesg to see if the balance is hitting problems finding space to even
free things up.

However, one big advantage of btrfs is that you can easily temporarily
add a small amount of space while you sort things out. Just plug in a
USB memory stick, and add it to the filesystem using 'btrfs device add'.

I don't recommend leaving it as part of the filesystem for long - it is
too easy for the memory stick to fail, or for you remove it forgetting
how important it is, but it can be useful when you are trying to do
things like remove snapshots and files or run balance. Don't forget to
use btrfs device remove to remove it - not just unplugging it!



Re: [PATCH] btrfs-progs: build: Use PKG_CONFIG instead of pkg-config

2021-03-10 Thread David Sterba
On Tue, Mar 09, 2021 at 10:24:40PM +0100, Heiko Becker wrote:
> Hard-coding the pkg-config executable might result in build errors
> on system and cross environments that have prefixed toolchains. The
> PKG_CONFIG variable already holds the proper one and is already used
> in a few other places.
> 
> Signed-off-by: Heiko Becker 

Added to devel, thanks.


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Goldwyn Rodrigues
On 13:02 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > Forgive my ignorance, but is there a reason why this isn't wired up to
> > Btrfs at the same time? It seems weird to me that adding a feature
> 
> btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> 
> If you think about it, btrfs and DAX are diametrically opposite things.
> DAX is about giving raw access to the hardware.  btrfs is about offering
> extra value (RAID, checksums, ...), none of which can be done if the
> filesystem isn't in the read/write path.
> 
> That's why there's no DAX support in btrfs.  If you want DAX, you have
> to give up all the features you like in btrfs.  So you may as well use
> a different filesystem.

DAX on btrfs has been attempted[1]. Of course, we could not
have checksums or multi-device with it. However, got stuck on
associating a shared extent on the same page mapping: basically the
TODO above dax_associate_entry().

Shiyang has proposed a way to disassociate existing mapping, but I
don't think that is the best solution. DAX for CoW will not work until
we have a way of mapping a page to multiple inodes (page->mapping),
which will convert a 1-N inode-page mapping to M-N inode-page mapping.

[1] https://lore.kernel.org/linux-btrfs/20190429172649.8288-1-rgold...@suse.de/

-- 
Goldwyn


Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Matthew Wilcox
On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> On 13:02 10/03, Matthew Wilcox wrote:
> > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > Btrfs at the same time? It seems weird to me that adding a feature
> > 
> > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX support.
> > 
> > If you think about it, btrfs and DAX are diametrically opposite things.
> > DAX is about giving raw access to the hardware.  btrfs is about offering
> > extra value (RAID, checksums, ...), none of which can be done if the
> > filesystem isn't in the read/write path.
> > 
> > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > to give up all the features you like in btrfs.  So you may as well use
> > a different filesystem.
> 
> DAX on btrfs has been attempted[1]. Of course, we could not

But why?  A completeness fetish?  I don't understand why you decided
to do this work.

> have checksums or multi-device with it. However, got stuck on
> associating a shared extent on the same page mapping: basically the
> TODO above dax_associate_entry().
> 
> Shiyang has proposed a way to disassociate existing mapping, but I
> don't think that is the best solution. DAX for CoW will not work until
> we have a way of mapping a page to multiple inodes (page->mapping),
> which will convert a 1-N inode-page mapping to M-N inode-page mapping.

If you're still thinking in terms of pages, you're doing DAX wrong.
DAX should work without a struct page.


Aw: Re: no memory is freed after snapshots are deleted

2021-03-10 Thread telsch
> Don't forget that, in general, deleting a snapshot does nothing - if the
> original files are still there (or any other snapshots of the same files
> are still there). In my experience, if you *really* need space urgently
> you are best of starting with deleting some big files *and* all the
> snapshots containing them, rather than starting by deleting snapshots.
>
> If you are doing balances with low space, I find it useful to watch
> dmesg to see if the balance is hitting problems finding space to even
> free things up.
>
> However, one big advantage of btrfs is that you can easily temporarily
> add a small amount of space while you sort things out. Just plug in a
> USB memory stick, and add it to the filesystem using 'btrfs device add'.
>
> I don't recommend leaving it as part of the filesystem for long - it is
> too easy for the memory stick to fail, or for you remove it forgetting
> how important it is, but it can be useful when you are trying to do
> things like remove snapshots and files or run balance. Don't forget to
> use btrfs device remove to remove it - not just unplugging it!

Yes that's why i deleted all snapshots.
I had also added a ramdisk to work around the low memory problem during balance 
but without success.

Any other ideas to fix this?


Re: nfs subvolume access?

2021-03-10 Thread Ulli Horlacher
On Wed 2021-03-10 (09:35), Graham Cobb wrote:

> >>> root@tsmsrvj:~# find /data/fex | wc -l
> >>> 489887
> > 
> >>I can't remember if this is why, but I've had to put a distinct
> >> fsid field in each separate subvolume being exported:
> >>
> >> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
> > 
> > I must export EACH subvolume?!
> 
> I have had similar problems. I *think* the current case is that modern
> NFS, using NFS V4, can cope with the whole disk being accessible without
> giving each subvolume its own FSID (which I have stopped doing).

I cannot use NFS4 (for several reasons). I must use NFS3


> > The snapshots are generated automatically (via cron)!
> > I cannot add them to /etc/exports
> 
> Well, you could write some scripts... but I don't think it is necessary.
> I *think* it is only necessary if you want `find` to be able to cross
> between subvolumes on the NFS mounted disks.

It is not only a find problem:

root@fex:/nfs/tsmsrvj/fex# ls -R
:
spool
ls: ./spool: not listing already-listed directory


And as I wrote: there is no such problem with Ubuntu 18.04!
So, is it a btrfs or a nfs bug?


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<5bded122-8adf-e5e7-dceb-37a3875f1...@cobb.uk.net>


Re: Aw: Re: no memory is freed after snapshots are deleted

2021-03-10 Thread Remi Gauvin
On 2021-03-10 10:49 a.m., telsch wrote:

> 
> Any other ideas to fix this?
> 


We can check that there are, in fact, no unexpected subvolumes.

btrfs sub list /

In particular, I wonder if you have subvolumes/snapshots hidden behind
the mounted subvolume.

Also, I don't think it's really the case here with only 16GB reported by
du,, but do you have any large, heavily fragmented files, such as VM
virtual disks?  Without defragmentation or compression, I've seen those
consume more than twice their reported file size on btrfs.








Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Goldwyn Rodrigues
On 14:26 10/03, Matthew Wilcox wrote:
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > > 
> > > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX 
> > > support.
> > > 
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware.  btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > > 
> > > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > > to give up all the features you like in btrfs.  So you may as well use
> > > a different filesystem.
> > 
> > DAX on btrfs has been attempted[1]. Of course, we could not
> 
> But why?  A completeness fetish?  I don't understand why you decided
> to do this work.

If only I had a penny every time I heard "why would you want to do that?"

> 
> > have checksums or multi-device with it. However, got stuck on
> > associating a shared extent on the same page mapping: basically the
> > TODO above dax_associate_entry().
> > 
> > Shiyang has proposed a way to disassociate existing mapping, but I
> > don't think that is the best solution. DAX for CoW will not work until
> > we have a way of mapping a page to multiple inodes (page->mapping),
> > which will convert a 1-N inode-page mapping to M-N inode-page mapping.
> 
> If you're still thinking in terms of pages, you're doing DAX wrong.
> DAX should work without a struct page.

Not pages specifically, but mappings.
fsdax needs the mappings during the page fault and it breaks in case both
files fault on the same shared extent.

For Reference: WARN_ON_ONCE(page->mapping && page->mapping != mapping)
in dax_disassociate_entry().

-- 
Goldwyn


Re: no memory is freed after snapshots are deleted

2021-03-10 Thread Zygo Blaxell
On Wed, Mar 10, 2021 at 01:07:47PM +0100, telsch wrote:
> Dear devs,
> 
> after my root partiton was full, i deleted the last monthly snapshots. 
> however, no memory was freed.
> so far rebalancing helped:
> 
>   btrfs balance start -v -musage=0 /
>   btrfs balance start -v -dusage=0 /
> 
> i have deleted all snapshots, but no memory is being freed this time.
> 
> du -hcsx /
> 16G /
> 16G total
> 
> btrfs-progs v5.10.1
> Linux arch-server 5.10.21-1-lts #1 SMP Sun, 07 Mar 2021 11:56:15 + x86_64 
> GNU/Linux
> 
> btrfs fi show /
> Label: none  uuid: 3d242677-6a15-4ce7-853a-5c82f0427769
> Total devices 1 FS bytes used 37.24GiB
> devid1 size 39.95GiB used 39.95GiB path /dev/mapper/root
> 
> btrfs fi df /
> Data, single: total=36.45GiB, used=35.86GiB
> System, DUP: total=32.00MiB, used=16.00KiB
> Metadata, DUP: total=1.72GiB, used=1.38GiB
> GlobalReserve, single: total=215.94MiB, used=0.00B

Check

btrfs sub list -d /

to make sure there are no deleted snapshots pending.  If a snapshot
has a single open file on it (or a bind mount or similar equivalent to
an open file), the cleaner will not delete it until the last open file
descriptor is closed.  You'll have to find the process with the open
file and convince it to close the file (or kill the process).  This can
be tricky since lsof and fuser are not able to identify open files on
deleted snapshots, so these tools are not usable.  Rebooting will force
all the files to be closed.

You can also use 'compsize' and measure the difference in size
between 'referenced' and 'usage' columns.  If referenced is below
usage then you have some big extents with small references (this can
be caused by prealloc and some database write patterns, or by using
a non-btrfs-extent-aware dedupe tool).  Defrag will get rid of those
if you have no snapshots.  You will have to start at the top of the
filesystem tree and work your way down until you find the offending files,
as compsize can only give you a summary.

> any ideas how to solve this without recreating filesystem?
> 
> thx!


Re: Re: Re: no memory is freed after snapshots are deleted

2021-03-10 Thread telsch


> We can check that there are, in fact, no unexpected subvolumes.
>
> btrfs sub list /
>
> In particular, I wonder if you have subvolumes/snapshots hidden behind
> the mounted subvolume.
>
> Also, I don't think it's really the case here with only 16GB reported by
> du,, but do you have any large, heavily fragmented files, such as VM
> virtual disks?  Without defragmentation or compression, I've seen those
> consume more than twice their reported file size on btrfs.

thanks for the hint! i have only considered snapshots but no subvolumes. since 
them considered as file system boundary and i had set the option -x, 
--one-file-system, it is clear where I waste my storage space.
my fault - thank you!


Multiple files with the same name in one directory

2021-03-10 Thread Martin Raiber
Hi,

I have this in a btrfs directory. Linux kernel 5.10.16, no errors in dmesg, no 
scrub errors:

ls -lh
total 19G
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
-rwxr-x--- 1 root root  783 Mar 10 14:56 disk_config.dat
...

disk_config.dat gets written to using fsync rename ( write new version to 
disk_config.dat.new, fsync disk_config.dat.new, then rename to disk_config.dat 
-- it is missing the parent directory fsync).

So far no negative consequences... (except that programs might get confused).

echo 3 > /proc/sys/vm/drop_caches doesn't help.

Regards,
Martin Raiber



Re: nfs subvolume access?

2021-03-10 Thread Forza



 From: Ulli Horlacher  -- Sent: 2021-03-10 - 
16:55 

> On Wed 2021-03-10 (09:35), Graham Cobb wrote:
> 
>> >>> root@tsmsrvj:~# find /data/fex | wc -l
>> >>> 489887
>> > 
>> >>I can't remember if this is why, but I've had to put a distinct
>> >> fsid field in each separate subvolume being exported:
>> >>
>> >> /srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash
>> > 
>> > I must export EACH subvolume?!
>> 
>> I have had similar problems. I *think* the current case is that modern
>> NFS, using NFS V4, can cope with the whole disk being accessible without
>> giving each subvolume its own FSID (which I have stopped doing).
> 
> I cannot use NFS4 (for several reasons). I must use NFS3
> 
> 
>> > The snapshots are generated automatically (via cron)!
>> > I cannot add them to /etc/exports
>> 
>> Well, you could write some scripts... but I don't think it is necessary.
>> I *think* it is only necessary if you want `find` to be able to cross
>> between subvolumes on the NFS mounted disks.
> 
> It is not only a find problem:
> 
> root@fex:/nfs/tsmsrvj/fex# ls -R
> :
> spool
> ls: ./spool: not listing already-listed directory
> 
> 
> And as I wrote: there is no such problem with Ubuntu 18.04!
> So, is it a btrfs or a nfs bug?
> 
>

Did you try the fsid on the export? (not separate exports for all subvols) 
Without it the NFS server tries to enumerate it from the filesystem itself, 
which can cause weird issues. It is good practice to always use fsid on all 
exports in any case. 

At least with NFS4 server on my Ubuntu NFS servers at work, there are no issues 
with subvols for clients the mount with vers=3

You may want to enable debug logging on your server. 
https://wiki.tnonline.net/w/Blog/NFS_Server_Logging

/Forza



Re: nfs subvolume access?

2021-03-10 Thread Ulli Horlacher
On Wed 2021-03-10 (18:29), Forza wrote:

> Did you try the fsid on the export?

Yes:

root@tsmsrvj:/etc# grep tsm exports 
/data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash,fsid=0x0011)

root@tsmsrvj:/etc# exportfs -va
exporting fex.rus.uni-stuttgart.de:/data/fex
exporting tsmsrvi.rus.uni-stuttgart.de:/data/fex


root@tsmsrvi:~# umount /nfs/tsmsrvj/fex

root@tsmsrvi:~# mount -o nfsvers=3,proto=tcp tsmsrvj:/data/fex /nfs/tsmsrvj/fex

root@tsmsrvi:~# find /nfs/tsmsrvj/fex
/nfs/tsmsrvj/fex
find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same 
file system loop as '/nfs/tsmsrvj/fex'.



> You may want to enable debug logging on your server.
> https://wiki.tnonline.net/w/Blog/NFS_Server_Logging

root@tsmsrvj:/etc# rpcdebug -m nfsd all
nfsd   sock fh export svc proc fileop auth repcache xdr lockd

root@tsmsrvj:/var/log# tailf kern.log
2021-03-10 18:45:17 [106259.649850] nfsd_dispatch: vers 3 proc 1
2021-03-10 18:45:17 [106259.649854] nfsd: GETATTR(3)  8: 00010001 0011 
   
2021-03-10 18:45:17 [106259.649856] nfsd: fh_verify(8: 00010001 0011 
   )
2021-03-10 18:45:17 [106259.650306] nfsd_dispatch: vers 3 proc 4
2021-03-10 18:45:17 [106259.650310] nfsd: ACCESS(3)   8: 00010001 0011 
    0x1f
2021-03-10 18:45:17 [106259.650313] nfsd: fh_verify(8: 00010001 0011 
   )
2021-03-10 18:45:17 [106259.650869] nfsd_dispatch: vers 3 proc 17
2021-03-10 18:45:17 [106259.650874] nfsd: READDIR+(3) 8: 00010001 0011 
    32768 bytes at 0
2021-03-10 18:45:17 [106259.650877] nfsd: fh_verify(8: 00010001 0011 
   )
2021-03-10 18:45:17 [106259.650883] nfsd: fh_verify(8: 00010001 0011 
   )
2021-03-10 18:45:17 [106259.650903] nfsd: fh_compose(exp 00:31/256 /fex, 
ino=256)
2021-03-10 18:45:17 [106259.650907] nfsd: fh_compose(exp 00:31/256 /, ino=256)
2021-03-10 18:45:17 [106259.651454] nfsd_dispatch: vers 3 proc 3
2021-03-10 18:45:17 [106259.651459] nfsd: LOOKUP(3)   8: 00010001 0011 
    spool
2021-03-10 18:45:17 [106259.651463] nfsd: fh_verify(8: 00010001 0011 
   )
2021-03-10 18:45:17 [106259.651471] nfsd: nfsd_lookup(fh 8: 00010001 0011 
   , spool)
2021-03-10 18:45:17 [106259.651477] nfsd: fh_compose(exp 00:31/256 fex/spool, 
ino=256)

Hmmm... and now?

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<55bb7f3.9ce44d1.1781d2fe...@tnonline.net>


Re: misc-next a646ddc2bba2: kernel BUG at fs/btrfs/ctree.c:1210! tree mod log

2021-03-10 Thread Zygo Blaxell
On Fri, Mar 05, 2021 at 12:45:08PM +, Filipe Manana wrote:
> On Fri, Mar 5, 2021 at 1:08 AM Zygo Blaxell
>  wrote:
> >
> > On Tue, Mar 02, 2021 at 04:24:19PM +, Filipe Manana wrote:
> > > On Sat, Feb 27, 2021 at 3:53 PM Zygo Blaxell
> > >  wrote:
> > > >
> > > > Hit this twice so far, while running the usual
> > > > balance/dedupe/rsync/snapshots/all at once on:
> > > >
> > > > a646ddc2bba2 (kdave-gitlab/misc-next) btrfs: unlock extents in 
> > > > btrfs_zero_range in case of quota reservation errors
> > > >
> > > > Looks like tree mod log bugs are back (or never went away?).
> > >
> > > Different bugs causing similar problems.
> > >
> > > Try this:   https://pastebin.com/VkesNs4R
> >
> > I put that patch on top of a646ddc2bba2 and ran it on the same test VM
> > for a few days.  It has now reached its previous uptime record without
> > incident.
> >
> > It looks like a good fix.  I'll leave it running for a few days more to
> > be sure.
> 
> Great!
> 
> Ok, so that seems to confirm what I suspected and made me run into
> other sorts of weirdness during logical ino calls (returning
> unexpected results).
> I haven't hit the BUG_ON() as you do, but if this is indeed caused by
> allowing to reuse unwritten extent buffers in the same transaction,
> it's no wonder that BUG_ON() and many other weird issues happen.
> 
> Can you now try the following version?
> 
> https://pastebin.com/raw/5VHjzdn6
> 
> Leave it for at least as many days as you tested the previous patch,
> hell, even a week or more if you can.

Just to clean up the email thread:

The new patch (5VHjzdn6) has now run 87 hours on top of the original
misc-next a646ddc2bba2, exceeding the best uptime without a patch by
about 30 hours.

Now that I read this again, I notice I forgot to ask if you wanted the new
patch instead of the old one, or on top of it.  I guess it doesn't matter
now--I ran each one separately and they both worked for my test case.

> Thanks, much appreciated.
> 
> >
> > Thanks!
> >
> > > Thanks.
> > >
> > > >
> > > > [40422.398920][T28995] BTRFS info (device dm-0): balance: 
> > > > canceled
> > > > [40607.394003][T11577] BTRFS info (device dm-0): balance: start 
> > > > -dlimit=9
> > > > [40607.398597][T11577] BTRFS info (device dm-0): relocating 
> > > > block group 315676950528 flags data
> > > > [40643.279661][T11577] BTRFS info (device dm-0): found 12686 
> > > > extents, loops 1, stage: move data extents
> > > > [40692.752695][T11577] BTRFS info (device dm-0): found 12686 
> > > > extents, loops 2, stage: update data pointers
> > > > [40704.860522][T11577] BTRFS info (device dm-0): relocating 
> > > > block group 314603208704 flags data
> > > > [40704.919977][T19054] [ cut here ]
> > > > [40704.921895][T19054] kernel BUG at fs/btrfs/ctree.c:1210!
> > > > [40704.923497][T19054] invalid opcode:  [#1] SMP KASAN PTI
> > > > [40704.925549][T19054] CPU: 1 PID: 19054 Comm: crawl_335 
> > > > Tainted: GW 5.11.0-2d11c0084b02-misc-next+ #89
> > > > [40704.929192][T19054] Hardware name: QEMU Standard PC (i440FX 
> > > > + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > [40704.931640][T19054] RIP: 
> > > > 0010:__tree_mod_log_rewind+0x3b1/0x3c0
> > > > [40704.933301][T19054] Code: 05 48 8d 74 10 65 ba 19 00 00 00 
> > > > e8 89 f3 06 00 e9 a7 fd ff ff 4c 8d 7b 2c 4c 89 ff e8 f8 bd c8 ff 48 63 
> > > > 43 2c e9 a2 fe ff ff <0f> 0b 0f 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 
> > > > 1f 44 00 00 55 48
> > > > [40704.938566][T19054] RSP: 0018:c90001eb70b8 EFLAGS: 
> > > > 00010297
> > > > [40704.940483][T19054] RAX:  RBX: 
> > > > 88812344e400 RCX: b28933b6
> > > > [40704.942668][T19054] RDX: 0007 RSI: 
> > > > dc00 RDI: 88812344e42c
> > > > [40704.945002][T19054] RBP: c90001eb7108 R08: 
> > > > 111020b60a20 R09: ed1020b60a20
> > > > [40704.948513][T19054] R10: 888105b050f9 R11: 
> > > > ed1020b60a1f R12: 00ee
> > > > [40704.951601][T19054] R13: 8880195520c0 R14: 
> > > > 8881bc958500 R15: 88812344e42c
> > > > [40704.954607][T19054] FS:  7fd1955e8700() 
> > > > GS:8881f560() knlGS:
> > > > [40704.957704][T19054] CS:  0010 DS:  ES:  CR0: 
> > > > 80050033
> > > > [40704.960125][T19054] CR2: 7efdb7928718 CR3: 
> > > > 00010103a006 CR4: 00170ee0
> > > > [40704.963186][T19054] Call Trace:
> > > > [40704.964229][T19054]  btrfs_search_old_slot+0x265/0x10d0
> > > > [40704.967068][T19054]  ? lock_acquired+0xbb/0x600
> > > > [40704.969148][T19054]  ? btrfs_search_slot+0x1090/0x1090
> > > > [40704.971106][T19054]  ? free_extent_buffer.part.61+0xd7/0x140
> > > > [40704.973020][T19054]  ? free_extent_buffer+0x

Re: [PATCH] fstest: random read fio test for read policy

2021-03-10 Thread Anand Jain



Hi,

 How about a review of this test case or suggest any better ideas?
 If we are ok with this, I will be adding 2 other types of workloads
 that we need to test read policies.

Thanks, Anand

On 22/2/21 10:48 pm, Anand Jain wrote:

This test case runs fio for raid1/10/1c3/1c4 profiles and all the
available read policies in the system. At the end of the test case,
a comparative summary of the result is in the $seqresfull-file.

LOAD_FACTOR parameter controls the fio scalability. For the
LOAD_FACTOR = 1 (default), this runs fio for file size = 1G and num
of jobs = 1, which approximately takes 65s to finish.

There are two objectives of this test case. 1. by default, with
LOAD_FACTOR = 1, it sanity tests the read policies. And 2. Run the
test case individually with a larger LOAD_FACTOR. For example, 10
for the comparative study of the read policy performance.

I find tests/btrfs as the placeholder for this test case. As it
contains many things which are btrfs specific and didn't fit well
under perf.

Signed-off-by: Anand Jain 
---
  tests/btrfs/231 | 145 
  tests/btrfs/231.out |   2 +
  tests/btrfs/group   |   1 +
  3 files changed, 148 insertions(+)
  create mode 100755 tests/btrfs/231
  create mode 100644 tests/btrfs/231.out

diff --git a/tests/btrfs/231 b/tests/btrfs/231
new file mode 100755
index ..c08b5826f60a
--- /dev/null
+++ b/tests/btrfs/231
@@ -0,0 +1,145 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2021 Anand Jain.  All Rights Reserved.
+#
+# FS QA Test 231
+#
+# Random read fio test for raid1(10)(c3)(c4) with available
+# read policy.
+#
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+fio_config=$tmp.fio
+fio_results=$tmp.fio_out
+
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_require_scratch_dev_pool 4
+
+njob=$LOAD_FACTOR
+size=$LOAD_FACTOR
+_require_scratch_size $(($size * 2 * 1024 * 1024))
+echo size=$size njob=$njob >> $seqres.full
+
+make_fio_config()
+{
+   #Set direct IO to true, help to avoid buffered IO so that read happens
+   #from the devices.
+   cat >$fio_config <> $fio_config
+   echo  "filename=$SCRATCH_MNT/$job/file" >> $fio_config
+done
+_require_fio $fio_config
+cat $fio_config >> $seqres.full
+
+work()
+{
+   raid=$1
+
+   echo - profile: $raid -- >> $seqres.full
+   echo >> $seqres.full
+   _scratch_pool_mkfs $raid >> $seqres.full 2>&1
+   _scratch_mount
+
+   fsid=$($BTRFS_UTIL_PROG filesystem show -m $SCRATCH_MNT | grep uuid: | \
+$AWK_PROG '{print $4}')
+   readpolicy_path="/sys/fs/btrfs/$fsid/read_policy"
+   policies=$(cat $readpolicy_path | sed 's/\[//g' | sed 's/\]//g')
+
+   for policy in $policies; do
+   echo $policy > $readpolicy_path || _fail "Fail to set 
readpolicy"
+   echo -n "activating readpolicy: " >> $seqres.full
+   cat $readpolicy_path >> $seqres.full
+   echo >> $seqres.full
+
+   > $fio_results
+   $FIO_PROG --output=$fio_results $fio_config
+   cat $fio_results >> $seqres.full
+   done
+
+   _scratch_unmount
+   _scratch_dev_pool_put
+}
+
+_scratch_dev_pool_get 2
+work "-m raid1 -d single"
+
+_scratch_dev_pool_get 2
+work "-m raid1 -d raid1"
+
+_scratch_dev_pool_get 4
+work "-m raid10 -d raid10"
+
+_scratch_dev_pool_get 3
+work "-m raid1c3 -d raid1c3"
+
+_scratch_dev_pool_get 4
+work "-m raid1c4 -d raid1c4"
+
+
+# Now benchmark the raw device performance
+> $fio_config
+make_fio_config
+_scratch_dev_pool_get 4
+for dev in $SCRATCH_DEV_POOL; do
+   echo "[$dev]" >> $fio_config
+   echo  "filename=$dev" >> $fio_config
+done
+_require_fio $fio_config
+cat $fio_config >> $seqres.full
+
+echo - profile: raw disk -- >> $seqres.full
+echo >> $seqres.full
+> $fio_results
+$FIO_PROG --output=$fio_results $fio_config
+cat $fio_results >> $seqres.full
+
+echo >> $seqres.full
+echo = Summary == >> $seqres.full
+cat $seqres.full | egrep -A1 "Run status|Disk stats|profile:|readpolicy" >> 
$seqres.full
+
+echo "Silence is golden"
+
+# success, all done
+status=0
+exit
diff --git a/tests/btrfs/231.out b/tests/btrfs/231.out
new file mode 100644
index ..a31b87a289bf
--- /dev/null
+++ b/tests/btrfs/231.out
@@ -0,0 +1,2 @@
+QA output created by 231
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index a7c6598326c4..7f449d1db99e 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -233,3 +233,4 @@
  228 auto quick volume
  229 auto quick send clone
  230 a

Re: [PATCH 0/3] btrfs: Convert kmap/memset/kunmap to memzero_user()

2021-03-10 Thread Andrew Morton
On Tue,  9 Mar 2021 13:21:34 -0800 ira.we...@intel.com wrote:

> Previously this was submitted to convert to zero_user()[1].  zero_user() is 
> not
> the same as memzero_user() and in fact some zero_user() calls may be better 
> off
> as memzero_user().  Regardless it was incorrect to convert btrfs to
> zero_user().
> 
> This series corrects this by lifting memzero_user(), converting it to
> kmap_local_page(), and then using it in btrfs.

This impacts btrfs more than MM.  I suggest the btrfs developers grab
it, with my

Acked-by: Andrew Morton 



Re: [PATCH v2 00/10] fsdax,xfs: Add reflink&dedupe support for fsdax

2021-03-10 Thread Dan Williams
On Wed, Mar 10, 2021 at 6:27 AM Matthew Wilcox  wrote:
>
> On Wed, Mar 10, 2021 at 08:21:59AM -0600, Goldwyn Rodrigues wrote:
> > On 13:02 10/03, Matthew Wilcox wrote:
> > > On Wed, Mar 10, 2021 at 07:30:41AM -0500, Neal Gompa wrote:
> > > > Forgive my ignorance, but is there a reason why this isn't wired up to
> > > > Btrfs at the same time? It seems weird to me that adding a feature
> > >
> > > btrfs doesn't support DAX.  only ext2, ext4, XFS and FUSE have DAX 
> > > support.
> > >
> > > If you think about it, btrfs and DAX are diametrically opposite things.
> > > DAX is about giving raw access to the hardware.  btrfs is about offering
> > > extra value (RAID, checksums, ...), none of which can be done if the
> > > filesystem isn't in the read/write path.
> > >
> > > That's why there's no DAX support in btrfs.  If you want DAX, you have
> > > to give up all the features you like in btrfs.  So you may as well use
> > > a different filesystem.
> >
> > DAX on btrfs has been attempted[1]. Of course, we could not
>
> But why?  A completeness fetish?  I don't understand why you decided
> to do this work.

Isn't DAX useful for pagecache minimization on read even if it is
awkward for a copy-on-write fs?

Seems it would be a useful case to have COW'd VM images on BTRFS that
don't need superfluous page cache allocations.


Re: nfs subvolume access?

2021-03-10 Thread Ulli Horlacher
On Wed 2021-03-10 (08:46), Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same 
> file system loop as '/nfs/tsmsrvj/fex'.

It is even worse:

root@tsmsrvj:# grep localhost /etc/exports
/data/fex   localhost(rw,async,no_subtree_check,no_root_squash)

root@tsmsrvj:# mount localhost:/data/fex /nfs/localhost/fex

root@tsmsrvj:# du -s /data/fex
64282240/data/fex

root@tsmsrvj:# du -s /nfs/localhost/fex
du: WARNING: Circular directory structure.
This almost certainly means that you have a corrupted file system.
NOTIFY YOUR SYSTEM MANAGER.
The following directory is part of the cycle:
  /nfs/localhost/fex/spool

0   /nfs/localhost/fex

root@tsmsrvj:# btrfs subvolume list /data
ID 257 gen 42 top level 5 path fex
ID 270 gen 42 top level 257 path fex/spool
ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test

root@tsmsrvj:# uname -a
Linux tsmsrvj 5.4.0-66-generic #74-Ubuntu SMP Wed Jan 27 22:54:38 UTC 2021 
x86_64 x86_64 x86_64 GNU/Linux

root@tsmsrvj:# btrfs version
btrfs-progs v5.4.1

root@tsmsrvj:# dpkg -l | grep nfs-
ii  nfs-common 1:1.3.4-2.5ubuntu3.3  
amd64NFS support files common to client and server
ii  nfs-kernel-server  1:1.3.4-2.5ubuntu3.3  
amd64support for NFS kernel server

The same bug appears if nfs server and client are different hosts or the
client is an older Ubuntu 18.04 system.


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20210310074620.ga2...@tik.uni-stuttgart.de>