Re: [PATCH] btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids

2018-02-28 Thread Anand Jain



On 03/01/2018 12:42 PM, kbuild test robot wrote:

Hi Anand,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.16-rc3 next-20180228]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-rename-btrfs_close_extra_device-to-btrfs_free_extra_devids/20180301-120850
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: x86_64-randconfig-x016-201808 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
 # save the attached .config to linux build tree
 make ARCH=x86_64

All errors (new ones prefixed by >>):



 there is v2 which already fixed this.

Thanks,
Anand


fs/btrfs/disk-io.c: In function 'open_ctree':

fs/btrfs/disk-io.c:2783:2: error: implicit declaration of function 
'btrfs_free_extra_devids'; did you mean 'btrfs_free_extra_devid'? 
[-Werror=implicit-function-declaration]

  btrfs_free_extra_devids(fs_devices, 0);
  ^~~
  btrfs_free_extra_devid
cc1: some warnings being treated as errors

vim +2783 fs/btrfs/disk-io.c

   2396 
   2397 int open_ctree(struct super_block *sb,
   2398struct btrfs_fs_devices *fs_devices,
   2399char *options)
   2400 {
   2401 u32 sectorsize;
   2402 u32 nodesize;
   2403 u32 stripesize;
   2404 u64 generation;
   2405 u64 features;
   2406 struct btrfs_key location;
   2407 struct buffer_head *bh;
   2408 struct btrfs_super_block *disk_super;
   2409 struct btrfs_fs_info *fs_info = btrfs_sb(sb);
   2410 struct btrfs_root *tree_root;
   2411 struct btrfs_root *chunk_root;
   2412 int ret;
   2413 int err = -EINVAL;
   2414 int num_backups_tried = 0;
   2415 int backup_index = 0;
   2416 int max_active;
   2417 int clear_free_space_tree = 0;
   2418 
   2419 tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, 
GFP_KERNEL);
   2420 chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info, 
GFP_KERNEL);
   2421 if (!tree_root || !chunk_root) {
   2422 err = -ENOMEM;
   2423 goto fail;
   2424 }
   2425 
   2426 ret = init_srcu_struct(&fs_info->subvol_srcu);
   2427 if (ret) {
   2428 err = ret;
   2429 goto fail;
   2430 }
   2431 
   2432 ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, 
GFP_KERNEL);
   2433 if (ret) {
   2434 err = ret;
   2435 goto fail_srcu;
   2436 }
   2437 fs_info->dirty_metadata_batch = PAGE_SIZE *
   2438 (1 + ilog2(nr_cpu_ids));
   2439 
   2440 ret = percpu_counter_init(&fs_info->delalloc_bytes, 0, 
GFP_KERNEL);
   2441 if (ret) {
   2442 err = ret;
   2443 goto fail_dirty_metadata_bytes;
   2444 }
   2445 
   2446 ret = percpu_counter_init(&fs_info->bio_counter, 0, GFP_KERNEL);
   2447 if (ret) {
   2448 err = ret;
   2449 goto fail_delalloc_bytes;
   2450 }
   2451 
   2452 INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
   2453 INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
   2454 INIT_LIST_HEAD(&fs_info->trans_list);
   2455 INIT_LIST_HEAD(&fs_info->dead_roots);
   2456 INIT_LIST_HEAD(&fs_info->delayed_iputs);
   2457 INIT_LIST_HEAD(&fs_info->delalloc_roots);
   2458 INIT_LIST_HEAD(&fs_info->caching_block_groups);
   2459 spin_lock_init(&fs_info->delalloc_root_lock);
   2460 spin_lock_init(&fs_info->trans_lock);
   2461 spin_lock_init(&fs_info->fs_roots_radix_lock);
   2462 spin_lock_init(&fs_info->delayed_iput_lock);
   2463 spin_lock_init(&fs_info->defrag_inodes_lock);
   2464 spin_lock_init(&fs_info->tree_mod_seq_lock);
   2465 spin_lock_init(&fs_info->super_lock);
   2466 spin_lock_init(&fs_info->qgroup_op_lock);
   2467 spin_lock_init(&fs_info->buffer_lock);
   2468 spin_lock_init(&fs_info->unused_bgs_lock);
   2469 rwlock_init(&fs_info->tree_mod_log_lock);
   2470 mutex_init(&fs_info->unused_bg_unpin_mutex);
   2471 mutex_init(&fs_info->delete_unused_bgs_mutex);
   2472 mutex_init(&fs_info->reloc_mutex);
   2473 mutex_init(&fs_info->delalloc_root_mutex);
   2474 mutex_init(&fs_info->cleaner_delayed

Re: [PATCH] btrfs: rename btrfs_close_extra_device to btrfs_free_extra_devids

2018-02-28 Thread kbuild test robot
Hi Anand,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on btrfs/next]
[also build test ERROR on v4.16-rc3 next-20180228]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Anand-Jain/btrfs-rename-btrfs_close_extra_device-to-btrfs_free_extra_devids/20180301-120850
base:   https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
next
config: x86_64-randconfig-x016-201808 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
# save the attached .config to linux build tree
make ARCH=x86_64 

All errors (new ones prefixed by >>):

   fs/btrfs/disk-io.c: In function 'open_ctree':
>> fs/btrfs/disk-io.c:2783:2: error: implicit declaration of function 
>> 'btrfs_free_extra_devids'; did you mean 'btrfs_free_extra_devid'? 
>> [-Werror=implicit-function-declaration]
 btrfs_free_extra_devids(fs_devices, 0);
 ^~~
 btrfs_free_extra_devid
   cc1: some warnings being treated as errors

vim +2783 fs/btrfs/disk-io.c

  2396  
  2397  int open_ctree(struct super_block *sb,
  2398 struct btrfs_fs_devices *fs_devices,
  2399 char *options)
  2400  {
  2401  u32 sectorsize;
  2402  u32 nodesize;
  2403  u32 stripesize;
  2404  u64 generation;
  2405  u64 features;
  2406  struct btrfs_key location;
  2407  struct buffer_head *bh;
  2408  struct btrfs_super_block *disk_super;
  2409  struct btrfs_fs_info *fs_info = btrfs_sb(sb);
  2410  struct btrfs_root *tree_root;
  2411  struct btrfs_root *chunk_root;
  2412  int ret;
  2413  int err = -EINVAL;
  2414  int num_backups_tried = 0;
  2415  int backup_index = 0;
  2416  int max_active;
  2417  int clear_free_space_tree = 0;
  2418  
  2419  tree_root = fs_info->tree_root = btrfs_alloc_root(fs_info, 
GFP_KERNEL);
  2420  chunk_root = fs_info->chunk_root = btrfs_alloc_root(fs_info, 
GFP_KERNEL);
  2421  if (!tree_root || !chunk_root) {
  2422  err = -ENOMEM;
  2423  goto fail;
  2424  }
  2425  
  2426  ret = init_srcu_struct(&fs_info->subvol_srcu);
  2427  if (ret) {
  2428  err = ret;
  2429  goto fail;
  2430  }
  2431  
  2432  ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, 
GFP_KERNEL);
  2433  if (ret) {
  2434  err = ret;
  2435  goto fail_srcu;
  2436  }
  2437  fs_info->dirty_metadata_batch = PAGE_SIZE *
  2438  (1 + ilog2(nr_cpu_ids));
  2439  
  2440  ret = percpu_counter_init(&fs_info->delalloc_bytes, 0, 
GFP_KERNEL);
  2441  if (ret) {
  2442  err = ret;
  2443  goto fail_dirty_metadata_bytes;
  2444  }
  2445  
  2446  ret = percpu_counter_init(&fs_info->bio_counter, 0, GFP_KERNEL);
  2447  if (ret) {
  2448  err = ret;
  2449  goto fail_delalloc_bytes;
  2450  }
  2451  
  2452  INIT_RADIX_TREE(&fs_info->fs_roots_radix, GFP_ATOMIC);
  2453  INIT_RADIX_TREE(&fs_info->buffer_radix, GFP_ATOMIC);
  2454  INIT_LIST_HEAD(&fs_info->trans_list);
  2455  INIT_LIST_HEAD(&fs_info->dead_roots);
  2456  INIT_LIST_HEAD(&fs_info->delayed_iputs);
  2457  INIT_LIST_HEAD(&fs_info->delalloc_roots);
  2458  INIT_LIST_HEAD(&fs_info->caching_block_groups);
  2459  spin_lock_init(&fs_info->delalloc_root_lock);
  2460  spin_lock_init(&fs_info->trans_lock);
  2461  spin_lock_init(&fs_info->fs_roots_radix_lock);
  2462  spin_lock_init(&fs_info->delayed_iput_lock);
  2463  spin_lock_init(&fs_info->defrag_inodes_lock);
  2464  spin_lock_init(&fs_info->tree_mod_seq_lock);
  2465  spin_lock_init(&fs_info->super_lock);
  2466  spin_lock_init(&fs_info->qgroup_op_lock);
  2467  spin_lock_init(&fs_info->buffer_lock);
  2468  spin_lock_init(&fs_info->unused_bgs_lock);
  2469  rwlock_init(&fs_info->tree_mod_log_lock);
  2470  mutex_init(&fs_info->unused_bg_unpin_mutex);
  2471  mutex_init(&fs_info->delete_unused_bgs_mutex);
  2472  mutex_init(&fs_info->reloc_mutex);
  2473  mutex_init(&fs_info->delalloc_root_mutex);
  2474  mutex_init(&fs_info->cleaner_delayed_iput_mutex);
  2475  seqlock_init(&fs_info->profiles_lock);
  2476  
  2477  

Re: [PATCH 3/4] btrfs-progs: check/lowmem mode: Check inline extent size

2018-02-28 Thread Su Yue



On 03/01/2018 10:47 AM, Qu Wenruo wrote:

Signed-off-by: Qu Wenruo 


Looks good to me.

Reviewed-by: Su Yue 

---
  check/mode-lowmem.c | 8 
  1 file changed, 8 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 62bcf3d2e126..44c58163f8f7 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1417,6 +1417,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
u64 csum_found; /* In byte size, sectorsize aligned */
u64 search_start;   /* Logical range start we search for csum */
u64 search_len; /* Logical range len we search for csum */
+   u32 max_inline_extent_size = BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info);
unsigned int extent_type;
unsigned int is_hole;
int compressed = 0;
@@ -1440,6 +1441,13 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
root->objectid, fkey->objectid, fkey->offset);
err |= FILE_EXTENT_ERROR;
}
+   if (extent_num_bytes > max_inline_extent_size) {
+   error(
+   "root %llu EXTENT_DATA[%llu %llu] too large inline extent size, have 
%llu, max: %u",
+   root->objectid, fkey->objectid, fkey->offset,
+   extent_num_bytes, max_inline_extent_size);
+   err |= FILE_EXTENT_ERROR;
+   }
if (!compressed && extent_num_bytes != item_inline_len) {
error(
"root %llu EXTENT_DATA[%llu %llu] wrong inline size, have: %llu, 
expected: %u",




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/4] btrfs-progs: check/original mode: Check inline extent size

2018-02-28 Thread Su Yue



On 03/01/2018 10:47 AM, Qu Wenruo wrote:

Kernel doesn't allow inline extent equal or larger than 4K.
And for inline extent larger than 4K, __btrfs_drop_extents() can return
-EOPNOTSUPP and cause unexpected error.

Check it in original mode.

Signed-off-by: Qu Wenruo 


Looks good to me.

Reviewed-by: Su Yue 


---
  check/main.c  | 4 
  check/mode-original.h | 1 +
  2 files changed, 5 insertions(+)

diff --git a/check/main.c b/check/main.c
index 97baae583f04..ce41550ab16a 100644
--- a/check/main.c
+++ b/check/main.c
@@ -560,6 +560,8 @@ static void print_inode_error(struct btrfs_root *root, 
struct inode_record *rec)
fprintf(stderr, ", bad file extent");
if (errors & I_ERR_FILE_EXTENT_OVERLAP)
fprintf(stderr, ", file extent overlap");
+   if (errors & I_ERR_FILE_EXTENT_TOO_LARGE)
+   fprintf(stderr, ", inline file extent too large");
if (errors & I_ERR_FILE_EXTENT_DISCOUNT)
fprintf(stderr, ", file extent discount");
if (errors & I_ERR_DIR_ISIZE_WRONG)
@@ -1461,6 +1463,8 @@ static int process_file_extent(struct btrfs_root *root,
num_bytes = btrfs_file_extent_inline_len(eb, slot, fi);
if (num_bytes == 0)
rec->errors |= I_ERR_BAD_FILE_EXTENT;
+   if (num_bytes > BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info))
+   rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
rec->found_size += num_bytes;
num_bytes = (num_bytes + mask) & ~mask;
} else if (extent_type == BTRFS_FILE_EXTENT_REG ||
diff --git a/check/mode-original.h b/check/mode-original.h
index f859af478f0f..368de692fdd1 100644
--- a/check/mode-original.h
+++ b/check/mode-original.h
@@ -185,6 +185,7 @@ struct file_extent_hole {
  #define I_ERR_SOME_CSUM_MISSING   (1 << 12)
  #define I_ERR_LINK_COUNT_WRONG(1 << 13)
  #define I_ERR_FILE_EXTENT_ORPHAN  (1 << 14)
+#define I_ERR_FILE_EXTENT_TOO_LARGE(1 << 15)
  
  struct inode_record {

struct list_head backrefs;




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/4] btrfs-progs: test/convert: Add test case for invalid large inline data extent

2018-02-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 .../016-invalid-large-inline-extent/test.sh| 22 ++
 1 file changed, 22 insertions(+)
 create mode 100755 tests/convert-tests/016-invalid-large-inline-extent/test.sh

diff --git a/tests/convert-tests/016-invalid-large-inline-extent/test.sh 
b/tests/convert-tests/016-invalid-large-inline-extent/test.sh
new file mode 100755
index ..f37c7c09d2e7
--- /dev/null
+++ b/tests/convert-tests/016-invalid-large-inline-extent/test.sh
@@ -0,0 +1,22 @@
+#!/bin/bash
+# Check if btrfs-convert refuses to rollback the filesystem, and leave the fs
+# and the convert image untouched
+
+source "$TEST_TOP/common"
+source "$TEST_TOP/common.convert"
+
+setup_root_helper
+prepare_test_dev
+check_prereq btrfs-convert
+check_global_prereq mke2fs
+
+convert_test_prep_fs ext4 mke2fs -t ext4 -b 4096
+
+# Create a 6K file, which should not be inlined
+run_check $SUDO_HELPER dd if=/dev/zero bs=2k count=3 of="$TEST_MNT/file1"
+
+run_check_umount_test_dev
+
+# convert_test_do_convert() will call btrfs check, which should expose any
+# invalid inline extent with too large size
+convert_test_do_convert
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/4] btrfs-progs: check/lowmem mode: Check inline extent size

2018-02-28 Thread Qu Wenruo
Signed-off-by: Qu Wenruo 
---
 check/mode-lowmem.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 62bcf3d2e126..44c58163f8f7 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -1417,6 +1417,7 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
u64 csum_found; /* In byte size, sectorsize aligned */
u64 search_start;   /* Logical range start we search for csum */
u64 search_len; /* Logical range len we search for csum */
+   u32 max_inline_extent_size = BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info);
unsigned int extent_type;
unsigned int is_hole;
int compressed = 0;
@@ -1440,6 +1441,13 @@ static int check_file_extent(struct btrfs_root *root, 
struct btrfs_key *fkey,
root->objectid, fkey->objectid, fkey->offset);
err |= FILE_EXTENT_ERROR;
}
+   if (extent_num_bytes > max_inline_extent_size) {
+   error(
+   "root %llu EXTENT_DATA[%llu %llu] too large inline extent size, 
have %llu, max: %u",
+   root->objectid, fkey->objectid, fkey->offset,
+   extent_num_bytes, max_inline_extent_size);
+   err |= FILE_EXTENT_ERROR;
+   }
if (!compressed && extent_num_bytes != item_inline_len) {
error(
"root %llu EXTENT_DATA[%llu %llu] wrong inline size, have: 
%llu, expected: %u",
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/4] Fix long standing -EOPNOTSUPP problem caused by

2018-02-28 Thread Qu Wenruo
Kernel doesn't support dropping range inside inline extent, and prevents
such thing happening by limiting max inline extent size to
min(max_inline, sectorsize - 1) in cow_file_range_inline().

However btrfs-progs only inherit the BTRFS_MAX_INLINE_DATA_SIZE() macro,
which doesn't have sectorsize check.
And since btrfs-progs defaults to 16K nodesize, above macro allows large
inline extent over 15K size.

This leads to unexpected kernel behavior.

The bug exists from the very beginning of btrfs-convert, dating back to
2008 when btrfs-convert is first introduced.

Qu Wenruo (4):
  btrfs-progs: Limit inline extent below page size
  btrfs-progs: check/original mode: Check inline extent size
  btrfs-progs: check/lowmem mode: Check inline extent size
  btrfs-progs: test/convert: Add test case for invalid large inline data
extent

 check/main.c   |  4 
 check/mode-lowmem.c|  8 
 check/mode-original.h  |  1 +
 ctree.h| 11 +--
 .../016-invalid-large-inline-extent/test.sh| 22 ++
 5 files changed, 44 insertions(+), 2 deletions(-)
 create mode 100755 tests/convert-tests/016-invalid-large-inline-extent/test.sh

-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/4] btrfs-progs: check/original mode: Check inline extent size

2018-02-28 Thread Qu Wenruo
Kernel doesn't allow inline extent equal or larger than 4K.
And for inline extent larger than 4K, __btrfs_drop_extents() can return
-EOPNOTSUPP and cause unexpected error.

Check it in original mode.

Signed-off-by: Qu Wenruo 
---
 check/main.c  | 4 
 check/mode-original.h | 1 +
 2 files changed, 5 insertions(+)

diff --git a/check/main.c b/check/main.c
index 97baae583f04..ce41550ab16a 100644
--- a/check/main.c
+++ b/check/main.c
@@ -560,6 +560,8 @@ static void print_inode_error(struct btrfs_root *root, 
struct inode_record *rec)
fprintf(stderr, ", bad file extent");
if (errors & I_ERR_FILE_EXTENT_OVERLAP)
fprintf(stderr, ", file extent overlap");
+   if (errors & I_ERR_FILE_EXTENT_TOO_LARGE)
+   fprintf(stderr, ", inline file extent too large");
if (errors & I_ERR_FILE_EXTENT_DISCOUNT)
fprintf(stderr, ", file extent discount");
if (errors & I_ERR_DIR_ISIZE_WRONG)
@@ -1461,6 +1463,8 @@ static int process_file_extent(struct btrfs_root *root,
num_bytes = btrfs_file_extent_inline_len(eb, slot, fi);
if (num_bytes == 0)
rec->errors |= I_ERR_BAD_FILE_EXTENT;
+   if (num_bytes > BTRFS_MAX_INLINE_DATA_SIZE(root->fs_info))
+   rec->errors |= I_ERR_FILE_EXTENT_TOO_LARGE;
rec->found_size += num_bytes;
num_bytes = (num_bytes + mask) & ~mask;
} else if (extent_type == BTRFS_FILE_EXTENT_REG ||
diff --git a/check/mode-original.h b/check/mode-original.h
index f859af478f0f..368de692fdd1 100644
--- a/check/mode-original.h
+++ b/check/mode-original.h
@@ -185,6 +185,7 @@ struct file_extent_hole {
 #define I_ERR_SOME_CSUM_MISSING(1 << 12)
 #define I_ERR_LINK_COUNT_WRONG (1 << 13)
 #define I_ERR_FILE_EXTENT_ORPHAN   (1 << 14)
+#define I_ERR_FILE_EXTENT_TOO_LARGE(1 << 15)
 
 struct inode_record {
struct list_head backrefs;
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/4] btrfs-progs: Limit inline extent below page size

2018-02-28 Thread Qu Wenruo
Kernel doesn't support to drop extent inside an inlined extent.
And kernel tends to limit inline extent just below sectorsize, so also
limit it in btrfs-progs.

This fixes unexpected -EOPNOTSUPP error from __btrfs_drop_extents() on
converted btrfs.

Fixes: 806528b8755f ("Add Yan Zheng's ext3->btrfs conversion program")
Reported-by: Peter Y. Chuang 
Signed-off-by: Qu Wenruo 
---
 ctree.h | 11 +--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/ctree.h b/ctree.h
index 17cdac76c58c..0282deef339b 100644
--- a/ctree.h
+++ b/ctree.h
@@ -20,6 +20,7 @@
 #define __BTRFS_CTREE_H__
 
 #include 
+#include "internal.h"
 
 #if BTRFS_FLAT_INCLUDES
 #include "list.h"
@@ -1195,8 +1196,14 @@ static inline u32 BTRFS_NODEPTRS_PER_BLOCK(const struct 
btrfs_fs_info *info)
(offsetof(struct btrfs_file_extent_item, disk_bytenr))
 static inline u32 BTRFS_MAX_INLINE_DATA_SIZE(const struct btrfs_fs_info *info)
 {
-   return BTRFS_MAX_ITEM_SIZE(info) -
-   BTRFS_FILE_EXTENT_INLINE_DATA_START;
+   /*
+* Inline extent larger than pagesize could lead to kernel unexpected
+* error when dropping extents, so we need to limit the inline extent
+* size to less than sectorsize.
+*/
+   return min_t(u32, info->sectorsize - 1,
+BTRFS_MAX_ITEM_SIZE(info) -
+BTRFS_FILE_EXTENT_INLINE_DATA_START);
 }
 
 static inline u32 BTRFS_MAX_XATTR_SIZE(const struct btrfs_fs_info *info)
-- 
2.16.2

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug report] BTRFS partition re-mounted as read-only after few minutes of use

2018-02-28 Thread Qu Wenruo


On 2018年03月01日 02:36, Filipe Manana wrote:
> On Wed, Feb 28, 2018 at 5:50 PM, David Sterba  wrote:
>> On Wed, Feb 28, 2018 at 05:43:40PM +0100, peteryuchu...@gmail.com wrote:
>>> On my laptop, which has just been switched to BTRFS, the root partition
>>> (a BTRFS partition inside an encrypted LVM. The drive is an NVMe) is
>>> re-mounted as read-only few minutes after boot.
>>>
>>> Trace:
>>
>> By any chance, are there other messages from btrfs above the line?
>>>
>>> [  199.974591] [ cut here ]
>>> [  199.974593] BTRFS: Transaction aborted (error -95)
>>
>> -95 is EOPNOTSUPP, ie operation not supported
>>
>>> [  199.974647] WARNING: CPU: 0 PID: 324 at fs/btrfs/inode.c:3042 
>>> btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]
>>
>> btrfs_finish_ordered_io::
>>
>>  3038 btrfs_ordered_update_i_size(inode, 0, ordered_extent);
>>  3039 ret = btrfs_update_inode_fallback(trans, root, inode);
>>  3040 if (ret) {
>>  3041 btrfs_abort_transaction(trans, ret);
>>  3042 goto out;
>>  3043 }
> 
> I don't know what's exactly in Arch's kernel, but looking at the
> 4.15.5 stable tag from kernel.org:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/inode.c?h=v4.15.5#n3042
> 
> The -EOPNOTSUPP error can come from btrfs_drop_extents() through the
> call to insert_reserved_file_extent().

__btrfs_drop_extents() will return -EOPNOTSUPP if we're dropping part of
an inline extent.

Could be something wrong with convert inline extent generator.


> We've had several reports of this kind of error in this location in
> the past and they happened to be on filesystems converted from extN to
> btrfs.
> I don't know however if this filesystem was from such a conversion nor
> if those old bugs in the conversion tool were fixed.

And since the user is using Arch and kernel is latest, it normally means
the btrfs-progs is also latest.

I need to double check about the convert inline extent code to ensure we
don't create too large inline extent.

Thanks,
Qu

> 
> 
>>
>> the return code is unexpected here. And seeing 'operation not supported'
>> after a inode size change looks strange but EOPNOTSUPP could be returned
>> from some places.
>>
>> The transaction is aborted from a thread that finalizes some processing
>> so we don't have enough information here to see how it started. I
>> suspect there's a file that gets modified short after boot and hits the
>> problem. I don't think the EOPNOTSUPP is returned from the lower layers
>> (lvm encryption or nvme), so at this point seems like a btrfs bug.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 



signature.asc
Description: OpenPGP digital signature


Re: spurious full btrfs corruption

2018-02-28 Thread Qu Wenruo


On 2018年02月28日 23:50, Christoph Anton Mitterer wrote:
> Hey Qu
> 
> Thanks for still looking into this.
> I'm still in the recovery process (and there are other troubles at the
> university where I work, so everything will take me some time), but I
> have made a dd image of the broken fs, before I put a backup on the
> SSD, so that still exist in the case we need to do further debugging.
> 
> To thoroughly describe what has happened, let me go a bit back.
> 
> - Until last ~ September, I was using some Fujitsu E782, for at least
>   4 years, with no signs of data corruptions.

That's pretty good.

> - For my personal data, I have one[0] Seagate 8 TB SMR HDD, which I
>   backup (send/receive) on two further such HDDs (all these are
>   btrfs), and (rsync) on one further with ext4.
>   These files have all their SHA512 sums attached as XATTRs, which I
>   regularly test. So I think I can be pretty sure, that there was never
>   a case of silent data corruption and the RAM on the E782 is fine.

Good backup practice can't be even better.

> - In October I got a new notebook from the university... brand new
>   Fujitsu U757 in basically the best possible configuration.
>   I ran memtest86+ in it's normal (non-SMP) mode for roughly a day,
>   with no errors.
>   In SMP mode (which is considered experimental, I think) it crashes
>   reproducible on the same position. Many people seem to have this
>   (with exactly the same test, address range where it freezes) so I
>   considered it a bug in memtest86+ SMP mode, which it likely is.
>   A patch[1], didn't help me.

Normally I won't blame memory unless strange behavior happens, from
unexpected freeze to strange kernel panic.

But when this happens, a lot of things can go wrong.

> - Unfortunately from the beginning on that notebook showed many further
>   issues.
>   - CPU overheating[2]
>   - boot freezes, when the initramfs tool of Debian isn't configured
> to 
> blindly add all modules to the initramfs[3].
>   - spurious freezes, which I couldn't really debug any further since
> there is no serial port...

Netconsole would help here, especially when U757 has an RJ45.
As long as you have another system which is able to run nc, it should
catch any kernel message, and help us to analyse if it's really a memory
corruption.

> in that cases neither Magic-SysRq nor
> even NumLock LEDs and so worked anymore.
> These freezes caused me some troubles with dpkg[4].
> The issue I describe there, could also shed some light on the whole
> situation, since it resulted out of the freezes.
> - The dealer replaced the thermal paste on the CPU and when the CPU
>   overheating and the freezes didn't go away, they sent the notebook
>   for one week to Fujitsu in Germany, who allegedly thoroughly tested
>   it with Windows, and found no errors.

That's unfortunately very common for consumer electronics, as few people
and cooperation really care about Linux user on consumer laptops.

And since there are problems with the system (either hardware or
software), I already see a much higher possibility to hard reset.

> 
> - The notebooks SSD is a Samsung SSD 850 PRO 1TB, the same which I
>   already used with the old notebook.
>   A long SMART check after the corruption, brought no errors.

Also using that SSD with smaller capacity, it's less possible for the SSD.

> 
> 
> - Just before the corruption on the btrfs happened, I decided it's
> time 
>   for a backup of the notebooks SSD (what an irony, I know), so I made
>   a snapshot of my one and only subvol, removed and non-precious data
>   from that snapshot, made anotjer ro-snapshot of that and removed the
>   rw snapshot.
> - The kernel was some 4.14.
> 
> - More or less after that, I saw the "BUG: unable to handle kernel
>   paging request at 9fb75f827100" which I reported here.
>   I'm not sure whether this had to do with btrfs at all, and even if
>   whether it was the fs on the SSD, or another one on an external HDD

It could be Btrfs, and it would block btrfs module to continue, which is
almost a hard reset.

>   I've had mounted at that time.
>   sync/umount/remount,rw/shutdown all didn't work, and I had to power
>   off the node.
> - After that things went on basically as I described in my previous
>   mails to the list already.
>   - There were some csum erros.>   - Checking these files with debsums 
> (Debian stores MD5s of the
> package's files) found no errors.
>   - A scrub brought no errors.
>   - Shortly after the scrub, further csum errors as well as:
> BTRFS critical (device dm-0): unable to find logical 4503658729209856 
> length 4096
>   - Then I booted from a rescue USB stick with kernel/btrfs-progs 4.12.
>   - fsck in normal/lowmem mode were okay except:
> Couldn't find free space inode 1
>   - I cleared the v1 free space cache
>   - a scrub failed with "ret=-1, errno=5 (Input/output error)"
>   - Things like these in the kernel log:
> Feb 21 17:43:09 heisenberg kernel

Re: [PATCH] Btrfs: fix unexpected -EEXIST when creating new inode

2018-02-28 Thread Liu Bo
On Wed, Feb 28, 2018 at 04:06:40PM +, Filipe Manana wrote:
> On Thu, Jan 25, 2018 at 6:02 PM, Liu Bo  wrote:
> > The highest objectid, which is assigned to new inode, is decided at
> > the time of initializing fs roots.  However, in cases where log replay
> > gets processed, the btree which fs root owns might be changed, so we
> > have to search it again for the highest objectid, otherwise creating
> > new inode would end up with -EEXIST.
> >
> > cc:  v4.4-rc6+
> > Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when 
> > loading tree root and subvolume roots")
> > Signed-off-by: Liu Bo 
> 
> Hi Bo,
> 
> Any reason to not have submitted a test case for fstests?
> Unless I missed something this should be easy to reproduce, deterministic 
> issue.
>

It's on my todo list for a while until I forgot it...will do after I
fix the bugs I have now.

I found this originally from running generic/475.

Thanks,

-liubo
> thanks
> 
> > ---
> >  fs/btrfs/tree-log.c | 19 +++
> >  1 file changed, 19 insertions(+)
> >
> > diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> > index a7e6235..646cdbf 100644
> > --- a/fs/btrfs/tree-log.c
> > +++ b/fs/btrfs/tree-log.c
> > @@ -28,6 +28,7 @@
> >  #include "hash.h"
> >  #include "compression.h"
> >  #include "qgroup.h"
> > +#include "inode-map.h"
> >
> >  /* magic values for the inode_only field in btrfs_log_inode:
> >   *
> > @@ -5715,6 +5716,24 @@ int btrfs_recover_log_trees(struct btrfs_root 
> > *log_root_tree)
> >   path);
> > }
> >
> > +   if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) {
> > +   struct btrfs_root *root = wc.replay_dest;
> > +
> > +   btrfs_release_path(path);
> > +
> > +   /*
> > +* We have just replayed everything, and the highest
> > +* objectid of fs roots probably has changed in case
> > +* some inode_item's got replayed.
> > +*/
> > +   /*
> > +* root->objectid_mutex is not acquired as log 
> > replay
> > +* could only happen during mount.
> > +*/
> > +   ret = btrfs_find_highest_objectid(root,
> > + &root->highest_objectid);
> > +   }
> > +
> > key.offset = found_key.offset - 1;
> > wc.replay_dest->log_root = NULL;
> > free_extent_buffer(log->node);
> > --
> > 2.9.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> -- 
> Filipe David Manana,
> 
> “Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs space used issue

2018-02-28 Thread Austin S. Hemmelgarn

On 2018-02-28 14:54, Duncan wrote:

Austin S. Hemmelgarn posted on Wed, 28 Feb 2018 14:24:40 -0500 as
excerpted:


I believe this effect is what Austin was referencing when he suggested
the defrag, tho defrag won't necessarily /entirely/ clear it up.  One
way to be /sure/ it's cleared up would be to rewrite the entire file,
deleting the original, either by copying it to a different filesystem
and back (with the off-filesystem copy guaranteeing that it can't use
reflinks to the existing extents), or by using cp's --reflink=never
option.
(FWIW, I prefer the former, just to be sure, using temporary copies to
a suitably sized tmpfs for speed where possible, tho obviously if the
file is larger than your memory size that's not possible.)



Correct, this is why I recommended trying a defrag.  I've actually never
seen things so bad that a simple defrag didn't fix them however (though
I have seen a few cases where the target extent size had to be set
higher than the default of 20MB).


Good to know.  I knew larger target extent sizes could help, but between
not being sure they'd entirely fix it and not wanting to get too far down
into the detail when the copy-off-the-filesystem-and-back option is
/sure/ to fix the problem, I decided to handwave that part of it. =:^)
FWIW, a target size of 128M has fixed it on all 5 cases I've seen where 
the default didn't.  In theory, there's probably some really 
pathological case where that won't work, but I've just gotten into the 
habit of using that by default on all my systems now and haven't seen 
any issues so far (but like you I'm pretty much exclusively on SSD's, 
and the small handful of things I have on traditional hard disks are all 
archival storage with WORM access patterns).



Also, as counter-intuitive as it
might sound, autodefrag really doesn't help much with this, and can
actually make things worse.


I hadn't actually seen that here, but suspect I might, now, as previous
autodefrag behavior on my system tended to rewrite the entire file[1],
thereby effectively giving me the benefit of the copy-away-and-back
technique without actually bothering, while that "bug" has now been fixed.

I sort of wish the old behavior remained an option, maybe
radicalautodefrag or something, and must confess to being a bit concerned
over the eventual impact here now that autodefrag does /not/ rewrite the
entire file any more, but oh, well...  Chances are it's not going to be
/that/ big a deal since I /am/ on fast ssd, and if it becomes one, I
guess I can just setup say firefox-profile-defrag.timer jobs or whatever,
as necessary.

---
[1] I forgot whether it was ssd behavior, or compression, or what, but
something I'm using here apparently forced autodefrag to rewrite the
entire file, and a recent "bugfix" changed that so it's more in line with
the normal autodefrag behavior.  I rather preferred the old behavior,
especially since I'm on fast ssd and all my large files tend to be write-
once no-rewrite anyway, but I understand the performance implications on
large active-rewrite files such as gig-plus database and VM-image files,
so...
Hmm.  I've actually never seen such behavior myself.  I do know that 
compression impacts how autodefrag works (autodefrag tries to rewrite up 
to 64k around a random write, but compression operates in 128k blocks), 
but beyond that I'm not sure what might have caused this.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs space used issue

2018-02-28 Thread Duncan
Austin S. Hemmelgarn posted on Wed, 28 Feb 2018 14:24:40 -0500 as
excerpted:

>> I believe this effect is what Austin was referencing when he suggested
>> the defrag, tho defrag won't necessarily /entirely/ clear it up.  One
>> way to be /sure/ it's cleared up would be to rewrite the entire file,
>> deleting the original, either by copying it to a different filesystem
>> and back (with the off-filesystem copy guaranteeing that it can't use
>> reflinks to the existing extents), or by using cp's --reflink=never
>> option.
>> (FWIW, I prefer the former, just to be sure, using temporary copies to
>> a suitably sized tmpfs for speed where possible, tho obviously if the
>> file is larger than your memory size that's not possible.)

> Correct, this is why I recommended trying a defrag.  I've actually never
> seen things so bad that a simple defrag didn't fix them however (though
> I have seen a few cases where the target extent size had to be set
> higher than the default of 20MB).

Good to know.  I knew larger target extent sizes could help, but between 
not being sure they'd entirely fix it and not wanting to get too far down 
into the detail when the copy-off-the-filesystem-and-back option is 
/sure/ to fix the problem, I decided to handwave that part of it. =:^)

> Also, as counter-intuitive as it
> might sound, autodefrag really doesn't help much with this, and can
> actually make things worse.

I hadn't actually seen that here, but suspect I might, now, as previous 
autodefrag behavior on my system tended to rewrite the entire file[1], 
thereby effectively giving me the benefit of the copy-away-and-back 
technique without actually bothering, while that "bug" has now been fixed.

I sort of wish the old behavior remained an option, maybe 
radicalautodefrag or something, and must confess to being a bit concerned 
over the eventual impact here now that autodefrag does /not/ rewrite the 
entire file any more, but oh, well...  Chances are it's not going to be 
/that/ big a deal since I /am/ on fast ssd, and if it becomes one, I 
guess I can just setup say firefox-profile-defrag.timer jobs or whatever, 
as necessary.

---
[1] I forgot whether it was ssd behavior, or compression, or what, but 
something I'm using here apparently forced autodefrag to rewrite the 
entire file, and a recent "bugfix" changed that so it's more in line with 
the normal autodefrag behavior.  I rather preferred the old behavior, 
especially since I'm on fast ssd and all my large files tend to be write-
once no-rewrite anyway, but I understand the performance implications on 
large active-rewrite files such as gig-plus database and VM-image files, 
so...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs space used issue

2018-02-28 Thread Austin S. Hemmelgarn

On 2018-02-28 14:09, Duncan wrote:

vinayak hegde posted on Tue, 27 Feb 2018 18:39:51 +0530 as excerpted:


I am using btrfs, But I am seeing du -sh and df -h showing huge size
difference on ssd.

mount:
/dev/drbd1 on /dc/fileunifier.datacache type btrfs


(rw,noatime,nodiratime,flushoncommit,discard,nospace_cache,recovery,commit=5,subvolid=5,subvol=/)



du -sh /dc/fileunifier.datacache/ -  331G

df -h /dev/drbd1  746G  346G  398G  47% /dc/fileunifier.datacache

btrfs fi usage /dc/fileunifier.datacache/
Overall:
 Device size: 745.19GiB Device allocated: 368.06GiB
 Device unallocated: 377.13GiB Device missing:
 0.00B Used: 346.73GiB Free (estimated):
 396.36GiB(min: 207.80GiB)
 Data ratio:  1.00 Metadata ratio:  2.00
 Global reserve: 176.00MiB(used: 0.00B)

Data,single: Size:365.00GiB, Used:345.76GiB
/dev/drbd1 365.00GiB

Metadata,DUP: Size:1.50GiB, Used:493.23MiB
/dev/drbd1   3.00GiB

System,DUP: Size:32.00MiB, Used:80.00KiB
/dev/drbd1  64.00MiB

Unallocated:
/dev/drbd1 377.13GiB


Even if we consider 6G metadata its 331+6 = 337.
where is 9GB used?

Please explain.


Taking a somewhat higher level view than Austin's reply, on btrfs, plain
df and to a somewhat lessor extent du[1] are at best good /estimations/
of usage, and for df, space remaining.  Due to btrfs' COW/copy-on-write
semantics and features such as the various replication/raid schemes,
snapshotting, etc, btrfs makes available, that df/du don't really
understand as they simply don't have and weren't /designed/ to have that
level of filesystem-specific insight, they, particularly df due to its
whole-filesystem focus, aren't particularly accurate on btrfs.  Consider
their output more a "best estimate given the rough data we have
available" sort of report.

To get the real filesystem focused picture, use btrfs filesystem usage,
or btrfs filesystem show combined with btrfs filesystem df.  That's what
you should trust, altho various utilities that check for available space
before doing something often use the kernel-call equivalent of (plain) df
to ensure they have the required space, so it's worthwhile to keep an eye
on it as the filesystem fills, as well.  If it gets too out of sync with
btrfs filesystem usage, or if btrfs filesystem usage unallocated drops
below say five gigs or data or metadata size vs used shows a spread of
multiple gigs (your data shows a spread of ~20 gigs ATM, but with 377
gigs still unallocated it's no big deal; it would be a big deal if those
were reversed, tho, only 20 gigs unallocated and a spread of 300+ gigs in
data size vs used), then corrective action such as a filtered rebalance
may be necessary.

There are entries in the FAQ discussing free space issues that you should
definitely read if you haven't, altho they obviously address the general
case, so if you have more questions about an individual case after having
read them, here is a good place to ask. =:^)

Everything having to do with "space" (see both the 1/Important-questions
and 4/Common-questions sections) here:

https://btrfs.wiki.kernel.org/index.php/FAQ

Meanwhile, it's worth noting that not entirely intuitively, btrfs' COW
implementation can "waste" space on larger files that are mostly, but not
entirely, rewritten.  An example is the best way to demonstrate.
Consider each x a used block and each - an unused but still referenced
block:

Original file, written as a single extent (diagram works best with
monospace, not arbitrarily rewrapped):

xxx

First rewrite of part of it:

xxx--xx
xx


Nth rewrite, where some blocks of the original still remain as originally
written:

--xxx--
xxx---
xxx
 
  x---xx
   xxx
   xxx


As you can see, that first really large extent remains fully referenced,
altho only three blocks of it remain in actual use.  All those -- won't
be returned to free space until those last three blocks get rewritten as
well, thus freeing the entire original extent.

I believe this effect is what Austin was referencing when he suggested
the defrag, tho defrag won't necessarily /entirely/ clear it up.  One way
to be /sure/ it's cleared up would be to rewrite the entire file,
deleting the original, either by copying it to a different filesystem and
back (with the off-filesystem copy guaranteeing that it can't use reflinks
to the existing extents), or by using cp's --reflink=never option.
(FWIW, I prefer the former, just to be sure, using temporary copies to a
suitably sized tmpfs for speed where possible, tho obviously if the file
is larger than your memory size that's not possible.)
Correct, this is why I recommended trying a defrag.  I'

Re: btrfs space used issue

2018-02-28 Thread Duncan
vinayak hegde posted on Tue, 27 Feb 2018 18:39:51 +0530 as excerpted:

> I am using btrfs, But I am seeing du -sh and df -h showing huge size
> difference on ssd.
> 
> mount:
> /dev/drbd1 on /dc/fileunifier.datacache type btrfs
> 
(rw,noatime,nodiratime,flushoncommit,discard,nospace_cache,recovery,commit=5,subvolid=5,subvol=/)
> 
> 
> du -sh /dc/fileunifier.datacache/ -  331G
> 
> df -h /dev/drbd1  746G  346G  398G  47% /dc/fileunifier.datacache
> 
> btrfs fi usage /dc/fileunifier.datacache/
> Overall:
> Device size: 745.19GiB Device allocated: 368.06GiB
> Device unallocated: 377.13GiB Device missing:
> 0.00B Used: 346.73GiB Free (estimated):
> 396.36GiB(min: 207.80GiB)
> Data ratio:  1.00 Metadata ratio:  2.00
> Global reserve: 176.00MiB(used: 0.00B)
> 
> Data,single: Size:365.00GiB, Used:345.76GiB
>/dev/drbd1 365.00GiB
> 
> Metadata,DUP: Size:1.50GiB, Used:493.23MiB
>/dev/drbd1   3.00GiB
> 
> System,DUP: Size:32.00MiB, Used:80.00KiB
>/dev/drbd1  64.00MiB
> 
> Unallocated:
>/dev/drbd1 377.13GiB
> 
> 
> Even if we consider 6G metadata its 331+6 = 337.
> where is 9GB used?
> 
> Please explain.

Taking a somewhat higher level view than Austin's reply, on btrfs, plain 
df and to a somewhat lessor extent du[1] are at best good /estimations/ 
of usage, and for df, space remaining.  Due to btrfs' COW/copy-on-write 
semantics and features such as the various replication/raid schemes, 
snapshotting, etc, btrfs makes available, that df/du don't really 
understand as they simply don't have and weren't /designed/ to have that 
level of filesystem-specific insight, they, particularly df due to its 
whole-filesystem focus, aren't particularly accurate on btrfs.  Consider 
their output more a "best estimate given the rough data we have 
available" sort of report.

To get the real filesystem focused picture, use btrfs filesystem usage, 
or btrfs filesystem show combined with btrfs filesystem df.  That's what 
you should trust, altho various utilities that check for available space 
before doing something often use the kernel-call equivalent of (plain) df 
to ensure they have the required space, so it's worthwhile to keep an eye 
on it as the filesystem fills, as well.  If it gets too out of sync with 
btrfs filesystem usage, or if btrfs filesystem usage unallocated drops 
below say five gigs or data or metadata size vs used shows a spread of 
multiple gigs (your data shows a spread of ~20 gigs ATM, but with 377 
gigs still unallocated it's no big deal; it would be a big deal if those 
were reversed, tho, only 20 gigs unallocated and a spread of 300+ gigs in 
data size vs used), then corrective action such as a filtered rebalance 
may be necessary.

There are entries in the FAQ discussing free space issues that you should 
definitely read if you haven't, altho they obviously address the general 
case, so if you have more questions about an individual case after having 
read them, here is a good place to ask. =:^)

Everything having to do with "space" (see both the 1/Important-questions 
and 4/Common-questions sections) here:

https://btrfs.wiki.kernel.org/index.php/FAQ

Meanwhile, it's worth noting that not entirely intuitively, btrfs' COW 
implementation can "waste" space on larger files that are mostly, but not 
entirely, rewritten.  An example is the best way to demonstrate.  
Consider each x a used block and each - an unused but still referenced 
block:

Original file, written as a single extent (diagram works best with 
monospace, not arbitrarily rewrapped):

xxx

First rewrite of part of it:

xxx--xx
   xx


Nth rewrite, where some blocks of the original still remain as originally 
written:

--xxx--
   xxx---
xxx

 x---xx
  xxx
  xxx


As you can see, that first really large extent remains fully referenced, 
altho only three blocks of it remain in actual use.  All those -- won't 
be returned to free space until those last three blocks get rewritten as 
well, thus freeing the entire original extent.

I believe this effect is what Austin was referencing when he suggested 
the defrag, tho defrag won't necessarily /entirely/ clear it up.  One way 
to be /sure/ it's cleared up would be to rewrite the entire file, 
deleting the original, either by copying it to a different filesystem and 
back (with the off-filesystem copy guaranteeing that it can't use reflinks 
to the existing extents), or by using cp's --reflink=never option.  
(FWIW, I prefer the former, just to be sure, using temporary copies to a 
suitably sized tmpfs for speed where possible, tho obviously if the file 
is larger than your memory

Re: [Bug report] BTRFS partition re-mounted as read-only after few minutes of use

2018-02-28 Thread peteryuchuang
On Wed, 2018-02-28 at 18:36 +, Filipe Manana wrote:
> On Wed, Feb 28, 2018 at 5:50 PM, David Sterba 
> wrote:
> > On Wed, Feb 28, 2018 at 05:43:40PM +0100, peteryuchu...@gmail.com
> > wrote:
> > > On my laptop, which has just been switched to BTRFS, the root
> > > partition
> > > (a BTRFS partition inside an encrypted LVM. The drive is an NVMe)
> > > is
> > > re-mounted as read-only few minutes after boot.
> > > 
> > > Trace:
> > 
> > By any chance, are there other messages from btrfs above the line?
> > > 
> > > [  199.974591] [ cut here ]
> > > [  199.974593] BTRFS: Transaction aborted (error -95)
> > 
> > -95 is EOPNOTSUPP, ie operation not supported
> > 
> > > [  199.974647] WARNING: CPU: 0 PID: 324 at fs/btrfs/inode.c:3042
> > > btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]
> > 
> > btrfs_finish_ordered_io::
> > 
> >  3038 btrfs_ordered_update_i_size(inode, 0,
> > ordered_extent);
> >  3039 ret = btrfs_update_inode_fallback(trans, root,
> > inode);
> >  3040 if (ret) {
> >  3041 btrfs_abort_transaction(trans, ret);
> >  3042 goto out;
> >  3043 }
> 
> I don't know what's exactly in Arch's kernel, but looking at the
> 4.15.5 stable tag from kernel.org:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.g
> it/tree/fs/btrfs/inode.c?h=v4.15.5#n3042
> 
> The -EOPNOTSUPP error can come from btrfs_drop_extents() through the
> call to insert_reserved_file_extent().
> We've had several reports of this kind of error in this location in
> the past and they happened to be on filesystems converted from extN
> to
> btrfs.
> I don't know however if this filesystem was from such a conversion
> nor
> if those old bugs in the conversion tool were fixed.
> 
> 

Indeed it was converted from ext4. I may try to rebuild the system from
scratch when I have more time, but I'm afraid I have to revert back to
ext4 for now.

> > 
> > the return code is unexpected here. And seeing 'operation not
> > supported'
> > after a inode size change looks strange but EOPNOTSUPP could be
> > returned
> > from some places.
> > 
> > The transaction is aborted from a thread that finalizes some
> > processing
> > so we don't have enough information here to see how it started. I
> > suspect there's a file that gets modified short after boot and hits
> > the
> > problem. I don't think the EOPNOTSUPP is returned from the lower
> > layers
> > (lvm encryption or nvme), so at this point seems like a btrfs bug.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-
> > btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug report] BTRFS partition re-mounted as read-only after few minutes of use

2018-02-28 Thread Filipe Manana
On Wed, Feb 28, 2018 at 5:50 PM, David Sterba  wrote:
> On Wed, Feb 28, 2018 at 05:43:40PM +0100, peteryuchu...@gmail.com wrote:
>> On my laptop, which has just been switched to BTRFS, the root partition
>> (a BTRFS partition inside an encrypted LVM. The drive is an NVMe) is
>> re-mounted as read-only few minutes after boot.
>>
>> Trace:
>
> By any chance, are there other messages from btrfs above the line?
>>
>> [  199.974591] [ cut here ]
>> [  199.974593] BTRFS: Transaction aborted (error -95)
>
> -95 is EOPNOTSUPP, ie operation not supported
>
>> [  199.974647] WARNING: CPU: 0 PID: 324 at fs/btrfs/inode.c:3042 
>> btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]
>
> btrfs_finish_ordered_io::
>
>  3038 btrfs_ordered_update_i_size(inode, 0, ordered_extent);
>  3039 ret = btrfs_update_inode_fallback(trans, root, inode);
>  3040 if (ret) {
>  3041 btrfs_abort_transaction(trans, ret);
>  3042 goto out;
>  3043 }

I don't know what's exactly in Arch's kernel, but looking at the
4.15.5 stable tag from kernel.org:

https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/tree/fs/btrfs/inode.c?h=v4.15.5#n3042

The -EOPNOTSUPP error can come from btrfs_drop_extents() through the
call to insert_reserved_file_extent().
We've had several reports of this kind of error in this location in
the past and they happened to be on filesystems converted from extN to
btrfs.
I don't know however if this filesystem was from such a conversion nor
if those old bugs in the conversion tool were fixed.


>
> the return code is unexpected here. And seeing 'operation not supported'
> after a inode size change looks strange but EOPNOTSUPP could be returned
> from some places.
>
> The transaction is aborted from a thread that finalizes some processing
> so we don't have enough information here to see how it started. I
> suspect there's a file that gets modified short after boot and hits the
> problem. I don't think the EOPNOTSUPP is returned from the lower layers
> (lvm encryption or nvme), so at this point seems like a btrfs bug.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Bug report] BTRFS partition re-mounted as read-only after few minutes of use

2018-02-28 Thread David Sterba
On Wed, Feb 28, 2018 at 05:43:40PM +0100, peteryuchu...@gmail.com wrote:
> On my laptop, which has just been switched to BTRFS, the root partition
> (a BTRFS partition inside an encrypted LVM. The drive is an NVMe) is
> re-mounted as read-only few minutes after boot. 
> 
> Trace:

By any chance, are there other messages from btrfs above the line?
> 
> [  199.974591] [ cut here ]
> [  199.974593] BTRFS: Transaction aborted (error -95)

-95 is EOPNOTSUPP, ie operation not supported

> [  199.974647] WARNING: CPU: 0 PID: 324 at fs/btrfs/inode.c:3042 
> btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]

btrfs_finish_ordered_io::

 3038 btrfs_ordered_update_i_size(inode, 0, ordered_extent);
 3039 ret = btrfs_update_inode_fallback(trans, root, inode);
 3040 if (ret) {
 3041 btrfs_abort_transaction(trans, ret);
 3042 goto out;
 3043 }

the return code is unexpected here. And seeing 'operation not supported'
after a inode size change looks strange but EOPNOTSUPP could be returned
from some places.

The transaction is aborted from a thread that finalizes some processing
so we don't have enough information here to see how it started. I
suspect there's a file that gets modified short after boot and hits the
problem. I don't think the EOPNOTSUPP is returned from the lower layers
(lvm encryption or nvme), so at this point seems like a btrfs bug.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs space used issue

2018-02-28 Thread Andrei Borzenkov
On Wed, Feb 28, 2018 at 9:01 AM, vinayak hegde  wrote:
> I ran full defragement and balance both, but didnt help.

Showing the same information immediately after full defragment would be helpful.

> My created and accounting usage files are matching the du -sh output.
> But I am not getting why btrfs internals use so much extra space.
> My worry is, will get no space error earlier than I expect.
> Is it expected with btrfs internal that it will use so much extra space?
>

Did you try to reboot? Deleted opened file could well cause this effect.

> Vinayak
>
>
>
>
> On Tue, Feb 27, 2018 at 7:24 PM, Austin S. Hemmelgarn
>  wrote:
>> On 2018-02-27 08:09, vinayak hegde wrote:
>>>
>>> I am using btrfs, But I am seeing du -sh and df -h showing huge size
>>> difference on ssd.
>>>
>>> mount:
>>> /dev/drbd1 on /dc/fileunifier.datacache type btrfs
>>>
>>> (rw,noatime,nodiratime,flushoncommit,discard,nospace_cache,recovery,commit=5,subvolid=5,subvol=/)
>>>
>>>
>>> du -sh /dc/fileunifier.datacache/ -  331G
>>>
>>> df -h
>>> /dev/drbd1  746G  346G  398G  47% /dc/fileunifier.datacache
>>>
>>> btrfs fi usage /dc/fileunifier.datacache/
>>> Overall:
>>>  Device size: 745.19GiB
>>>  Device allocated: 368.06GiB
>>>  Device unallocated: 377.13GiB
>>>  Device missing: 0.00B
>>>  Used: 346.73GiB
>>>  Free (estimated): 396.36GiB(min: 207.80GiB)
>>>  Data ratio:  1.00
>>>  Metadata ratio:  2.00
>>>  Global reserve: 176.00MiB(used: 0.00B)
>>>
>>> Data,single: Size:365.00GiB, Used:345.76GiB
>>> /dev/drbd1 365.00GiB
>>>
>>> Metadata,DUP: Size:1.50GiB, Used:493.23MiB
>>> /dev/drbd1   3.00GiB
>>>
>>> System,DUP: Size:32.00MiB, Used:80.00KiB
>>> /dev/drbd1  64.00MiB
>>>
>>> Unallocated:
>>> /dev/drbd1 377.13GiB
>>>
>>>
>>> Even if we consider 6G metadata its 331+6 = 337.
>>> where is 9GB used?
>>>
>>> Please explain.
>>
>> First, you're counting the metadata wrong.  The value shown per-device by
>> `btrfs filesystem usage` already accounts for replication (so it's only 3 GB
>> of metadata allocated, not 6 GB).  Neither `df` nor `du` looks at the chunk
>> level allocations though.
>>
>> Now, with that out of the way, the discrepancy almost certainly comes form
>> differences in how `df` and `du` calculate space usage.  In particular, `df`
>> calls statvfs and looks at the f_blocks and f_bfree values to compute space
>> usage, while `du` walks the filesystem tree calling stat on everything and
>> looking at st_blksize and st_blocks (or instead at st_size if you pass in
>> `--apparent-size` as an option).  This leads to a couple of differences in
>> what they will count:
>>
>> 1. `du` may or may not properly count hardlinks, sparse files, and
>> transparently compressed data, dependent on whether or not you use
>> `--apparent-sizes` (by default, it does properly count all of those), while
>> `df` will always account for those properly.
>> 2. `du` does not properly account for reflinked blocks (from deduplication,
>> snapshots, or use of the CLONE ioctl), and will count each reflink of every
>> block as part of the total size, while `df` will always count each block
>> exactly once no matter how many reflinks it has.
>> 3. `du` does not account for all of the BTRFS metadata allocations,
>> functionally ignoring space allocated for anything but inline data, while
>> `df` accounts for all BTRFS metadata properly.
>> 4. `du` will recurse into other filesystems if you don't pass the `-x`
>> option to it, while `df` will only report for each filesystem separately.
>> 5. `du` will only count data usage under the given mount point, and won't
>> account for data on other subvolumes that may be mounted elsewhere (and if
>> you pass in `-x` won't count data on other subvolumes located under the
>> given path either), while `df` will count all the data in all subvolumes.
>> 6. There are a couple of other differences too, but they're rather complex
>> and dependent on the internals of BTRFS.
>>
>> In your case, I think the issue is probably one of the various things under
>> item 6.  Items 1, 2 and 4 will cause `du` to report more space usage than
>> `df`, item 3 is irrelevant because `du` shows less space than the total data
>> chunk usage reported by `btrfs filesystem usage`, and item 5 is irrelevant
>> because you're mounting the root subvolume and not using the `-x` option on
>> `du` (and therefore there can't be other subvolumes you're missing).
>>
>> Try running a full defrag of the given mount point.  If what I think is
>> causing this is in fact the issue, that should bring the numbers back
>> in-line with each other.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscri

[PATCH] generic: add test for fsync after renaming and linking special file

2018-02-28 Thread fdmanana
From: Filipe Manana 

Test that when a fsync journal/log exists, if we rename a special file
(fifo, symbolic link or device), create a hard link for it with its old
name and then commit the journal/log, if a power loss happens the
filesystem will not fail to replay the journal/log when it is mounted
the next time.

This test is motivated by a bug found in btrfs, which is fixed by the
following patch for the linux kernel:

  "Btrfs: fix log replay failure after linking special file and fsync"

Signed-off-by: Filipe Manana 
---
 tests/generic/479 | 112 ++
 tests/generic/479.out |   2 +
 tests/generic/group   |   1 +
 3 files changed, 115 insertions(+)
 create mode 100644 tests/generic/479
 create mode 100644 tests/generic/479.out

diff --git a/tests/generic/479 b/tests/generic/479
new file mode 100644
index ..7e4ba7d0
--- /dev/null
+++ b/tests/generic/479
@@ -0,0 +1,112 @@
+#! /bin/bash
+# FSQA Test No. 479
+#
+# Test that when a fsync journal/log exists, if we rename a special file (fifo,
+# symbolic link or device), create a hard link for it with its old name and 
then
+# commit the journal/log, if a power loss happens the filesystem will not fail
+# to replay the journal/log when it is mounted the next time.
+#
+#---
+#
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+run_test()
+{
+   local file_type=$1
+
+   _scratch_mkfs >>$seqres.full 2>&1
+   _require_metadata_journaling $SCRATCH_DEV
+   _init_flakey
+   _mount_flakey
+
+   mkdir $SCRATCH_MNT/testdir
+   case $file_type in
+   symlink)
+   ln -s xxx $SCRATCH_MNT/testdir/foo
+   ;;
+   fifo)
+   mkfifo $SCRATCH_MNT/testdir/foo
+   ;;
+   dev)
+   mknod $SCRATCH_MNT/testdir/foo c 0 0
+   ;;
+   *)
+   _fail "Invalid file type argument: $file_type"
+   esac
+   # Make sure everything done so far is durably persisted.
+   sync
+
+   # Create a file and fsync it just to create a journal/log. This file
+   # must be in the same directory as our special file "foo".
+   touch $SCRATCH_MNT/testdir/f1
+   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/f1
+
+   # Rename our special file and then create link that has its old name.
+   mv $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
+   ln $SCRATCH_MNT/testdir/bar $SCRATCH_MNT/testdir/foo
+
+   # Create a second file and fsync it. This is just to durably persist the
+   # fsync journal/log which is typically modified by the previous rename
+   # and link operations. This file does not need to be placed in the same
+   # directory as our special file.
+   touch $SCRATCH_MNT/f2
+   $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/f2
+
+   # Simulate a power failure and mount the filesystem to check that
+   # replaying the fsync log/journal succeeds, that is the mount operation
+   # does not fail.
+   _flakey_drop_and_remount
+   _unmount_flakey
+   _cleanup_flakey
+}
+
+run_test symlink
+run_test fifo
+run_test dev
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/479.out b/tests/generic/479.out
new file mode 100644
index ..290f18b3
--- /dev/null
+++ b/tests/generic/479.out
@@ -0,0 +1,2 @@
+QA output created by 479
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 1e808865..3b9b47e3 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -481,3 +481,4 @@
 476 auto rw
 477 auto quick exportfs
 478 auto quick
+479 auto quick metadata
-- 
2.11.0

--
To unsubscribe from this list:

[PATCH] generic: test fsync new file after removing hard link

2018-02-28 Thread fdmanana
From: Filipe Manana 

Test that if we have a file with two hard links in the same parent
directory, then remove of the links, create a new file in the same
parent directory and with the name of the link removed, fsync the new
file and have a power loss, mounting the filesystem succeeds.

This test is motivated by a bug found in btrfs, which is fixed by
the linux kernel patch titled:

  "Btrfs: fix log replay failure after unlink and link combination"

Signed-off-by: Filipe Manana 
---
 tests/generic/480 | 83 +++
 tests/generic/480.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 86 insertions(+)
 create mode 100755 tests/generic/480
 create mode 100644 tests/generic/480.out

diff --git a/tests/generic/480 b/tests/generic/480
new file mode 100755
index ..a287684b
--- /dev/null
+++ b/tests/generic/480
@@ -0,0 +1,83 @@
+#! /bin/bash
+# FSQA Test No. 480
+#
+# Test that if we have a file with two hard links in the same parent directory,
+# then remove of the links, create a new file in the same parent directory and
+# with the name of the link removed, fsync the new file and have a power loss,
+# mounting the filesystem succeeds.
+#
+#---
+#
+# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_require_metadata_journaling $SCRATCH_DEV
+_init_flakey
+_mount_flakey
+
+mkdir $SCRATCH_MNT/testdir
+touch $SCRATCH_MNT/testdir/foo
+ln $SCRATCH_MNT/testdir/foo $SCRATCH_MNT/testdir/bar
+
+# Make sure everything done so far is durably persisted.
+sync
+
+# Now remove of the links of our file and create a new file with the same name
+# and in the same parent directory, and finally fsync this new file.
+unlink $SCRATCH_MNT/testdir/bar
+touch $SCRATCH_MNT/testdir/bar
+$XFS_IO_PROG -c "fsync" $SCRATCH_MNT/testdir/bar
+
+# Simulate a power failure and mount the filesystem to check that replaying the
+# the fsync log/journal succeeds, that is the mount operation does not fail.
+_flakey_drop_and_remount
+
+_unmount_flakey
+_cleanup_flakey
+
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/generic/480.out b/tests/generic/480.out
new file mode 100644
index ..a40a718e
--- /dev/null
+++ b/tests/generic/480.out
@@ -0,0 +1,2 @@
+QA output created by 480
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index 3b9b47e3..ea2056b1 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -482,3 +482,4 @@
 477 auto quick exportfs
 478 auto quick
 479 auto quick metadata
+480 auto quick metadata
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug report] BTRFS partition re-mounted as read-only after few minutes of use

2018-02-28 Thread peteryuchuang
Hi,

On my laptop, which has just been switched to BTRFS, the root partition
(a BTRFS partition inside an encrypted LVM. The drive is an NVMe) is
re-mounted as read-only few minutes after boot. 

Trace:

[  199.974591] [ cut here ]
[  199.974593] BTRFS: Transaction aborted (error -95)
[  199.974647] WARNING: CPU: 0 PID: 324 at fs/btrfs/inode.c:3042
btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]
[  199.974648] Modules linked in: tun fuse cmac rfcomm bnep
snd_hda_codec_hdmi ip6t_REJECT snd_hda_codec_generic nf_reject_ipv6
nf_log_ipv6 xt_hl nf_conntrack_ipv6 nf_defrag_ipv6 ip6t_rt ipt_REJECT
nf_reject_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp
nls_iso8859_1 nls_cp437 vfat fat nf_conntrack_ipv4 nf_defrag_ipv4
xt_addrtype xt_conntrack snd_soc_skl snd_soc_skl_ipc snd_hda_ext_core
snd_soc_sst_dsp snd_soc_sst_ipc snd_soc_acpi brcmfmac snd_soc_core
ip6table_filter ip6_tables brcmutil nf_conntrack_netbios_ns
snd_compress nf_conntrack_broadcast nf_nat_ftp ac97_bus cfg80211
snd_pcm_dmaengine nf_nat nf_conntrack_ftp nf_conntrack libcrc32c
crc32c_generic thunderbolt iptable_filter iTCO_wdt mmc_core
iTCO_vendor_support crypto_user msr intel_rapl x86_pkg_temp_thermal
intel_powerclamp coretemp
[  199.974675]  kvm_intel snd_hda_intel snd_hda_codec kvm snd_hda_core
applesmc snd_hwdep input_polldev irqbypass snd_pcm intel_cstate
snd_timer intel_uncore intel_rapl_perf pcspkr i915 snd i2c_i801
soundcore joydev mousedev input_leds i2c_algo_bit drm_kms_helper
hci_uart btbcm btqca btintel drm bluetooth mei_me 8250_dw intel_gtt mei
agpgart acpi_als shpchp syscopyarea idma64 sysfillrect sbs sysimgblt
fb_sys_fops ecdh_generic rfkill kfifo_buf sbshc industrialio rtc_cmos
evdev mac_hid ac apple_bl facetimehd(O) videobuf2_dma_sg
videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media ip_tables
x_tables btrfs xor zstd_decompress zstd_compress xxhash raid6_pq
dm_crypt algif_skcipher af_alg dm_mod crct10dif_pclmul crc32_pclmul
crc32c_intel ghash_clmulni_intel pcbc aesni_intel aes_x86_64
crypto_simd
[  199.974710]  glue_helper cryptd xhci_pci xhci_hcd usbcore usb_common
applespi(O) crc16 led_class intel_lpss_pci intel_lpss
spi_pxa2xx_platform
[  199.974718] CPU: 0 PID: 324 Comm: kworker/u8:6 Tainted:
G U O 4.15.5-1-ARCH #1
[  199.974718] Hardware name: Apple Inc. MacBookPro14,1/Mac-
B4831CEBD52A0C4C, BIOS MBP141.88Z.0169.B00.1712141501 12/14/2017
[  199.974734] Workqueue: btrfs-endio-write btrfs_endio_write_helper
[btrfs]
[  199.974746] RIP: 0010:btrfs_finish_ordered_io+0x7ab/0x850 [btrfs]
[  199.974747] RSP: 0018:b3310128bdc8 EFLAGS: 00010286
[  199.974749] RAX:  RBX: a25b1e150b60 RCX:
0001
[  199.974749] RDX: 8001 RSI: 9fe47fd0 RDI:

[  199.974750] RBP: a25a2df7dad0 R08: 0001 R09:
0412
[  199.974751] R10:  R11:  R12:
a25a2df7d8e0
[  199.974752] R13: a25a2df7d8c0 R14: a25b1f037f78 R15:
0001
[  199.974753] FS:  () GS:a25b2ec0()
knlGS:
[  199.974754] CS:  0010 DS:  ES:  CR0: 80050033
[  199.974755] CR2: 7fd9cc5dd3e7 CR3: 6300a002 CR4:
003606f0
[  199.974756] DR0:  DR1:  DR2:

[  199.974756] DR3:  DR6: fffe0ff0 DR7:
0400
[  199.974757] Call Trace:
[  199.974774]  normal_work_helper+0x39/0x370 [btrfs]
[  199.974779]  process_one_work+0x1ce/0x410
[  199.974782]  worker_thread+0x2b/0x3d0
[  199.974784]  ? process_one_work+0x410/0x410
[  199.974785]  kthread+0x113/0x130
[  199.974787]  ? kthread_create_on_node+0x70/0x70
[  199.974789]  ? do_syscall_64+0x74/0x190
[  199.974791]  ? SyS_exit_group+0x10/0x10
[  199.974793]  ret_from_fork+0x35/0x40
[  199.974795] Code: 08 01 e9 a4 fb ff ff 49 8b 46 60 f0 0f ba a8 50 12
00 00 02 72 17 8b 74 24 10 83 fe fb 74 32 48 c7 c7 38 a7 6c c0 e8 85 7a
a3 de <0f> 0b 8b 4c 24 10 ba e2 0b 00 00 eb b1 4c 8b 23 4c 8b 53 10 41 
[  199.974820] ---[ end trace c8ed62ff6a525901 ]---
[  199.974822] BTRFS: error (device dm-2) in
btrfs_finish_ordered_io:3042: errno=-95 unknown
[  199.974824] BTRFS info (device dm-2): forced readonly
[  199.976696] BTRFS error (device dm-2): pending csums is 6447104

Bug report: https://bugzilla.kernel.org/show_bug.cgi?id=198945
Kernel version: 4.15.5
Distro: Arch Linux
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix unexpected -EEXIST when creating new inode

2018-02-28 Thread Filipe Manana
On Thu, Jan 25, 2018 at 6:02 PM, Liu Bo  wrote:
> The highest objectid, which is assigned to new inode, is decided at
> the time of initializing fs roots.  However, in cases where log replay
> gets processed, the btree which fs root owns might be changed, so we
> have to search it again for the highest objectid, otherwise creating
> new inode would end up with -EEXIST.
>
> cc:  v4.4-rc6+
> Fixes: f32e48e92596 ("Btrfs: Initialize btrfs_root->highest_objectid when 
> loading tree root and subvolume roots")
> Signed-off-by: Liu Bo 

Hi Bo,

Any reason to not have submitted a test case for fstests?
Unless I missed something this should be easy to reproduce, deterministic issue.

thanks

> ---
>  fs/btrfs/tree-log.c | 19 +++
>  1 file changed, 19 insertions(+)
>
> diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
> index a7e6235..646cdbf 100644
> --- a/fs/btrfs/tree-log.c
> +++ b/fs/btrfs/tree-log.c
> @@ -28,6 +28,7 @@
>  #include "hash.h"
>  #include "compression.h"
>  #include "qgroup.h"
> +#include "inode-map.h"
>
>  /* magic values for the inode_only field in btrfs_log_inode:
>   *
> @@ -5715,6 +5716,24 @@ int btrfs_recover_log_trees(struct btrfs_root 
> *log_root_tree)
>   path);
> }
>
> +   if (!ret && wc.stage == LOG_WALK_REPLAY_ALL) {
> +   struct btrfs_root *root = wc.replay_dest;
> +
> +   btrfs_release_path(path);
> +
> +   /*
> +* We have just replayed everything, and the highest
> +* objectid of fs roots probably has changed in case
> +* some inode_item's got replayed.
> +*/
> +   /*
> +* root->objectid_mutex is not acquired as log replay
> +* could only happen during mount.
> +*/
> +   ret = btrfs_find_highest_objectid(root,
> + &root->highest_objectid);
> +   }
> +
> key.offset = found_key.offset - 1;
> wc.replay_dest->log_root = NULL;
> free_extent_buffer(log->node);
> --
> 2.9.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: fix log replay failure after linking special file and fsync

2018-02-28 Thread fdmanana
From: Filipe Manana 

If in the same transaction we rename a special file (fifo, character/block
device or symbolic link), create a hard link for it having its old name
then sync the log, we will end up with a log that can not be replayed and
at when attempting to replay it, an EEXIST error is returned and mounting
the filesystem fails. Example scenario:

  $ mkfs.btrfs -f /dev/sdc
  $ mount /dev/sdc /mnt
  $ mkdir /mnt/testdir
  $ mkfifo /mnt/testdir/foo
  # Make sure everything done so far is durably persisted.
  $ sync

  # Create some unrelated file and fsync it, this is just to create a log
  # tree. The file must be in the same directory as our special file.
  $ touch /mnt/testdir/f1
  $ xfs_io -c "fsync" /mnt/testdir/f1

  # Rename our special file and then create a hard link with its old name.
  $ mv /mnt/testdir/foo /mnt/testdir/bar
  $ ln /mnt/testdir/bar /mnt/testdir/foo

  # Create some other unrelated file and fsync it, this is just to persist
  # the log tree which was modified by the previous rename and link
  # operations. Alternatively we could have modified file f1 and fsync it.
  $ touch /mnt/f2
  $ xfs_io -c "fsync" /mnt/f2

  

  $ mount /dev/sdc /mnt
  mount: mount /dev/sdc on /mnt failed: File exists

This happens because when both the log tree and the subvolume's tree have
an entry in the directory "testdir" with the same name, that is, there
is one key (258 INODE_REF 257) in the subvolume tree and another one in
the log tree (where 258 is the inode number of our special file and 257
is the inode for directory "testdir"). Only the data of those two keys
differs, in the subvolume tree the index field for inode reference has
a value of 3 while the log tree it has a value of 5. Because the same key
exists in both trees, but have different index, the log replay fails with
an -EEXIST error when attempting to replay the inode reference from the
log tree.

Fix this by setting the last_unlink_trans field of the inode (our special
file) to the current transaction id when a hard link is created, as this
forces logging the parent directory inode, solving the conflict at log
replay time.

A new generic test case for fstests was also submitted.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 28d0de199b05..411a022489e4 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -5841,7 +5841,7 @@ int btrfs_log_new_name(struct btrfs_trans_handle *trans,
 * this will force the logging code to walk the dentry chain
 * up for the file
 */
-   if (S_ISREG(inode->vfs_inode.i_mode))
+   if (!S_ISDIR(inode->vfs_inode.i_mode))
inode->last_unlink_trans = trans->transid;
 
/*
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix log replay failure after unlink and link combination

2018-02-28 Thread fdmanana
From: Filipe Manana 

If we have a file with 2 (or more) hard links in the same directory,
remove one of the hard links, create a new file (or link an existing file)
in the same directory with the name of the removed hard link, and then
finally fsync the new file, we end up with a log that fails to replay,
causing a mount failure.

Example:

  $ mkfs.btrfs -f /dev/sdb
  $ mount /dev/sdb /mnt

  $ mkdir /mnt/testdir
  $ touch /mnt/testdir/foo
  $ ln /mnt/testdir/foo /mnt/testdir/bar

  $ sync

  $ unlink /mnt/testdir/bar
  $ touch /mnt/testdir/bar
  $ xfs_io -c "fsync" /mnt/testdir/bar

  

  $ mount /dev/sdb /mnt
  mount: mount(2) failed: /mnt: No such file or directory

When replaying the log, for that example, we also see the following in
dmesg/syslog:

  [71813.671307] BTRFS info (device dm-0): failed to delete reference to bar, 
inode 258 parent 257
  [71813.674204] [ cut here ]
  [71813.675694] BTRFS: Transaction aborted (error -2)
  [71813.677236] WARNING: CPU: 1 PID: 13231 at fs/btrfs/inode.c:4128 
__btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] Modules linked in: btrfs xfs f2fs dm_flakey dm_mod dax 
ghash_clmulni_intel ppdev pcbc aesni_intel aes_x86_64 crypto_simd cryptd 
glue_helper evdev psmouse i2c_piix4 parport_pc i2c_core pcspkr sg serio_raw 
parport button sunrpc loop autofs4 ext4 crc16 mbcache jbd2 zstd_decompress 
zstd_compress xxhash raid10 raid456 async_raid6_recov async_memcpy async_pq 
async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath 
linear md_mod ata_generic sd_mod virtio_scsi ata_piix libata virtio_pci 
virtio_ring crc32c_intel floppy virtio e1000 scsi_mod [last unloaded: btrfs]
  [71813.679669] CPU: 1 PID: 13231 Comm: mount Tainted: GW
4.15.0-rc9-btrfs-next-56+ #1
  [71813.679669] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
  [71813.679669] RIP: 0010:__btrfs_unlink_inode+0x17b/0x355 [btrfs]
  [71813.679669] RSP: 0018:c90001cef738 EFLAGS: 00010286
  [71813.679669] RAX: 0025 RBX: 880217ce4708 RCX: 
0001
  [71813.679669] RDX:  RSI: 81c14bae RDI: 

  [71813.679669] RBP: c90001cef7c0 R08: 0001 R09: 
0001
  [71813.679669] R10: c90001cef5e0 R11: 8343f007 R12: 
880217d474c8
  [71813.679669] R13: fffe R14: 88021ccf1548 R15: 
0101
  [71813.679669] FS:  7f7cee84c480() GS:88023fc8() 
knlGS:
  [71813.679669] CS:  0010 DS:  ES:  CR0: 80050033
  [71813.679669] CR2: 7f7cedc1abf9 CR3: 0002354b4003 CR4: 
001606e0
  [71813.679669] Call Trace:
  [71813.679669]  btrfs_unlink_inode+0x17/0x41 [btrfs]
  [71813.679669]  drop_one_dir_item+0xfa/0x131 [btrfs]
  [71813.679669]  add_inode_ref+0x71e/0x851 [btrfs]
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  ? replay_one_buffer+0x53/0x53a [btrfs]
  [71813.679669]  replay_one_buffer+0x4a4/0x53a [btrfs]
  [71813.679669]  ? rcu_read_unlock+0x3a/0x57
  [71813.679669]  ? __lock_is_held+0x39/0x71
  [71813.679669]  walk_up_log_tree+0x101/0x1d2 [btrfs]
  [71813.679669]  walk_log_tree+0xad/0x188 [btrfs]
  [71813.679669]  btrfs_recover_log_trees+0x1fa/0x31e [btrfs]
  [71813.679669]  ? replay_one_extent+0x544/0x544 [btrfs]
  [71813.679669]  open_ctree+0x1cf6/0x2209 [btrfs]
  [71813.679669]  btrfs_mount_root+0x368/0x482 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  btrfs_mount+0x13e/0x772 [btrfs]
  [71813.679669]  ? trace_hardirqs_on_caller+0x14c/0x1a6
  [71813.679669]  ? __lockdep_init_map+0x176/0x1c2
  [71813.679669]  ? mount_fs+0x64/0x10b
  [71813.679669]  mount_fs+0x64/0x10b
  [71813.679669]  vfs_kern_mount+0x68/0xce
  [71813.679669]  do_mount+0x6e5/0x973
  [71813.679669]  ? memdup_user+0x3e/0x5c
  [71813.679669]  SyS_mount+0x72/0x98
  [71813.679669]  entry_SYSCALL_64_fastpath+0x1e/0x8b
  [71813.679669] RIP: 0033:0x7f7cedf150ba
  [71813.679669] RSP: 002b:7ffca71da688 EFLAGS: 0206
  [71813.679669] Code: 7f a0 e8 51 0c fd ff 48 8b 43 50 f0 0f ba a8 30 2c 00 00 
02 72 17 41 83 fd fb 74 11 44 89 ee 48 c7 c7 7d 11 7f a0 e8 38 f5 8d e0 <0f> ff 
44 89 e9 ba 20 10 00 00 eb 4d 48 8b 4d b0 48 8b 75 88 4c
  [71813.679669] ---[ end trace 83bd473fc5b4663b ]---
  [71813.854764] BTRFS: error (device dm-0) in __btrfs_unlink_inode:4128: 
errno=-2 No such entry
  [71813.886994] BTRFS: error (device dm-0) in btrfs_replay_log:2307: errno=-2 
No such entry (Failed to recover log tree)
  [71813.903357] BTRFS error (device dm-0): cleaner transaction attach returned 
-30
  [71814.128078] BTRFS error (device dm-0): open_ctree failed

This happens because the log has inode reference items for both inode 25

Re: Btrfs occupies more space than du reports...

2018-02-28 Thread Andrei Borzenkov
On Wed, Feb 28, 2018 at 2:26 PM, Shyam Prasad N  wrote:
> Hi,
>
> Thanks for the reply.
>
>> * `df` calls `statvfs` to get it's data, which tries to count physical
>> allocation accounting for replication profiles.  In other words, data in
>> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
>> raid5 and raid6 chunks gets counted with a bit of extra space for the
>> parity, etc.
>
> We have data not using raid (single), metadata using dup, we've not
> used compression, subvols have not been created yet (other than the
> default subvol), there are no other mount points within the tree.
> Taking into account all that you're saying, the numbers don't make
> sense to me. "btrfs fi usage" tells that the data "used" is much more
> than what it should be. I agree more with what du is saying the disk
> usage is.
> I tried an experiment. Filled up the available space (as per what
> btrfs believes is available) with huge files. As soon as the usage
> reached 100%, further writes started to return ENOSPC. This is what
> I'm scared is what is going to happen when these filesystems
> eventually fill up. This would normally be the expected behaviour, but
> in many of these servers, the actual data that is being used is much
> lesser (60-70 GBs in some cases).
> To me, it looks like a btrfs internal refcounting has gone wrong.
> Maybe it's thinking that some data blocks (which are actually free)
> are in use?

One reason could be overwrites inside of extents. What happens is
btrfs does not (always) physically split extent when it is partially
overwritten. So some space remains free but unavailable.

Filesystem 1K-blocks  Used Available Use% Mounted on
/dev/sdb18387584 16704   7531456   1% /mnt
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.580041 s, 181 MB/s
localhost:~ # sync
localhost:~ # df -k /mnt
Filesystem 1K-blocks   Used Available Use% Mounted on
/dev/sdb18387584 119552   7428864   2% /mnt
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=25
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00781892 s, 134 MB/s
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=50
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00780386 s, 134 MB/s
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=75
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00761908 s, 138 MB/s
localhost:~ # sync
localhost:~ # df -k /mnt
Filesystem 1K-blocks   Used Available Use% Mounted on
/dev/sdb18387584 122624   7425792   2% /mnt

So 3M is lost. And if you write 50M in the middle you will get 50M "lost" space.

I do not know how btrfs decides when to split extent.

defragmenting file should free those partial extents again.

btrfs fi defrag -r /mnt

> Or some other refcounting issue?
> We've tried "btrfs check" as well as "btrfs scrub", so far. Both have
> not reported any errors.
>
> Regards,
> Shyam
>
> On Fri, Feb 23, 2018 at 6:53 PM, Austin S. Hemmelgarn
>  wrote:
>> On 2018-02-23 06:21, Shyam Prasad N wrote:
>>>
>>> Hi,
>>>
>>> Can someone explain me why there is a difference in the number of
>>> blocks reported by df and du commands below?
>>>
>>> =
>>> # df -h /dc
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/drbd1  746G  519G  225G  70% /dc
>>>
>>> # btrfs filesystem df -h /dc/
>>> Data, single: total=518.01GiB, used=516.58GiB
>>> System, DUP: total=8.00MiB, used=80.00KiB
>>> Metadata, DUP: total=2.00GiB, used=1019.72MiB
>>> GlobalReserve, single: total=352.00MiB, used=0.00B
>>>
>>> # du -sh /dc
>>> 467G/dc
>>> =
>>>
>>> df shows 519G is used. While recursive check using du shows only 467G.
>>> The filesystem doesn't contain any snapshots/extra subvolumes.
>>> Neither does it contain any mounted filesystem under /dc.
>>> I also considered that it could be a void left behind by one of the
>>> open FDs held by a process. So I rebooted the system. Still no
>>> changes.
>>>
>>> The situation is even worse on a few other systems with similar
>>> configuration.
>>>
>>
>> At least part of this is a difference in how each tool computes space usage.
>>
>> * `df` calls `statvfs` to get it's data, which tries to count physical
>> allocation accounting for replication profiles.  In other words, data in
>> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
>> raid5 and raid6 chunks gets counted with a bit of extra space for the
>> parity, etc.
>>
>> * `btrfs fi df` looks directly at the filesystem itself and counts how much
>> space is available to each chunk type in the `total` values and how much
>> space is used in each chunk type in the `used` values, after replication.
>> If you add together the data used value and twice the system and metadata
>> used values

Re: Btrfs occupies more space than du reports...

2018-02-28 Thread Shyam Prasad N
Hi,

Thanks for the reply.

> * `df` calls `statvfs` to get it's data, which tries to count physical
> allocation accounting for replication profiles.  In other words, data in
> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
> raid5 and raid6 chunks gets counted with a bit of extra space for the
> parity, etc.

We have data not using raid (single), metadata using dup, we've not
used compression, subvols have not been created yet (other than the
default subvol), there are no other mount points within the tree.
Taking into account all that you're saying, the numbers don't make
sense to me. "btrfs fi usage" tells that the data "used" is much more
than what it should be. I agree more with what du is saying the disk
usage is.
I tried an experiment. Filled up the available space (as per what
btrfs believes is available) with huge files. As soon as the usage
reached 100%, further writes started to return ENOSPC. This is what
I'm scared is what is going to happen when these filesystems
eventually fill up. This would normally be the expected behaviour, but
in many of these servers, the actual data that is being used is much
lesser (60-70 GBs in some cases).
To me, it looks like a btrfs internal refcounting has gone wrong.
Maybe it's thinking that some data blocks (which are actually free)
are in use? Or some other refcounting issue?
We've tried "btrfs check" as well as "btrfs scrub", so far. Both have
not reported any errors.

Regards,
Shyam

On Fri, Feb 23, 2018 at 6:53 PM, Austin S. Hemmelgarn
 wrote:
> On 2018-02-23 06:21, Shyam Prasad N wrote:
>>
>> Hi,
>>
>> Can someone explain me why there is a difference in the number of
>> blocks reported by df and du commands below?
>>
>> =
>> # df -h /dc
>> Filesystem  Size  Used Avail Use% Mounted on
>> /dev/drbd1  746G  519G  225G  70% /dc
>>
>> # btrfs filesystem df -h /dc/
>> Data, single: total=518.01GiB, used=516.58GiB
>> System, DUP: total=8.00MiB, used=80.00KiB
>> Metadata, DUP: total=2.00GiB, used=1019.72MiB
>> GlobalReserve, single: total=352.00MiB, used=0.00B
>>
>> # du -sh /dc
>> 467G/dc
>> =
>>
>> df shows 519G is used. While recursive check using du shows only 467G.
>> The filesystem doesn't contain any snapshots/extra subvolumes.
>> Neither does it contain any mounted filesystem under /dc.
>> I also considered that it could be a void left behind by one of the
>> open FDs held by a process. So I rebooted the system. Still no
>> changes.
>>
>> The situation is even worse on a few other systems with similar
>> configuration.
>>
>
> At least part of this is a difference in how each tool computes space usage.
>
> * `df` calls `statvfs` to get it's data, which tries to count physical
> allocation accounting for replication profiles.  In other words, data in
> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
> raid5 and raid6 chunks gets counted with a bit of extra space for the
> parity, etc.
>
> * `btrfs fi df` looks directly at the filesystem itself and counts how much
> space is available to each chunk type in the `total` values and how much
> space is used in each chunk type in the `used` values, after replication.
> If you add together the data used value and twice the system and metadata
> used values, you get the used value reported by regular `df` (well, close to
> it that is, `df` rounds at a lower precision than `btrfs fi df` does).
>
> * `du` scans the directory tree and looks at the file allocation values
> returned form `stat` calls (or just looks at file sizes if you pass the
> `--apparent-size` flag to it).  Like `btrfs fi df`, it reports values after
> replication, it has a couple of nasty caveats on BTRFS, namely that it will
> report sizes for natively compressed files _before_ compression, and will
> count reflinked blocks once for each link.
>
> Now, this doesn't explain the entirety of the discrepancy with `du`, but it
> should cover the whole difference between `df` and `btrfs fi df`.



-- 
-Shyam
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] btrfs-progs: fsck-tests: Introduce test case with keyed data backref with the extent offset

2018-02-28 Thread Lu Fengqi
Add the testcase for false alert of data extent backref lost with the
extent offset.

The image can be reproduced by the following commands:
--
dev=~/test.img
mnt=/mnt/btrfs

umount $mnt &> /dev/null
fallocate -l 128M $dev

mkfs.btrfs $dev
mount $dev $mnt

for i in `seq 1 10`; do
xfs_io -f -c "pwrite 0 2K" $mnt/file$i
done

xfs_io -f -c "falloc 0 64K" $mnt/file11

for i in `seq 1 32`; do
xfs_io -f -c "reflink $mnt/file11 0 $(($i * 64))K 64K" $mnt/file11
done

xfs_io -f -c "reflink $mnt/file11 32K $((33 * 64))K 32K" $mnt/file11

btrfs subvolume snapshot $mnt $mnt/snap1

umount $mnt
btrfs-image -c9 $dev extent_data_ref.img
--

Signed-off-by: Lu Fengqi 
---
 .../fsck-tests/020-extent-ref-cases/extent_data_ref.img  | Bin 0 -> 6144 bytes
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 tests/fsck-tests/020-extent-ref-cases/extent_data_ref.img

diff --git a/tests/fsck-tests/020-extent-ref-cases/extent_data_ref.img 
b/tests/fsck-tests/020-extent-ref-cases/extent_data_ref.img
new file mode 100644
index 
..3ab2396ba9c810d98f16a5efcf7fe23ee4b12ab5
GIT binary patch
literal 6144
zcmeHKhgXx!wg;pLB1&(OB2^RhswC7R@)a*^1Q
zoukK(t9nD9`o4x$Zyr51!+&3ungAIaVYi!m$6rv1jhq;d6iHImVvj2b5>V
zZ~4u@ZwCGY81Nfk`ym*TMS8{0BlxQ&y0Uq6A0K)8N6Q>rPiyN%7Kg|ljJ!X25e4SR
zqg7JI60Rvr4w?fol%`OCtCcBcV<_$gA%9RVL#wS;7f#<1L22?a92Cb2xOK=~
ztcp9_ps855F#$LaXkeFshxR3J0Tq7L8mL*o3+{{h8zP1eZcPCgCjIA^nAjejfD;qQ|nts`g?xciH
zSxzRsQ}~tghc&9j`df0KVy!nE4dZN{05%*Q5=fhH7U^d!v}miq*r;Z3j>~gzE;@8W
z6;$=)G#k9A=;EiI!?T{UWC3V7yjSSRK6Abo%Piy-qi57pvzh%8ZyE0
z0;(+IqPI&A6`Z}KJN*Z#XK#$}6`aVdTGyFIX`vPTET|;IsYQub!&uY
zwfq$QiB(`qtSwrzUre0u876v5M(qYvQN{owlvG00#}j>3Q58
zRrn9L;w~AKI|K>y>=Isih~!@YzSS&9VjAeID>ihw!{Lp-KJt7ypjyx;GWJED%rm7j
ziD!Rj>Mx}I1@ce}zXc`0IWd@m<#(Mcvio-{-6V}syKbv|5Ap*Bju=+1IqCewAP6|U
zpi^s!F8{bS^|eCQMI~!KtqWn(DoHEt|
zqd2TOWRx%Gz#>Z$ZP-Yi3OTx}MSgu7qWz;ZiBo*i^m1yD^=B_@*B2ZsECn5aDkXlc
z*8}y9j>h`0H0m9Pw|Npg28u^ixQhCVH7FFc^9+42`K727;7xag@5JL1F$EKnt2$f>
z9z(^ZW(qy31tQgcDbec8_S?c&;(IK=6&Q11Q_pqUnXc8JH{^)tF!oPTFQ_O{pb2Y9
z@SW{5u?WN@dh8hK6*P}3f$NqxjjF1uXDV?nIA{CO#Y=7;_r1O}o&B4w(EPE0jB49G
z&>n=!@5%~3D3OZnH_cwfy~8m(J2bagn3G`$qalENf;V5T~OcAwWmv
zHYO=|{o!@5KnfdjLB6Mf6i?z>fGQIAU&dWWMi^UHi(^(}Z;`Zl@~KW%jR7FCz;JxrP??o*
zt1gxa#S)8z$HL+6i80A^;WRNbIVJFjmr%KnBF5!q*CSqrgSEvDLVQQ34hjP-ij-P-
zr<0E+kFI{4wGub9Xem8y9gR=bu$n){U_FPjMxv94w2|+Q*KQzTwk~pOK0YLW<4(*QGhOFKsvH3
zJie6P);E>4$1Xmv=*fVjYj{Dr${YHb>RqpwW&-*l6=t=T!$j?kDlW79;8c2}Yfj}I
zLrgG3$S(XlaQ1SHG&O_?Pnvx{G0os!9{0(T($%_1W~P^soWkM^dU;O;ymGs>{~W%Q
zui9?JM8$8ZI#80s6N8A9W}*^kS|fGWY*%H1GT0je(m*lSfwK;Go<~HYgc}4VGzR!!
zl=mwm`GoQO!DO1ekx>Q
zm0YZ@b=|Q4F@L_HQ)Hxe{s`JeMX+;Qg*
zZJ}GJfj_uTDhn!AzJ?8**@hC9GN77IWl9AFNi^(Nz)C#u3c%IYy`Ye9gJVq(X
z7p6EJhHx!E4kaAo*1xUqk|+*sRg@mk#xq)n)-)flPrClVW$x+!d^&lnpdRZGGU?FS
zjNQh6nEi@Muxi}FUFixI-#B@WhIXGW>tjn-y{7AiA3`+N3wfbWyCHwe>0jo(gu3Fr
zpyb0)AQIC9N10u;2seX57s)O;?eu9DeOs&QM`TMgF@e
z+{))YEFaq=)c2`XZ4JEpSz3@p%3is6zwLHnPD1K2
znIToGv8NeWky{p%3?Iy+?8*ku4Q91+6Z)~$s`prt>lTVkb>?|Oy`Igpx6CKrKW*~V
zV1K<-f2a2&b9|zoQX{f&z5{61=+ch@-ZnHYg1nQtgtx`*
z>4y@f;@sfvVb7hjhVP&Dy$bAxmgAUNfEcPKDKt14a0lUO8Q%@N-gA&!l*_yoe#I|8*P6ARScVV3wHEXzx-ppZN-6l$`$I=ZNOF
zr0rRR05%UH0BJ)Aq=q8|=8t7{=9&=#*;ZCpTou;iqIcRl=(o1U{bh!^^w_Ro20F%O
zW@Q5W84H1=3k#%%b(T@J*IFt=D|w2s-#k{__F4XgHnDetS}sTkjw
z{yEPPxKR}p=j&Xab)~p(&BABXuLGcvPW#LOptjWJn}Fg{5P?6eWtMT{t6hPoZazvK
z$2Dj>`x0uV-So;v=k{6twPpUV9VM%OsUq4}Kd~SK6%h8#-Omk7cSb7$$10)$clOo*
zs_IS@*eY&q>XvGU~`
zRNkCm4R&m=)mt}5YucjqzDCRM2pVtLoMduA>yg{veA8;z<=GXpXU50wkLpp4>hb;2
zS!n(_%ikW)ZT8Xebnw|btf_cP#r2Addf4Jg<~Akb9rE8Lj37{ZY*4W~vEB!nVWglp
zySnP7ENL39eOyl?7ji47*q8z~CL%0HZWoS+_FZF9cluiDz

[PATCH 2/3] btrfs-progs: check/lowmem: Fix false alert of data extent backref lost for snapshot

2018-02-28 Thread Lu Fengqi
Btrfs lowmem check reports the following false alert:
--
ERROR: file extent[267 2162688] root 256 owner 5 backref lost
--

The file extent is in the leaf which is shared by file tree 256 and fs
tree.
--
leaf 30605312 items 46 free space 4353 generation 7 owner 5
..
item 45 key (267 EXTENT_DATA 2162688) itemoff 5503 itemsize 53
generation 7 type 2 (prealloc)
prealloc data disk byte 13631488 nr 65536
prealloc data offset 32768 nr 32768
--

And there is the corresponding extent_data_ref item in the extent tree.
--
item 1 key (13631488 EXTENT_DATA_REF 1007496934287921081) itemoff 15274 
itemsize 28
extent data backref root 5 objectid 267 offset 2129920 count 1
--

The offset of EXTENT_DATA_REF which is the hash of the owner root objectid,
the inode number and the calculated offset (file offset - extent offset).

What caused the false alert is the code mix up the owner root objectid and
the file tree objectid.

Fixes: b0d360b541f0 ("btrfs-progs: check: introduce function to check data 
backref in extent tree")
Signed-off-by: Lu Fengqi 
---
 check/mode-lowmem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index f37b1b2c1571..6f1ea8db341d 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -2689,8 +2689,8 @@ static int check_extent_data_item(struct btrfs_root *root,
/* Didn't find inlined data backref, try EXTENT_DATA_REF_KEY */
dbref_key.objectid = btrfs_file_extent_disk_bytenr(eb, fi);
dbref_key.type = BTRFS_EXTENT_DATA_REF_KEY;
-   dbref_key.offset = hash_extent_data_ref(root->objectid,
-   fi_key.objectid, fi_key.offset - offset);
+   dbref_key.offset = hash_extent_data_ref(owner, fi_key.objectid,
+   fi_key.offset - offset);
 
ret = btrfs_search_slot(NULL, root->fs_info->extent_root,
&dbref_key, &path, 0, 0);
-- 
2.16.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 1/3] btrfs-progs: check/lowmem: Fix the incorrect error message of check_extent_data_item

2018-02-28 Thread Lu Fengqi
Instead of the disk_bytenr and disk_num_bytes of the extent_item which the
file extent references, we should output the objectid and offset of the
file extent. And the leaf may be shared by the file trees, we should print
the objectid of the root and the owner of the leaf.

Fixes: b0d360b541f0 ("btrfs-progs: check: introduce function to check data 
backref in extent tree")
Signed-off-by: Lu Fengqi 
---
V2: Output the objectid of the root and the owner of the leaf.

 check/mode-lowmem.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index 62bcf3d2e126..f37b1b2c1571 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -2631,9 +2631,9 @@ static int check_extent_data_item(struct btrfs_root *root,
 
if (!(extent_flags & BTRFS_EXTENT_FLAG_DATA)) {
error(
-   "extent[%llu %llu] backref type mismatch, wanted bit: %llx",
-   disk_bytenr, disk_num_bytes,
-   BTRFS_EXTENT_FLAG_DATA);
+"file extent[%llu %llu] root %llu owner %llu backref type mismatch, wanted 
bit: %llx",
+   fi_key.objectid, fi_key.offset, root->objectid, owner,
+   BTRFS_EXTENT_FLAG_DATA);
err |= BACKREF_MISMATCH;
}
 
@@ -2722,8 +2722,9 @@ out:
err |= BACKREF_MISSING;
btrfs_release_path(&path);
if (err & BACKREF_MISSING) {
-   error("data extent[%llu %llu] backref lost",
- disk_bytenr, disk_num_bytes);
+   error(
+   "file extent[%llu %llu] root %llu owner %llu backref lost",
+   fi_key.objectid, fi_key.offset, root->objectid, owner);
}
return err;
 }
-- 
2.16.2



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: Re: BUG: unable to handle kernel paging request at ffff9fb75f827100

2018-02-28 Thread Qu Wenruo
Hi Christoph,

Since I'm still digging the unexpected corruption (although without much
progress yet), would you please describe how the corruption happens?

In my current investigation, btrfs is indeed bullet-proof, (unlike my
original assumption) using newer dm-log-writes tool.

But free space cache file (v1 free space cache) is not CoW protected so
it's vulnerable against power loss.

So my current assumption is, there are at least 2 power loss happens
during the problem.

The 1st power loss caused free space cache corrupted but not detected by
its checksum, and btrfs used the corrupted free space cache to allocate
tree blocks.

And then 2nd power loss happened. Since new allocated tree blocks can
overwrite existing tree blocks, it breaks metadata CoW of btrfs, and
leads the final corruption.

Would you please provide some detailed info about the corruption?

Thanks,
Qu

On 2018年02月23日 03:21, Christoph Anton Mitterer wrote:
> Am 22. Februar 2018 04:57:53 MEZ schrieb Qu Wenruo :
>>
>>
>> On 2018年02月22日 10:56, Christoph Anton Mitterer wrote:
>>> Just one last for today... I did a quick run with the byte nr from
>> the last mail... See screenshot
>>>
>>> It still gives these mapping errors... But does seem to write
>> files...
>>
>>From your previous picture, it seems that FS_TREE is your primary
>> subvolume, and 257 would be your snapshot.
>>
>> And for that block which can't be mapped, it seems to be a corruption
>> and it's really too large.
>>
>> So ignoring it wouldn't be a problem.
>>
>> And just keep btrfs-store running to see what it salvaged?
>>
>>>
>>> But these mapping errors... Wtf?! 
>>>
>>>
>>> Thanks and until tomorrow.
>>>
>>> Chris
>>>
>>> Oh and in my panic (I still fear that my main data fs, which is on
>> other hard disks could be affected by that strange bug, too, and have
>> no idea how to verify they are not) I forgot: you are from China,
>> aren't you? So a blessed happy new year. :-)
>>
>> Happy new year too.
>>
>> Thanks,
>> Qu
>>
>>>
> 
> Hey
> 
> Have you written more after the mail below?  Cause my normal email account 
> ran full and I cannot recover that right now with my computer.
> 
> Anyway... I tried now the restore and it seems to give back some data 
> (haven't looked at it yet)... I also made a dd copy of the whole fs image to 
> another freshly crafted btrfs fs (as an image file).
> 
> That seemed to work well, but when I differs that image with the original, 
> new csum errors of that file were found. (see attached image)
> 
> Could that be a pointer to some hardware defect? Perhaps the memory? Though I 
> did do an extensive memtest86+ a while ago.
> 
> And that could be the reason for the corruption in the first place...
> 
> Thanks,
> Chris.
> 



signature.asc
Description: OpenPGP digital signature