date:20170910

Re: btrfs_remove_chunk call trace?

2017-09-10 Thread Rich Rauenzahn

...and can it be related to the Samsung 840 SSD's not supporting NCQ
Trim?  (Although I can't tell which device this trace is from -- it
could be a mechanical Western Digital.)

On Sun, Sep 10, 2017 at 10:16 PM, Rich Rauenzahn  wrote:
> Is this something to be concerned about?
>
> I'm running the latest mainline kernel on CentOS 7.
>
> [ 1338.882288] [ cut here ]
> [ 1338.883058] WARNING: CPU: 2 PID: 790 at fs/btrfs/ctree.h:1559
> btrfs_update_device+0x1c5/0x1d0 [btrfs]
> [ 1338.883809] Modules linked in: xt_nat veth ipt_MASQUERADE
> nf_nat_masquerade_ipv4 xt_addrtype overlay loop nf_conntrack_ftp
> nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_comment xt_recent
> xt_multiport xt_conntrack iptable_filter xt_REDIRECT nf_nat_redirect
> iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nct6775
> nf_nat nf_conntrack hwmon_vid jc42 vfat fat dm_mirror dm_region_hash
> dm_log dm_mod dax xfs libcrc32c x86_pkg_temp_thermal intel_powerclamp
> coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi
> snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul
> snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec aesni_intel
> snd_hda_core iTCO_wdt snd_hwdep crypto_simd glue_helper cryptd
> iTCO_vendor_support snd_seq mei_wdt snd_seq_device intel_cstate
> cdc_acm snd_pcm intel_rapl_perf
> [ 1338.888639]  input_leds snd_timer lpc_ich i2c_i801 pcspkr
> intel_pch_thermal snd mfd_core sg mei_me soundcore mei acpi_pad shpchp
> ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
> btrfs xor raid6_pq sd_mod crc32c_intel ahci e1000e libahci
> firewire_ohci igb i915 dca firewire_core ptp i2c_algo_bit crc_itu_t
> libata pps_core drm_kms_helper syscopyarea sysfillrect sysimgblt
> fb_sys_fops drm video
> [ 1338.891412] CPU: 2 PID: 790 Comm: btrfs-cleaner Tainted: GW
>   4.13.1-1.el7.elrepo.x86_64 #1
> [ 1338.892171] Hardware name: Supermicro X10SAE/X10SAE, BIOS 2.0a 05/09/2014
> [ 1338.892884] task: 880407cec5c0 task.stack: c90002624000
> [ 1338.893613] RIP: 0010:btrfs_update_device+0x1c5/0x1d0 [btrfs]
> [ 1338.894299] RSP: 0018:c90002627d00 EFLAGS: 00010206
> [ 1338.894956] RAX: 0fff RBX: 880407cd9930 RCX: 
> 01d1c1011e00
> [ 1338.895647] RDX: 8800 RSI: 880336e80f9e RDI: 
> 88028395bd88
> [ 1338.896304] RBP: c90002627d48 R08: 3fc2 R09: 
> c90002627cb8
> [ 1338.896934] R10: 1000 R11: 0003 R12: 
> 880405f68c00
> [ 1338.897618] R13:  R14: 88028395bd88 R15: 
> 3f9e
> [ 1338.898251] FS:  () GS:88041fa8()
> knlGS:
> [ 1338.898867] CS:  0010 DS:  ES:  CR0: 80050033
> [ 1338.899522] CR2: 7ff82f2cb000 CR3: 01c09000 CR4: 
> 001406e0
> [ 1338.900157] DR0:  DR1:  DR2: 
> 
> [ 1338.900772] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [ 1338.901402] Call Trace:
> [ 1338.902017]  btrfs_remove_chunk+0x2fb/0x8b0 [btrfs]
> [ 1338.902673]  btrfs_delete_unused_bgs+0x363/0x440 [btrfs]
> [ 1338.903304]  cleaner_kthread+0x150/0x180 [btrfs]
> [ 1338.903908]  kthread+0x109/0x140
> [ 1338.904593]  ? btree_invalidatepage+0xa0/0xa0 [btrfs]
> [ 1338.905207]  ? kthread_park+0x60/0x60
> [ 1338.905803]  ret_from_fork+0x25/0x30
> [ 1338.906416] Code: 10 00 00 00 4c 89 fe e8 8a 30 ff ff 4c 89 f7 e8
> 32 f6 fc ff e9 d3 fe ff ff b8 f4 ff ff ff e9 d4 fe ff ff 0f 1f 00 e8
> cb 3e be e0 <0f> ff eb af 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2
> be 02
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs-progs: Output time elapsed for each major tree it checked

2017-09-10 Thread Qu Wenruo

Marc reported that "btrfs check --repair" runs much faster than "btrfs
check", which is quite weird.

This patch will add time elapsed for each major tree it checked, for
both original mode and lowmem mode, so we can have a clue what's going
wrong.

Reported-by: Marc MERLIN 
Signed-off-by: Qu Wenruo 
---
 cmds-check.c | 21 +++--
 utils.h  | 24 
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 006edbde..fee806cd 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -5318,13 +5318,16 @@ static int do_check_fs_roots(struct btrfs_fs_info 
*fs_info,
  struct cache_tree *root_cache)
 {
int ret;
+   struct timer timer;
 
if (!ctx.progress_enabled)
fprintf(stderr, "checking fs roots\n");
+   start_timer();
if (check_mode == CHECK_MODE_LOWMEM)
ret = check_fs_roots_v2(fs_info);
else
ret = check_fs_roots(fs_info, root_cache);
+   printf("done in %d seconds\n", stop_timer());
 
return ret;
 }
@@ -11584,14 +11587,16 @@ out:
 static int do_check_chunks_and_extents(struct btrfs_fs_info *fs_info)
 {
int ret;
+   struct timer timer;
 
if (!ctx.progress_enabled)
fprintf(stderr, "checking extents\n");
+   start_timer();
if (check_mode == CHECK_MODE_LOWMEM)
ret = check_chunks_and_extents_v2(fs_info);
else
ret = check_chunks_and_extents(fs_info);
-
+   printf("done in %d seconds\n", stop_timer());
return ret;
 }
 
@@ -12772,6 +12777,7 @@ int cmd_check(int argc, char **argv)
int qgroups_repaired = 0;
unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE;
int force = 0;
+   struct timer timer;
 
while(1) {
int c;
@@ -12953,8 +12959,11 @@ int cmd_check(int argc, char **argv)
if (repair)
ctree_flags |= OPEN_CTREE_PARTIAL;
 
+   printf("opening btrfs filesystem\n");
+   start_timer();
info = open_ctree_fs_info(argv[optind], bytenr, tree_root_bytenr,
  chunk_root_bytenr, ctree_flags);
+   printf("done in %d seconds\n", stop_timer());
if (!info) {
error("cannot open file system");
ret = -EIO;
@@ -13115,8 +13124,10 @@ int cmd_check(int argc, char **argv)
else
fprintf(stderr, "checking free space cache\n");
}
+   start_timer();
ret = check_space_cache(root);
err |= !!ret;
+   printf("done in %d seconds\n", stop_timer());
if (ret) {
if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE))
error("errors found in free space tree");
@@ -13140,18 +13151,22 @@ int cmd_check(int argc, char **argv)
}
 
fprintf(stderr, "checking csums\n");
+   start_timer();
ret = check_csums(root);
err |= !!ret;
+   printf("done in %d seconds\n", stop_timer());
if (ret) {
error("errors found in csum tree");
goto out;
}
 
-   fprintf(stderr, "checking root refs\n");
/* For low memory mode, check_fs_roots_v2 handles root refs */
if (check_mode != CHECK_MODE_LOWMEM) {
+   fprintf(stderr, "checking root refs\n");
+   start_timer();
ret = check_root_refs(root, _cache);
err |= !!ret;
+   printf("done in %d seconds\n", stop_timer());
if (ret) {
error("errors found in root refs");
goto out;
@@ -13186,8 +13201,10 @@ int cmd_check(int argc, char **argv)
 
if (info->quota_enabled) {
fprintf(stderr, "checking quota groups\n");
+   start_timer();
ret = qgroup_verify_all(info);
err |= !!ret;
+   printf("done in %d seconds\n", stop_timer());
if (ret) {
error("failed to check quota groups");
goto out;
diff --git a/utils.h b/utils.h
index d28a05a6..159487db 100644
--- a/utils.h
+++ b/utils.h
@@ -172,4 +172,28 @@ u64 rand_u64(void);
 unsigned int rand_range(unsigned int upper);
 void init_rand_seed(u64 seed);
 
+/* Utils to report time duration */
+struct timer {
+   time_t start;
+};
+
+static inline void start_timer(struct timer *t)
+{
+   time(>start);
+}
+
+/*
+ * Stop timer and return the time elapsed in int
+ *
+ * int should be large enough to "btrfs check" and avoid
+ * type converting.
+ */
+static inline int stop_timer(struct timer *t)
+{
+   time_t end;
+
+   time();
+
+   return (int)(difftime(end, t->start));
+}
 #endif
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 22:34, Martin Raiber wrote:

Hi,

On 10.09.2017 08:45 Qu Wenruo wrote:



On 2017年09月10日 14:41, Qu Wenruo wrote:



On 2017年09月10日 07:50, Rohan Kadekodi wrote:

Hello,

I was trying to understand how file renames are handled in Btrfs. I
read the code documentation, but had a problem understanding a few
things.

During a file rename, btrfs_commit_transaction() is called which is
because Btrfs has to commit the whole FS before storing the
information related to the new renamed file. It has to commit the FS
because a rename first does an unlink, which is not recorded in the
btrfs_rename() transaction and so is not logged in the log tree. Is my
understanding correct? If yes, my questions are as follows:


Not familiar with rename kernel code, so not much help for rename
opeartion.



1. What does committing the whole FS mean?


Committing the whole fs means a lot of things, but generally
speaking, it makes that the on-disk data is inconsistent with each
other.



For obvious part, it writes modified fs/subvolume trees to disk (with
handling of tree operations so no half modified trees).

Also other trees like extent tree (very hot since every CoW will
update it, and the most complicated one), csum tree if modified.

After transaction is committed, the on-disk btrfs will represent the
states when commit trans is called, and every tree should match each
other.

Despite of this, after a transaction is committed, generation of the
fs get increased and modified tree blocks will have the same
generation number.


Blktrace shows that there
are 2   256KB writes, which are essentially writes to the data of
the root directory of the file system (which I found out through
btrfs-debug-tree).


I'd say you didn't check btrfs-debug-tree output carefully enough.
I strongly recommend to do vimdiff to get what tree is modified.

At least the following trees are modified:

1) fs/subvolume tree
     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
     updated inode time.
     So fs/subvolume tree must be CoWed.

2) extent tree
     CoW of above metadata operation will definitely cause extent
     allocation and freeing, extent tree will also get updated.

3) root tree
     Both extent tree and fs/subvolume tree modified, their root bytenr
     needs to be updated and root tree must be updated.

And finally superblocks.

I just verified the behavior with empty btrfs created on a 1G file,
only one file to do the rename.

In that case (with 4K sectorsize and 16K nodesize), the total IO
should be (3 * 16K) * 2 + 4K * 2 = 104K.

"3" = number of tree blocks get modified
"16K" = nodesize
1st "*2" = DUP profile for metadata
"4K" = superblock size
2nd "*2" = 2 superblocks for 1G fs.

If your extent/root/fs trees have higher level, then more tree blocks
needs to be updated.
And if your fs is very large, you may have 3 superblocks.


Is this equivalent to doing a shell sync, as the
same block groups are written during a shell sync too?


For shell "sync" the difference is that, "sync" will write all dirty
data pages to disk, and then commit transaction.
While only calling btrfs_commit_transacation() doesn't trigger dirty
page writeback.

So there is a difference.


this conversation made me realize why btrfs has sub-optimal meta-data
performance. Cow b-trees are not the best data structure for such small
changes. In my application I have multiple operations (e.g. renames)
which can be bundles up and (mostly) one writer.


Things are more complicated in fact.

For example, even you are only renaming/moving one file.
But in fact you're going to at least modify 6 items, they are:

1) Removing DIR_INDEX of original parent dir inode
   Assume the original parent dir inode number is 300.
   We are removing (300 DIR_INDEX ).

2) Removing DIR_ITEM of original parent dir inode
   We are removing (300 DIR_ITEM )

3) Removing INODE_REF of the renamed inode
   Assume the renamed inode number is 400
   We are removing (400 INODE_REF 300).

4) Adding new DIR_INDEX to new parent dir inode
   Assume the new parent dir inode number is 500.
   We are adding (500 DIR_INDEX )

5) Adding new DIR_ITEM to new parent dir inode
   We are adding (500 DIR_ITEM )

6) Adding new INODE_REF to renamed inode
   We are adding (400 INODE_REF 500)

As you can see, there are 6 keys modification, and we can't ensure they 
are all in one leaf.

In worst case, we need to CoW the tree 6 times for different leaves.
(Although CoWed tree won't be CoWed again until written to disk, which 
reduces overhead)


And even more, if you modified one tree, you must also modify the 
ROOT_ITEM pointing the tree, which leads to root tree CoW.



I have a crazy idea to double buffering tree blocks.
That's to say, one tree block is actually consisted of 2 real tree blocks.

And when CoW happens, just switch to the other tree block.
So that we don't really need to update its parent pointer, so we can 
limit the CoW affected range to minimal.


But it's

[PATCH] btrfs-progs: update btrfs-completion

2017-09-10 Thread Misono, Tomohiro

This patch updates btrfs-completion:
 - add "filesystem du" and "rescure zero-log"
 - restrict _btrfs_mnts to show btrfs type only
 - add more completion in last case statements

(This file contains both spaces/tabs and may need cleanup.)

Signed-off-by: Tomohiro Misono 
---
 btrfs-completion | 43 +++
 1 file changed, 35 insertions(+), 8 deletions(-)

diff --git a/btrfs-completion b/btrfs-completion
index 3ede77b..1f00add 100644
--- a/btrfs-completion
+++ b/btrfs-completion
@@ -16,7 +16,7 @@ _btrfs_mnts()
local MNTS
MNTS=''
while read mnt; do MNTS+="$mnt "
-   done < <(mount | awk '{print $3}')
+   done < <(mount -t btrfs | awk '{print $3}')
COMPREPLY+=( $( compgen -W "$MNTS" -- "$cur" ) )
 }
 
@@ -31,11 +31,11 @@ _btrfs()
 
 commands='subvolume filesystem balance device scrub check rescue restore 
inspect-internal property send receive quota qgroup replace help version'
 commands_subvolume='create delete list snapshot find-new get-default 
set-default show sync'
-commands_filesystem='defragment sync resize show df label usage'
+commands_filesystem='defragment sync resize show df du label usage'
 commands_balance='start pause cancel resume status'
 commands_device='scan add delete remove ready stats usage'
 commands_scrub='start cancel resume status'
-commands_rescue='chunk-recover super-recover'
+commands_rescue='chunk-recover super-recover zero-log'
 commands_inspect_internal='inode-resolve logical-resolve subvolid-resolve 
rootid min-dev-size dump-tree dump-super tree-stats'
 commands_property='get set list'
 commands_quota='enable disable rescan'
@@ -114,6 +114,10 @@ _btrfs()
_filedir
return 0
;;
+   df|usage)
+   _btrfs_mnts
+   return 0
+   ;;
label)
_btrfs_mnts
_btrfs_devs
@@ -125,6 +129,26 @@ _btrfs()
_btrfs_devs
 return 0
 ;;
+   inspect-internal)
+   case $prev in
+   min-dev-size)
+   _btrfs_mnts
+   return 0
+   ;;
+   rootid)
+   _filedir
+   return 0
+   ;;
+   esac
+   ;;
+   receive)
+   case $prev in
+   -f)
+   _filedir
+   return 0
+   ;;
+   esac
+   ;;
 replace)
 case $prev in
 status|cancel)
@@ -137,14 +161,17 @@ _btrfs()
 ;;
 esac
 ;;
+   subvolume)
+   case $prev in
+   list)
+   _btrfs_mnts
+   return 0
+   ;;
+   esac
+   ;;
 esac
 fi
 
-if [[ "$cmd" == "receive" && "$prev" == "-f" ]]; then
-_filedir
-return 0
-fi
-
 _filedir -d
 return 0
 }
-- 
2.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs_remove_chunk call trace?

2017-09-10 Thread Rich Rauenzahn

Is this something to be concerned about?

I'm running the latest mainline kernel on CentOS 7.

[ 1338.882288] [ cut here ]
[ 1338.883058] WARNING: CPU: 2 PID: 790 at fs/btrfs/ctree.h:1559
btrfs_update_device+0x1c5/0x1d0 [btrfs]
[ 1338.883809] Modules linked in: xt_nat veth ipt_MASQUERADE
nf_nat_masquerade_ipv4 xt_addrtype overlay loop nf_conntrack_ftp
nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_comment xt_recent
xt_multiport xt_conntrack iptable_filter xt_REDIRECT nf_nat_redirect
iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nct6775
nf_nat nf_conntrack hwmon_vid jc42 vfat fat dm_mirror dm_region_hash
dm_log dm_mod dax xfs libcrc32c x86_pkg_temp_thermal intel_powerclamp
coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi
snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul
snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec aesni_intel
snd_hda_core iTCO_wdt snd_hwdep crypto_simd glue_helper cryptd
iTCO_vendor_support snd_seq mei_wdt snd_seq_device intel_cstate
cdc_acm snd_pcm intel_rapl_perf
[ 1338.888639]  input_leds snd_timer lpc_ich i2c_i801 pcspkr
intel_pch_thermal snd mfd_core sg mei_me soundcore mei acpi_pad shpchp
ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables
btrfs xor raid6_pq sd_mod crc32c_intel ahci e1000e libahci
firewire_ohci igb i915 dca firewire_core ptp i2c_algo_bit crc_itu_t
libata pps_core drm_kms_helper syscopyarea sysfillrect sysimgblt
fb_sys_fops drm video
[ 1338.891412] CPU: 2 PID: 790 Comm: btrfs-cleaner Tainted: GW
  4.13.1-1.el7.elrepo.x86_64 #1
[ 1338.892171] Hardware name: Supermicro X10SAE/X10SAE, BIOS 2.0a 05/09/2014
[ 1338.892884] task: 880407cec5c0 task.stack: c90002624000
[ 1338.893613] RIP: 0010:btrfs_update_device+0x1c5/0x1d0 [btrfs]
[ 1338.894299] RSP: 0018:c90002627d00 EFLAGS: 00010206
[ 1338.894956] RAX: 0fff RBX: 880407cd9930 RCX: 01d1c1011e00
[ 1338.895647] RDX: 8800 RSI: 880336e80f9e RDI: 88028395bd88
[ 1338.896304] RBP: c90002627d48 R08: 3fc2 R09: c90002627cb8
[ 1338.896934] R10: 1000 R11: 0003 R12: 880405f68c00
[ 1338.897618] R13:  R14: 88028395bd88 R15: 3f9e
[ 1338.898251] FS:  () GS:88041fa8()
knlGS:
[ 1338.898867] CS:  0010 DS:  ES:  CR0: 80050033
[ 1338.899522] CR2: 7ff82f2cb000 CR3: 01c09000 CR4: 001406e0
[ 1338.900157] DR0:  DR1:  DR2: 
[ 1338.900772] DR3:  DR6: fffe0ff0 DR7: 0400
[ 1338.901402] Call Trace:
[ 1338.902017]  btrfs_remove_chunk+0x2fb/0x8b0 [btrfs]
[ 1338.902673]  btrfs_delete_unused_bgs+0x363/0x440 [btrfs]
[ 1338.903304]  cleaner_kthread+0x150/0x180 [btrfs]
[ 1338.903908]  kthread+0x109/0x140
[ 1338.904593]  ? btree_invalidatepage+0xa0/0xa0 [btrfs]
[ 1338.905207]  ? kthread_park+0x60/0x60
[ 1338.905803]  ret_from_fork+0x25/0x30
[ 1338.906416] Code: 10 00 00 00 4c 89 fe e8 8a 30 ff ff 4c 89 f7 e8
32 f6 fc ff e9 d3 fe ff ff b8 f4 ff ff ff e9 d4 fe ff ff 0f 1f 00 e8
cb 3e be e0 <0f> ff eb af 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2
be 02
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Andrei Borzenkov

10.09.2017 23:17, Dmitrii Tcvetkov пишет:
>>> Drive1  Drive2Drive3
>>> X   X
>>> X X
>>> X X
>>>
>>> Where X is a chunk of raid1 block group.  
>>
>> But this table clearly shows that adding third drive increases free
>> space by 50%. You need to reallocate data to actually make use of it,
>> but it was done in this case.
> 
> It increases it but I don't see how this space is in any way useful
> unless data is in single profile. After full balance chunks will be
> spread over 3 devices, how it helps in raid1 data profile case?
> 
A1 A2  => A1 A2 - => A1 A2 B1 => A1 A2 B1
B1 B2 B1 B2 --  B2  -C1 B2 C2

It is raid1 profile on three disks fully utilizing them (assuming equal
sizes of course). Where "raid1" means - each data block has two copies
on different devices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 22:32, Rohan Kadekodi wrote:

Thank you for the prompt and elaborate answers! However, I think I was
unclear in my questions, and I apologize for the confusion.

What I meant was that for a file rename, when I check the blktrace
output, there are 2 writes of 256KB each starting from byte number:
13373440

When I check btrfs-debug-tree, I see that the following items are related to it:

1) root tree:
  key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53
  extent data disk byte 13373440 nr 262144
  extent data offset 0 nr 262144 ram 262144
  extent compression 0

2) extent tree:
  key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53
  extent refs 1 gen 12 flags DATA
  extent data backref root 1 objectid 256 offset 0 count 1

So this means that the extent allocated to the root folder (mount
point) is getting written twice right? Here I am not talking about any
metadata, but the data in the extent allocated to the root folder,
that is inode number 256.


Such extent data is used by free space cache.

If using nospace_cache or space_cache=v2 mount option, there will no 
such thing.


Free space cache is used for recording free and used space for each 
chunk (or block group, which is mostly the same thing).
Since CoW happens for metadata chunk, its used/free space mapping get 
modified and then free space cache will also be updated.


BTW, some term usage difference makes me a little confused.
Personally speaking, we call root 1 "tree root" or "root tree", not root 
directory.

As in fact such tree doesn't contain any real file/directory.


When I was analyzing the code, I saw that these writes happened from
btrfs_start_dirty_block_groups() which is in
btrfs_commit_transaction(). This is the same thing that is getting
written on a filesystem commit.

So my questions were:
1) Why are there 2 256KB writes happening during a filesystem commit
to the same location instead of just 1? Also, what exactly is written
in the root folder of the file system? Again, I am talking about the
data held in the extent allocated inode 256 and not about any metadata
or any tree.


As stated above, EXTENT_DATA in root tree is for space cache (v1).
Which uses NoCOW file extent as file to record free space.

And such space cache is for each block group.

Furthermore, since it's EXTENT_DATA, it counts as DATA, so it follows 
your data profile (default to single for single device and RAID0 for 
multi device).


If not using DUP1 as data profile, then you have 2 block groups get 
modified.




2) I understand by the on-disk format that all the child dir/inode
info in one subvolume are in the same tree, but these writes that I am
talking about are not to any tree, they to the data held in inode 256,
which happens to be the mount point. So by root directory, I mean the
mount point or the inode 256 (not any tree).


As mentioned before, it's better to call it "root tree" as it doesn't 
really represents a directory.



And even though metadata
wise there is no hierarchy as such in the file system, each folder
data will only contain the data belonging to its children right?


The sentence is confusing to me now.
By "folder" did you mean normal directory? And how do you define "data 
belonging to its children"?


As stated before, there is no real boundary for an inode (including 
normal file and directory).
All inode data (including EXTENT_DATA for regular file and DIR_INDEX/DIR 
for directory inode) are just sequential keys (with its data) in a 
subvolume.


So without your definition of "belonging to" I can't get the point.


Hence
my question was that why does the data in the extent allocated to
inode 256 need to be rewritten instead of just the parent folder for a
rename?


My first paragraph explained this.

BTW, for your concerned EXTENT_DATA in root 1 (root tree), it's used by 
the following sequence: (BTRFS_ prefix omitted, all keys are in root 1)


(FREE_SPACE_OBJECTID, 0, )
Its structure, btrfs_free_space_header, contains a key referring to an 
inode, which is a regular file inode.

The inode key will be (, INODE_ITEM, 0)

Then still in tree root (rootid 1), search using the (, INODE_ITEM, 
0) key, to locate the free space cache inode.


Finally btrfs will just read data stored for this inode.
Using its (, EXTENT_DATA, ) to locate its real data on 
disk, and read it out.


For details like how the space cache looks like, you need to check the 
free space cache code then.
(And for short, it's a mess, so we have space_cache=v2, which uses 
normal btrfs Btree to store such info, and btrfs-debug-tree can show it 
easily)


And of course, for transaction commit, each dirty block group will need 
to update its free space cache, and its free space cache file has 
NODATACOW flag, so free space cache itself has some checksum mechanism, 
so normally the whole free space cache file is updated.


Thanks,
Qu



Thanks,
Rohan

On 10 September 2017 at 01:45, Qu Wenruo

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

2017-09-10 Thread Marc MERLIN

On Sun, Sep 10, 2017 at 01:16:26PM +, Josef Bacik wrote:
> Great, if the free space cache is fucked again after the next go
> around then I need to expand the verifier to watch entries being added
> to the cache as well.  Thanks,

Well, I copied about 1TB of data, and nothing happened.
So it seems clearing it and fsck may have fixed this fault I had been
carrying for quite a while.
If so, yeah!

I'm not sure if this needs a kernel fix to not get triggered and if
btrfs check should also be improved to catch this, but hopefully you
know what makes sense there.

Thanks,
Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Duncan

FLJ posted on Sun, 10 Sep 2017 15:45:42 +0200 as excerpted:

> I have a BTRFS RAID1 volume running for the past year. I avoided all
> pitfalls known to me that would mess up this volume. I never
> experimented with quotas, no-COW, snapshots, defrag, nothing really.
> The volume is a RAID1 from day 1 and is working reliably until now.
> 
> Until yesterday it consisted of two 3 TB drives, something along the
> lines:
> 
> Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 2 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 2.47TiB path /dev/sdb
> devid2 size 2.73TiB used 2.47TiB path /dev/sdc

I'm going to try a different approach than I see in the two existing 
subthreads, so I started from scratch with my own subthread...

So the above looks reasonable so far...

> 
> Yesterday I've added a new drive to the FS and did a full rebalance
> (without filters) over night, which went through without any issues.
> 
> Now I have:
>  Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 3 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 1.24TiB path /dev/sdb
> devid2 size 2.73TiB used 1.24TiB path /dev/sdc
> devid3 size 7.28TiB used 2.48TiB path /dev/sda

That's exactly as expected, after a balance.

Note the size, 2.73 TiB (twos-power) for the smaller two, not 3 (tho it's 
probably 3 TB, tens-power), 7.28 TiB, not 8, for the larger one.

The most-free-space chunk allocation, with raid1-paired chunks, means the 
first chunk of every pair will get allocated to the largest, 7.28 TiB 
device.  The other two devices are equal in size, 2.73 TiB each, and the 
second chunk can't get allocated to the largest device as only one chunk 
of the pair can go there, so the allocator will in general alternate 
allocations from the smaller two, for the second chunk of each pair.  (I 
say in general, because metadata chunks are smaller than data chunks, so 
it's possible that two chunks in a row, a metadata chunk and a data 
chunk, will be allocated from the same device, before it switches to the 
other.)

Because the larger device is larger than the other two combined, it'll 
always get one copy, while the others fill up evenly at half the usage of 
the larger device, until both smaller devices are full, at which point 
you won't be able to allocate further raid1 chunks and you'll ENOSPC.

> # btrfs fi df /mnt/BigVault/
> Data, RAID1: total=2.47TiB, used=2.47TiB
> System, RAID1: total=32.00MiB, used=384.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.74GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

Still looks reasonable.

Note that assuming you're using a reasonably current btrfs-progs, there's 
also the btrfs fi usage and btrfs dev usage commands.  Btrfs fi df is an 
older form that has much less information than the fi and dev usage 
commands, tho between btrfs fi show and btrfs fi df, /most/ of the 
filesystem-level information in btrfs fi usage can be deduced, tho not 
necessarily the device-level detail.  Btrfs fi usage is thus preferred, 
assuming it's available to you.  (In addition to btrfs fi usage being 
newer, both it and btrfs fi df require a mounted btrfs.  If the 
filesystem refuses to mount, btrfs fi show may be all that's available.)

While I'm digressing, I'm guessing you know this already, but for others, 
global reserve is reserved from and comes out of metadata, so you can add 
global reserve total to metadata used.  Normally, btrfs won't use 
anything from the global reserve, so usage there will be zero.  If it's 
not, that's a very strong indication that your filesystem believes it is 
very short on space (even if data and metadata say they both have lots of 
unused space left, for some reason, very likely a bug in that case, the 
filesystem believes otherwise) and you need to take corrective action 
immediately, or risk the filesystem effectively going read-only when 
nothing else can be written.
 
> But still df -h is giving me:
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault
> 
> Although I've heard and read about the difficulty in reporting free
> space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
> but I only have a single volume, no subvolumes, no snapshots, no quotas
> and both data and metadata are RAID1.

The most practical advice I've seen regarding "normal" df (that is, the 
one from coreutils, not btrfs fi df) in the case of uneven device sizes 
in particular, is simply ignore its numbers -- they're not reliable.  The 
only thing you need to be sure of is that it says you have enough space 
for whatever you're actually doing ATM, since various applications will 
trust its numbers and may refuse to do whatever filesystem operation at 
all, if it says there's not enough space.

The algorithm reasonably new coreutils df (and the kernel calls it 
depends on) uses is much better

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Kai Krakow

Am Sun, 10 Sep 2017 20:15:52 +0200
schrieb Ferenc-Levente Juhos :

> >Problem is that each raid1 block group contains two chunks on two
> >separate devices, it can't utilize fully three devices no matter
> >what. If that doesn't suit you then you need to add 4th disk. After
> >that FS will be able to use all unallocated space on all disks in
> >raid1 profile. But even then you'll be able to safely lose only one
> >disk since BTRFS still will be storing only 2 copies of data.  
> 
> I hope I didn't say that I want to utilize all three devices fully. It
> was clear to me that there will be 2 TB of wasted space.
> Also I'm not questioning the chunk allocator for RAID1 at all. It's
> clear and it always has been clear that for RAID1 the chunks need to
> be allocated on different physical devices.
> If I understood Kai's point of view, he even suggested that I might
> need to do balancing to make sure that the free space on the three
> devices is being used smartly. Hence the questions about balancing.

It will allocate chunks from the device with the most space available.
So while you fill your disks space usage will evenly distribute.

The problem comes when you start deleting stuff, some chunks may even
be freed, and everything becomes messed up. In an aging file system you
may notice that the chunks are no longer evenly distributed. A balance
is a way to fix that because it will reallocate chunks and coalesce
data back into single chunks, making free space for new allocations. In
this process it will actually evenly distribute your data again.

You may want to use this rebalance script:
https://www.spinics.net/lists/linux-btrfs/msg52076.html

> I mean in worst case it could happen like this:
> 
> Again I have disks of sizes 3, 3, 8:
> Fig.1
> Drive1(8) Drive2(3) Drive3(3)
>  -   X1X1
>  -   X2X2
>  -   X3X3
> Here the new drive is completely unused. Even if one X1 chunk would be
> on Drive1 it would be still a sub-optimal allocation.

This won't happen while filling a fresh btrfs. Chunks are always
allocated from a device with most free space (within the raid1
constraints). This it will allocate space alternating between disk1+2
and disk1+3.

> This is the optimal allocation. Will btrfs allocate like this?
> Considering that Drive1 has the most free space.
> Fig. 2
> Drive1(8) Drive2(3) Drive3(3)
> X1X1-
> X2-   X2
> X3X3-
> X4-   X4

Yes.

> From my point of view Fig.2 shows the optimal allocation, by the time
> the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB
> (because it is exclusively holding the mirrors for both Drive2 and 3).
> For sure now btrfs can say, since two of the drives are completely
> full he can't allocate any more chunks and the remaining 2 TB of space
> from Drive1 is wasted. This is clear it's even pointed out by the
> btrfs size calculator.

Yes.


> But again if the above statements are true, then df might as well tell
> the "truth" and report that I have 3.5 TB space free and not 1.5TB (as
> it is reported now). Again here I fully understand Kai's explanation.
> Because coming back to my first e-mail, my "problem" was that df is
> reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data.

The size calculator has undergone some revisions. I think it currently
estimates the free space from net data to raw data ratio across all
devices, taking the current raid constraints into account.

Calculating free space in btrfs is difficult because in the future
btrfs may even support different raid levels for different sub volumes.
It's probably best to calculate for the worst case scenario then.

Even today it's already difficult if you use different raid levels for
meta data and content data: The filesystem cannot predict the future of
allocations. It can only give an educated guess. And the calculation
was revised a few times to not "overshoot".


> So the question still remains, is it just that df is intentionally not
> smart enough to give a more accurate estimation,

The df utility doesn't now anything about btrfs allocations. The value
is estimated by btrfs itself. To get more detailed info for capacity
planning, you should use "btrfs fi df" and its various siblings.

> or is the assumption
> that the allocator picks the drive with most free space mistaken?
> If I continue along the lines of what Kai said, and I need to do
> re-balance, because the allocation is not like shown above (Fig.2),
> then my question is still legitimate. Are there any filters that one
> might use to speed up or to selectively balance in my case? or will I
> need to do full balance?

Your assumption is misguided. The total free space estimation is a
totally different thing than what the allocator bases its decision on.
See "btrfs dev usage". The allocator uses space from the biggest
unallocated space

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov

> > Drive1  Drive2Drive3
> > X   X
> > X X
> > X X
> > 
> > Where X is a chunk of raid1 block group.  
> 
> But this table clearly shows that adding third drive increases free
> space by 50%. You need to reallocate data to actually make use of it,
> but it was done in this case.

It increases it but I don't see how this space is in any way useful
unless data is in single profile. After full balance chunks will be
spread over 3 devices, how it helps in raid1 data profile case?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Andrei Borzenkov

10.09.2017 19:11, Dmitrii Tcvetkov пишет:
>> Actually based on http://carfax.org.uk/btrfs-usage/index.html I
>> would've expected 6 TB of usable space. Here I get 6.4 which is odd,
>> but that only 1.5 TB is available is even stranger.
>>
>> Could anyone explain what I did wrong or why my expectations are wrong?
>>
>> Thank you in advance
> 
> I'd say df and the website calculate different things. In btrfs raid1 profile 
> stores exactly 2 copies of data, each copy is on separate device. 
> So by adding third drive, no matter how big, effective free space didn't 
> expand because btrfs still needs space on any one of other two drives to 
> store second half of each raid1 chunk stored on that third drive. 
> 
> Basically:
> 
> Drive1  Drive2Drive3
> X   X
> X X
> X X
> 
> Where X is a chunk of raid1 block group.

But this table clearly shows that adding third drive increases free
space by 50%. You need to reallocate data to actually make use of it,
but it was done in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Andrei Borzenkov

10.09.2017 18:47, Kai Krakow пишет:
> Am Sun, 10 Sep 2017 15:45:42 +0200
> schrieb FLJ :
> 
>> Hello all,
>>
>> I have a BTRFS RAID1 volume running for the past year. I avoided all
>> pitfalls known to me that would mess up this volume. I never
>> experimented with quotas, no-COW, snapshots, defrag, nothing really.
>> The volume is a RAID1 from day 1 and is working reliably until now.
>>
>> Until yesterday it consisted of two 3 TB drives, something along the
>> lines:
>>
>> Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
>> Total devices 2 FS bytes used 2.47TiB
>> devid1 size 2.73TiB used 2.47TiB path /dev/sdb
>> devid2 size 2.73TiB used 2.47TiB path /dev/sdc
>>
>> Yesterday I've added a new drive to the FS and did a full rebalance
>> (without filters) over night, which went through without any issues.
>>
>> Now I have:
>>  Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
>> Total devices 3 FS bytes used 2.47TiB
>> devid1 size 2.73TiB used 1.24TiB path /dev/sdb
>> devid2 size 2.73TiB used 1.24TiB path /dev/sdc
>> devid3 size 7.28TiB used 2.48TiB path /dev/sda
>>
>> # btrfs fi df /mnt/BigVault/
>> Data, RAID1: total=2.47TiB, used=2.47TiB
>> System, RAID1: total=32.00MiB, used=384.00KiB
>> Metadata, RAID1: total=4.00GiB, used=2.74GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> But still df -h is giving me:
>> Filesystem   Size  Used Avail Use% Mounted on
>> /dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault
>>
>> Although I've heard and read about the difficulty in reporting free
>> space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
>> but I only have a single volume, no subvolumes, no snapshots, no
>> quotas and both data and metadata are RAID1.
>>
>> My expectation would've been that in case of BigVault Size == Used +
>> Avail.
>>
>> Actually based on http://carfax.org.uk/btrfs-usage/index.html I
>> would've expected 6 TB of usable space. Here I get 6.4 which is odd,

Total size is estimation which in this case is computed as (sum of
device sizes)/2 which is approximately 6.4TiB.

>> but that only 1.5 TB is available is even stranger.
>>
>> Could anyone explain what I did wrong or why my expectations are
>> wrong?
>>
>> Thank you in advance
> 
> Btrfs reports estimated free space from the free space of the smallest
> member as it can only guarantee that.

It's not exactly true. For three devices with free space of 1TiB, 2TiB
and 3TiB it would return 2TiB as available space. But it is not
sophisticated enough to notice that it actually has 3TiB available.

I wonder if this is only free space calculation or actual allocation
algorithm behaves similar (effectively ignoring part of available space).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Ferenc-Levente Juhos

>Problem is that each raid1 block group contains two chunks on two
>separate devices, it can't utilize fully three devices no matter what.
>If that doesn't suit you then you need to add 4th disk. After
>that FS will be able to use all unallocated space on all disks in raid1
>profile. But even then you'll be able to safely lose only one disk
>since BTRFS still will be storing only 2 copies of data.

I hope I didn't say that I want to utilize all three devices fully. It
was clear to me that there will be 2 TB of wasted space.
Also I'm not questioning the chunk allocator for RAID1 at all. It's
clear and it always has been clear that for RAID1 the chunks need to
be allocated on different physical devices.
If I understood Kai's point of view, he even suggested that I might
need to do balancing to make sure that the free space on the three
devices is being used smartly. Hence the questions about balancing.

I mean in worst case it could happen like this:

Again I have disks of sizes 3, 3, 8:
Fig.1
Drive1(8) Drive2(3) Drive3(3)
 -   X1X1
 -   X2X2
 -   X3X3
Here the new drive is completely unused. Even if one X1 chunk would be
on Drive1 it would be still a sub-optimal allocation.

This is the optimal allocation. Will btrfs allocate like this?
Considering that Drive1 has the most free space.
Fig. 2
Drive1(8) Drive2(3) Drive3(3)
X1X1-
X2-   X2
X3X3-
X4-   X4

>From my point of view Fig.2 shows the optimal allocation, by the time
the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB
(because it is exclusively holding the mirrors for both Drive2 and 3).
For sure now btrfs can say, since two of the drives are completely
full he can't allocate any more chunks and the remaining 2 TB of space
from Drive1 is wasted. This is clear it's even pointed out by the
btrfs size calculator.

But again if the above statements are true, then df might as well tell
the "truth" and report that I have 3.5 TB space free and not 1.5TB (as
it is reported now). Again here I fully understand Kai's explanation.
Because coming back to my first e-mail, my "problem" was that df is
reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data.

So the question still remains, is it just that df is intentionally not
smart enough to give a more accurate estimation, or is the assumption
that the allocator picks the drive with most free space mistaken?
If I continue along the lines of what Kai said, and I need to do
re-balance, because the allocation is not like shown above (Fig.2),
then my question is still legitimate. Are there any filters that one
might use to speed up or to selectively balance in my case? or will I
need to do full balance?

On Sun, Sep 10, 2017 at 7:19 PM, Dmitrii Tcvetkov  wrote:
>> @Kai and Dmitrii
>> thank you for your explanations if I understand you correctly, you're
>> saying that btrfs makes no attempt to "optimally" use the physical
>> devices it has in the FS, once a new RAID1 block group needs to be
>> allocated it will semi-randomly pick two devices with enough space and
>> allocate two equal sized chunks, one on each. This new chunk may or
>> may not fall onto my newly added 8 TB drive. Am I understanding this
>> correctly?
> If I remember correctly chunk allocator allocates new chunks on device
> which has the most unallocated space.
>
>> Is there some sort of balance filter that would speed up this sort of
>> balancing? Will balance be smart enough to make the "right" decision?
>> As far as I read the chunk allocator used during balance is the same
>> that is used during normal operation. If the allocator is already
>> sub-optimal during normal operations, what's the guarantee that it
>> will make a "better" decision during balancing?
>
> I don't really see any way that being possible in raid1 profile. How
> can you fill all three devices if you can split data only twice? There
> will be moment when two of three disks are full and BTRFS can't
> allocate new raid1 block group because it has only one drive with
> unallocated space.
>
>>
>> When I say "right" and "better" I mean this:
>> Drive1(8) Drive2(3) Drive3(3)
>> X1X1
>> X2X2
>> X3X3
>> X4X4
>> I was convinced until now that the chunk allocator at least tries a
>> best possible allocation. I'm sure it's complicated to develop a
>> generic algorithm to fit all setups, but it should be possible.
>
>
> Problem is that each raid1 block group contains two chunks on two
> separate devices, it can't utilize fully three devices no matter what.
> If that doesn't suit you then you need to add 4th disk. After
> that FS will be able to use all unallocated space on all disks in raid1
> profile. But even then you'll be able to safely lose only one disk
> since BTRFS still will be storing only 2

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov

> @Kai and Dmitrii
> thank you for your explanations if I understand you correctly, you're
> saying that btrfs makes no attempt to "optimally" use the physical
> devices it has in the FS, once a new RAID1 block group needs to be
> allocated it will semi-randomly pick two devices with enough space and
> allocate two equal sized chunks, one on each. This new chunk may or
> may not fall onto my newly added 8 TB drive. Am I understanding this
> correctly?
If I remember correctly chunk allocator allocates new chunks on device
which has the most unallocated space. 

> Is there some sort of balance filter that would speed up this sort of
> balancing? Will balance be smart enough to make the "right" decision?
> As far as I read the chunk allocator used during balance is the same
> that is used during normal operation. If the allocator is already
> sub-optimal during normal operations, what's the guarantee that it
> will make a "better" decision during balancing?

I don't really see any way that being possible in raid1 profile. How
can you fill all three devices if you can split data only twice? There
will be moment when two of three disks are full and BTRFS can't
allocate new raid1 block group because it has only one drive with
unallocated space.

> 
> When I say "right" and "better" I mean this:
> Drive1(8) Drive2(3) Drive3(3)
> X1X1
> X2X2
> X3X3
> X4X4
> I was convinced until now that the chunk allocator at least tries a
> best possible allocation. I'm sure it's complicated to develop a
> generic algorithm to fit all setups, but it should be possible.
 

Problem is that each raid1 block group contains two chunks on two
separate devices, it can't utilize fully three devices no matter what.
If that doesn't suit you then you need to add 4th disk. After
that FS will be able to use all unallocated space on all disks in raid1
profile. But even then you'll be able to safely lose only one disk
since BTRFS still will be storing only 2 copies of data.

This behavior is not relevant for single or raid0 profiles of
multidevice BTRFS filesystems.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Ferenc-Levente Juhos

@Kai and Dmitrii
thank you for your explanations if I understand you correctly, you're
saying that btrfs makes no attempt to "optimally" use the physical
devices it has in the FS, once a new RAID1 block group needs to be
allocated it will semi-randomly pick two devices with enough space and
allocate two equal sized chunks, one on each. This new chunk may or
may not fall onto my newly added 8 TB drive. Am I understanding this
correctly?
> You will probably need to
>run balance once in a while to evenly redistribute allocated chunks
>across all disks.

Is there some sort of balance filter that would speed up this sort of
balancing? Will balance be smart enough to make the "right" decision?
As far as I read the chunk allocator used during balance is the same
that is used during normal operation. If the allocator is already
sub-optimal during normal operations, what's the guarantee that it
will make a "better" decision during balancing?

When I say "right" and "better" I mean this:
Drive1(8) Drive2(3) Drive3(3)
X1X1
X2X2
X3X3
X4X4
I was convinced until now that the chunk allocator at least tries a
best possible allocation. I'm sure it's complicated to develop a
generic algorithm to fit all setups, but it should be possible.

On Sun, Sep 10, 2017 at 5:47 PM, Kai Krakow  wrote:
> Am Sun, 10 Sep 2017 15:45:42 +0200
> schrieb FLJ :
>
>> Hello all,
>>
>> I have a BTRFS RAID1 volume running for the past year. I avoided all
>> pitfalls known to me that would mess up this volume. I never
>> experimented with quotas, no-COW, snapshots, defrag, nothing really.
>> The volume is a RAID1 from day 1 and is working reliably until now.
>>
>> Until yesterday it consisted of two 3 TB drives, something along the
>> lines:
>>
>> Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
>> Total devices 2 FS bytes used 2.47TiB
>> devid1 size 2.73TiB used 2.47TiB path /dev/sdb
>> devid2 size 2.73TiB used 2.47TiB path /dev/sdc
>>
>> Yesterday I've added a new drive to the FS and did a full rebalance
>> (without filters) over night, which went through without any issues.
>>
>> Now I have:
>>  Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
>> Total devices 3 FS bytes used 2.47TiB
>> devid1 size 2.73TiB used 1.24TiB path /dev/sdb
>> devid2 size 2.73TiB used 1.24TiB path /dev/sdc
>> devid3 size 7.28TiB used 2.48TiB path /dev/sda
>>
>> # btrfs fi df /mnt/BigVault/
>> Data, RAID1: total=2.47TiB, used=2.47TiB
>> System, RAID1: total=32.00MiB, used=384.00KiB
>> Metadata, RAID1: total=4.00GiB, used=2.74GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> But still df -h is giving me:
>> Filesystem   Size  Used Avail Use% Mounted on
>> /dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault
>>
>> Although I've heard and read about the difficulty in reporting free
>> space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
>> but I only have a single volume, no subvolumes, no snapshots, no
>> quotas and both data and metadata are RAID1.
>>
>> My expectation would've been that in case of BigVault Size == Used +
>> Avail.
>>
>> Actually based on http://carfax.org.uk/btrfs-usage/index.html I
>> would've expected 6 TB of usable space. Here I get 6.4 which is odd,
>> but that only 1.5 TB is available is even stranger.
>>
>> Could anyone explain what I did wrong or why my expectations are
>> wrong?
>>
>> Thank you in advance
>
> Btrfs reports estimated free space from the free space of the smallest
> member as it can only guarantee that. In your case this is 2.73 minus
> 1.24 free which is roughly around 1.5T. But since this free space
> distributes across three disks with one having much more free space, it
> probably will use up that space at half the rate of actual allocation.
> But due to how btrfs allocates from free space in chunks, that may not
> be possible - thus the low unexpected value. You will probably need to
> run balance once in a while to evenly redistribute allocated chunks
> across all disks.
>
> It may give you better estimates if you combine sdb and sdc into one
> logical device, e.g. using raid0 or jbod via md or lvm.
>
>
> --
> Regards,
> Kai
>
> Replies to list-only preferred.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Dmitrii Tcvetkov

>Actually based on http://carfax.org.uk/btrfs-usage/index.html I
>would've expected 6 TB of usable space. Here I get 6.4 which is odd,
>but that only 1.5 TB is available is even stranger.
>
>Could anyone explain what I did wrong or why my expectations are wrong?
>
>Thank you in advance

I'd say df and the website calculate different things. In btrfs raid1 profile 
stores exactly 2 copies of data, each copy is on separate device. 
So by adding third drive, no matter how big, effective free space didn't expand 
because btrfs still needs space on any one of other two drives to store second 
half of each raid1 chunk stored on that third drive. 

Basically:

Drive1  Drive2Drive3
X   X
X   X
  X X

Where X is a chunk of raid1 block group.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Kai Krakow

Am Sun, 10 Sep 2017 15:45:42 +0200
schrieb FLJ :

> Hello all,
> 
> I have a BTRFS RAID1 volume running for the past year. I avoided all
> pitfalls known to me that would mess up this volume. I never
> experimented with quotas, no-COW, snapshots, defrag, nothing really.
> The volume is a RAID1 from day 1 and is working reliably until now.
> 
> Until yesterday it consisted of two 3 TB drives, something along the
> lines:
> 
> Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 2 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 2.47TiB path /dev/sdb
> devid2 size 2.73TiB used 2.47TiB path /dev/sdc
> 
> Yesterday I've added a new drive to the FS and did a full rebalance
> (without filters) over night, which went through without any issues.
> 
> Now I have:
>  Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 3 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 1.24TiB path /dev/sdb
> devid2 size 2.73TiB used 1.24TiB path /dev/sdc
> devid3 size 7.28TiB used 2.48TiB path /dev/sda
> 
> # btrfs fi df /mnt/BigVault/
> Data, RAID1: total=2.47TiB, used=2.47TiB
> System, RAID1: total=32.00MiB, used=384.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.74GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> But still df -h is giving me:
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault
> 
> Although I've heard and read about the difficulty in reporting free
> space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
> but I only have a single volume, no subvolumes, no snapshots, no
> quotas and both data and metadata are RAID1.
> 
> My expectation would've been that in case of BigVault Size == Used +
> Avail.
> 
> Actually based on http://carfax.org.uk/btrfs-usage/index.html I
> would've expected 6 TB of usable space. Here I get 6.4 which is odd,
> but that only 1.5 TB is available is even stranger.
> 
> Could anyone explain what I did wrong or why my expectations are
> wrong?
> 
> Thank you in advance

Btrfs reports estimated free space from the free space of the smallest
member as it can only guarantee that. In your case this is 2.73 minus
1.24 free which is roughly around 1.5T. But since this free space
distributes across three disks with one having much more free space, it
probably will use up that space at half the rate of actual allocation.
But due to how btrfs allocates from free space in chunks, that may not
be possible - thus the low unexpected value. You will probably need to
run balance once in a while to evenly redistribute allocated chunks
across all disks.

It may give you better estimates if you combine sdb and sdc into one
logical device, e.g. using raid0 or jbod via md or lvm.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-09-10 Thread Marc MERLIN

On Sat, Sep 09, 2017 at 10:43:16PM +0300, Andrei Borzenkov wrote:
> 09.09.2017 16:44, Ulli Horlacher пишет:
> > 
> > Your tool does not create .snapshot subdirectories in EVERY directory like
> 
> Neither does NetApp. Those "directories" are magic handles that do not
> really exist.
 
Correct, thanks for saving me typing the same thing (I actually did work
at netapp many years back, so I'm familiar with how they work)

> > Netapp does.
> > Example:
> > 
> > framstag@fex:~: cd ~/Mail/.snapshot/
> > framstag@fex:~/Mail/.snapshot: l
> > lR-X - 2017-09-09 09:55 2017-09-09_.daily -> 
> > /local/home/.snapshot/2017-09-09_.daily/framstag/Mail
> 
> Apart from obvious problem with recursive directory traversal (NetApp
> .snapshot are not visible with normal directory list) those will also be
> captured in snapshots and cannot be removed. NetApp snapshots themselves
> do not expose .snapshot "directories".

Correct. Netapp knows this of course, which is why those .snapshot
directories are "magic" and hidden to ls(1), find(1) and others when
they do a readdir(3)

> > lR-X - 2017-09-09 14:00 2017-09-09_1400.hourly -> 
> > /local/home/.snapshot/2017-09-09_1400.hourly/framstag/Mail
> > lR-X - 2017-09-09 15:00 2017-09-09_1500.hourly -> 
> > /local/home/.snapshot/2017-09-09_1500.hourly/framstag/Mail
> > lR-X - 2017-09-09 15:18 2017-09-09_1518.single -> 
> > /local/home/.snapshot/2017-09-09_1518.single/framstag/Mail
> > lR-X - 2017-09-09 15:20 2017-09-09_1520.single -> 
> > /local/home/.snapshot/2017-09-09_1520.single/framstag/Mail
> > lR-X - 2017-09-09 15:22 2017-09-09_1522.single -> 
> > /local/home/.snapshot/2017-09-09_1522.single/framstag/Mail
> > 
> > My users (and I) need snapshots in this way.

You are used to them being there, I was too :)
While you could create lots of symlinks, I opted not to since it would
have littered the filesystem.
I can simply cd $(SNAPROOT)/volname_hourly/$(PWD)
and end up where I wanted to be.

I suppose you could make a snapcd shell function that does this for you.
The only issue is that volname_hourly comes before the rest of the path,
so you aren't given a list of all the snapshots available for a given
path, you have to cd into the given snapshot first, and then add the
path.
I agree it's not as nice as netapp, but honestly I don't think you can
do better with btrfs at this point.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Martin Raiber

Hi,

On 10.09.2017 08:45 Qu Wenruo wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>> Committing the whole fs means a lot of things, but generally
>> speaking, it makes that the on-disk data is inconsistent with each
>> other.
>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will
>> update it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each
>> other.
>>
>> Despite of this, after a transaction is committed, generation of the
>> fs get increased and modified tree blocks will have the same
>> generation number.
>>
>>> Blktrace shows that there
>>> are 2   256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>>     Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>>     updated inode time.
>>     So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>>     CoW of above metadata operation will definitely cause extent
>>     allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>>     Both extent tree and fs/subvolume tree modified, their root bytenr
>>     needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file,
>> only one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO
>> should be (3 * 16K) * 2 + 4K * 2 = 104K.
>>
>> "3" = number of tree blocks get modified
>> "16K" = nodesize
>> 1st "*2" = DUP profile for metadata
>> "4K" = superblock size
>> 2nd "*2" = 2 superblocks for 1G fs.
>>
>> If your extent/root/fs trees have higher level, then more tree blocks
>> needs to be updated.
>> And if your fs is very large, you may have 3 superblocks.
>>
>>> Is this equivalent to doing a shell sync, as the
>>> same block groups are written during a shell sync too?
>>
>> For shell "sync" the difference is that, "sync" will write all dirty
>> data pages to disk, and then commit transaction.
>> While only calling btrfs_commit_transacation() doesn't trigger dirty
>> page writeback.
>>
>> So there is a difference.

this conversation made me realize why btrfs has sub-optimal meta-data
performance. Cow b-trees are not the best data structure for such small
changes. In my application I have multiple operations (e.g. renames)
which can be bundles up and (mostly) one writer.
I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one
way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC
and there have been discussions about removing them.
Best would be if there were delayed metadata, where metadata is handled
the same as delayed allocations and data changes, i.e. commit on fsync,
commit interval or fssync. I assumed this was already the case...

Please correct me if I got this wrong.

Regards,
Martin Raiber
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Rohan Kadekodi

Thank you for the prompt and elaborate answers! However, I think I was
unclear in my questions, and I apologize for the confusion.

What I meant was that for a file rename, when I check the blktrace
output, there are 2 writes of 256KB each starting from byte number:
13373440

When I check btrfs-debug-tree, I see that the following items are related to it:

1) root tree:
 key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53
 extent data disk byte 13373440 nr 262144
 extent data offset 0 nr 262144 ram 262144
 extent compression 0

2) extent tree:
 key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53
 extent refs 1 gen 12 flags DATA
 extent data backref root 1 objectid 256 offset 0 count 1

So this means that the extent allocated to the root folder (mount
point) is getting written twice right? Here I am not talking about any
metadata, but the data in the extent allocated to the root folder,
that is inode number 256.

When I was analyzing the code, I saw that these writes happened from
btrfs_start_dirty_block_groups() which is in
btrfs_commit_transaction(). This is the same thing that is getting
written on a filesystem commit.

So my questions were:
1) Why are there 2 256KB writes happening during a filesystem commit
to the same location instead of just 1? Also, what exactly is written
in the root folder of the file system? Again, I am talking about the
data held in the extent allocated inode 256 and not about any metadata
or any tree.

2) I understand by the on-disk format that all the child dir/inode
info in one subvolume are in the same tree, but these writes that I am
talking about are not to any tree, they to the data held in inode 256,
which happens to be the mount point. So by root directory, I mean the
mount point or the inode 256 (not any tree). And even though metadata
wise there is no hierarchy as such in the file system, each folder
data will only contain the data belonging to its children right? Hence
my question was that why does the data in the extent allocated to
inode 256 need to be rewritten instead of just the parent folder for a
rename?

Thanks,
Rohan

On 10 September 2017 at 01:45, Qu Wenruo  wrote:
>
>
> On 2017年09月10日 14:41, Qu Wenruo wrote:
>>
>>
>>
>> On 2017年09月10日 07:50, Rohan Kadekodi wrote:
>>>
>>> Hello,
>>>
>>> I was trying to understand how file renames are handled in Btrfs. I
>>> read the code documentation, but had a problem understanding a few
>>> things.
>>>
>>> During a file rename, btrfs_commit_transaction() is called which is
>>> because Btrfs has to commit the whole FS before storing the
>>> information related to the new renamed file. It has to commit the FS
>>> because a rename first does an unlink, which is not recorded in the
>>> btrfs_rename() transaction and so is not logged in the log tree. Is my
>>> understanding correct? If yes, my questions are as follows:
>>
>>
>> Not familiar with rename kernel code, so not much help for rename
>> opeartion.
>>
>>>
>>> 1. What does committing the whole FS mean?
>>
>>
>> Committing the whole fs means a lot of things, but generally speaking, it
>> makes that the on-disk data is inconsistent with each other.
>
> ^consistent
> Sorry for the typo.
>
> Thanks,
> Qu
>
>>
>> For obvious part, it writes modified fs/subvolume trees to disk (with
>> handling of tree operations so no half modified trees).
>>
>> Also other trees like extent tree (very hot since every CoW will update
>> it, and the most complicated one), csum tree if modified.
>>
>> After transaction is committed, the on-disk btrfs will represent the
>> states when commit trans is called, and every tree should match each other.
>>
>> Despite of this, after a transaction is committed, generation of the fs
>> get increased and modified tree blocks will have the same generation number.
>>
>>> Blktrace shows that there
>>> are 2   256KB writes, which are essentially writes to the data of
>>> the root directory of the file system (which I found out through
>>> btrfs-debug-tree).
>>
>>
>> I'd say you didn't check btrfs-debug-tree output carefully enough.
>> I strongly recommend to do vimdiff to get what tree is modified.
>>
>> At least the following trees are modified:
>>
>> 1) fs/subvolume tree
>> Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
>> updated inode time.
>> So fs/subvolume tree must be CoWed.
>>
>> 2) extent tree
>> CoW of above metadata operation will definitely cause extent
>> allocation and freeing, extent tree will also get updated.
>>
>> 3) root tree
>> Both extent tree and fs/subvolume tree modified, their root bytenr
>> needs to be updated and root tree must be updated.
>>
>> And finally superblocks.
>>
>> I just verified the behavior with empty btrfs created on a 1G file, only
>> one file to do the rename.
>>
>> In that case (with 4K sectorsize and 16K nodesize), the total IO should be
>> (3 * 16K) * 2 + 4K * 2 = 104K.

Re: generic name for volume and subvolume root?

2017-09-10 Thread Peter Grandi

> As I am writing some documentation abount creating snapshots:
> Is there a generic name for both volume and subvolume root?

Yes, it is from the UNIX side 'root directory' and from the
Btrfs side 'subvolume'. Like some other things Btrfs, its
terminology is often inconsistent, but "volume" *usually* means
"the set of devices [and contained root directories] with the
same Btrfs 'fsid'".

I think that the top-level subvolume should not be called the
"volume": while there is no reason why a UNIX-like filesystem
should be limited to a single block-device, one of the
fundamental properties of UNIX-like filesystems is that
hard-links are only possible (if at all possible) within a
filesystem, and that 'statfs' returns a different "device id"
per filesystem. Therefore a Btrfs volume is not properly a
filesystem, but potentially a filesystem forest, as it may
contain multiple filesystems each with its own root directory.

> Is there a simple name for directories I can snapshot?

You can only snapshot *root directories*, of which in Btrfs
there are two types: subvolumes (an unfortunate name perhaps) or
snapshots.

In UNIX-like OSes every filesystem has a "root directory" and
some filesystem types like Btrfs, NILFS2, and potentially JFS
can have more than one, and some can even mount more than one
simultaneously.

The root directory mounted as '/' is called the "system root
directory". When unmounted all filesystem root directories have
no names, just an inode number. Conceivably the root inode of a
UNIX-like filesystem could be an inode of any type, but I have
never seen a recent UNIX-like OS able to mount anything other
than a directory-type root inode (Plan 9 is not a UNIX-like OS
:->).

As someone else observed, the word "root" is overloaded in
UNIX-like OS discourse, like the word "filesystem", and that's
unfortunate but can always be resolved verbosely by using the
appropriate qualifier like "root directory", "system root
directory", "'root' user", "uid 0 capabilities", etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Help me understand what is going on with my RAID1 FS

2017-09-10 Thread FLJ

Hello all,

I have a BTRFS RAID1 volume running for the past year. I avoided all
pitfalls known to me that would mess up this volume. I never
experimented with quotas, no-COW, snapshots, defrag, nothing really.
The volume is a RAID1 from day 1 and is working reliably until now.

Until yesterday it consisted of two 3 TB drives, something along the lines:

Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
Total devices 2 FS bytes used 2.47TiB
devid1 size 2.73TiB used 2.47TiB path /dev/sdb
devid2 size 2.73TiB used 2.47TiB path /dev/sdc

Yesterday I've added a new drive to the FS and did a full rebalance
(without filters) over night, which went through without any issues.

Now I have:
 Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
Total devices 3 FS bytes used 2.47TiB
devid1 size 2.73TiB used 1.24TiB path /dev/sdb
devid2 size 2.73TiB used 1.24TiB path /dev/sdc
devid3 size 7.28TiB used 2.48TiB path /dev/sda

# btrfs fi df /mnt/BigVault/
Data, RAID1: total=2.47TiB, used=2.47TiB
System, RAID1: total=32.00MiB, used=384.00KiB
Metadata, RAID1: total=4.00GiB, used=2.74GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

But still df -h is giving me:
Filesystem   Size  Used Avail Use% Mounted on
/dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault

Although I've heard and read about the difficulty in reporting free
space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
but I only have a single volume, no subvolumes, no snapshots, no
quotas and both data and metadata are RAID1.

My expectation would've been that in case of BigVault Size == Used + Avail.

Actually based on http://carfax.org.uk/btrfs-usage/index.html I
would've expected 6 TB of usable space. Here I get 6.4 which is odd,
but that only 1.5 TB is available is even stranger.

Could anyone explain what I did wrong or why my expectations are wrong?

Thank you in advance
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check --repair now runs in minutes instead of hours? aborting

2017-09-10 Thread Marc MERLIN

On Sun, Sep 10, 2017 at 02:01:58PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月10日 01:44, Marc MERLIN wrote:
> > So, should I assume that btrfs progs git has some issue since there is
> > no plausible way that a check --repair should be faster than a regular
> > check?
> 
> Yes, the assumption that repair should be no faster than RO check is
> correct.
> Especially for clean fs, repair should just behave the same as RO check.
> 
> And I'll first submit a patch (or patches) to output the consumed time for
> each tree, so we could have a clue what is going wrong.
> (Digging the code is just a little too boring for me)

Cool. Let me know when I should sync and re-try.
In the meantime, though, my check --repair went back to 170mn after
triggering an FS bug for Josef, so it seems back to normal.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

2017-09-10 Thread Josef Bacik

Great, if the free space cache is fucked again after the next go around then I 
need to expand the verifier to watch entries being added to the cache as well.  
Thanks,

Josef

Sent from my iPhone

> On Sep 10, 2017, at 9:14 AM, Marc MERLIN  wrote:
> 
>> On Sun, Sep 10, 2017 at 03:12:16AM +, Josef Bacik wrote:
>> Ok mount -o clear_cache, umount and run fsck again just to make sure.  Then 
>> if it comes out clean mount with ref_verify again and wait for it to blow up 
>> again.  Thanks,
> 
> Ok, just did the 2nd fsck, came back clean after mount -o clear_cache
> 
> I'll re-trigger the exact same bug and repeat the whole cycle then.
> 
> Marc
> -- 
> "A mouse is a device used to point at the xterm you want to type in" - A.S.R.
> Microsoft is to operating systems 
>   what McDonalds is to gourmet cooking
> Home page: 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__marc.merlins.org_=DwIBAg=5VD0RTtNlTh3ycd41b3MUw=sDzg6MvHymKOUgI8SFIm4Q=46Ubpt2icp5_meAcqMuzd4whl0dZVSwf02fqYoDbzKw=nb55W48Rh0IzH8FH4eykviziYCc2S72iYmmNxdpjbOc=
>   | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

2017-09-10 Thread Marc MERLIN

On Sun, Sep 10, 2017 at 03:12:16AM +, Josef Bacik wrote:
> Ok mount -o clear_cache, umount and run fsck again just to make sure.  Then 
> if it comes out clean mount with ref_verify again and wait for it to blow up 
> again.  Thanks,
 
Ok, just did the 2nd fsck, came back clean after mount -o clear_cache

I'll re-trigger the exact same bug and repeat the whole cycle then.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 19:19, Christophe JAILLET wrote:

If 'btrfs_alloc_path()' fails, we must free the resourses already
allocated, as done in the other error handling paths in this function.

Signed-off-by: Christophe JAILLET 


Reviewed-by: Qu Wenruo 

BTW, I also checked all btrfs_alloc_path() in self tests, not such leak 
remaining.


Thanks,
Qu

---
  fs/btrfs/tests/free-space-tree-tests.c | 3 ++-
  1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tests/free-space-tree-tests.c 
b/fs/btrfs/tests/free-space-tree-tests.c
index 1458bb0ea124..8444a018cca2 100644
--- a/fs/btrfs/tests/free-space-tree-tests.c
+++ b/fs/btrfs/tests/free-space-tree-tests.c
@@ -500,7 +500,8 @@ static int run_test(test_func_t test_func, int bitmaps, u32 
sectorsize,
path = btrfs_alloc_path();
if (!path) {
test_msg("Couldn't allocate path\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
  
  	ret = add_block_group_free_space(, root->fs_info, cache);



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'

2017-09-10 Thread Christophe JAILLET

If 'btrfs_alloc_path()' fails, we must free the resourses already
allocated, as done in the other error handling paths in this function.

Signed-off-by: Christophe JAILLET 
---
 fs/btrfs/tests/free-space-tree-tests.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tests/free-space-tree-tests.c 
b/fs/btrfs/tests/free-space-tree-tests.c
index 1458bb0ea124..8444a018cca2 100644
--- a/fs/btrfs/tests/free-space-tree-tests.c
+++ b/fs/btrfs/tests/free-space-tree-tests.c
@@ -500,7 +500,8 @@ static int run_test(test_func_t test_func, int bitmaps, u32 
sectorsize,
path = btrfs_alloc_path();
if (!path) {
test_msg("Couldn't allocate path\n");
-   return -ENOMEM;
+   ret = -ENOMEM;
+   goto out;
}
 
ret = add_block_group_free_space(, root->fs_info, cache);
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please help with exact actions for raid1 hot-swap

2017-09-10 Thread Patrik Lundquist

On 10 September 2017 at 08:33, Marat Khalili  wrote:
> It doesn't need replaced disk to be readable, right?

Only enough to be mountable, which it already is, so your read errors
on /dev/sdb isn't a problem.

> Then what prevents same procedure to work without a spare bay?

It is basically the same procedure but with a bunch of gotchas due to
bugs and odd behaviour. Only having one shot at it, before it can only
be mounted read-only, is especially problematic (will be fixed in
Linux 4.14).


> --
>
> With Best Regards,
> Marat Khalili
>
> On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquist 
>  wrote:
>>On 9 September 2017 at 12:05, Marat Khalili  wrote:
>>> Forgot to add, I've got a spare empty bay if it can be useful here.
>>
>>That makes it much easier since you don't have to mount it degraded,
>>with the risks involved.
>>
>>Add and partition the disk.
>>
>># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data
>>
>>Remove the old disk when it is done.
>>
>>> --
>>>
>>> With Best Regards,
>>> Marat Khalili
>>>
>>> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili
>> wrote:
Dear list,

I'm going to replace one hard drive (partition actually) of a btrfs
raid1. Can you please spell exactly what I need to do in order to get
my
filesystem working as RAID1 again after replacement, exactly as it
>>was
before? I saw some bad examples of drive replacement in this list so
>>I
afraid to just follow random instructions on wiki, and putting this
system out of action even temporarily would be very inconvenient.

For this filesystem:

> $ sudo btrfs fi show /dev/sdb7
> Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
> Total devices 2 FS bytes used 106.23GiB
> devid1 size 2.71TiB used 126.01GiB path /dev/sda7
> devid2 size 2.71TiB used 126.01GiB path /dev/sdb7
> $ grep /mnt/data /proc/mounts
> /dev/sda7 /mnt/data btrfs
> rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0
> $ sudo btrfs fi df /mnt/data
> Data, RAID1: total=123.00GiB, used=104.57GiB
> System, RAID1: total=8.00MiB, used=48.00KiB
> Metadata, RAID1: total=3.00GiB, used=1.67GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> $ uname -a
> Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux

I've got this in dmesg:

> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0
> action 0x0
> [  +0.51] ata6.00: irq_stat 0x4008
> [  +0.29] ata6.00: failed command: READ FPDMA QUEUED
> [  +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag
>>3
> ncq 57344 in
>res 41/40:00:68:6c:f3/00:00:79:00:00/40
>>Emask
> 0x409 (media error) 
> [  +0.94] ata6.00: status: { DRDY ERR }
> [  +0.26] ata6.00: error: { UNC }
> [  +0.001195] ata6.00: configured for UDMA/133
> [  +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result:
>>hostbyte=DID_OK
> driverbyte=DRIVER_SENSE
> [  +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error
> [current] [descriptor]
> [  +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read
> error - auto reallocate failed
> [  +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00
>>00

> 79 f3 6c 50 00 00 00 70 00 00
> [  +0.03] blk_update_request: I/O error, dev sdb, sector
2045996136
> [  +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
rd
> 1, flush 0, corrupt 0, gen 0
> [  +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
rd
> 2, flush 0, corrupt 0, gen 0
> [  +0.77] ata6: EH complete

There's still 1 in Current_Pending_Sector line of smartctl output as
>>of

now, so it probably won't heal by itself.

--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe
>>linux-btrfs"
in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>--
>>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>>in
>>the body of a message to majord...@vger.kernel.org
>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-09-10 Thread A L

Perhaps netapp is using a VFS overlay. There is really only one snapshot but it 
is shown in the overlay on every folder. Kind of the same with samba Shadow 
Copies.

 From: Ulli Horlacher  -- Sent: 2017-09-09 - 
21:52 

> On Sat 2017-09-09 (22:43), Andrei Borzenkov wrote:
> 
>> > Your tool does not create .snapshot subdirectories in EVERY directory like
>> 
>> Neither does NetApp. Those "directories" are magic handles that do not
>> really exist.
> 
> I know.
> But symbolic links are the next close thing (I am not a kernel programmer).
> 
> 
>> Apart from obvious problem with recursive directory traversal (NetApp
>> .snapshot are not visible with normal directory list)
> 
> Yes, they are, at least sometimes, eg tar includes the snapshots.
> 
> 
> -- 
> Ullrich Horlacher  Server und Virtualisierung
> Rechenzentrum TIK 
> Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
> Allmandring 30aTel:++49-711-68565868
> 70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
> REF:<14c87878-a5a0-d7d3-4a76-c55812e75...@gmail.com>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 14:41, Qu Wenruo wrote:



On 2017年09月10日 07:50, Rohan Kadekodi wrote:

Hello,

I was trying to understand how file renames are handled in Btrfs. I
read the code documentation, but had a problem understanding a few
things.

During a file rename, btrfs_commit_transaction() is called which is
because Btrfs has to commit the whole FS before storing the
information related to the new renamed file. It has to commit the FS
because a rename first does an unlink, which is not recorded in the
btrfs_rename() transaction and so is not logged in the log tree. Is my
understanding correct? If yes, my questions are as follows:


Not familiar with rename kernel code, so not much help for rename 
opeartion.




1. What does committing the whole FS mean?


Committing the whole fs means a lot of things, but generally speaking, 
it makes that the on-disk data is inconsistent with each other.

^consistent
Sorry for the typo.

Thanks,
Qu


For obvious part, it writes modified fs/subvolume trees to disk (with 
handling of tree operations so no half modified trees).


Also other trees like extent tree (very hot since every CoW will update 
it, and the most complicated one), csum tree if modified.


After transaction is committed, the on-disk btrfs will represent the 
states when commit trans is called, and every tree should match each other.


Despite of this, after a transaction is committed, generation of the fs 
get increased and modified tree blocks will have the same generation 
number.



Blktrace shows that there
are 2   256KB writes, which are essentially writes to the data of
the root directory of the file system (which I found out through
btrfs-debug-tree).


I'd say you didn't check btrfs-debug-tree output carefully enough.
I strongly recommend to do vimdiff to get what tree is modified.

At least the following trees are modified:

1) fs/subvolume tree
    Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
    updated inode time.
    So fs/subvolume tree must be CoWed.

2) extent tree
    CoW of above metadata operation will definitely cause extent
    allocation and freeing, extent tree will also get updated.

3) root tree
    Both extent tree and fs/subvolume tree modified, their root bytenr
    needs to be updated and root tree must be updated.

And finally superblocks.

I just verified the behavior with empty btrfs created on a 1G file, only 
one file to do the rename.


In that case (with 4K sectorsize and 16K nodesize), the total IO should 
be (3 * 16K) * 2 + 4K * 2 = 104K.


"3" = number of tree blocks get modified
"16K" = nodesize
1st "*2" = DUP profile for metadata
"4K" = superblock size
2nd "*2" = 2 superblocks for 1G fs.

If your extent/root/fs trees have higher level, then more tree blocks 
needs to be updated.

And if your fs is very large, you may have 3 superblocks.


Is this equivalent to doing a shell sync, as the
same block groups are written during a shell sync too?


For shell "sync" the difference is that, "sync" will write all dirty 
data pages to disk, and then commit transaction.
While only calling btrfs_commit_transacation() doesn't trigger dirty 
page writeback.


So there is a difference.

And furthermore, if there is nothing to modified at all, sync will just 
skip the fs, so btrfs_commit_transaction() is not ensured if you call 
"sync".



Also, does it
imply that all the metadata held by the log tree is now checkpointed
to the respective trees?


Log tree part is a little tricky, as the log tree is not really a 
journal for btrfs.
Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't 
need any journal.


Log tree is mainly used for enhancing btrfs fsync performance.
You can totally disable log tree by notreelog mount option and btrfs 
will behave just fine.


And furthermore, I'm not very familiar with log tree, I need to verify 
the code to see if log tree is used in rename, so I can't say much right 
now.


But to make things easy, I strongly recommend to ignore log tree for now.



2. Why are there 2 complete writes to the data held by the root
directory and not just 1? These writes are 256KB each, which is the
size of the extent allocated to the root directory


Check my first calculation and verify the debug-tree output before and 
after rename.


I think there is some extra factors affecting the number, from the tree 
height to your fs tree organization.




3. Why are the writes being done to the root directory of the file
system / subvolume and not just the parent directory where the unlink
happened?


That's why I strongly recommend to understand btrfs on-disk format first.
A lot of things can be answered after understanding the on-disk layout, 
without asking any other guys.


The short answer is, btrfs puts all its child dir/inode info into one 
tree for one subvolume.
(And the term "root directory" here is a little confusing, are you 
talking about the fs tree root or the root tree?)


Not the

Re: Regarding handling of file renames in Btrfs

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 07:50, Rohan Kadekodi wrote:

Hello,

I was trying to understand how file renames are handled in Btrfs. I
read the code documentation, but had a problem understanding a few
things.

During a file rename, btrfs_commit_transaction() is called which is
because Btrfs has to commit the whole FS before storing the
information related to the new renamed file. It has to commit the FS
because a rename first does an unlink, which is not recorded in the
btrfs_rename() transaction and so is not logged in the log tree. Is my
understanding correct? If yes, my questions are as follows:


Not familiar with rename kernel code, so not much help for rename opeartion.



1. What does committing the whole FS mean?


Committing the whole fs means a lot of things, but generally speaking, 
it makes that the on-disk data is inconsistent with each other.


For obvious part, it writes modified fs/subvolume trees to disk (with 
handling of tree operations so no half modified trees).


Also other trees like extent tree (very hot since every CoW will update 
it, and the most complicated one), csum tree if modified.


After transaction is committed, the on-disk btrfs will represent the 
states when commit trans is called, and every tree should match each other.


Despite of this, after a transaction is committed, generation of the fs 
get increased and modified tree blocks will have the same generation number.



Blktrace shows that there
are 2   256KB writes, which are essentially writes to the data of
the root directory of the file system (which I found out through
btrfs-debug-tree).


I'd say you didn't check btrfs-debug-tree output carefully enough.
I strongly recommend to do vimdiff to get what tree is modified.

At least the following trees are modified:

1) fs/subvolume tree
   Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and
   updated inode time.
   So fs/subvolume tree must be CoWed.

2) extent tree
   CoW of above metadata operation will definitely cause extent
   allocation and freeing, extent tree will also get updated.

3) root tree
   Both extent tree and fs/subvolume tree modified, their root bytenr
   needs to be updated and root tree must be updated.

And finally superblocks.

I just verified the behavior with empty btrfs created on a 1G file, only 
one file to do the rename.


In that case (with 4K sectorsize and 16K nodesize), the total IO should 
be (3 * 16K) * 2 + 4K * 2 = 104K.


"3" = number of tree blocks get modified
"16K" = nodesize
1st "*2" = DUP profile for metadata
"4K" = superblock size
2nd "*2" = 2 superblocks for 1G fs.

If your extent/root/fs trees have higher level, then more tree blocks 
needs to be updated.

And if your fs is very large, you may have 3 superblocks.


Is this equivalent to doing a shell sync, as the
same block groups are written during a shell sync too?


For shell "sync" the difference is that, "sync" will write all dirty 
data pages to disk, and then commit transaction.
While only calling btrfs_commit_transacation() doesn't trigger dirty 
page writeback.


So there is a difference.

And furthermore, if there is nothing to modified at all, sync will just 
skip the fs, so btrfs_commit_transaction() is not ensured if you call 
"sync".



Also, does it
imply that all the metadata held by the log tree is now checkpointed
to the respective trees?


Log tree part is a little tricky, as the log tree is not really a 
journal for btrfs.
Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't 
need any journal.


Log tree is mainly used for enhancing btrfs fsync performance.
You can totally disable log tree by notreelog mount option and btrfs 
will behave just fine.


And furthermore, I'm not very familiar with log tree, I need to verify 
the code to see if log tree is used in rename, so I can't say much right 
now.


But to make things easy, I strongly recommend to ignore log tree for now.



2. Why are there 2 complete writes to the data held by the root
directory and not just 1? These writes are 256KB each, which is the
size of the extent allocated to the root directory


Check my first calculation and verify the debug-tree output before and 
after rename.


I think there is some extra factors affecting the number, from the tree 
height to your fs tree organization.




3. Why are the writes being done to the root directory of the file
system / subvolume and not just the parent directory where the unlink
happened?


That's why I strongly recommend to understand btrfs on-disk format first.
A lot of things can be answered after understanding the on-disk layout, 
without asking any other guys.


The short answer is, btrfs puts all its child dir/inode info into one 
tree for one subvolume.
(And the term "root directory" here is a little confusing, are you 
talking about the fs tree root or the root tree?)


Not the common one tree for one inode layout.

So if you rename one file in a subvolume, the subvolume tree get CoWed, 
which means from the

Re: Please help with exact actions for raid1 hot-swap

2017-09-10 Thread Marat Khalili

It doesn't need replaced disk to be readable, right? Then what prevents same 
procedure to work without a spare bay?
-- 

With Best Regards,
Marat Khalili

On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquist 
 wrote:
>On 9 September 2017 at 12:05, Marat Khalili  wrote:
>> Forgot to add, I've got a spare empty bay if it can be useful here.
>
>That makes it much easier since you don't have to mount it degraded,
>with the risks involved.
>
>Add and partition the disk.
>
># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data
>
>Remove the old disk when it is done.
>
>> --
>>
>> With Best Regards,
>> Marat Khalili
>>
>> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili
> wrote:
>>>Dear list,
>>>
>>>I'm going to replace one hard drive (partition actually) of a btrfs
>>>raid1. Can you please spell exactly what I need to do in order to get
>>>my
>>>filesystem working as RAID1 again after replacement, exactly as it
>was
>>>before? I saw some bad examples of drive replacement in this list so
>I
>>>afraid to just follow random instructions on wiki, and putting this
>>>system out of action even temporarily would be very inconvenient.
>>>
>>>For this filesystem:
>>>
 $ sudo btrfs fi show /dev/sdb7
 Label: 'data'  uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0
 Total devices 2 FS bytes used 106.23GiB
 devid1 size 2.71TiB used 126.01GiB path /dev/sda7
 devid2 size 2.71TiB used 126.01GiB path /dev/sdb7
 $ grep /mnt/data /proc/mounts
 /dev/sda7 /mnt/data btrfs
 rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0
 $ sudo btrfs fi df /mnt/data
 Data, RAID1: total=123.00GiB, used=104.57GiB
 System, RAID1: total=8.00MiB, used=48.00KiB
 Metadata, RAID1: total=3.00GiB, used=1.67GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B
 $ uname -a
 Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC
 2017 x86_64 x86_64 x86_64 GNU/Linux
>>>
>>>I've got this in dmesg:
>>>
 [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0
 action 0x0
 [  +0.51] ata6.00: irq_stat 0x4008
 [  +0.29] ata6.00: failed command: READ FPDMA QUEUED
 [  +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag
>3
 ncq 57344 in
res 41/40:00:68:6c:f3/00:00:79:00:00/40
>Emask
 0x409 (media error) 
 [  +0.94] ata6.00: status: { DRDY ERR }
 [  +0.26] ata6.00: error: { UNC }
 [  +0.001195] ata6.00: configured for UDMA/133
 [  +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result:
>hostbyte=DID_OK
 driverbyte=DRIVER_SENSE
 [  +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error
 [current] [descriptor]
 [  +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read
 error - auto reallocate failed
 [  +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00
>00
>>>
 79 f3 6c 50 00 00 00 70 00 00
 [  +0.03] blk_update_request: I/O error, dev sdb, sector
>>>2045996136
 [  +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>>>rd
 1, flush 0, corrupt 0, gen 0
 [  +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0,
>>>rd
 2, flush 0, corrupt 0, gen 0
 [  +0.77] ata6: EH complete
>>>
>>>There's still 1 in Current_Pending_Sector line of smartctl output as
>of
>>>
>>>now, so it probably won't heal by itself.
>>>
>>>--
>>>
>>>With Best Regards,
>>>Marat Khalili
>>>--
>>>To unsubscribe from this list: send the line "unsubscribe
>linux-btrfs"
>>>in
>>>the body of a message to majord...@vger.kernel.org
>>>More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe
>linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>--
>To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
>in
>the body of a message to majord...@vger.kernel.org
>More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check --repair now runs in minutes instead of hours? aborting

2017-09-10 Thread Qu Wenruo




On 2017年09月10日 01:44, Marc MERLIN wrote:

So, should I assume that btrfs progs git has some issue since there is
no plausible way that a check --repair should be faster than a regular
check?


Yes, the assumption that repair should be no faster than RO check is 
correct.

Especially for clean fs, repair should just behave the same as RO check.

And I'll first submit a patch (or patches) to output the consumed time 
for each tree, so we could have a clue what is going wrong.

(Digging the code is just a little too boring for me)

Thanks,
Qu



Thanks,
Marc

On Tue, Sep 05, 2017 at 07:45:25AM -0700, Marc MERLIN wrote:

On Tue, Sep 05, 2017 at 04:05:04PM +0800, Qu Wenruo wrote:

gargamel:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total.60TiB, used.54TiB
System, DUP: total2.00MiB, used=1.19MiB
Metadata, DUP: totalX.00GiB, used.69GiB


Wait for a minute.

Is that .69GiB means 706 MiB? Or my email client/GMX screwed up the format
(again)?
This output format must be changed, at least to 0.69 GiB, or 706 MiB.
  
Email client problem. I see control characters in what you quoted.


Let's try again
gargamel:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total=10.66TiB, used=10.60TiB  => 10TB
System, DUP: total=64.00MiB, used=1.20MiB=> 1.2MB
Metadata, DUP: total=57.50GiB, used=12.76GiB => 13GB
GlobalReserve, single: total=512.00MiB, used=0.00B  => 0


You mean lowmem is actually FASTER than original mode?
That's very surprising.
  
Correct, unless I add --repair and then original mode is 2x faster than

lowmem.


Is there any special operation done for that btrfs?
Like offline dedupe or tons of reflinks?


In this case, no.
Note that btrfs check used to take many hours overnight until I did a
git pull of btrfs progs and built the latest from TOT.


BTW, how many subvolumes do you have in the fs?
  
gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | wc -l

91

If I remove snapshots for btrfs send and historical 'backups':
gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | grep -Ev 
'(hourly|daily|weekly|rw|ro)' | wc -l
5


This looks like a bug. My first guess is related to number of
subvolumes/reflinks, but I'm not sure since I don't have many real-world
btrfs.

I'll take sometime to look into it.

Thanks for the very interesting report,


Thanks for having a look :)

Marc
--
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
    what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs_remove_chunk call trace?

[PATCH] btrfs-progs: Output time elapsed for each major tree it checked

Re: Regarding handling of file renames in Btrfs

[PATCH] btrfs-progs: update btrfs-completion

btrfs_remove_chunk call trace?

Re: Help me understand what is going on with my RAID1 FS

Re: Regarding handling of file renames in Btrfs

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: Help me understand what is going on with my RAID1 FS

Re: netapp-alike snapshots?

Re: Regarding handling of file renames in Btrfs

Re: Regarding handling of file renames in Btrfs

Re: generic name for volume and subvolume root?

Help me understand what is going on with my RAID1 FS

Re: btrfs check --repair now runs in minutes instead of hours? aborting

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)

Re: [PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'

[PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'

Re: Please help with exact actions for raid1 hot-swap

Re: netapp-alike snapshots?

Re: Regarding handling of file renames in Btrfs

Re: Regarding handling of file renames in Btrfs

Re: Please help with exact actions for raid1 hot-swap

Re: btrfs check --repair now runs in minutes instead of hours? aborting

34 matches

Site Navigation

Mail list logo

Footer information