Re: btrfs_remove_chunk call trace?
...and can it be related to the Samsung 840 SSD's not supporting NCQ Trim? (Although I can't tell which device this trace is from -- it could be a mechanical Western Digital.) On Sun, Sep 10, 2017 at 10:16 PM, Rich Rauenzahnwrote: > Is this something to be concerned about? > > I'm running the latest mainline kernel on CentOS 7. > > [ 1338.882288] [ cut here ] > [ 1338.883058] WARNING: CPU: 2 PID: 790 at fs/btrfs/ctree.h:1559 > btrfs_update_device+0x1c5/0x1d0 [btrfs] > [ 1338.883809] Modules linked in: xt_nat veth ipt_MASQUERADE > nf_nat_masquerade_ipv4 xt_addrtype overlay loop nf_conntrack_ftp > nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_comment xt_recent > xt_multiport xt_conntrack iptable_filter xt_REDIRECT nf_nat_redirect > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nct6775 > nf_nat nf_conntrack hwmon_vid jc42 vfat fat dm_mirror dm_region_hash > dm_log dm_mod dax xfs libcrc32c x86_pkg_temp_thermal intel_powerclamp > coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi > snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul > snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec aesni_intel > snd_hda_core iTCO_wdt snd_hwdep crypto_simd glue_helper cryptd > iTCO_vendor_support snd_seq mei_wdt snd_seq_device intel_cstate > cdc_acm snd_pcm intel_rapl_perf > [ 1338.888639] input_leds snd_timer lpc_ich i2c_i801 pcspkr > intel_pch_thermal snd mfd_core sg mei_me soundcore mei acpi_pad shpchp > ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables > btrfs xor raid6_pq sd_mod crc32c_intel ahci e1000e libahci > firewire_ohci igb i915 dca firewire_core ptp i2c_algo_bit crc_itu_t > libata pps_core drm_kms_helper syscopyarea sysfillrect sysimgblt > fb_sys_fops drm video > [ 1338.891412] CPU: 2 PID: 790 Comm: btrfs-cleaner Tainted: GW > 4.13.1-1.el7.elrepo.x86_64 #1 > [ 1338.892171] Hardware name: Supermicro X10SAE/X10SAE, BIOS 2.0a 05/09/2014 > [ 1338.892884] task: 880407cec5c0 task.stack: c90002624000 > [ 1338.893613] RIP: 0010:btrfs_update_device+0x1c5/0x1d0 [btrfs] > [ 1338.894299] RSP: 0018:c90002627d00 EFLAGS: 00010206 > [ 1338.894956] RAX: 0fff RBX: 880407cd9930 RCX: > 01d1c1011e00 > [ 1338.895647] RDX: 8800 RSI: 880336e80f9e RDI: > 88028395bd88 > [ 1338.896304] RBP: c90002627d48 R08: 3fc2 R09: > c90002627cb8 > [ 1338.896934] R10: 1000 R11: 0003 R12: > 880405f68c00 > [ 1338.897618] R13: R14: 88028395bd88 R15: > 3f9e > [ 1338.898251] FS: () GS:88041fa8() > knlGS: > [ 1338.898867] CS: 0010 DS: ES: CR0: 80050033 > [ 1338.899522] CR2: 7ff82f2cb000 CR3: 01c09000 CR4: > 001406e0 > [ 1338.900157] DR0: DR1: DR2: > > [ 1338.900772] DR3: DR6: fffe0ff0 DR7: > 0400 > [ 1338.901402] Call Trace: > [ 1338.902017] btrfs_remove_chunk+0x2fb/0x8b0 [btrfs] > [ 1338.902673] btrfs_delete_unused_bgs+0x363/0x440 [btrfs] > [ 1338.903304] cleaner_kthread+0x150/0x180 [btrfs] > [ 1338.903908] kthread+0x109/0x140 > [ 1338.904593] ? btree_invalidatepage+0xa0/0xa0 [btrfs] > [ 1338.905207] ? kthread_park+0x60/0x60 > [ 1338.905803] ret_from_fork+0x25/0x30 > [ 1338.906416] Code: 10 00 00 00 4c 89 fe e8 8a 30 ff ff 4c 89 f7 e8 > 32 f6 fc ff e9 d3 fe ff ff b8 f4 ff ff ff e9 d4 fe ff ff 0f 1f 00 e8 > cb 3e be e0 <0f> ff eb af 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2 > be 02 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Output time elapsed for each major tree it checked
Marc reported that "btrfs check --repair" runs much faster than "btrfs check", which is quite weird. This patch will add time elapsed for each major tree it checked, for both original mode and lowmem mode, so we can have a clue what's going wrong. Reported-by: Marc MERLINSigned-off-by: Qu Wenruo --- cmds-check.c | 21 +++-- utils.h | 24 2 files changed, 43 insertions(+), 2 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 006edbde..fee806cd 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -5318,13 +5318,16 @@ static int do_check_fs_roots(struct btrfs_fs_info *fs_info, struct cache_tree *root_cache) { int ret; + struct timer timer; if (!ctx.progress_enabled) fprintf(stderr, "checking fs roots\n"); + start_timer(); if (check_mode == CHECK_MODE_LOWMEM) ret = check_fs_roots_v2(fs_info); else ret = check_fs_roots(fs_info, root_cache); + printf("done in %d seconds\n", stop_timer()); return ret; } @@ -11584,14 +11587,16 @@ out: static int do_check_chunks_and_extents(struct btrfs_fs_info *fs_info) { int ret; + struct timer timer; if (!ctx.progress_enabled) fprintf(stderr, "checking extents\n"); + start_timer(); if (check_mode == CHECK_MODE_LOWMEM) ret = check_chunks_and_extents_v2(fs_info); else ret = check_chunks_and_extents(fs_info); - + printf("done in %d seconds\n", stop_timer()); return ret; } @@ -12772,6 +12777,7 @@ int cmd_check(int argc, char **argv) int qgroups_repaired = 0; unsigned ctree_flags = OPEN_CTREE_EXCLUSIVE; int force = 0; + struct timer timer; while(1) { int c; @@ -12953,8 +12959,11 @@ int cmd_check(int argc, char **argv) if (repair) ctree_flags |= OPEN_CTREE_PARTIAL; + printf("opening btrfs filesystem\n"); + start_timer(); info = open_ctree_fs_info(argv[optind], bytenr, tree_root_bytenr, chunk_root_bytenr, ctree_flags); + printf("done in %d seconds\n", stop_timer()); if (!info) { error("cannot open file system"); ret = -EIO; @@ -13115,8 +13124,10 @@ int cmd_check(int argc, char **argv) else fprintf(stderr, "checking free space cache\n"); } + start_timer(); ret = check_space_cache(root); err |= !!ret; + printf("done in %d seconds\n", stop_timer()); if (ret) { if (btrfs_fs_compat_ro(info, FREE_SPACE_TREE)) error("errors found in free space tree"); @@ -13140,18 +13151,22 @@ int cmd_check(int argc, char **argv) } fprintf(stderr, "checking csums\n"); + start_timer(); ret = check_csums(root); err |= !!ret; + printf("done in %d seconds\n", stop_timer()); if (ret) { error("errors found in csum tree"); goto out; } - fprintf(stderr, "checking root refs\n"); /* For low memory mode, check_fs_roots_v2 handles root refs */ if (check_mode != CHECK_MODE_LOWMEM) { + fprintf(stderr, "checking root refs\n"); + start_timer(); ret = check_root_refs(root, _cache); err |= !!ret; + printf("done in %d seconds\n", stop_timer()); if (ret) { error("errors found in root refs"); goto out; @@ -13186,8 +13201,10 @@ int cmd_check(int argc, char **argv) if (info->quota_enabled) { fprintf(stderr, "checking quota groups\n"); + start_timer(); ret = qgroup_verify_all(info); err |= !!ret; + printf("done in %d seconds\n", stop_timer()); if (ret) { error("failed to check quota groups"); goto out; diff --git a/utils.h b/utils.h index d28a05a6..159487db 100644 --- a/utils.h +++ b/utils.h @@ -172,4 +172,28 @@ u64 rand_u64(void); unsigned int rand_range(unsigned int upper); void init_rand_seed(u64 seed); +/* Utils to report time duration */ +struct timer { + time_t start; +}; + +static inline void start_timer(struct timer *t) +{ + time(>start); +} + +/* + * Stop timer and return the time elapsed in int + * + * int should be large enough to "btrfs check" and avoid + * type converting. + */ +static inline int stop_timer(struct timer *t) +{ + time_t end; + + time(); + + return (int)(difftime(end, t->start)); +} #endif -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org
Re: Regarding handling of file renames in Btrfs
On 2017年09月10日 22:34, Martin Raiber wrote: Hi, On 10.09.2017 08:45 Qu Wenruo wrote: On 2017年09月10日 14:41, Qu Wenruo wrote: On 2017年09月10日 07:50, Rohan Kadekodi wrote: Hello, I was trying to understand how file renames are handled in Btrfs. I read the code documentation, but had a problem understanding a few things. During a file rename, btrfs_commit_transaction() is called which is because Btrfs has to commit the whole FS before storing the information related to the new renamed file. It has to commit the FS because a rename first does an unlink, which is not recorded in the btrfs_rename() transaction and so is not logged in the log tree. Is my understanding correct? If yes, my questions are as follows: Not familiar with rename kernel code, so not much help for rename opeartion. 1. What does committing the whole FS mean? Committing the whole fs means a lot of things, but generally speaking, it makes that the on-disk data is inconsistent with each other. For obvious part, it writes modified fs/subvolume trees to disk (with handling of tree operations so no half modified trees). Also other trees like extent tree (very hot since every CoW will update it, and the most complicated one), csum tree if modified. After transaction is committed, the on-disk btrfs will represent the states when commit trans is called, and every tree should match each other. Despite of this, after a transaction is committed, generation of the fs get increased and modified tree blocks will have the same generation number. Blktrace shows that there are 2 256KB writes, which are essentially writes to the data of the root directory of the file system (which I found out through btrfs-debug-tree). I'd say you didn't check btrfs-debug-tree output carefully enough. I strongly recommend to do vimdiff to get what tree is modified. At least the following trees are modified: 1) fs/subvolume tree Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and updated inode time. So fs/subvolume tree must be CoWed. 2) extent tree CoW of above metadata operation will definitely cause extent allocation and freeing, extent tree will also get updated. 3) root tree Both extent tree and fs/subvolume tree modified, their root bytenr needs to be updated and root tree must be updated. And finally superblocks. I just verified the behavior with empty btrfs created on a 1G file, only one file to do the rename. In that case (with 4K sectorsize and 16K nodesize), the total IO should be (3 * 16K) * 2 + 4K * 2 = 104K. "3" = number of tree blocks get modified "16K" = nodesize 1st "*2" = DUP profile for metadata "4K" = superblock size 2nd "*2" = 2 superblocks for 1G fs. If your extent/root/fs trees have higher level, then more tree blocks needs to be updated. And if your fs is very large, you may have 3 superblocks. Is this equivalent to doing a shell sync, as the same block groups are written during a shell sync too? For shell "sync" the difference is that, "sync" will write all dirty data pages to disk, and then commit transaction. While only calling btrfs_commit_transacation() doesn't trigger dirty page writeback. So there is a difference. this conversation made me realize why btrfs has sub-optimal meta-data performance. Cow b-trees are not the best data structure for such small changes. In my application I have multiple operations (e.g. renames) which can be bundles up and (mostly) one writer. Things are more complicated in fact. For example, even you are only renaming/moving one file. But in fact you're going to at least modify 6 items, they are: 1) Removing DIR_INDEX of original parent dir inode Assume the original parent dir inode number is 300. We are removing (300 DIR_INDEX ). 2) Removing DIR_ITEM of original parent dir inode We are removing (300 DIR_ITEM ) 3) Removing INODE_REF of the renamed inode Assume the renamed inode number is 400 We are removing (400 INODE_REF 300). 4) Adding new DIR_INDEX to new parent dir inode Assume the new parent dir inode number is 500. We are adding (500 DIR_INDEX ) 5) Adding new DIR_ITEM to new parent dir inode We are adding (500 DIR_ITEM ) 6) Adding new INODE_REF to renamed inode We are adding (400 INODE_REF 500) As you can see, there are 6 keys modification, and we can't ensure they are all in one leaf. In worst case, we need to CoW the tree 6 times for different leaves. (Although CoWed tree won't be CoWed again until written to disk, which reduces overhead) And even more, if you modified one tree, you must also modify the ROOT_ITEM pointing the tree, which leads to root tree CoW. I have a crazy idea to double buffering tree blocks. That's to say, one tree block is actually consisted of 2 real tree blocks. And when CoW happens, just switch to the other tree block. So that we don't really need to update its parent pointer, so we can limit the CoW affected range to minimal. But it's
[PATCH] btrfs-progs: update btrfs-completion
This patch updates btrfs-completion: - add "filesystem du" and "rescure zero-log" - restrict _btrfs_mnts to show btrfs type only - add more completion in last case statements (This file contains both spaces/tabs and may need cleanup.) Signed-off-by: Tomohiro Misono--- btrfs-completion | 43 +++ 1 file changed, 35 insertions(+), 8 deletions(-) diff --git a/btrfs-completion b/btrfs-completion index 3ede77b..1f00add 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -16,7 +16,7 @@ _btrfs_mnts() local MNTS MNTS='' while read mnt; do MNTS+="$mnt " - done < <(mount | awk '{print $3}') + done < <(mount -t btrfs | awk '{print $3}') COMPREPLY+=( $( compgen -W "$MNTS" -- "$cur" ) ) } @@ -31,11 +31,11 @@ _btrfs() commands='subvolume filesystem balance device scrub check rescue restore inspect-internal property send receive quota qgroup replace help version' commands_subvolume='create delete list snapshot find-new get-default set-default show sync' -commands_filesystem='defragment sync resize show df label usage' +commands_filesystem='defragment sync resize show df du label usage' commands_balance='start pause cancel resume status' commands_device='scan add delete remove ready stats usage' commands_scrub='start cancel resume status' -commands_rescue='chunk-recover super-recover' +commands_rescue='chunk-recover super-recover zero-log' commands_inspect_internal='inode-resolve logical-resolve subvolid-resolve rootid min-dev-size dump-tree dump-super tree-stats' commands_property='get set list' commands_quota='enable disable rescan' @@ -114,6 +114,10 @@ _btrfs() _filedir return 0 ;; + df|usage) + _btrfs_mnts + return 0 + ;; label) _btrfs_mnts _btrfs_devs @@ -125,6 +129,26 @@ _btrfs() _btrfs_devs return 0 ;; + inspect-internal) + case $prev in + min-dev-size) + _btrfs_mnts + return 0 + ;; + rootid) + _filedir + return 0 + ;; + esac + ;; + receive) + case $prev in + -f) + _filedir + return 0 + ;; + esac + ;; replace) case $prev in status|cancel) @@ -137,14 +161,17 @@ _btrfs() ;; esac ;; + subvolume) + case $prev in + list) + _btrfs_mnts + return 0 + ;; + esac + ;; esac fi -if [[ "$cmd" == "receive" && "$prev" == "-f" ]]; then -_filedir -return 0 -fi - _filedir -d return 0 } -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs_remove_chunk call trace?
Is this something to be concerned about? I'm running the latest mainline kernel on CentOS 7. [ 1338.882288] [ cut here ] [ 1338.883058] WARNING: CPU: 2 PID: 790 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1c5/0x1d0 [btrfs] [ 1338.883809] Modules linked in: xt_nat veth ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_addrtype overlay loop nf_conntrack_ftp nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_comment xt_recent xt_multiport xt_conntrack iptable_filter xt_REDIRECT nf_nat_redirect iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nct6775 nf_nat nf_conntrack hwmon_vid jc42 vfat fat dm_mirror dm_region_hash dm_log dm_mod dax xfs libcrc32c x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic irqbypass crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel pcbc snd_hda_codec aesni_intel snd_hda_core iTCO_wdt snd_hwdep crypto_simd glue_helper cryptd iTCO_vendor_support snd_seq mei_wdt snd_seq_device intel_cstate cdc_acm snd_pcm intel_rapl_perf [ 1338.888639] input_leds snd_timer lpc_ich i2c_i801 pcspkr intel_pch_thermal snd mfd_core sg mei_me soundcore mei acpi_pad shpchp ie31200_edac nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables btrfs xor raid6_pq sd_mod crc32c_intel ahci e1000e libahci firewire_ohci igb i915 dca firewire_core ptp i2c_algo_bit crc_itu_t libata pps_core drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm video [ 1338.891412] CPU: 2 PID: 790 Comm: btrfs-cleaner Tainted: GW 4.13.1-1.el7.elrepo.x86_64 #1 [ 1338.892171] Hardware name: Supermicro X10SAE/X10SAE, BIOS 2.0a 05/09/2014 [ 1338.892884] task: 880407cec5c0 task.stack: c90002624000 [ 1338.893613] RIP: 0010:btrfs_update_device+0x1c5/0x1d0 [btrfs] [ 1338.894299] RSP: 0018:c90002627d00 EFLAGS: 00010206 [ 1338.894956] RAX: 0fff RBX: 880407cd9930 RCX: 01d1c1011e00 [ 1338.895647] RDX: 8800 RSI: 880336e80f9e RDI: 88028395bd88 [ 1338.896304] RBP: c90002627d48 R08: 3fc2 R09: c90002627cb8 [ 1338.896934] R10: 1000 R11: 0003 R12: 880405f68c00 [ 1338.897618] R13: R14: 88028395bd88 R15: 3f9e [ 1338.898251] FS: () GS:88041fa8() knlGS: [ 1338.898867] CS: 0010 DS: ES: CR0: 80050033 [ 1338.899522] CR2: 7ff82f2cb000 CR3: 01c09000 CR4: 001406e0 [ 1338.900157] DR0: DR1: DR2: [ 1338.900772] DR3: DR6: fffe0ff0 DR7: 0400 [ 1338.901402] Call Trace: [ 1338.902017] btrfs_remove_chunk+0x2fb/0x8b0 [btrfs] [ 1338.902673] btrfs_delete_unused_bgs+0x363/0x440 [btrfs] [ 1338.903304] cleaner_kthread+0x150/0x180 [btrfs] [ 1338.903908] kthread+0x109/0x140 [ 1338.904593] ? btree_invalidatepage+0xa0/0xa0 [btrfs] [ 1338.905207] ? kthread_park+0x60/0x60 [ 1338.905803] ret_from_fork+0x25/0x30 [ 1338.906416] Code: 10 00 00 00 4c 89 fe e8 8a 30 ff ff 4c 89 f7 e8 32 f6 fc ff e9 d3 fe ff ff b8 f4 ff ff ff e9 d4 fe ff ff 0f 1f 00 e8 cb 3e be e0 <0f> ff eb af 0f 1f 80 00 00 00 00 0f 1f 44 00 00 55 31 d2 be 02 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
10.09.2017 23:17, Dmitrii Tcvetkov пишет: >>> Drive1 Drive2Drive3 >>> X X >>> X X >>> X X >>> >>> Where X is a chunk of raid1 block group. >> >> But this table clearly shows that adding third drive increases free >> space by 50%. You need to reallocate data to actually make use of it, >> but it was done in this case. > > It increases it but I don't see how this space is in any way useful > unless data is in single profile. After full balance chunks will be > spread over 3 devices, how it helps in raid1 data profile case? > A1 A2 => A1 A2 - => A1 A2 B1 => A1 A2 B1 B1 B2 B1 B2 -- B2 -C1 B2 C2 It is raid1 profile on three disks fully utilizing them (assuming equal sizes of course). Where "raid1" means - each data block has two copies on different devices. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
On 2017年09月10日 22:32, Rohan Kadekodi wrote: Thank you for the prompt and elaborate answers! However, I think I was unclear in my questions, and I apologize for the confusion. What I meant was that for a file rename, when I check the blktrace output, there are 2 writes of 256KB each starting from byte number: 13373440 When I check btrfs-debug-tree, I see that the following items are related to it: 1) root tree: key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53 extent data disk byte 13373440 nr 262144 extent data offset 0 nr 262144 ram 262144 extent compression 0 2) extent tree: key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53 extent refs 1 gen 12 flags DATA extent data backref root 1 objectid 256 offset 0 count 1 So this means that the extent allocated to the root folder (mount point) is getting written twice right? Here I am not talking about any metadata, but the data in the extent allocated to the root folder, that is inode number 256. Such extent data is used by free space cache. If using nospace_cache or space_cache=v2 mount option, there will no such thing. Free space cache is used for recording free and used space for each chunk (or block group, which is mostly the same thing). Since CoW happens for metadata chunk, its used/free space mapping get modified and then free space cache will also be updated. BTW, some term usage difference makes me a little confused. Personally speaking, we call root 1 "tree root" or "root tree", not root directory. As in fact such tree doesn't contain any real file/directory. When I was analyzing the code, I saw that these writes happened from btrfs_start_dirty_block_groups() which is in btrfs_commit_transaction(). This is the same thing that is getting written on a filesystem commit. So my questions were: 1) Why are there 2 256KB writes happening during a filesystem commit to the same location instead of just 1? Also, what exactly is written in the root folder of the file system? Again, I am talking about the data held in the extent allocated inode 256 and not about any metadata or any tree. As stated above, EXTENT_DATA in root tree is for space cache (v1). Which uses NoCOW file extent as file to record free space. And such space cache is for each block group. Furthermore, since it's EXTENT_DATA, it counts as DATA, so it follows your data profile (default to single for single device and RAID0 for multi device). If not using DUP1 as data profile, then you have 2 block groups get modified. 2) I understand by the on-disk format that all the child dir/inode info in one subvolume are in the same tree, but these writes that I am talking about are not to any tree, they to the data held in inode 256, which happens to be the mount point. So by root directory, I mean the mount point or the inode 256 (not any tree). As mentioned before, it's better to call it "root tree" as it doesn't really represents a directory. And even though metadata wise there is no hierarchy as such in the file system, each folder data will only contain the data belonging to its children right? The sentence is confusing to me now. By "folder" did you mean normal directory? And how do you define "data belonging to its children"? As stated before, there is no real boundary for an inode (including normal file and directory). All inode data (including EXTENT_DATA for regular file and DIR_INDEX/DIR for directory inode) are just sequential keys (with its data) in a subvolume. So without your definition of "belonging to" I can't get the point. Hence my question was that why does the data in the extent allocated to inode 256 need to be rewritten instead of just the parent folder for a rename? My first paragraph explained this. BTW, for your concerned EXTENT_DATA in root 1 (root tree), it's used by the following sequence: (BTRFS_ prefix omitted, all keys are in root 1) (FREE_SPACE_OBJECTID, 0, ) Its structure, btrfs_free_space_header, contains a key referring to an inode, which is a regular file inode. The inode key will be (, INODE_ITEM, 0) Then still in tree root (rootid 1), search using the (, INODE_ITEM, 0) key, to locate the free space cache inode. Finally btrfs will just read data stored for this inode. Using its (, EXTENT_DATA, ) to locate its real data on disk, and read it out. For details like how the space cache looks like, you need to check the free space cache code then. (And for short, it's a mess, so we have space_cache=v2, which uses normal btrfs Btree to store such info, and btrfs-debug-tree can show it easily) And of course, for transaction commit, each dirty block group will need to update its free space cache, and its free space cache file has NODATACOW flag, so free space cache itself has some checksum mechanism, so normally the whole free space cache file is updated. Thanks, Qu Thanks, Rohan On 10 September 2017 at 01:45, Qu Wenruo
Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)
On Sun, Sep 10, 2017 at 01:16:26PM +, Josef Bacik wrote: > Great, if the free space cache is fucked again after the next go > around then I need to expand the verifier to watch entries being added > to the cache as well. Thanks, Well, I copied about 1TB of data, and nothing happened. So it seems clearing it and fsck may have fixed this fault I had been carrying for quite a while. If so, yeah! I'm not sure if this needs a kernel fix to not get triggered and if btrfs check should also be improved to catch this, but hopefully you know what makes sense there. Thanks, Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
FLJ posted on Sun, 10 Sep 2017 15:45:42 +0200 as excerpted: > I have a BTRFS RAID1 volume running for the past year. I avoided all > pitfalls known to me that would mess up this volume. I never > experimented with quotas, no-COW, snapshots, defrag, nothing really. > The volume is a RAID1 from day 1 and is working reliably until now. > > Until yesterday it consisted of two 3 TB drives, something along the > lines: > > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 2 FS bytes used 2.47TiB > devid1 size 2.73TiB used 2.47TiB path /dev/sdb > devid2 size 2.73TiB used 2.47TiB path /dev/sdc I'm going to try a different approach than I see in the two existing subthreads, so I started from scratch with my own subthread... So the above looks reasonable so far... > > Yesterday I've added a new drive to the FS and did a full rebalance > (without filters) over night, which went through without any issues. > > Now I have: > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 3 FS bytes used 2.47TiB > devid1 size 2.73TiB used 1.24TiB path /dev/sdb > devid2 size 2.73TiB used 1.24TiB path /dev/sdc > devid3 size 7.28TiB used 2.48TiB path /dev/sda That's exactly as expected, after a balance. Note the size, 2.73 TiB (twos-power) for the smaller two, not 3 (tho it's probably 3 TB, tens-power), 7.28 TiB, not 8, for the larger one. The most-free-space chunk allocation, with raid1-paired chunks, means the first chunk of every pair will get allocated to the largest, 7.28 TiB device. The other two devices are equal in size, 2.73 TiB each, and the second chunk can't get allocated to the largest device as only one chunk of the pair can go there, so the allocator will in general alternate allocations from the smaller two, for the second chunk of each pair. (I say in general, because metadata chunks are smaller than data chunks, so it's possible that two chunks in a row, a metadata chunk and a data chunk, will be allocated from the same device, before it switches to the other.) Because the larger device is larger than the other two combined, it'll always get one copy, while the others fill up evenly at half the usage of the larger device, until both smaller devices are full, at which point you won't be able to allocate further raid1 chunks and you'll ENOSPC. > # btrfs fi df /mnt/BigVault/ > Data, RAID1: total=2.47TiB, used=2.47TiB > System, RAID1: total=32.00MiB, used=384.00KiB > Metadata, RAID1: total=4.00GiB, used=2.74GiB > GlobalReserve, single: total=512.00MiB, used=0.00B Still looks reasonable. Note that assuming you're using a reasonably current btrfs-progs, there's also the btrfs fi usage and btrfs dev usage commands. Btrfs fi df is an older form that has much less information than the fi and dev usage commands, tho between btrfs fi show and btrfs fi df, /most/ of the filesystem-level information in btrfs fi usage can be deduced, tho not necessarily the device-level detail. Btrfs fi usage is thus preferred, assuming it's available to you. (In addition to btrfs fi usage being newer, both it and btrfs fi df require a mounted btrfs. If the filesystem refuses to mount, btrfs fi show may be all that's available.) While I'm digressing, I'm guessing you know this already, but for others, global reserve is reserved from and comes out of metadata, so you can add global reserve total to metadata used. Normally, btrfs won't use anything from the global reserve, so usage there will be zero. If it's not, that's a very strong indication that your filesystem believes it is very short on space (even if data and metadata say they both have lots of unused space left, for some reason, very likely a bug in that case, the filesystem believes otherwise) and you need to take corrective action immediately, or risk the filesystem effectively going read-only when nothing else can be written. > But still df -h is giving me: > Filesystem Size Used Avail Use% Mounted on > /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault > > Although I've heard and read about the difficulty in reporting free > space due to the flexibility of BTRFS, snapshots and subvolumes, etc., > but I only have a single volume, no subvolumes, no snapshots, no quotas > and both data and metadata are RAID1. The most practical advice I've seen regarding "normal" df (that is, the one from coreutils, not btrfs fi df) in the case of uneven device sizes in particular, is simply ignore its numbers -- they're not reliable. The only thing you need to be sure of is that it says you have enough space for whatever you're actually doing ATM, since various applications will trust its numbers and may refuse to do whatever filesystem operation at all, if it says there's not enough space. The algorithm reasonably new coreutils df (and the kernel calls it depends on) uses is much better
Re: Help me understand what is going on with my RAID1 FS
Am Sun, 10 Sep 2017 20:15:52 +0200 schrieb Ferenc-Levente Juhos: > >Problem is that each raid1 block group contains two chunks on two > >separate devices, it can't utilize fully three devices no matter > >what. If that doesn't suit you then you need to add 4th disk. After > >that FS will be able to use all unallocated space on all disks in > >raid1 profile. But even then you'll be able to safely lose only one > >disk since BTRFS still will be storing only 2 copies of data. > > I hope I didn't say that I want to utilize all three devices fully. It > was clear to me that there will be 2 TB of wasted space. > Also I'm not questioning the chunk allocator for RAID1 at all. It's > clear and it always has been clear that for RAID1 the chunks need to > be allocated on different physical devices. > If I understood Kai's point of view, he even suggested that I might > need to do balancing to make sure that the free space on the three > devices is being used smartly. Hence the questions about balancing. It will allocate chunks from the device with the most space available. So while you fill your disks space usage will evenly distribute. The problem comes when you start deleting stuff, some chunks may even be freed, and everything becomes messed up. In an aging file system you may notice that the chunks are no longer evenly distributed. A balance is a way to fix that because it will reallocate chunks and coalesce data back into single chunks, making free space for new allocations. In this process it will actually evenly distribute your data again. You may want to use this rebalance script: https://www.spinics.net/lists/linux-btrfs/msg52076.html > I mean in worst case it could happen like this: > > Again I have disks of sizes 3, 3, 8: > Fig.1 > Drive1(8) Drive2(3) Drive3(3) > - X1X1 > - X2X2 > - X3X3 > Here the new drive is completely unused. Even if one X1 chunk would be > on Drive1 it would be still a sub-optimal allocation. This won't happen while filling a fresh btrfs. Chunks are always allocated from a device with most free space (within the raid1 constraints). This it will allocate space alternating between disk1+2 and disk1+3. > This is the optimal allocation. Will btrfs allocate like this? > Considering that Drive1 has the most free space. > Fig. 2 > Drive1(8) Drive2(3) Drive3(3) > X1X1- > X2- X2 > X3X3- > X4- X4 Yes. > From my point of view Fig.2 shows the optimal allocation, by the time > the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB > (because it is exclusively holding the mirrors for both Drive2 and 3). > For sure now btrfs can say, since two of the drives are completely > full he can't allocate any more chunks and the remaining 2 TB of space > from Drive1 is wasted. This is clear it's even pointed out by the > btrfs size calculator. Yes. > But again if the above statements are true, then df might as well tell > the "truth" and report that I have 3.5 TB space free and not 1.5TB (as > it is reported now). Again here I fully understand Kai's explanation. > Because coming back to my first e-mail, my "problem" was that df is > reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data. The size calculator has undergone some revisions. I think it currently estimates the free space from net data to raw data ratio across all devices, taking the current raid constraints into account. Calculating free space in btrfs is difficult because in the future btrfs may even support different raid levels for different sub volumes. It's probably best to calculate for the worst case scenario then. Even today it's already difficult if you use different raid levels for meta data and content data: The filesystem cannot predict the future of allocations. It can only give an educated guess. And the calculation was revised a few times to not "overshoot". > So the question still remains, is it just that df is intentionally not > smart enough to give a more accurate estimation, The df utility doesn't now anything about btrfs allocations. The value is estimated by btrfs itself. To get more detailed info for capacity planning, you should use "btrfs fi df" and its various siblings. > or is the assumption > that the allocator picks the drive with most free space mistaken? > If I continue along the lines of what Kai said, and I need to do > re-balance, because the allocation is not like shown above (Fig.2), > then my question is still legitimate. Are there any filters that one > might use to speed up or to selectively balance in my case? or will I > need to do full balance? Your assumption is misguided. The total free space estimation is a totally different thing than what the allocator bases its decision on. See "btrfs dev usage". The allocator uses space from the biggest unallocated space
Re: Help me understand what is going on with my RAID1 FS
> > Drive1 Drive2Drive3 > > X X > > X X > > X X > > > > Where X is a chunk of raid1 block group. > > But this table clearly shows that adding third drive increases free > space by 50%. You need to reallocate data to actually make use of it, > but it was done in this case. It increases it but I don't see how this space is in any way useful unless data is in single profile. After full balance chunks will be spread over 3 devices, how it helps in raid1 data profile case? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
10.09.2017 19:11, Dmitrii Tcvetkov пишет: >> Actually based on http://carfax.org.uk/btrfs-usage/index.html I >> would've expected 6 TB of usable space. Here I get 6.4 which is odd, >> but that only 1.5 TB is available is even stranger. >> >> Could anyone explain what I did wrong or why my expectations are wrong? >> >> Thank you in advance > > I'd say df and the website calculate different things. In btrfs raid1 profile > stores exactly 2 copies of data, each copy is on separate device. > So by adding third drive, no matter how big, effective free space didn't > expand because btrfs still needs space on any one of other two drives to > store second half of each raid1 chunk stored on that third drive. > > Basically: > > Drive1 Drive2Drive3 > X X > X X > X X > > Where X is a chunk of raid1 block group. But this table clearly shows that adding third drive increases free space by 50%. You need to reallocate data to actually make use of it, but it was done in this case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
10.09.2017 18:47, Kai Krakow пишет: > Am Sun, 10 Sep 2017 15:45:42 +0200 > schrieb FLJ: > >> Hello all, >> >> I have a BTRFS RAID1 volume running for the past year. I avoided all >> pitfalls known to me that would mess up this volume. I never >> experimented with quotas, no-COW, snapshots, defrag, nothing really. >> The volume is a RAID1 from day 1 and is working reliably until now. >> >> Until yesterday it consisted of two 3 TB drives, something along the >> lines: >> >> Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db >> Total devices 2 FS bytes used 2.47TiB >> devid1 size 2.73TiB used 2.47TiB path /dev/sdb >> devid2 size 2.73TiB used 2.47TiB path /dev/sdc >> >> Yesterday I've added a new drive to the FS and did a full rebalance >> (without filters) over night, which went through without any issues. >> >> Now I have: >> Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db >> Total devices 3 FS bytes used 2.47TiB >> devid1 size 2.73TiB used 1.24TiB path /dev/sdb >> devid2 size 2.73TiB used 1.24TiB path /dev/sdc >> devid3 size 7.28TiB used 2.48TiB path /dev/sda >> >> # btrfs fi df /mnt/BigVault/ >> Data, RAID1: total=2.47TiB, used=2.47TiB >> System, RAID1: total=32.00MiB, used=384.00KiB >> Metadata, RAID1: total=4.00GiB, used=2.74GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> But still df -h is giving me: >> Filesystem Size Used Avail Use% Mounted on >> /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault >> >> Although I've heard and read about the difficulty in reporting free >> space due to the flexibility of BTRFS, snapshots and subvolumes, etc., >> but I only have a single volume, no subvolumes, no snapshots, no >> quotas and both data and metadata are RAID1. >> >> My expectation would've been that in case of BigVault Size == Used + >> Avail. >> >> Actually based on http://carfax.org.uk/btrfs-usage/index.html I >> would've expected 6 TB of usable space. Here I get 6.4 which is odd, Total size is estimation which in this case is computed as (sum of device sizes)/2 which is approximately 6.4TiB. >> but that only 1.5 TB is available is even stranger. >> >> Could anyone explain what I did wrong or why my expectations are >> wrong? >> >> Thank you in advance > > Btrfs reports estimated free space from the free space of the smallest > member as it can only guarantee that. It's not exactly true. For three devices with free space of 1TiB, 2TiB and 3TiB it would return 2TiB as available space. But it is not sophisticated enough to notice that it actually has 3TiB available. I wonder if this is only free space calculation or actual allocation algorithm behaves similar (effectively ignoring part of available space). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
>Problem is that each raid1 block group contains two chunks on two >separate devices, it can't utilize fully three devices no matter what. >If that doesn't suit you then you need to add 4th disk. After >that FS will be able to use all unallocated space on all disks in raid1 >profile. But even then you'll be able to safely lose only one disk >since BTRFS still will be storing only 2 copies of data. I hope I didn't say that I want to utilize all three devices fully. It was clear to me that there will be 2 TB of wasted space. Also I'm not questioning the chunk allocator for RAID1 at all. It's clear and it always has been clear that for RAID1 the chunks need to be allocated on different physical devices. If I understood Kai's point of view, he even suggested that I might need to do balancing to make sure that the free space on the three devices is being used smartly. Hence the questions about balancing. I mean in worst case it could happen like this: Again I have disks of sizes 3, 3, 8: Fig.1 Drive1(8) Drive2(3) Drive3(3) - X1X1 - X2X2 - X3X3 Here the new drive is completely unused. Even if one X1 chunk would be on Drive1 it would be still a sub-optimal allocation. This is the optimal allocation. Will btrfs allocate like this? Considering that Drive1 has the most free space. Fig. 2 Drive1(8) Drive2(3) Drive3(3) X1X1- X2- X2 X3X3- X4- X4 >From my point of view Fig.2 shows the optimal allocation, by the time the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB (because it is exclusively holding the mirrors for both Drive2 and 3). For sure now btrfs can say, since two of the drives are completely full he can't allocate any more chunks and the remaining 2 TB of space from Drive1 is wasted. This is clear it's even pointed out by the btrfs size calculator. But again if the above statements are true, then df might as well tell the "truth" and report that I have 3.5 TB space free and not 1.5TB (as it is reported now). Again here I fully understand Kai's explanation. Because coming back to my first e-mail, my "problem" was that df is reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data. So the question still remains, is it just that df is intentionally not smart enough to give a more accurate estimation, or is the assumption that the allocator picks the drive with most free space mistaken? If I continue along the lines of what Kai said, and I need to do re-balance, because the allocation is not like shown above (Fig.2), then my question is still legitimate. Are there any filters that one might use to speed up or to selectively balance in my case? or will I need to do full balance? On Sun, Sep 10, 2017 at 7:19 PM, Dmitrii Tcvetkovwrote: >> @Kai and Dmitrii >> thank you for your explanations if I understand you correctly, you're >> saying that btrfs makes no attempt to "optimally" use the physical >> devices it has in the FS, once a new RAID1 block group needs to be >> allocated it will semi-randomly pick two devices with enough space and >> allocate two equal sized chunks, one on each. This new chunk may or >> may not fall onto my newly added 8 TB drive. Am I understanding this >> correctly? > If I remember correctly chunk allocator allocates new chunks on device > which has the most unallocated space. > >> Is there some sort of balance filter that would speed up this sort of >> balancing? Will balance be smart enough to make the "right" decision? >> As far as I read the chunk allocator used during balance is the same >> that is used during normal operation. If the allocator is already >> sub-optimal during normal operations, what's the guarantee that it >> will make a "better" decision during balancing? > > I don't really see any way that being possible in raid1 profile. How > can you fill all three devices if you can split data only twice? There > will be moment when two of three disks are full and BTRFS can't > allocate new raid1 block group because it has only one drive with > unallocated space. > >> >> When I say "right" and "better" I mean this: >> Drive1(8) Drive2(3) Drive3(3) >> X1X1 >> X2X2 >> X3X3 >> X4X4 >> I was convinced until now that the chunk allocator at least tries a >> best possible allocation. I'm sure it's complicated to develop a >> generic algorithm to fit all setups, but it should be possible. > > > Problem is that each raid1 block group contains two chunks on two > separate devices, it can't utilize fully three devices no matter what. > If that doesn't suit you then you need to add 4th disk. After > that FS will be able to use all unallocated space on all disks in raid1 > profile. But even then you'll be able to safely lose only one disk > since BTRFS still will be storing only 2
Re: Help me understand what is going on with my RAID1 FS
> @Kai and Dmitrii > thank you for your explanations if I understand you correctly, you're > saying that btrfs makes no attempt to "optimally" use the physical > devices it has in the FS, once a new RAID1 block group needs to be > allocated it will semi-randomly pick two devices with enough space and > allocate two equal sized chunks, one on each. This new chunk may or > may not fall onto my newly added 8 TB drive. Am I understanding this > correctly? If I remember correctly chunk allocator allocates new chunks on device which has the most unallocated space. > Is there some sort of balance filter that would speed up this sort of > balancing? Will balance be smart enough to make the "right" decision? > As far as I read the chunk allocator used during balance is the same > that is used during normal operation. If the allocator is already > sub-optimal during normal operations, what's the guarantee that it > will make a "better" decision during balancing? I don't really see any way that being possible in raid1 profile. How can you fill all three devices if you can split data only twice? There will be moment when two of three disks are full and BTRFS can't allocate new raid1 block group because it has only one drive with unallocated space. > > When I say "right" and "better" I mean this: > Drive1(8) Drive2(3) Drive3(3) > X1X1 > X2X2 > X3X3 > X4X4 > I was convinced until now that the chunk allocator at least tries a > best possible allocation. I'm sure it's complicated to develop a > generic algorithm to fit all setups, but it should be possible. Problem is that each raid1 block group contains two chunks on two separate devices, it can't utilize fully three devices no matter what. If that doesn't suit you then you need to add 4th disk. After that FS will be able to use all unallocated space on all disks in raid1 profile. But even then you'll be able to safely lose only one disk since BTRFS still will be storing only 2 copies of data. This behavior is not relevant for single or raid0 profiles of multidevice BTRFS filesystems. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
@Kai and Dmitrii thank you for your explanations if I understand you correctly, you're saying that btrfs makes no attempt to "optimally" use the physical devices it has in the FS, once a new RAID1 block group needs to be allocated it will semi-randomly pick two devices with enough space and allocate two equal sized chunks, one on each. This new chunk may or may not fall onto my newly added 8 TB drive. Am I understanding this correctly? > You will probably need to >run balance once in a while to evenly redistribute allocated chunks >across all disks. Is there some sort of balance filter that would speed up this sort of balancing? Will balance be smart enough to make the "right" decision? As far as I read the chunk allocator used during balance is the same that is used during normal operation. If the allocator is already sub-optimal during normal operations, what's the guarantee that it will make a "better" decision during balancing? When I say "right" and "better" I mean this: Drive1(8) Drive2(3) Drive3(3) X1X1 X2X2 X3X3 X4X4 I was convinced until now that the chunk allocator at least tries a best possible allocation. I'm sure it's complicated to develop a generic algorithm to fit all setups, but it should be possible. On Sun, Sep 10, 2017 at 5:47 PM, Kai Krakowwrote: > Am Sun, 10 Sep 2017 15:45:42 +0200 > schrieb FLJ : > >> Hello all, >> >> I have a BTRFS RAID1 volume running for the past year. I avoided all >> pitfalls known to me that would mess up this volume. I never >> experimented with quotas, no-COW, snapshots, defrag, nothing really. >> The volume is a RAID1 from day 1 and is working reliably until now. >> >> Until yesterday it consisted of two 3 TB drives, something along the >> lines: >> >> Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db >> Total devices 2 FS bytes used 2.47TiB >> devid1 size 2.73TiB used 2.47TiB path /dev/sdb >> devid2 size 2.73TiB used 2.47TiB path /dev/sdc >> >> Yesterday I've added a new drive to the FS and did a full rebalance >> (without filters) over night, which went through without any issues. >> >> Now I have: >> Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db >> Total devices 3 FS bytes used 2.47TiB >> devid1 size 2.73TiB used 1.24TiB path /dev/sdb >> devid2 size 2.73TiB used 1.24TiB path /dev/sdc >> devid3 size 7.28TiB used 2.48TiB path /dev/sda >> >> # btrfs fi df /mnt/BigVault/ >> Data, RAID1: total=2.47TiB, used=2.47TiB >> System, RAID1: total=32.00MiB, used=384.00KiB >> Metadata, RAID1: total=4.00GiB, used=2.74GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> But still df -h is giving me: >> Filesystem Size Used Avail Use% Mounted on >> /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault >> >> Although I've heard and read about the difficulty in reporting free >> space due to the flexibility of BTRFS, snapshots and subvolumes, etc., >> but I only have a single volume, no subvolumes, no snapshots, no >> quotas and both data and metadata are RAID1. >> >> My expectation would've been that in case of BigVault Size == Used + >> Avail. >> >> Actually based on http://carfax.org.uk/btrfs-usage/index.html I >> would've expected 6 TB of usable space. Here I get 6.4 which is odd, >> but that only 1.5 TB is available is even stranger. >> >> Could anyone explain what I did wrong or why my expectations are >> wrong? >> >> Thank you in advance > > Btrfs reports estimated free space from the free space of the smallest > member as it can only guarantee that. In your case this is 2.73 minus > 1.24 free which is roughly around 1.5T. But since this free space > distributes across three disks with one having much more free space, it > probably will use up that space at half the rate of actual allocation. > But due to how btrfs allocates from free space in chunks, that may not > be possible - thus the low unexpected value. You will probably need to > run balance once in a while to evenly redistribute allocated chunks > across all disks. > > It may give you better estimates if you combine sdb and sdc into one > logical device, e.g. using raid0 or jbod via md or lvm. > > > -- > Regards, > Kai > > Replies to list-only preferred. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
>Actually based on http://carfax.org.uk/btrfs-usage/index.html I >would've expected 6 TB of usable space. Here I get 6.4 which is odd, >but that only 1.5 TB is available is even stranger. > >Could anyone explain what I did wrong or why my expectations are wrong? > >Thank you in advance I'd say df and the website calculate different things. In btrfs raid1 profile stores exactly 2 copies of data, each copy is on separate device. So by adding third drive, no matter how big, effective free space didn't expand because btrfs still needs space on any one of other two drives to store second half of each raid1 chunk stored on that third drive. Basically: Drive1 Drive2Drive3 X X X X X X Where X is a chunk of raid1 block group. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help me understand what is going on with my RAID1 FS
Am Sun, 10 Sep 2017 15:45:42 +0200 schrieb FLJ: > Hello all, > > I have a BTRFS RAID1 volume running for the past year. I avoided all > pitfalls known to me that would mess up this volume. I never > experimented with quotas, no-COW, snapshots, defrag, nothing really. > The volume is a RAID1 from day 1 and is working reliably until now. > > Until yesterday it consisted of two 3 TB drives, something along the > lines: > > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 2 FS bytes used 2.47TiB > devid1 size 2.73TiB used 2.47TiB path /dev/sdb > devid2 size 2.73TiB used 2.47TiB path /dev/sdc > > Yesterday I've added a new drive to the FS and did a full rebalance > (without filters) over night, which went through without any issues. > > Now I have: > Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db > Total devices 3 FS bytes used 2.47TiB > devid1 size 2.73TiB used 1.24TiB path /dev/sdb > devid2 size 2.73TiB used 1.24TiB path /dev/sdc > devid3 size 7.28TiB used 2.48TiB path /dev/sda > > # btrfs fi df /mnt/BigVault/ > Data, RAID1: total=2.47TiB, used=2.47TiB > System, RAID1: total=32.00MiB, used=384.00KiB > Metadata, RAID1: total=4.00GiB, used=2.74GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > But still df -h is giving me: > Filesystem Size Used Avail Use% Mounted on > /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault > > Although I've heard and read about the difficulty in reporting free > space due to the flexibility of BTRFS, snapshots and subvolumes, etc., > but I only have a single volume, no subvolumes, no snapshots, no > quotas and both data and metadata are RAID1. > > My expectation would've been that in case of BigVault Size == Used + > Avail. > > Actually based on http://carfax.org.uk/btrfs-usage/index.html I > would've expected 6 TB of usable space. Here I get 6.4 which is odd, > but that only 1.5 TB is available is even stranger. > > Could anyone explain what I did wrong or why my expectations are > wrong? > > Thank you in advance Btrfs reports estimated free space from the free space of the smallest member as it can only guarantee that. In your case this is 2.73 minus 1.24 free which is roughly around 1.5T. But since this free space distributes across three disks with one having much more free space, it probably will use up that space at half the rate of actual allocation. But due to how btrfs allocates from free space in chunks, that may not be possible - thus the low unexpected value. You will probably need to run balance once in a while to evenly redistribute allocated chunks across all disks. It may give you better estimates if you combine sdb and sdc into one logical device, e.g. using raid0 or jbod via md or lvm. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netapp-alike snapshots?
On Sat, Sep 09, 2017 at 10:43:16PM +0300, Andrei Borzenkov wrote: > 09.09.2017 16:44, Ulli Horlacher пишет: > > > > Your tool does not create .snapshot subdirectories in EVERY directory like > > Neither does NetApp. Those "directories" are magic handles that do not > really exist. Correct, thanks for saving me typing the same thing (I actually did work at netapp many years back, so I'm familiar with how they work) > > Netapp does. > > Example: > > > > framstag@fex:~: cd ~/Mail/.snapshot/ > > framstag@fex:~/Mail/.snapshot: l > > lR-X - 2017-09-09 09:55 2017-09-09_.daily -> > > /local/home/.snapshot/2017-09-09_.daily/framstag/Mail > > Apart from obvious problem with recursive directory traversal (NetApp > .snapshot are not visible with normal directory list) those will also be > captured in snapshots and cannot be removed. NetApp snapshots themselves > do not expose .snapshot "directories". Correct. Netapp knows this of course, which is why those .snapshot directories are "magic" and hidden to ls(1), find(1) and others when they do a readdir(3) > > lR-X - 2017-09-09 14:00 2017-09-09_1400.hourly -> > > /local/home/.snapshot/2017-09-09_1400.hourly/framstag/Mail > > lR-X - 2017-09-09 15:00 2017-09-09_1500.hourly -> > > /local/home/.snapshot/2017-09-09_1500.hourly/framstag/Mail > > lR-X - 2017-09-09 15:18 2017-09-09_1518.single -> > > /local/home/.snapshot/2017-09-09_1518.single/framstag/Mail > > lR-X - 2017-09-09 15:20 2017-09-09_1520.single -> > > /local/home/.snapshot/2017-09-09_1520.single/framstag/Mail > > lR-X - 2017-09-09 15:22 2017-09-09_1522.single -> > > /local/home/.snapshot/2017-09-09_1522.single/framstag/Mail > > > > My users (and I) need snapshots in this way. You are used to them being there, I was too :) While you could create lots of symlinks, I opted not to since it would have littered the filesystem. I can simply cd $(SNAPROOT)/volname_hourly/$(PWD) and end up where I wanted to be. I suppose you could make a snapcd shell function that does this for you. The only issue is that volname_hourly comes before the rest of the path, so you aren't given a list of all the snapshots available for a given path, you have to cd into the given snapshot first, and then add the path. I agree it's not as nice as netapp, but honestly I don't think you can do better with btrfs at this point. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
Hi, On 10.09.2017 08:45 Qu Wenruo wrote: > > > On 2017年09月10日 14:41, Qu Wenruo wrote: >> >> >> On 2017年09月10日 07:50, Rohan Kadekodi wrote: >>> Hello, >>> >>> I was trying to understand how file renames are handled in Btrfs. I >>> read the code documentation, but had a problem understanding a few >>> things. >>> >>> During a file rename, btrfs_commit_transaction() is called which is >>> because Btrfs has to commit the whole FS before storing the >>> information related to the new renamed file. It has to commit the FS >>> because a rename first does an unlink, which is not recorded in the >>> btrfs_rename() transaction and so is not logged in the log tree. Is my >>> understanding correct? If yes, my questions are as follows: >> >> Not familiar with rename kernel code, so not much help for rename >> opeartion. >> >>> >>> 1. What does committing the whole FS mean? >> >> Committing the whole fs means a lot of things, but generally >> speaking, it makes that the on-disk data is inconsistent with each >> other. > >> For obvious part, it writes modified fs/subvolume trees to disk (with >> handling of tree operations so no half modified trees). >> >> Also other trees like extent tree (very hot since every CoW will >> update it, and the most complicated one), csum tree if modified. >> >> After transaction is committed, the on-disk btrfs will represent the >> states when commit trans is called, and every tree should match each >> other. >> >> Despite of this, after a transaction is committed, generation of the >> fs get increased and modified tree blocks will have the same >> generation number. >> >>> Blktrace shows that there >>> are 2 256KB writes, which are essentially writes to the data of >>> the root directory of the file system (which I found out through >>> btrfs-debug-tree). >> >> I'd say you didn't check btrfs-debug-tree output carefully enough. >> I strongly recommend to do vimdiff to get what tree is modified. >> >> At least the following trees are modified: >> >> 1) fs/subvolume tree >> Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and >> updated inode time. >> So fs/subvolume tree must be CoWed. >> >> 2) extent tree >> CoW of above metadata operation will definitely cause extent >> allocation and freeing, extent tree will also get updated. >> >> 3) root tree >> Both extent tree and fs/subvolume tree modified, their root bytenr >> needs to be updated and root tree must be updated. >> >> And finally superblocks. >> >> I just verified the behavior with empty btrfs created on a 1G file, >> only one file to do the rename. >> >> In that case (with 4K sectorsize and 16K nodesize), the total IO >> should be (3 * 16K) * 2 + 4K * 2 = 104K. >> >> "3" = number of tree blocks get modified >> "16K" = nodesize >> 1st "*2" = DUP profile for metadata >> "4K" = superblock size >> 2nd "*2" = 2 superblocks for 1G fs. >> >> If your extent/root/fs trees have higher level, then more tree blocks >> needs to be updated. >> And if your fs is very large, you may have 3 superblocks. >> >>> Is this equivalent to doing a shell sync, as the >>> same block groups are written during a shell sync too? >> >> For shell "sync" the difference is that, "sync" will write all dirty >> data pages to disk, and then commit transaction. >> While only calling btrfs_commit_transacation() doesn't trigger dirty >> page writeback. >> >> So there is a difference. this conversation made me realize why btrfs has sub-optimal meta-data performance. Cow b-trees are not the best data structure for such small changes. In my application I have multiple operations (e.g. renames) which can be bundles up and (mostly) one writer. I guess using BTRFS_IOC_TRANS_START and BTRFS_IOC_TRANS_END would be one way to reduce the cow overhead, but those are dangerous wrt. to ENOSPC and there have been discussions about removing them. Best would be if there were delayed metadata, where metadata is handled the same as delayed allocations and data changes, i.e. commit on fsync, commit interval or fssync. I assumed this was already the case... Please correct me if I got this wrong. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
Thank you for the prompt and elaborate answers! However, I think I was unclear in my questions, and I apologize for the confusion. What I meant was that for a file rename, when I check the blktrace output, there are 2 writes of 256KB each starting from byte number: 13373440 When I check btrfs-debug-tree, I see that the following items are related to it: 1) root tree: key (256 EXTENT_DATA 0) itemoff 13649 itemsize 53 extent data disk byte 13373440 nr 262144 extent data offset 0 nr 262144 ram 262144 extent compression 0 2) extent tree: key (13373440 EXTENT_ITEM 262144) itemoff 15040 itemsize 53 extent refs 1 gen 12 flags DATA extent data backref root 1 objectid 256 offset 0 count 1 So this means that the extent allocated to the root folder (mount point) is getting written twice right? Here I am not talking about any metadata, but the data in the extent allocated to the root folder, that is inode number 256. When I was analyzing the code, I saw that these writes happened from btrfs_start_dirty_block_groups() which is in btrfs_commit_transaction(). This is the same thing that is getting written on a filesystem commit. So my questions were: 1) Why are there 2 256KB writes happening during a filesystem commit to the same location instead of just 1? Also, what exactly is written in the root folder of the file system? Again, I am talking about the data held in the extent allocated inode 256 and not about any metadata or any tree. 2) I understand by the on-disk format that all the child dir/inode info in one subvolume are in the same tree, but these writes that I am talking about are not to any tree, they to the data held in inode 256, which happens to be the mount point. So by root directory, I mean the mount point or the inode 256 (not any tree). And even though metadata wise there is no hierarchy as such in the file system, each folder data will only contain the data belonging to its children right? Hence my question was that why does the data in the extent allocated to inode 256 need to be rewritten instead of just the parent folder for a rename? Thanks, Rohan On 10 September 2017 at 01:45, Qu Wenruowrote: > > > On 2017年09月10日 14:41, Qu Wenruo wrote: >> >> >> >> On 2017年09月10日 07:50, Rohan Kadekodi wrote: >>> >>> Hello, >>> >>> I was trying to understand how file renames are handled in Btrfs. I >>> read the code documentation, but had a problem understanding a few >>> things. >>> >>> During a file rename, btrfs_commit_transaction() is called which is >>> because Btrfs has to commit the whole FS before storing the >>> information related to the new renamed file. It has to commit the FS >>> because a rename first does an unlink, which is not recorded in the >>> btrfs_rename() transaction and so is not logged in the log tree. Is my >>> understanding correct? If yes, my questions are as follows: >> >> >> Not familiar with rename kernel code, so not much help for rename >> opeartion. >> >>> >>> 1. What does committing the whole FS mean? >> >> >> Committing the whole fs means a lot of things, but generally speaking, it >> makes that the on-disk data is inconsistent with each other. > > ^consistent > Sorry for the typo. > > Thanks, > Qu > >> >> For obvious part, it writes modified fs/subvolume trees to disk (with >> handling of tree operations so no half modified trees). >> >> Also other trees like extent tree (very hot since every CoW will update >> it, and the most complicated one), csum tree if modified. >> >> After transaction is committed, the on-disk btrfs will represent the >> states when commit trans is called, and every tree should match each other. >> >> Despite of this, after a transaction is committed, generation of the fs >> get increased and modified tree blocks will have the same generation number. >> >>> Blktrace shows that there >>> are 2 256KB writes, which are essentially writes to the data of >>> the root directory of the file system (which I found out through >>> btrfs-debug-tree). >> >> >> I'd say you didn't check btrfs-debug-tree output carefully enough. >> I strongly recommend to do vimdiff to get what tree is modified. >> >> At least the following trees are modified: >> >> 1) fs/subvolume tree >> Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and >> updated inode time. >> So fs/subvolume tree must be CoWed. >> >> 2) extent tree >> CoW of above metadata operation will definitely cause extent >> allocation and freeing, extent tree will also get updated. >> >> 3) root tree >> Both extent tree and fs/subvolume tree modified, their root bytenr >> needs to be updated and root tree must be updated. >> >> And finally superblocks. >> >> I just verified the behavior with empty btrfs created on a 1G file, only >> one file to do the rename. >> >> In that case (with 4K sectorsize and 16K nodesize), the total IO should be >> (3 * 16K) * 2 + 4K * 2 = 104K.
Re: generic name for volume and subvolume root?
> As I am writing some documentation abount creating snapshots: > Is there a generic name for both volume and subvolume root? Yes, it is from the UNIX side 'root directory' and from the Btrfs side 'subvolume'. Like some other things Btrfs, its terminology is often inconsistent, but "volume" *usually* means "the set of devices [and contained root directories] with the same Btrfs 'fsid'". I think that the top-level subvolume should not be called the "volume": while there is no reason why a UNIX-like filesystem should be limited to a single block-device, one of the fundamental properties of UNIX-like filesystems is that hard-links are only possible (if at all possible) within a filesystem, and that 'statfs' returns a different "device id" per filesystem. Therefore a Btrfs volume is not properly a filesystem, but potentially a filesystem forest, as it may contain multiple filesystems each with its own root directory. > Is there a simple name for directories I can snapshot? You can only snapshot *root directories*, of which in Btrfs there are two types: subvolumes (an unfortunate name perhaps) or snapshots. In UNIX-like OSes every filesystem has a "root directory" and some filesystem types like Btrfs, NILFS2, and potentially JFS can have more than one, and some can even mount more than one simultaneously. The root directory mounted as '/' is called the "system root directory". When unmounted all filesystem root directories have no names, just an inode number. Conceivably the root inode of a UNIX-like filesystem could be an inode of any type, but I have never seen a recent UNIX-like OS able to mount anything other than a directory-type root inode (Plan 9 is not a UNIX-like OS :->). As someone else observed, the word "root" is overloaded in UNIX-like OS discourse, like the word "filesystem", and that's unfortunate but can always be resolved verbosely by using the appropriate qualifier like "root directory", "system root directory", "'root' user", "uid 0 capabilities", etc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Help me understand what is going on with my RAID1 FS
Hello all, I have a BTRFS RAID1 volume running for the past year. I avoided all pitfalls known to me that would mess up this volume. I never experimented with quotas, no-COW, snapshots, defrag, nothing really. The volume is a RAID1 from day 1 and is working reliably until now. Until yesterday it consisted of two 3 TB drives, something along the lines: Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db Total devices 2 FS bytes used 2.47TiB devid1 size 2.73TiB used 2.47TiB path /dev/sdb devid2 size 2.73TiB used 2.47TiB path /dev/sdc Yesterday I've added a new drive to the FS and did a full rebalance (without filters) over night, which went through without any issues. Now I have: Label: 'BigVault' uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db Total devices 3 FS bytes used 2.47TiB devid1 size 2.73TiB used 1.24TiB path /dev/sdb devid2 size 2.73TiB used 1.24TiB path /dev/sdc devid3 size 7.28TiB used 2.48TiB path /dev/sda # btrfs fi df /mnt/BigVault/ Data, RAID1: total=2.47TiB, used=2.47TiB System, RAID1: total=32.00MiB, used=384.00KiB Metadata, RAID1: total=4.00GiB, used=2.74GiB GlobalReserve, single: total=512.00MiB, used=0.00B But still df -h is giving me: Filesystem Size Used Avail Use% Mounted on /dev/sdb 6.4T 2.5T 1.5T 63% /mnt/BigVault Although I've heard and read about the difficulty in reporting free space due to the flexibility of BTRFS, snapshots and subvolumes, etc., but I only have a single volume, no subvolumes, no snapshots, no quotas and both data and metadata are RAID1. My expectation would've been that in case of BigVault Size == Used + Avail. Actually based on http://carfax.org.uk/btrfs-usage/index.html I would've expected 6 TB of usable space. Here I get 6.4 which is odd, but that only 1.5 TB is available is even stranger. Could anyone explain what I did wrong or why my expectations are wrong? Thank you in advance -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check --repair now runs in minutes instead of hours? aborting
On Sun, Sep 10, 2017 at 02:01:58PM +0800, Qu Wenruo wrote: > > > On 2017年09月10日 01:44, Marc MERLIN wrote: > > So, should I assume that btrfs progs git has some issue since there is > > no plausible way that a check --repair should be faster than a regular > > check? > > Yes, the assumption that repair should be no faster than RO check is > correct. > Especially for clean fs, repair should just behave the same as RO check. > > And I'll first submit a patch (or patches) to output the consumed time for > each tree, so we could have a clue what is going wrong. > (Digging the code is just a little too boring for me) Cool. Let me know when I should sync and re-try. In the meantime, though, my check --repair went back to 170mn after triggering an FS bug for Josef, so it seems back to normal. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)
Great, if the free space cache is fucked again after the next go around then I need to expand the verifier to watch entries being added to the cache as well. Thanks, Josef Sent from my iPhone > On Sep 10, 2017, at 9:14 AM, Marc MERLINwrote: > >> On Sun, Sep 10, 2017 at 03:12:16AM +, Josef Bacik wrote: >> Ok mount -o clear_cache, umount and run fsck again just to make sure. Then >> if it comes out clean mount with ref_verify again and wait for it to blow up >> again. Thanks, > > Ok, just did the 2nd fsck, came back clean after mount -o clear_cache > > I'll re-trigger the exact same bug and repeat the whole cycle then. > > Marc > -- > "A mouse is a device used to point at the xterm you want to type in" - A.S.R. > Microsoft is to operating systems > what McDonalds is to gourmet cooking > Home page: > https://urldefense.proofpoint.com/v2/url?u=http-3A__marc.merlins.org_=DwIBAg=5VD0RTtNlTh3ycd41b3MUw=sDzg6MvHymKOUgI8SFIm4Q=46Ubpt2icp5_meAcqMuzd4whl0dZVSwf02fqYoDbzKw=nb55W48Rh0IzH8FH4eykviziYCc2S72iYmmNxdpjbOc= > | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: error (device dm-2) in btrfs_run_delayed_refs:2960: errno=-17 Object already exists (since 3.4 / 2012)
On Sun, Sep 10, 2017 at 03:12:16AM +, Josef Bacik wrote: > Ok mount -o clear_cache, umount and run fsck again just to make sure. Then > if it comes out clean mount with ref_verify again and wait for it to blow up > again. Thanks, Ok, just did the 2nd fsck, came back clean after mount -o clear_cache I'll re-trigger the exact same bug and repeat the whole cycle then. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
On 2017年09月10日 19:19, Christophe JAILLET wrote: If 'btrfs_alloc_path()' fails, we must free the resourses already allocated, as done in the other error handling paths in this function. Signed-off-by: Christophe JAILLETReviewed-by: Qu Wenruo BTW, I also checked all btrfs_alloc_path() in self tests, not such leak remaining. Thanks, Qu --- fs/btrfs/tests/free-space-tree-tests.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tests/free-space-tree-tests.c b/fs/btrfs/tests/free-space-tree-tests.c index 1458bb0ea124..8444a018cca2 100644 --- a/fs/btrfs/tests/free-space-tree-tests.c +++ b/fs/btrfs/tests/free-space-tree-tests.c @@ -500,7 +500,8 @@ static int run_test(test_func_t test_func, int bitmaps, u32 sectorsize, path = btrfs_alloc_path(); if (!path) { test_msg("Couldn't allocate path\n"); - return -ENOMEM; + ret = -ENOMEM; + goto out; } ret = add_block_group_free_space(, root->fs_info, cache); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: tests: Fix a memory leak in error handling path in 'run_test()'
If 'btrfs_alloc_path()' fails, we must free the resourses already allocated, as done in the other error handling paths in this function. Signed-off-by: Christophe JAILLET--- fs/btrfs/tests/free-space-tree-tests.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tests/free-space-tree-tests.c b/fs/btrfs/tests/free-space-tree-tests.c index 1458bb0ea124..8444a018cca2 100644 --- a/fs/btrfs/tests/free-space-tree-tests.c +++ b/fs/btrfs/tests/free-space-tree-tests.c @@ -500,7 +500,8 @@ static int run_test(test_func_t test_func, int bitmaps, u32 sectorsize, path = btrfs_alloc_path(); if (!path) { test_msg("Couldn't allocate path\n"); - return -ENOMEM; + ret = -ENOMEM; + goto out; } ret = add_block_group_free_space(, root->fs_info, cache); -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help with exact actions for raid1 hot-swap
On 10 September 2017 at 08:33, Marat Khaliliwrote: > It doesn't need replaced disk to be readable, right? Only enough to be mountable, which it already is, so your read errors on /dev/sdb isn't a problem. > Then what prevents same procedure to work without a spare bay? It is basically the same procedure but with a bunch of gotchas due to bugs and odd behaviour. Only having one shot at it, before it can only be mounted read-only, is especially problematic (will be fixed in Linux 4.14). > -- > > With Best Regards, > Marat Khalili > > On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquist > wrote: >>On 9 September 2017 at 12:05, Marat Khalili wrote: >>> Forgot to add, I've got a spare empty bay if it can be useful here. >> >>That makes it much easier since you don't have to mount it degraded, >>with the risks involved. >> >>Add and partition the disk. >> >># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data >> >>Remove the old disk when it is done. >> >>> -- >>> >>> With Best Regards, >>> Marat Khalili >>> >>> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili >> wrote: Dear list, I'm going to replace one hard drive (partition actually) of a btrfs raid1. Can you please spell exactly what I need to do in order to get my filesystem working as RAID1 again after replacement, exactly as it >>was before? I saw some bad examples of drive replacement in this list so >>I afraid to just follow random instructions on wiki, and putting this system out of action even temporarily would be very inconvenient. For this filesystem: > $ sudo btrfs fi show /dev/sdb7 > Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 > Total devices 2 FS bytes used 106.23GiB > devid1 size 2.71TiB used 126.01GiB path /dev/sda7 > devid2 size 2.71TiB used 126.01GiB path /dev/sdb7 > $ grep /mnt/data /proc/mounts > /dev/sda7 /mnt/data btrfs > rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0 > $ sudo btrfs fi df /mnt/data > Data, RAID1: total=123.00GiB, used=104.57GiB > System, RAID1: total=8.00MiB, used=48.00KiB > Metadata, RAID1: total=3.00GiB, used=1.67GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > $ uname -a > Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC > 2017 x86_64 x86_64 x86_64 GNU/Linux I've got this in dmesg: > [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 > action 0x0 > [ +0.51] ata6.00: irq_stat 0x4008 > [ +0.29] ata6.00: failed command: READ FPDMA QUEUED > [ +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag >>3 > ncq 57344 in >res 41/40:00:68:6c:f3/00:00:79:00:00/40 >>Emask > 0x409 (media error) > [ +0.94] ata6.00: status: { DRDY ERR } > [ +0.26] ata6.00: error: { UNC } > [ +0.001195] ata6.00: configured for UDMA/133 > [ +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: >>hostbyte=DID_OK > driverbyte=DRIVER_SENSE > [ +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error > [current] [descriptor] > [ +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read > error - auto reallocate failed > [ +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 >>00 > 79 f3 6c 50 00 00 00 70 00 00 > [ +0.03] blk_update_request: I/O error, dev sdb, sector 2045996136 > [ +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd > 1, flush 0, corrupt 0, gen 0 > [ +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd > 2, flush 0, corrupt 0, gen 0 > [ +0.77] ata6: EH complete There's still 1 in Current_Pending_Sector line of smartctl output as >>of now, so it probably won't heal by itself. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe >>linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>-- >>To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >>in >>the body of a message to majord...@vger.kernel.org >>More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netapp-alike snapshots?
Perhaps netapp is using a VFS overlay. There is really only one snapshot but it is shown in the overlay on every folder. Kind of the same with samba Shadow Copies. From: Ulli Horlacher-- Sent: 2017-09-09 - 21:52 > On Sat 2017-09-09 (22:43), Andrei Borzenkov wrote: > >> > Your tool does not create .snapshot subdirectories in EVERY directory like >> >> Neither does NetApp. Those "directories" are magic handles that do not >> really exist. > > I know. > But symbolic links are the next close thing (I am not a kernel programmer). > > >> Apart from obvious problem with recursive directory traversal (NetApp >> .snapshot are not visible with normal directory list) > > Yes, they are, at least sometimes, eg tar includes the snapshots. > > > -- > Ullrich Horlacher Server und Virtualisierung > Rechenzentrum TIK > Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de > Allmandring 30aTel:++49-711-68565868 > 70569 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ > REF:<14c87878-a5a0-d7d3-4a76-c55812e75...@gmail.com> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regarding handling of file renames in Btrfs
On 2017年09月10日 14:41, Qu Wenruo wrote: On 2017年09月10日 07:50, Rohan Kadekodi wrote: Hello, I was trying to understand how file renames are handled in Btrfs. I read the code documentation, but had a problem understanding a few things. During a file rename, btrfs_commit_transaction() is called which is because Btrfs has to commit the whole FS before storing the information related to the new renamed file. It has to commit the FS because a rename first does an unlink, which is not recorded in the btrfs_rename() transaction and so is not logged in the log tree. Is my understanding correct? If yes, my questions are as follows: Not familiar with rename kernel code, so not much help for rename opeartion. 1. What does committing the whole FS mean? Committing the whole fs means a lot of things, but generally speaking, it makes that the on-disk data is inconsistent with each other. ^consistent Sorry for the typo. Thanks, Qu For obvious part, it writes modified fs/subvolume trees to disk (with handling of tree operations so no half modified trees). Also other trees like extent tree (very hot since every CoW will update it, and the most complicated one), csum tree if modified. After transaction is committed, the on-disk btrfs will represent the states when commit trans is called, and every tree should match each other. Despite of this, after a transaction is committed, generation of the fs get increased and modified tree blocks will have the same generation number. Blktrace shows that there are 2 256KB writes, which are essentially writes to the data of the root directory of the file system (which I found out through btrfs-debug-tree). I'd say you didn't check btrfs-debug-tree output carefully enough. I strongly recommend to do vimdiff to get what tree is modified. At least the following trees are modified: 1) fs/subvolume tree Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and updated inode time. So fs/subvolume tree must be CoWed. 2) extent tree CoW of above metadata operation will definitely cause extent allocation and freeing, extent tree will also get updated. 3) root tree Both extent tree and fs/subvolume tree modified, their root bytenr needs to be updated and root tree must be updated. And finally superblocks. I just verified the behavior with empty btrfs created on a 1G file, only one file to do the rename. In that case (with 4K sectorsize and 16K nodesize), the total IO should be (3 * 16K) * 2 + 4K * 2 = 104K. "3" = number of tree blocks get modified "16K" = nodesize 1st "*2" = DUP profile for metadata "4K" = superblock size 2nd "*2" = 2 superblocks for 1G fs. If your extent/root/fs trees have higher level, then more tree blocks needs to be updated. And if your fs is very large, you may have 3 superblocks. Is this equivalent to doing a shell sync, as the same block groups are written during a shell sync too? For shell "sync" the difference is that, "sync" will write all dirty data pages to disk, and then commit transaction. While only calling btrfs_commit_transacation() doesn't trigger dirty page writeback. So there is a difference. And furthermore, if there is nothing to modified at all, sync will just skip the fs, so btrfs_commit_transaction() is not ensured if you call "sync". Also, does it imply that all the metadata held by the log tree is now checkpointed to the respective trees? Log tree part is a little tricky, as the log tree is not really a journal for btrfs. Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't need any journal. Log tree is mainly used for enhancing btrfs fsync performance. You can totally disable log tree by notreelog mount option and btrfs will behave just fine. And furthermore, I'm not very familiar with log tree, I need to verify the code to see if log tree is used in rename, so I can't say much right now. But to make things easy, I strongly recommend to ignore log tree for now. 2. Why are there 2 complete writes to the data held by the root directory and not just 1? These writes are 256KB each, which is the size of the extent allocated to the root directory Check my first calculation and verify the debug-tree output before and after rename. I think there is some extra factors affecting the number, from the tree height to your fs tree organization. 3. Why are the writes being done to the root directory of the file system / subvolume and not just the parent directory where the unlink happened? That's why I strongly recommend to understand btrfs on-disk format first. A lot of things can be answered after understanding the on-disk layout, without asking any other guys. The short answer is, btrfs puts all its child dir/inode info into one tree for one subvolume. (And the term "root directory" here is a little confusing, are you talking about the fs tree root or the root tree?) Not the
Re: Regarding handling of file renames in Btrfs
On 2017年09月10日 07:50, Rohan Kadekodi wrote: Hello, I was trying to understand how file renames are handled in Btrfs. I read the code documentation, but had a problem understanding a few things. During a file rename, btrfs_commit_transaction() is called which is because Btrfs has to commit the whole FS before storing the information related to the new renamed file. It has to commit the FS because a rename first does an unlink, which is not recorded in the btrfs_rename() transaction and so is not logged in the log tree. Is my understanding correct? If yes, my questions are as follows: Not familiar with rename kernel code, so not much help for rename opeartion. 1. What does committing the whole FS mean? Committing the whole fs means a lot of things, but generally speaking, it makes that the on-disk data is inconsistent with each other. For obvious part, it writes modified fs/subvolume trees to disk (with handling of tree operations so no half modified trees). Also other trees like extent tree (very hot since every CoW will update it, and the most complicated one), csum tree if modified. After transaction is committed, the on-disk btrfs will represent the states when commit trans is called, and every tree should match each other. Despite of this, after a transaction is committed, generation of the fs get increased and modified tree blocks will have the same generation number. Blktrace shows that there are 2 256KB writes, which are essentially writes to the data of the root directory of the file system (which I found out through btrfs-debug-tree). I'd say you didn't check btrfs-debug-tree output carefully enough. I strongly recommend to do vimdiff to get what tree is modified. At least the following trees are modified: 1) fs/subvolume tree Rename modified the DIR_INDEX/DIR_ITEM/INODE_REF at least, and updated inode time. So fs/subvolume tree must be CoWed. 2) extent tree CoW of above metadata operation will definitely cause extent allocation and freeing, extent tree will also get updated. 3) root tree Both extent tree and fs/subvolume tree modified, their root bytenr needs to be updated and root tree must be updated. And finally superblocks. I just verified the behavior with empty btrfs created on a 1G file, only one file to do the rename. In that case (with 4K sectorsize and 16K nodesize), the total IO should be (3 * 16K) * 2 + 4K * 2 = 104K. "3" = number of tree blocks get modified "16K" = nodesize 1st "*2" = DUP profile for metadata "4K" = superblock size 2nd "*2" = 2 superblocks for 1G fs. If your extent/root/fs trees have higher level, then more tree blocks needs to be updated. And if your fs is very large, you may have 3 superblocks. Is this equivalent to doing a shell sync, as the same block groups are written during a shell sync too? For shell "sync" the difference is that, "sync" will write all dirty data pages to disk, and then commit transaction. While only calling btrfs_commit_transacation() doesn't trigger dirty page writeback. So there is a difference. And furthermore, if there is nothing to modified at all, sync will just skip the fs, so btrfs_commit_transaction() is not ensured if you call "sync". Also, does it imply that all the metadata held by the log tree is now checkpointed to the respective trees? Log tree part is a little tricky, as the log tree is not really a journal for btrfs. Btrfs uses CoW for metadata so in theory (and in fact) btrfs doesn't need any journal. Log tree is mainly used for enhancing btrfs fsync performance. You can totally disable log tree by notreelog mount option and btrfs will behave just fine. And furthermore, I'm not very familiar with log tree, I need to verify the code to see if log tree is used in rename, so I can't say much right now. But to make things easy, I strongly recommend to ignore log tree for now. 2. Why are there 2 complete writes to the data held by the root directory and not just 1? These writes are 256KB each, which is the size of the extent allocated to the root directory Check my first calculation and verify the debug-tree output before and after rename. I think there is some extra factors affecting the number, from the tree height to your fs tree organization. 3. Why are the writes being done to the root directory of the file system / subvolume and not just the parent directory where the unlink happened? That's why I strongly recommend to understand btrfs on-disk format first. A lot of things can be answered after understanding the on-disk layout, without asking any other guys. The short answer is, btrfs puts all its child dir/inode info into one tree for one subvolume. (And the term "root directory" here is a little confusing, are you talking about the fs tree root or the root tree?) Not the common one tree for one inode layout. So if you rename one file in a subvolume, the subvolume tree get CoWed, which means from the
Re: Please help with exact actions for raid1 hot-swap
It doesn't need replaced disk to be readable, right? Then what prevents same procedure to work without a spare bay? -- With Best Regards, Marat Khalili On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquistwrote: >On 9 September 2017 at 12:05, Marat Khalili wrote: >> Forgot to add, I've got a spare empty bay if it can be useful here. > >That makes it much easier since you don't have to mount it degraded, >with the risks involved. > >Add and partition the disk. > ># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data > >Remove the old disk when it is done. > >> -- >> >> With Best Regards, >> Marat Khalili >> >> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili > wrote: >>>Dear list, >>> >>>I'm going to replace one hard drive (partition actually) of a btrfs >>>raid1. Can you please spell exactly what I need to do in order to get >>>my >>>filesystem working as RAID1 again after replacement, exactly as it >was >>>before? I saw some bad examples of drive replacement in this list so >I >>>afraid to just follow random instructions on wiki, and putting this >>>system out of action even temporarily would be very inconvenient. >>> >>>For this filesystem: >>> $ sudo btrfs fi show /dev/sdb7 Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 Total devices 2 FS bytes used 106.23GiB devid1 size 2.71TiB used 126.01GiB path /dev/sda7 devid2 size 2.71TiB used 126.01GiB path /dev/sdb7 $ grep /mnt/data /proc/mounts /dev/sda7 /mnt/data btrfs rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0 $ sudo btrfs fi df /mnt/data Data, RAID1: total=123.00GiB, used=104.57GiB System, RAID1: total=8.00MiB, used=48.00KiB Metadata, RAID1: total=3.00GiB, used=1.67GiB GlobalReserve, single: total=512.00MiB, used=0.00B $ uname -a Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux >>> >>>I've got this in dmesg: >>> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 action 0x0 [ +0.51] ata6.00: irq_stat 0x4008 [ +0.29] ata6.00: failed command: READ FPDMA QUEUED [ +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag >3 ncq 57344 in res 41/40:00:68:6c:f3/00:00:79:00:00/40 >Emask 0x409 (media error) [ +0.94] ata6.00: status: { DRDY ERR } [ +0.26] ata6.00: error: { UNC } [ +0.001195] ata6.00: configured for UDMA/133 [ +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: >hostbyte=DID_OK driverbyte=DRIVER_SENSE [ +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error [current] [descriptor] [ +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed [ +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 >00 >>> 79 f3 6c 50 00 00 00 70 00 00 [ +0.03] blk_update_request: I/O error, dev sdb, sector >>>2045996136 [ +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >>>rd 1, flush 0, corrupt 0, gen 0 [ +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >>>rd 2, flush 0, corrupt 0, gen 0 [ +0.77] ata6: EH complete >>> >>>There's still 1 in Current_Pending_Sector line of smartctl output as >of >>> >>>now, so it probably won't heal by itself. >>> >>>-- >>> >>>With Best Regards, >>>Marat Khalili >>>-- >>>To unsubscribe from this list: send the line "unsubscribe >linux-btrfs" >>>in >>>the body of a message to majord...@vger.kernel.org >>>More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe >linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >-- >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs check --repair now runs in minutes instead of hours? aborting
On 2017年09月10日 01:44, Marc MERLIN wrote: So, should I assume that btrfs progs git has some issue since there is no plausible way that a check --repair should be faster than a regular check? Yes, the assumption that repair should be no faster than RO check is correct. Especially for clean fs, repair should just behave the same as RO check. And I'll first submit a patch (or patches) to output the consumed time for each tree, so we could have a clue what is going wrong. (Digging the code is just a little too boring for me) Thanks, Qu Thanks, Marc On Tue, Sep 05, 2017 at 07:45:25AM -0700, Marc MERLIN wrote: On Tue, Sep 05, 2017 at 04:05:04PM +0800, Qu Wenruo wrote: gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total.60TiB, used.54TiB System, DUP: total2.00MiB, used=1.19MiB Metadata, DUP: totalX.00GiB, used.69GiB Wait for a minute. Is that .69GiB means 706 MiB? Or my email client/GMX screwed up the format (again)? This output format must be changed, at least to 0.69 GiB, or 706 MiB. Email client problem. I see control characters in what you quoted. Let's try again gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=10.66TiB, used=10.60TiB => 10TB System, DUP: total=64.00MiB, used=1.20MiB=> 1.2MB Metadata, DUP: total=57.50GiB, used=12.76GiB => 13GB GlobalReserve, single: total=512.00MiB, used=0.00B => 0 You mean lowmem is actually FASTER than original mode? That's very surprising. Correct, unless I add --repair and then original mode is 2x faster than lowmem. Is there any special operation done for that btrfs? Like offline dedupe or tons of reflinks? In this case, no. Note that btrfs check used to take many hours overnight until I did a git pull of btrfs progs and built the latest from TOT. BTW, how many subvolumes do you have in the fs? gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | wc -l 91 If I remove snapshots for btrfs send and historical 'backups': gargamel:/mnt/btrfs_pool1# btrfs subvolume list . | grep -Ev '(hourly|daily|weekly|rw|ro)' | wc -l 5 This looks like a bug. My first guess is related to number of subvolumes/reflinks, but I'm not sure since I don't have many real-world btrfs. I'll take sometime to look into it. Thanks for the very interesting report, Thanks for having a look :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html