Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
On Wed, Jul 6, 2016 at 8:24 PM, Tomasz Kusmierzwrote: > you are throwing a lot of useful data, maybe diverting some of it into wiki ? > you know, us normal people might find it useful for making educated choice in > some future ? :) There is a wiki, and it's difficult for keep up to date as it is. There are just too many changes happening in Btrfs, and really only the devs have a birds eye view of what's going on and what will happen sooner than later. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On 7/6/16 8:35 AM, Holger Hoffstätte wrote: > On 07/06/16 14:25, Wang Shilong wrote: ... >> After patch, it will look like: >>Total Exclusive Set shared Filename >> skipping not btrfs dir/file: boot >> skipping not btrfs dir/file: dev >> skipping not btrfs dir/file: proc >> skipping not btrfs dir/file: run >> skipping not btrfs dir/file: sys >> 0.00B 0.00B - //root/.bash_logout >> 0.00B 0.00B - //root/.bash_profile >> 0.00B 0.00B - //root/.bashrc >> 0.00B 0.00B - //root/.cshrc >> 0.00B 0.00B - //root/.tcshrc >> >> This works for me to analysis system usage and analysis >> performaces. > > This is great, but can we please skip the "skipping .." messages? > Maybe it's just me but I really don't see the value of printing them > when they don't contribute to the result. > They also mess up the display. :) I agree, those messages add no value. -Eric > thanks, > Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()
hello, On 07/06/2016 08:27 PM, Holger Hoffstätte wrote: On 07/06/16 12:37, Wang Xiaoguang wrote: Below test scripts can reproduce this false ENOSPC: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount /dev/loop0 /tmp/mntpoint cd mntpoint xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile Above fallocate(2) operation will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. The reason is btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use just in time, and it calls btrfs_free_reserved_data_space_noquota() in the end of btrfs_fallocate(), which is too late and have already added false unnecessary pressure to enospc system. See call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() It will add btrfs_space_info's bytes_may_use accordingly. |-> btrfs_prealloc_file_range() It will call btrfs_reserve_extent(), but note that alloc type is RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will only increase btrfs_space_info's bytes_reserved accordingly, but will not decrease btrfs_space_info's bytes_may_use, then obviously we have overestimated real needed disk space, and it'll impact other processes who do write(2) or fallocate(2) operations, also can impact metadata reservation in mixed mode, and bytes_max_use will only be decreased in the end of btrfs_fallocate(). To fix this false ENOSPC, we need to decrease btrfs_space_info's bytes_may_use in btrfs_prealloc_file_range() in time, as what we do in cow_file_range(), See call graph in : cow_file_range() |-> extent_clear_unlock_delalloc() |-> clear_extent_bit() |-> btrfs_clear_bit_hook() |-> btrfs_free_reserved_data_space_noquota() This function will decrease bytes_may_use accordingly. So this patch choose to call btrfs_free_reserved_data_space() in __btrfs_prealloc_file_range() for both successful and failed path. Also this patch removes some old and useless comments. Signed-off-by: Wang XiaoguangVerified that the reproducer script indeed fails (with btrfs ~4.7) and the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating things without problem. Free space also still seems sane, as far as I could tell. So for both patches: Tested-by: Holger Hoffstätte Thanks very much :) Regards, Xiaoguang Wang cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
hello, On 07/07/2016 03:54 AM, Liu Bo wrote: On Wed, Jul 06, 2016 at 06:37:52PM +0800, Wang Xiaoguang wrote: In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | || |block group reloc_inode | Signed-off-by: Wang Xiaoguang--- fs/btrfs/relocation.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 0477dca..abc2f69 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode, u64 num_bytes; int nr = 0; int ret = 0; + u64 prealloc_start, prealloc_end; BUG_ON(cluster->start != cluster->boundary[0]); inode_lock(inode); - ret = btrfs_check_data_free_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + start = cluster->start - offset; + end = cluster->end - offset; + ret = btrfs_check_data_free_space(inode, start, end + 1 - start); if (ret) goto out; while (nr < cluster->nr) { - start = cluster->boundary[nr] - offset; + prealloc_start = cluster->boundary[nr] - offset; if (nr + 1 < cluster->nr) - end = cluster->boundary[nr + 1] - 1 - offset; + prealloc_end = cluster->boundary[nr + 1] - 1 - offset; else - end = cluster->end - offset; + prealloc_end = cluster->end - offset; - lock_extent(_I(inode)->io_tree, start, end); - num_bytes = end + 1 - start; - ret = btrfs_prealloc_file_range(inode, 0, start, + lock_extent(_I(inode)->io_tree, prealloc_start, + prealloc_end); + num_bytes = prealloc_end + 1 - prealloc_start; + ret = btrfs_prealloc_file_range(inode, 0, prealloc_start, num_bytes, num_bytes, - end + 1, _hint); - unlock_extent(_I(inode)->io_tree, start, end); + prealloc_end + 1, _hint); + unlock_extent(_I(inode)->io_tree, prealloc_start, + prealloc_end); Changing names is unnecessary, we can pick up other names for btrfs_{check/free}_data_free_space(). OK, then the changes will be small, thanks. Regards, Xiaoguang Wang Thanks, -liubo if (ret) break; nr++; } - btrfs_free_reserved_data_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + btrfs_free_reserved_data_space(inode, start, end + 1 - start); out: inode_unlock(inode); return ret; -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 7 Jul 2016, at 02:46, Chris Murphywrote: > Chaps, I didn’t wanted this to spring up as a performance of btrfs argument, BUT you are throwing a lot of useful data, maybe diverting some of it into wiki ? you know, us normal people might find it useful for making educated choice in some future ? :) Interestingly on my RAID10 with 6 disks I only get: dd if=/mnt/share/asdf of=/dev/zero bs=100M 113+1 records in 113+1 records out 11874643004 bytes (12 GB, 11 GiB) copied, 45.3123 s, 262 MB/s filefrag -v ext: logical_offset:physical_offset: length: expected: flags: 0:0..2471: 2101940598..2101943069: 2472: 1: 2472.. 12583: 1938312686..1938322797: 10112: 2101943070: 2:12584.. 12837: 1937534654..1937534907:254: 1938322798: 3:12838.. 12839: 1937534908..1937534909: 2: 4:12840.. 34109: 1902954063..1902975332: 21270: 1937534910: 5:34110.. 53671: 1900857931..1900877492: 19562: 1902975333: 6:53672.. 54055: 1900877493..1900877876:384: 7:54056.. 54063: 1900877877..1900877884: 8: 8:54064.. 98041: 1900877885..1900921862: 43978: 9:98042.. 117671: 1900921863..1900941492: 19630: 10: 117672.. 118055: 1900941493..1900941876:384: 11: 118056.. 161833: 1900941877..1900985654: 43778: 12: 161834.. 204013: 1900985655..1901027834: 42180: 13: 204014.. 214269: 1901027835..1901038090: 10256: 14: 214270.. 214401: 1901038091..1901038222:132: 15: 214402.. 214407: 1901038223..1901038228: 6: 16: 214408.. 258089: 1901038229..1901081910: 43682: 17: 258090.. 300139: 1901081911..1901123960: 42050: 18: 300140.. 310559: 1901123961..1901134380: 10420: 19: 310560.. 310695: 1901134381..1901134516:136: 20: 310696.. 354251: 1901134517..1901178072: 43556: 21: 354252.. 396389: 1901178073..1901220210: 42138: 22: 396390.. 406353: 1901220211..1901230174: 9964: 23: 406354.. 406515: 1901230175..1901230336:162: 24: 406516.. 406519: 1901230337..1901230340: 4: 25: 406520.. 450115: 1901230341..1901273936: 43596: 26: 450116.. 492161: 1901273937..1901315982: 42046: 27: 492162.. 524199: 1901315983..1901348020: 32038: 28: 524200.. 535355: 1901348021..1901359176: 11156: 29: 535356.. 535591: 1901359177..1901359412:236: 30: 535592.. 1315369: 1899830240..1900610017: 779778: 1901359413: 31: 1315370.. 1357435: 1901359413..1901401478: 42066: 1900610018: 32: 1357436.. 1368091: 1928101070..1928111725: 10656: 1901401479: 33: 1368092.. 1368231: 1928111726..1928111865:140: 34: 1368232.. 2113959: 1899043808..1899789535: 745728: 1928111866: 35: 2113960.. 2899082: 1898257376..1899042498: 785123: 1899789536: last,elf If it would be possible to read from 6 disks at once maybe this performance would be better for linear read. Anyway this is a huge diversion from original question, so maybe we will end here ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1] btrfs: Avoid reading out unnecessary extent tree blocks when mounting
Btrfs_read_block_groups() function is the most time consuming function if the fs is large and filled with small extents. For a btrfs filled with 100,000,000 16K sized files, mount the fs takes about 10 seconds. While ftrace shows that, btrfs_read_block_groups() takes about 9 seconds, taking up 90% of the mount time. So it's worthy to speedup btrfs_read_block_groups(), to reduce the overall mount time. Btrfs_read_block_groups() calls btrfs_search_slot() to find block group items. However the search key is (, BLOCK_GROUP_KEY, 0). This makes search_slot() always returns previous slot. Under most case, it's OK since the block group item and previous slot are in the same leaf. But for cases where block group item are the first item of a leaf, we must read out next leaf to get block group item. This needs extra IO and makes btrfs_read_block_groups() slower. In fact, before we call btrfs_read_block_groups(), we have already read out all chunks, which are 1:1 mapped with block group items. So we can get the exact block group length for btrfs_search_slot(), to avoid any possible btrfs_next_leaf() to speedup btrfs_read_block_groups(). With this patch, time spent on btrfs_read_block_groups() is reduced to 7.56s, compared to old 8.94s, about 15% improvement. Reported-by: Tsutomu ItohSigned-off-by: Qu Wenruo --- v2: Update commit message --- fs/btrfs/extent-tree.c | 61 -- fs/btrfs/extent_map.h | 22 ++ 2 files changed, 46 insertions(+), 37 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 82b912a..874f5b3 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9648,39 +9648,20 @@ out: return ret; } -static int find_first_block_group(struct btrfs_root *root, - struct btrfs_path *path, struct btrfs_key *key) +int find_block_group(struct btrfs_root *root, + struct btrfs_path *path, + struct extent_map *chunk_em) { int ret = 0; - struct btrfs_key found_key; - struct extent_buffer *leaf; - int slot; - - ret = btrfs_search_slot(NULL, root, key, path, 0, 0); - if (ret < 0) - goto out; + struct btrfs_key key; - while (1) { - slot = path->slots[0]; - leaf = path->nodes[0]; - if (slot >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret == 0) - continue; - if (ret < 0) - goto out; - break; - } - btrfs_item_key_to_cpu(leaf, _key, slot); + key.objectid = chunk_em->start; + key.offset = chunk_em->len; + key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; - if (found_key.objectid >= key->objectid && - found_key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - ret = 0; - goto out; - } - path->slots[0]++; - } -out: + ret = btrfs_search_slot(NULL, root, , path, 0, 0); + if (ret > 0) + ret = -ENOENT; return ret; } @@ -9899,16 +9880,14 @@ int btrfs_read_block_groups(struct btrfs_root *root) struct btrfs_block_group_cache *cache; struct btrfs_fs_info *info = root->fs_info; struct btrfs_space_info *space_info; - struct btrfs_key key; + struct btrfs_mapping_tree *map_tree = >fs_info->mapping_tree; + struct extent_map *chunk_em; struct btrfs_key found_key; struct extent_buffer *leaf; int need_clear = 0; u64 cache_gen; root = info->extent_root; - key.objectid = 0; - key.offset = 0; - key.type = BTRFS_BLOCK_GROUP_ITEM_KEY; path = btrfs_alloc_path(); if (!path) return -ENOMEM; @@ -9921,10 +9900,16 @@ int btrfs_read_block_groups(struct btrfs_root *root) if (btrfs_test_opt(root, CLEAR_CACHE)) need_clear = 1; + /* Here we don't lock the map tree, as we are the only reader */ + chunk_em = first_extent_mapping(_tree->map_tree); + /* Not really possible */ + if (!chunk_em) { + ret = -ENOENT; + goto error; + } + while (1) { - ret = find_first_block_group(root, path, ); - if (ret > 0) - break; + ret = find_block_group(root, path, chunk_em); if (ret != 0) goto error; @@ -9958,7 +9943,6 @@ int btrfs_read_block_groups(struct btrfs_root *root) sizeof(cache->item)); cache->flags = btrfs_block_group_flags(>item); - key.objectid = found_key.objectid +
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
On Wed, Jul 6, 2016 at 5:22 PM, Kai Krakowwrote: > The current implementation of RAID0 in btrfs is probably not very > optimized. RAID0 is a special case anyways: Stripes have a defined > width - I'm not sure what it is for btrfs, probably it's per chunk, so > it's 1GB, maybe it's 64k **. Stripe element (a.k.a. strip, a.k.a. md chunk) size in Btrfs is fixed at 64KiB. >That means your data is usually not read > from multiple disks in parallel anyways as long as requests are below > stripe width (which is probably true for most access patterns except > copying files) - there's no immediate performance benefit. Most any write pattern benefits from raid0 due to less disk contention, even if the typical file size is smaller than stripe size. Parallelization is improved even if it's suboptimal. This is really no different than md raid striping with a 64KiB chunk size. On Btrfs, it might be that some workloads benefit from metadata raid10, and others don't. I also think it's hard to estimate without benchmarking an actual workload with metadata as raid1 vs raid10. > So I guess, at this stage there's no big difference between RAID1 and > RAID10 in btrfs (except maybe for large file copies), not for single > process access patterns and neither for multi process access patterns. > Btrfs can only benefit from RAID1 in multi process access patterns > currently, as can btrfs RAID0 by design for usual small random access > patterns (and maybe large sequential operations). But RAID1 with more > than two disks and multi process access patterns is more or less equal > to RAID10 because stripes are likely to be on different devices anyways. I think that too would need to be benchmarked and I think it'd need to be aged as well to see the effect of both file and block group free space fragmentation. The devil will be in really minute details, all you have to do is read a few weeks of XFS list stuff with people talking about optimization or bad performance and almost always it's not the fault of the file system. And when it is, it depends on the kernel version as XFS has had substantial changes even over its long career, including (somewhat) recent changes for metadata heavy workloads. > In conclusion: RAID1 is simpler than RAID10 and thus its less likely to > contain flaws or bugs. I don't know about that. I think it's about the same. All multiple device support, except raid56, was introduced at the same time practically from day 2. Btrfs raid1 and raid10 tolerate only exactly 1 device loss, *maybe* two if you're very lucky, so neither of them are really scalable. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
Am Thu, 7 Jul 2016 00:51:16 +0100 schrieb Tomasz Kusmierz: > > On 7 Jul 2016, at 00:22, Kai Krakow wrote: > > > > Am Wed, 6 Jul 2016 13:20:15 +0100 > > schrieb Tomasz Kusmierz : > > > >> When I think of it, I did move this folder first when filesystem > >> was RAID 1 (or not even RAID at all) and then it was upgraded to > >> RAID 1 then RAID 10. Was there a faulty balance around August > >> 2014 ? Please remember that I’m using Ubuntu so it was probably > >> kernel from Ubuntu 14.04 LTS > >> > >> Also, I would like to hear it from horses mouth: dos & donts for a > >> long term storage where you moderately care about the data: RAID10 > >> - flaky ? would RAID1 give similar performance ? > > > > The current implementation of RAID0 in btrfs is probably not very > > optimized. RAID0 is a special case anyways: Stripes have a defined > > width - I'm not sure what it is for btrfs, probably it's per chunk, > > so it's 1GB, maybe it's 64k **. That means your data is usually not > > read from multiple disks in parallel anyways as long as requests > > are below stripe width (which is probably true for most access > > patterns except copying files) - there's no immediate performance > > benefit. This holds true for any RAID0 with read and write patterns > > below the stripe size. Data is just more evenly distributed across > > devices and your application will only benefit performance-wise if > > accesses spread semi-random across the span of the whole file. And > > at least last time I checked, it was stated that btrfs raid0 does > > not submit IOs in parallel yet but first reads one stripe, then the > > next - so it doesn't submit IOs to different devices in parallel. > > > > Getting to RAID1, btrfs is even less optimized: Stripe decision is > > based on process pids instead of device load, read accesses won't > > distribute evenly to different stripes per single process, it's > > only just reading from the same single device - always. Write > > access isn't faster anyways: Both stripes need to be written - > > writing RAID1 is single device performance only. > > > > So I guess, at this stage there's no big difference between RAID1 > > and RAID10 in btrfs (except maybe for large file copies), not for > > single process access patterns and neither for multi process access > > patterns. Btrfs can only benefit from RAID1 in multi process access > > patterns currently, as can btrfs RAID0 by design for usual small > > random access patterns (and maybe large sequential operations). But > > RAID1 with more than two disks and multi process access patterns is > > more or less equal to RAID10 because stripes are likely to be on > > different devices anyways. > > > > In conclusion: RAID1 is simpler than RAID10 and thus its less > > likely to contain flaws or bugs. > > > > **: Please enlighten me, I couldn't find docs on this matter. > > :O > > It’s an eye opener - I think that this should end up on btrfs WIKI … > seriously ! > > Anyway my use case for this is “storage” therefore I predominantly > copy large files. Then RAID10 may be your best option - for local operations. Copying large files, even a modern single SATA spindle can saturate a gigabit link. So, if your use case is NAS, and you don't use server side copies (like modern versions of NFS and Samba support), you won't benefit from RAID10 vs RAID1 - so just use the simpler implementation. My personal recommendation: Add a small, high quality SSD to your array and configure btrfs on top of bcache, configure it for write-around caching to get best life-time and data safety. This should cache mostly meta data access in your usecase and improve performance much better than RAID10 over RAID1. I can recommend Crucial MX series from personal experience, choose 250GB or higher as 120GB versions of Crucial MX suffer much lower durability for caching purposes. Adding bcache to an existing btrfs array is a little painful but easily doable if you have enough free space to temporarily sacrifice one disk. BTW: I'm using 3x 1TB btrfs mraid1/draid0 with a single 500GB bcache SSD in write-back mode and local operation (it's my desktop machine). The performance is great, bcache decouples some of the performance downsides the current btrfs raid implementation has. I do daily backups, so write-back caching is not a real problem (in case it fails), and btrfs draid0 is also not a problem (mraid1 ensures meta data integrity, so only file contents are at risk, and covered by backups). With this setup I can easily saturate my 6Gb onboard SATA controller, the system boots to usable desktop in 30 seconds from cold start (including EFI firmware), including autologin to full-blown KDE, autostart of Chrome and Steam, 2 virtual machine containers (nspawn-based, one MySQL instance, one ElasticSearch instance), plus local MySQL and ElasticSearch service (used for development and staging purposes), and a
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 7 Jul 2016, at 00:22, Kai Krakowwrote: > > Am Wed, 6 Jul 2016 13:20:15 +0100 > schrieb Tomasz Kusmierz : > >> When I think of it, I did move this folder first when filesystem was >> RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 >> then RAID 10. Was there a faulty balance around August 2014 ? Please >> remember that I’m using Ubuntu so it was probably kernel from Ubuntu >> 14.04 LTS >> >> Also, I would like to hear it from horses mouth: dos & donts for a >> long term storage where you moderately care about the data: RAID10 - >> flaky ? would RAID1 give similar performance ? > > The current implementation of RAID0 in btrfs is probably not very > optimized. RAID0 is a special case anyways: Stripes have a defined > width - I'm not sure what it is for btrfs, probably it's per chunk, so > it's 1GB, maybe it's 64k **. That means your data is usually not read > from multiple disks in parallel anyways as long as requests are below > stripe width (which is probably true for most access patterns except > copying files) - there's no immediate performance benefit. This holds > true for any RAID0 with read and write patterns below the stripe size. > Data is just more evenly distributed across devices and your > application will only benefit performance-wise if accesses spread > semi-random across the span of the whole file. And at least last time I > checked, it was stated that btrfs raid0 does not submit IOs in parallel > yet but first reads one stripe, then the next - so it doesn't submit > IOs to different devices in parallel. > > Getting to RAID1, btrfs is even less optimized: Stripe decision is based > on process pids instead of device load, read accesses won't distribute > evenly to different stripes per single process, it's only just reading > from the same single device - always. Write access isn't faster anyways: > Both stripes need to be written - writing RAID1 is single device > performance only. > > So I guess, at this stage there's no big difference between RAID1 and > RAID10 in btrfs (except maybe for large file copies), not for single > process access patterns and neither for multi process access patterns. > Btrfs can only benefit from RAID1 in multi process access patterns > currently, as can btrfs RAID0 by design for usual small random access > patterns (and maybe large sequential operations). But RAID1 with more > than two disks and multi process access patterns is more or less equal > to RAID10 because stripes are likely to be on different devices anyways. > > In conclusion: RAID1 is simpler than RAID10 and thus its less likely to > contain flaws or bugs. > > **: Please enlighten me, I couldn't find docs on this matter. :O It’s an eye opener - I think that this should end up on btrfs WIKI … seriously ! Anyway my use case for this is “storage” therefore I predominantly copy large files. > -- > Regards, > Kai > > Replies to list-only preferred. > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/20] xfs: run xfs_repair at the end of each test
On Thu, Jul 07, 2016 at 09:13:40AM +1000, Dave Chinner wrote: > On Mon, Jul 04, 2016 at 09:11:34PM -0700, Darrick J. Wong wrote: > > On Tue, Jul 05, 2016 at 11:56:17AM +0800, Eryu Guan wrote: > > > On Thu, Jun 16, 2016 at 06:48:01PM -0700, Darrick J. Wong wrote: > > > > Run xfs_repair twice at the end of each test -- once to rebuild > > > > the btree indices, and again with -n to check the rebuild work. > > > > > > > > Signed-off-by: Darrick J. Wong> > > > --- > > > > common/rc |3 +++ > > > > 1 file changed, 3 insertions(+) > > > > > > > > > > > > diff --git a/common/rc b/common/rc > > > > index 1225047..847191e 100644 > > > > --- a/common/rc > > > > +++ b/common/rc > > > > @@ -2225,6 +2225,9 @@ _check_xfs_filesystem() > > > > ok=0 > > > > fi > > > > > > > > +$XFS_REPAIR_PROG $extra_options $extra_log_options > > > > $extra_rt_options $device >$tmp.repair 2>&1 > > > > +cat $tmp.repair | _fix_malloc >>$seqres.full > > > > + > > > > > > Won't this hide fs corruptions? Did I miss anything? > > > > I could've sworn it did: > > > > xfs_repair -n > > (complain if corrupt) > > > > xfs_repair > > > > xfs_repair -n > > (complain if still corrupt) > > > > But that first xfs_repair -n hunk disappeared. :( > > > > Ok, will fix and resend. > > Not sure this is the best idea - when repair on an aged test device > takes 10s, this means the test harness overhead increases by a > factor of 3. i.e. test takes 1s to run, checking the filesystem > between tests now takes 30s. i.e. this will badly blow out the run > time of the test suite on aged test devices > > What does this overhead actually gain us that we couldn't encode > explicitly into a single test or two? e.g the test itself runs > repair on the aged test device I'm primarily using it as a way to expose the new rmap/refcount/rtrmap btree rebuilding code to a wider variety of filesystems. But you're right, there's no need to expose /everyone/ to this behavior. Shall I rework the change so that one can turn it on or off as desired? --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
Am Wed, 6 Jul 2016 13:20:15 +0100 schrieb Tomasz Kusmierz: > When I think of it, I did move this folder first when filesystem was > RAID 1 (or not even RAID at all) and then it was upgraded to RAID 1 > then RAID 10. Was there a faulty balance around August 2014 ? Please > remember that I’m using Ubuntu so it was probably kernel from Ubuntu > 14.04 LTS > > Also, I would like to hear it from horses mouth: dos & donts for a > long term storage where you moderately care about the data: RAID10 - > flaky ? would RAID1 give similar performance ? The current implementation of RAID0 in btrfs is probably not very optimized. RAID0 is a special case anyways: Stripes have a defined width - I'm not sure what it is for btrfs, probably it's per chunk, so it's 1GB, maybe it's 64k **. That means your data is usually not read from multiple disks in parallel anyways as long as requests are below stripe width (which is probably true for most access patterns except copying files) - there's no immediate performance benefit. This holds true for any RAID0 with read and write patterns below the stripe size. Data is just more evenly distributed across devices and your application will only benefit performance-wise if accesses spread semi-random across the span of the whole file. And at least last time I checked, it was stated that btrfs raid0 does not submit IOs in parallel yet but first reads one stripe, then the next - so it doesn't submit IOs to different devices in parallel. Getting to RAID1, btrfs is even less optimized: Stripe decision is based on process pids instead of device load, read accesses won't distribute evenly to different stripes per single process, it's only just reading from the same single device - always. Write access isn't faster anyways: Both stripes need to be written - writing RAID1 is single device performance only. So I guess, at this stage there's no big difference between RAID1 and RAID10 in btrfs (except maybe for large file copies), not for single process access patterns and neither for multi process access patterns. Btrfs can only benefit from RAID1 in multi process access patterns currently, as can btrfs RAID0 by design for usual small random access patterns (and maybe large sequential operations). But RAID1 with more than two disks and multi process access patterns is more or less equal to RAID10 because stripes are likely to be on different devices anyways. In conclusion: RAID1 is simpler than RAID10 and thus its less likely to contain flaws or bugs. **: Please enlighten me, I couldn't find docs on this matter. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/20] xfs: run xfs_repair at the end of each test
On Mon, Jul 04, 2016 at 09:11:34PM -0700, Darrick J. Wong wrote: > On Tue, Jul 05, 2016 at 11:56:17AM +0800, Eryu Guan wrote: > > On Thu, Jun 16, 2016 at 06:48:01PM -0700, Darrick J. Wong wrote: > > > Run xfs_repair twice at the end of each test -- once to rebuild > > > the btree indices, and again with -n to check the rebuild work. > > > > > > Signed-off-by: Darrick J. Wong> > > --- > > > common/rc |3 +++ > > > 1 file changed, 3 insertions(+) > > > > > > > > > diff --git a/common/rc b/common/rc > > > index 1225047..847191e 100644 > > > --- a/common/rc > > > +++ b/common/rc > > > @@ -2225,6 +2225,9 @@ _check_xfs_filesystem() > > > ok=0 > > > fi > > > > > > +$XFS_REPAIR_PROG $extra_options $extra_log_options $extra_rt_options > > > $device >$tmp.repair 2>&1 > > > +cat $tmp.repair | _fix_malloc>>$seqres.full > > > + > > > > Won't this hide fs corruptions? Did I miss anything? > > I could've sworn it did: > > xfs_repair -n > (complain if corrupt) > > xfs_repair > > xfs_repair -n > (complain if still corrupt) > > But that first xfs_repair -n hunk disappeared. :( > > Ok, will fix and resend. Not sure this is the best idea - when repair on an aged test device takes 10s, this means the test harness overhead increases by a factor of 3. i.e. test takes 1s to run, checking the filesystem between tests now takes 30s. i.e. this will badly blow out the run time of the test suite on aged test devices What does this overhead actually gain us that we couldn't encode explicitly into a single test or two? e.g the test itself runs repair on the aged test device Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 has failing disks, but smart is clear
> On 6 Jul 2016, at 23:14, Corey Coughlinwrote: > > Hi all, >Hoping you all can help, have a strange problem, think I know what's going > on, but could use some verification. I set up a raid1 type btrfs filesystem > on an Ubuntu 16.04 system, here's what it looks like: > > btrfs fi show > Label: none uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4 >Total devices 10 FS bytes used 3.42TiB >devid1 size 1.82TiB used 1.18TiB path /dev/sdd >devid2 size 698.64GiB used 47.00GiB path /dev/sdk >devid3 size 931.51GiB used 280.03GiB path /dev/sdm >devid4 size 931.51GiB used 280.00GiB path /dev/sdl >devid5 size 1.82TiB used 1.17TiB path /dev/sdi >devid6 size 1.82TiB used 823.03GiB path /dev/sdj >devid7 size 698.64GiB used 47.00GiB path /dev/sdg >devid8 size 1.82TiB used 1.18TiB path /dev/sda >devid9 size 1.82TiB used 1.18TiB path /dev/sdb >devid 10 size 1.36TiB used 745.03GiB path /dev/sdh > > I added a couple disks, and then ran a balance operation, and that took about > 3 days to finish. When it did finish, tried a scrub and got this message: > > scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4 >scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35 >total bytes scrubbed: 926.45GiB with 18849935 errors >error details: read=18849935 >corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: > 0 > > So that seems bad. Took a look at the devices and a few of them have errors: > ... > [/dev/sdi].generation_errs 0 > [/dev/sdj].write_io_errs 289436740 > [/dev/sdj].read_io_errs289492820 > [/dev/sdj].flush_io_errs 12411 > [/dev/sdj].corruption_errs 0 > [/dev/sdj].generation_errs 0 > [/dev/sdg].write_io_errs 0 > ... > [/dev/sda].generation_errs 0 > [/dev/sdb].write_io_errs 3490143 > [/dev/sdb].read_io_errs111 > [/dev/sdb].flush_io_errs 268 > [/dev/sdb].corruption_errs 0 > [/dev/sdb].generation_errs 0 > [/dev/sdh].write_io_errs 5839 > [/dev/sdh].read_io_errs2188 > [/dev/sdh].flush_io_errs 11 > [/dev/sdh].corruption_errs 1 > [/dev/sdh].generation_errs 16373 > > So I checked the smart data for those disks, they seem perfect, no > reallocated sectors, no problems. But one thing I did notice is that they > are all WD Green drives. So I'm guessing that if they power down and get > reassigned to a new /dev/sd* letter, that could lead to data corruption. I > used idle3ctl to turn off the shut down mode on all the green drives in the > system, but I'm having trouble getting the filesystem working without the > errors. I tried a 'check --repair' command on it, and it seems to find a lot > of verification errors, but it doesn't look like things are getting fixed. > But I have all the data on it backed up on another system, so I can recreate > this if I need to. But here's what I want to know: > > 1. Am I correct about the issues with the WD Green drives, if they change > mounts during disk operations, will that corrupt data? I just wanted to chip in with WD Green drives. I have a RAID10 running on 6x2TB of those, actually had for ~3 years. If disk goes down for spin down, and you try to access something - kernel & FS & whole system will wait for drive to re-spin and everything works OK. I’ve never had a drive being reassigned to different /dev/sdX due to spin down / up. 2 years ago I was having a corruption due to not using ECC ram on my system and one of RAM modules started producing errors that were never caught up by CPU / MoBo. Long story short, guy here managed to point me to the right direction and I started shifting my data to hopefully new and not corrupted FS … but I was sceptical of similar issue that you have described AND I did raid1 and while mounted I did shift disk from one SATA port to another and FS managed to pick up the disk in new location and did not even blinked (as far as I remember there was syslog entry to say that disk vanished and then that disk was added) Last word, you got plenty of errors in your SMART for transfer related stuff, please be advised that this may mean: - faulty cable - faulty mono controller - faulty drive controller - bad RAM - yes, mother board CAN use your ram for storing data and transfer related stuff … specially chapter ones. > 2. If that is the case: >a.) Is there any way I can stop the /dev/sd* mount points from changing? > Or can I set up the filesystem using UUIDs or something more solid? I > googled about it, but found conflicting info Don’t get it the wrong way but I’m personally surprised that anybody still uses mount points rather than UUID. Devices change from boot to boot for a lot of people and most of distros moved to uuid (2 years ago ? even the swap is mounted via UUID now) >b.) Or, is there something else changing my drive devices? I have most of > drives on an LSI SAS 9201-16i card, is there something I need to
Re: [PATCH v6 00/20] xfstests: minor fixes for the reflink/dedupe tests
On Tue, Jul 05, 2016 at 12:31:30PM +0800, Eryu Guan wrote: > Hi Darrick, > > On Thu, Jun 16, 2016 at 06:46:02PM -0700, Darrick J. Wong wrote: > > Hi all, > > > > This is the sixth revision of a patchset that adds to xfstests > > support for testing reverse-mappings of physical blocks to file and > > metadata (rmap); support for testing multiple file logical blocks to > > the same physical block (reflink); and implements the beginnings of > > online metadata scrubbing. > > > > The first eight patches are in Eryu Guan's pull request on 2016-06-15. > > Those patches haven't changed, but they're not yet in the upstream > > repo. > > > > If you're going to start using this mess, you probably ought to just > > pull from my github trees for kernel[1], xfsprogs[2], and xfstests[3]. > > There are also updates for xfs-docs[4]. The kernel patches should > > apply to dchinner's for-next; xfsprogs patches to for-next; and > > xfstest to master. The kernel git tree already has for-next included. > > > > The patches have been xfstested with x64, i386, and armv7l--arm64, > > ppc64, and ppc64le no longer boot in qemu. All three architectures > > pass all 'clone' group tests except xfs/128 (which is the swapext > > test), and AFAICT don't cause any new failures for the 'auto' group. > > > > This is an extraordinary way to eat your data. Enjoy! > > Comments and questions are, as always, welcome. > > I tested your xfstests patches with your kernel(HEAD f0b34b6 xfs: add > btree scrub tracepoints) and xfsprogs(HEAD 34bd754 xfs_scrub: create > online filesystem scrub program), with x86_64 host & 4k block size XFS. > > A './check -g auto' run looked fine overall. Besides the comments I > replied to some patches, other common minor issues are: > - space indention in _cleanup not tab > - bare 'umount $SCRATCH_MNT' not _scratch_unmount > - whitespace issues in _test|scratch_inject_error > > (I can fix all these minor issues at commit time, if you don't have > other major updates to these patches). I don't have any major updates to any of those patches; go ahead. FWIW I usually have unposted patches at all points in time, so if you want to fix minor nits in things I've already posted for review and commit them to upstream, that's fine. I pull down the latest xfstest git and rebase prior to sending a new patch series, so I'll absorb whatever you change. :) When I'm getting ready to do another big release, I inquire with the maintainers if they're about to push commits upstream to avoid the race post patches -> upstream push -> rebase patches -> repost patches. > And the review of changes to xfs/122 needs help from other XFS > developers :) (09/20 and 10/20) 09/20 (remove rmapx cruft) should be pretty straightforward, since I withdrew 'rmapx' and related changes from xfs. 10/20 (new log items) will probably remain outstanding for a while since those changes haven't really made it upstream yet. > And besides the first 8 patches, 15/20 has been in upstream as well. Oh, ok. > Thanks, > Eryu > > P.S. > The failed tests I saw when testing with reflink-enabled kernel & > xfsprogs: > > Failures: generic/054 generic/055 generic/108 generic/204 generic/356 > generic/357 xfs/004 xfs/096 xfs/122 xfs/293 > > generic/108 generic/204 and xfs/004 are new failures compared to stock > kernel and xfsprogs (kernel 4.7-rc5, xfsprogs 4.7-rc1). I think I have fixes for some of those that will go out during the next patchbomb. But thanks for the heads up, I'll have a look at a -g auto run before I submit again. --D > > Just FYI. > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 6 Jul 2016, at 22:41, Henk Slagerwrote: > > On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierz > wrote: >> >>> On 6 Jul 2016, at 02:25, Henk Slager wrote: >>> >>> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz >>> wrote: On 6 Jul 2016, at 00:30, Henk Slager wrote: On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz wrote: I did consider that, but: - some files were NOT accessed by anything with 100% certainty (well if there is a rootkit on my system or something in that shape than maybe yes) - the only application that could access those files is totem (well Nautilius checks extension -> directs it to totem) so in that case we would hear about out break of totem killing people files. - if it was a kernel bug then other large files would be affected. Maybe I’m wrong and it’s actually related to the fact that all those files are located in single location on file system (single folder) that might have a historical bug in some structure somewhere ? I find it hard to imagine that this has something to do with the folderstructure, unless maybe the folder is a subvolume with non-default attributes or so. How the files in that folder are created (at full disktransferspeed or during a day or even a week) might give some hint. You could run filefrag and see if that rings a bell. files that are 4096 show: 1 extent found >>> >>> I actually meant filefrag for the files that are not (yet) truncated >>> to 4k. For example for virtual machine imagefiles (CoW), one could see >>> an MBR write. >> 117 extents found >> filesize 15468645003 >> >> good / bad ? > > 117 extents for a 1.5G file is fine, with -v option you could see the > fragmentation at the start, but this won't lead to any hint why you > have the truncate issue. > I did forgot to add that file system was created a long time ago and it was created with leaf & node size = 16k. If this long time ago is >2 years then you have likely specifically set node size = 16k, otherwise with older tools it would have been 4K. You are right I used -l 16K -n 16K Have you created it as raid10 or has it undergone profile conversions? Due to lack of spare disks (it may sound odd for some but spending for more than 6 disks for home use seems like an overkill) and due to last I’ve had I had to migrate all data to new file system. This played that way that I’ve: 1. from original FS I’ve removed 2 disks 2. Created RAID1 on those 2 disks, 3. shifted 2TB 4. removed 2 disks from source FS and adde those to destination FS 5 shifted 2 further TB 6 destroyed original FS and adde 2 disks to destination FS 7 converted destination FS to RAID10 FYI, when I convert to raid 10 I use: btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f /path/to/FS this filesystem has 5 sub volumes. Files affected are located in separate folder within a “victim folder” that is within a one sub volume. It could also be that the ondisk format is somewhat corrupted (btrfs check should find that ) and that that causes the issue. root@noname_server:/mnt# btrfs check /dev/sdg1 Checking filesystem on /dev/sdg1 UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 checking extents checking free space cache checking fs roots checking csums checking root refs found 4424060642634 bytes used err is 0 total csum bytes: 4315954936 total tree bytes: 4522786816 total fs tree bytes: 61702144 total extent tree bytes: 41402368 btree space waste bytes: 72430813 file data blocks allocated: 4475917217792 referenced 4420407603200 No luck there :/ >>> >>> Indeed looks all normal. >>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over time, it has happened over a year ago with kernels recent at that time, but the fs was converted from raid5 Could you please elaborate on that ? you also ended up with files that got truncated to 4096 bytes ? >>> >>> I did not have truncated to 4k files, but your case lets me think of >>> small files inlining. Default max_inline mount option is 8k and that >>> means that 0 to ~3k files end up in metadata. I had size corruptions >>> for several of those small sized files that were updated quite >>> frequent, also within commit time AFAIK. Btrfs check lists this as >>> errors 400, although fs operation is not disturbed. I don't know what >>> happens if those small files are being updated/rewritten and are just >>> below or just above the max_inline limit. >>> >>> The only thing I was
raid1 has failing disks, but smart is clear
Hi all, Hoping you all can help, have a strange problem, think I know what's going on, but could use some verification. I set up a raid1 type btrfs filesystem on an Ubuntu 16.04 system, here's what it looks like: btrfs fi show Label: none uuid: 597ee185-36ac-4b68-8961-d4adc13f95d4 Total devices 10 FS bytes used 3.42TiB devid1 size 1.82TiB used 1.18TiB path /dev/sdd devid2 size 698.64GiB used 47.00GiB path /dev/sdk devid3 size 931.51GiB used 280.03GiB path /dev/sdm devid4 size 931.51GiB used 280.00GiB path /dev/sdl devid5 size 1.82TiB used 1.17TiB path /dev/sdi devid6 size 1.82TiB used 823.03GiB path /dev/sdj devid7 size 698.64GiB used 47.00GiB path /dev/sdg devid8 size 1.82TiB used 1.18TiB path /dev/sda devid9 size 1.82TiB used 1.18TiB path /dev/sdb devid 10 size 1.36TiB used 745.03GiB path /dev/sdh I added a couple disks, and then ran a balance operation, and that took about 3 days to finish. When it did finish, tried a scrub and got this message: scrub status for 597ee185-36ac-4b68-8961-d4adc13f95d4 scrub started at Sun Jun 26 18:19:28 2016 and was aborted after 01:16:35 total bytes scrubbed: 926.45GiB with 18849935 errors error details: read=18849935 corrected errors: 5860, uncorrectable errors: 18844075, unverified errors: 0 So that seems bad. Took a look at the devices and a few of them have errors: ... [/dev/sdi].generation_errs 0 [/dev/sdj].write_io_errs 289436740 [/dev/sdj].read_io_errs289492820 [/dev/sdj].flush_io_errs 12411 [/dev/sdj].corruption_errs 0 [/dev/sdj].generation_errs 0 [/dev/sdg].write_io_errs 0 ... [/dev/sda].generation_errs 0 [/dev/sdb].write_io_errs 3490143 [/dev/sdb].read_io_errs111 [/dev/sdb].flush_io_errs 268 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sdh].write_io_errs 5839 [/dev/sdh].read_io_errs2188 [/dev/sdh].flush_io_errs 11 [/dev/sdh].corruption_errs 1 [/dev/sdh].generation_errs 16373 So I checked the smart data for those disks, they seem perfect, no reallocated sectors, no problems. But one thing I did notice is that they are all WD Green drives. So I'm guessing that if they power down and get reassigned to a new /dev/sd* letter, that could lead to data corruption. I used idle3ctl to turn off the shut down mode on all the green drives in the system, but I'm having trouble getting the filesystem working without the errors. I tried a 'check --repair' command on it, and it seems to find a lot of verification errors, but it doesn't look like things are getting fixed. But I have all the data on it backed up on another system, so I can recreate this if I need to. But here's what I want to know: 1. Am I correct about the issues with the WD Green drives, if they change mounts during disk operations, will that corrupt data? 2. If that is the case: a.) Is there any way I can stop the /dev/sd* mount points from changing? Or can I set up the filesystem using UUIDs or something more solid? I googled about it, but found conflicting info b.) Or, is there something else changing my drive devices? I have most of drives on an LSI SAS 9201-16i card, is there something I need to do to make them fixed? c.) Or, is there a script or something I can use to figure out if the disks will change mounts? d.) Or, if I wipe everything and rebuild, will the disks with the idle3ctl fix work now? Regardless of whether or not it's a WD Green drive issue, should I just wipefs all the disks and rebuild it? Is there any way to recover this? Thanks for any help! --- Corey -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
On Wed, Jul 6, 2016 at 2:20 PM, Tomasz Kusmierzwrote: > >> On 6 Jul 2016, at 02:25, Henk Slager wrote: >> >> On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz >> wrote: >>> >>> On 6 Jul 2016, at 00:30, Henk Slager wrote: >>> >>> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz >>> wrote: >>> >>> I did consider that, but: >>> - some files were NOT accessed by anything with 100% certainty (well if >>> there is a rootkit on my system or something in that shape than maybe yes) >>> - the only application that could access those files is totem (well >>> Nautilius checks extension -> directs it to totem) so in that case we would >>> hear about out break of totem killing people files. >>> - if it was a kernel bug then other large files would be affected. >>> >>> Maybe I’m wrong and it’s actually related to the fact that all those files >>> are located in single location on file system (single folder) that might >>> have a historical bug in some structure somewhere ? >>> >>> >>> I find it hard to imagine that this has something to do with the >>> folderstructure, unless maybe the folder is a subvolume with >>> non-default attributes or so. How the files in that folder are created >>> (at full disktransferspeed or during a day or even a week) might give >>> some hint. You could run filefrag and see if that rings a bell. >>> >>> files that are 4096 show: >>> 1 extent found >> >> I actually meant filefrag for the files that are not (yet) truncated >> to 4k. For example for virtual machine imagefiles (CoW), one could see >> an MBR write. > 117 extents found > filesize 15468645003 > > good / bad ? 117 extents for a 1.5G file is fine, with -v option you could see the fragmentation at the start, but this won't lead to any hint why you have the truncate issue. >>> I did forgot to add that file system was created a long time ago and it was >>> created with leaf & node size = 16k. >>> >>> >>> If this long time ago is >2 years then you have likely specifically >>> set node size = 16k, otherwise with older tools it would have been 4K. >>> >>> You are right I used -l 16K -n 16K >>> >>> Have you created it as raid10 or has it undergone profile conversions? >>> >>> Due to lack of spare disks >>> (it may sound odd for some but spending for more than 6 disks for home use >>> seems like an overkill) >>> and due to last I’ve had I had to migrate all data to new file system. >>> This played that way that I’ve: >>> 1. from original FS I’ve removed 2 disks >>> 2. Created RAID1 on those 2 disks, >>> 3. shifted 2TB >>> 4. removed 2 disks from source FS and adde those to destination FS >>> 5 shifted 2 further TB >>> 6 destroyed original FS and adde 2 disks to destination FS >>> 7 converted destination FS to RAID10 >>> >>> FYI, when I convert to raid 10 I use: >>> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f >>> /path/to/FS >>> >>> this filesystem has 5 sub volumes. Files affected are located in separate >>> folder within a “victim folder” that is within a one sub volume. >>> >>> >>> It could also be that the ondisk format is somewhat corrupted (btrfs >>> check should find that ) and that that causes the issue. >>> >>> >>> root@noname_server:/mnt# btrfs check /dev/sdg1 >>> Checking filesystem on /dev/sdg1 >>> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 >>> checking extents >>> checking free space cache >>> checking fs roots >>> checking csums >>> checking root refs >>> found 4424060642634 bytes used err is 0 >>> total csum bytes: 4315954936 >>> total tree bytes: 4522786816 >>> total fs tree bytes: 61702144 >>> total extent tree bytes: 41402368 >>> btree space waste bytes: 72430813 >>> file data blocks allocated: 4475917217792 >>> referenced 4420407603200 >>> >>> No luck there :/ >> >> Indeed looks all normal. >> >>> In-lining on raid10 has caused me some trouble (I had 4k nodes) over >>> time, it has happened over a year ago with kernels recent at that >>> time, but the fs was converted from raid5 >>> >>> Could you please elaborate on that ? you also ended up with files that got >>> truncated to 4096 bytes ? >> >> I did not have truncated to 4k files, but your case lets me think of >> small files inlining. Default max_inline mount option is 8k and that >> means that 0 to ~3k files end up in metadata. I had size corruptions >> for several of those small sized files that were updated quite >> frequent, also within commit time AFAIK. Btrfs check lists this as >> errors 400, although fs operation is not disturbed. I don't know what >> happens if those small files are being updated/rewritten and are just >> below or just above the max_inline limit. >> >> The only thing I was thinking of is that your files were started as >> small, so inline, then extended to multi-GB. In the past, there were >> 'bad extent/chunk type' issues and it was suggested that the fs would >> have been an ext4-converted one (which
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 1:15 PM, Austin S. Hemmelgarnwrote: > On 2016-07-06 14:45, Chris Murphy wrote: >> I think it's statistically 0 people changing this from default. It's >> people with drives that have no SCT ERC support, used in raid1+, who >> happen to stumble upon this very obscure work around to avoid link >> resets in the face of media defects. Rare. > > Not as much as you think, once someone has this issue, they usually put > preventative measures in place on any system where it applies. I'd be > willing to bet that most sysadmins at big companies like RedHat or Oracle > are setting this. SCT ERC yes. Changing the kernel's command timer? I think almost zero. >> Well they have link resets and their file system presumably face >> plants as a result of a pile of commands in the queue returning as >> unsuccessful. So they have premature death of their system, rather >> than it getting sluggish. This is a long standing indicator on Windows >> to just reinstall the OS and restore data from backups -> the user has >> an opportunity to freshen up user data backup, and the reinstallation >> and restore from backup results in freshly written sectors which is >> how bad sectors get fixed. The marginally bad sectors get new writes >> and now read fast (or fast enough), and the persistently bad sectors >> result in the drive firmware remapping to reserve sectors. >> >> The main thing in my opinion is less extension of drive life, as it is >> the user gets to use the system, albeit sluggish, to make a backup of >> their data rather than possibly losing it. > > The extension of the drive's lifetime is a nice benefit, but not what my > point was here. For people in this particular case, it will almost > certainly only make things better (although at first it may make performance > worse). I'm not sure why it makes performance worse. The options are, slower reads vs a file system that almost certainly face plants upon a link reset. >> Basically it's: >> >> For SATA and USB drives: >> >> if data redundant, then enable short SCT ERC time if supported, if not >> supported then extend SCSI command timer to 200; >> >> if data not redundant, then disable SCT ERC if supported, and extend >> SCSI command timer to 200. >> >> For SCSI (SAS most likely these days), keep things the same as now. >> But that's only because this is a rare enough configuration now I >> don't know if we really know the problems there. It may be that their >> error recovery in 7 seconds is massively better and more reliable than >> consumer drives over 180 seconds. > > I don't see why you would think this is not common. I was not clear. Single device SAS is probably not common. They're typically being used in arrays where data is redundant. Using such a drive with short error recovery as a single boot drive? Probably not that common. > Separately, USB gets _really_ complicated if you want to cover everything, > USB drives may or may not present as non-rotational, may or may not show up > as SATA or SCSI bridges (there are some of the more expensive flash drives > that actually use SSD controllers plus USB-SAT chips internally), if they do > show up as such, may or may not support the required commands (most don't, > but it's seemingly hit or miss which do). Yup. Well, do what we can instead of just ignoring the problem? They can still be polled for features including SCT ERC and if it's not supported or configurable then fallback to increasing the command timer. I'm not sure what else can be done anyway. The main obstacle is squaring the device capability (low level) with storage stack redundancy 0 or 1 (high level). Something has to be aware of both to ideally get all devices ideally configured. >> Yep it's imperfect unless there's the proper cross communication >> between layers. There are some such things like hardware raid geometry >> that optionally poke through (when supported by hardware raid drivers) >> so that things like mkfs.xfs can automatically provide the right sunit >> swidth for optimized layout; which the device mapper already does >> automatically. So it could be done it's just a matter of how big of a >> problem is this to build it, vs just going with a new one size fits >> all default command timer? > > The other problem though is that the existing things pass through > _read-only_ data, while this requires writable data to be passed through, > which leads to all kinds of complicated issues potentially. I'm aware. There are also plenty of bugs even if write were to pass through. I've encountered more drives than not which accept only one SCT ERC change per poweron. A 2nd change causes the drive to offline and vanish off the bus. So no doubt this whole area is fragile enough not even the drive, controller, enclosure vendors are aware of where all the bodies are buried. What I think is fairly well established is that at least on Windows their lower level stuff including kernel
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 1:17 PM, Austin S. Hemmelgarnwrote: > In bash or most other POSIX compliant shells, you can run this: > echo $? > to get the return code of the previous command. > > In your case though, it may be reporting the FS ready because it had already > seen all the devices, IIUC, the flag that checks is only set once, and never > unset, which is not a good design in this case. Oh dear. [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi---tz-- 50.00g thintastic 2 VG Vwi---tz-- 50.00g thintastic 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs dev scan Scanning for Btrfs filesystems [root@f24s ~]# echo $? 0 [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 [root@f24s ~]# echo $? 0 [root@f24s ~]# btrfs fi show warning, device 2 is missing Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 *** Some devices missing Cute, device 1 is also missing but that's not mentioned. In any case, the device is still ready even after a dev scan. I guess this isn't exactly testable all that easily unless I reboot. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
On Wed, Jul 06, 2016 at 06:37:52PM +0800, Wang Xiaoguang wrote: > In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses > wrong file offset for reloc_inode, it uses cluster->start and cluster->end, > which indeed are extent's bytenr. The correct value should be > cluster->[start|end] minus block group's start bytenr. > > start bytenr cluster->start > | | extent | extent | ...| extent | > || > |block group reloc_inode | > > Signed-off-by: Wang Xiaoguang> --- > fs/btrfs/relocation.c | 27 +++ > 1 file changed, 15 insertions(+), 12 deletions(-) > > diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c > index 0477dca..abc2f69 100644 > --- a/fs/btrfs/relocation.c > +++ b/fs/btrfs/relocation.c > @@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode, > u64 num_bytes; > int nr = 0; > int ret = 0; > + u64 prealloc_start, prealloc_end; > > BUG_ON(cluster->start != cluster->boundary[0]); > inode_lock(inode); > > - ret = btrfs_check_data_free_space(inode, cluster->start, > - cluster->end + 1 - cluster->start); > + start = cluster->start - offset; > + end = cluster->end - offset; > + ret = btrfs_check_data_free_space(inode, start, end + 1 - start); > if (ret) > goto out; > > while (nr < cluster->nr) { > - start = cluster->boundary[nr] - offset; > + prealloc_start = cluster->boundary[nr] - offset; > if (nr + 1 < cluster->nr) > - end = cluster->boundary[nr + 1] - 1 - offset; > + prealloc_end = cluster->boundary[nr + 1] - 1 - offset; > else > - end = cluster->end - offset; > + prealloc_end = cluster->end - offset; > > - lock_extent(_I(inode)->io_tree, start, end); > - num_bytes = end + 1 - start; > - ret = btrfs_prealloc_file_range(inode, 0, start, > + lock_extent(_I(inode)->io_tree, prealloc_start, > + prealloc_end); > + num_bytes = prealloc_end + 1 - prealloc_start; > + ret = btrfs_prealloc_file_range(inode, 0, prealloc_start, > num_bytes, num_bytes, > - end + 1, _hint); > - unlock_extent(_I(inode)->io_tree, start, end); > + prealloc_end + 1, _hint); > + unlock_extent(_I(inode)->io_tree, prealloc_start, > + prealloc_end); Changing names is unnecessary, we can pick up other names for btrfs_{check/free}_data_free_space(). Thanks, -liubo > if (ret) > break; > nr++; > } > - btrfs_free_reserved_data_space(inode, cluster->start, > -cluster->end + 1 - cluster->start); > + btrfs_free_reserved_data_space(inode, start, end + 1 - start); > out: > inode_unlock(inode); > return ret; > -- > 2.9.0 > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On 2016-07-06 14:23, Chris Murphy wrote: On Wed, Jul 6, 2016 at 12:04 PM, Austin S. Hemmelgarnwrote: On 2016-07-06 13:19, Chris Murphy wrote: On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov wrote: 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. Where does udev builtin use btrfs itself? I see "btrfs ready $device" which is not a valid btrfs user space command. I never get any errors from "btrfs device ready" even when too many devices are missing. I don't know what it even does or if it's broken. This is a three device raid1 where I removed 2 devices and "btrfs device ready" does not complain, it always returns silent for me no matter what. It's been this way for years as far as I know. [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi-a-tz-- 50.00g thintastic2.55 2 VG Vwi-a-tz-- 50.00g thintastic4.00 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1 devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2 devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 [root@f24s ~]# btrfs device ready /dev/mapper/VG-1 [root@f24s ~]# [root@f24s ~]# lvchange -an VG/1 [root@f24s ~]# lvchange -an VG/2 [root@f24s ~]# btrfs dev scan Scanning for Btrfs filesystems [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi---tz-- 50.00g thintastic 2 VG Vwi---tz-- 50.00g thintastic 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show warning, device 2 is missing Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 *** Some devices missing [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 [root@f24s ~]# You won't get any output from it regardless, you have to check the return code as it's intended to be a tool for scripts and such. How do I check the return code? When I use strace, no matter what I'm getting +++ exited with 0 +++ I see both 'brfs device ready' and the udev btrfs builtin test are calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using user space tools to check but rather a btrfs ioctl. So clearly that works or I wouldn't have stalled boots when all devices aren't present. In bash or most other POSIX compliant shells, you can run this: echo $? to get the return code of the previous command. In your case though, it may be reporting the FS ready because it had already seen all the devices, IIUC, the flag that checks is only set once, and never unset, which is not a good design in this case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-06 14:45, Chris Murphy wrote: On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarnwrote: On 2016-07-06 12:43, Chris Murphy wrote: So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. Just thinking about this: 1. People who are setting this somewhere will be functionally unaffected. I think it's statistically 0 people changing this from default. It's people with drives that have no SCT ERC support, used in raid1+, who happen to stumble upon this very obscure work around to avoid link resets in the face of media defects. Rare. Not as much as you think, once someone has this issue, they usually put preventative measures in place on any system where it applies. I'd be willing to bet that most sysadmins at big companies like RedHat or Oracle are setting this. 2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. Well they have link resets and their file system presumably face plants as a result of a pile of commands in the queue returning as unsuccessful. So they have premature death of their system, rather than it getting sluggish. This is a long standing indicator on Windows to just reinstall the OS and restore data from backups -> the user has an opportunity to freshen up user data backup, and the reinstallation and restore from backup results in freshly written sectors which is how bad sectors get fixed. The marginally bad sectors get new writes and now read fast (or fast enough), and the persistently bad sectors result in the drive firmware remapping to reserve sectors. The main thing in my opinion is less extension of drive life, as it is the user gets to use the system, albeit sluggish, to make a backup of their data rather than possibly losing it. The extension of the drive's lifetime is a nice benefit, but not what my point was here. For people in this particular case, it will almost certainly only make things better (although at first it may make performance worse). 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error. I follow. In an ideal situation, what I'd want to see is: 1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk). Well if it's a non-redundant configuration, you'd want those long recoveries permitted, rather than enable SCT ERC. The drive has the ability to relocate sector data on a marginal (slow) read that's still successful. But clearly many manufacturers tolerate slow reads that don't result in immediate reallocation or overwrite or we wouldn't be in this situation in the first place. I think this auto reallocation is thwarted by enabling SCT ERC. It just flat out gives up and reports a read error. So it is still data loss in the non-redundant configuration and thus not an improvement. I agree, but if it's only the kernel doing this, then we can't make judgements based on userspace usage. Also, the first situation while not optimal is still better than what happens now, at least there you will get an I/O error in a reasonable amount of time (as opposed to after a really long time if ever). Basically it's: For SATA and USB drives: if data redundant, then enable short SCT ERC time if supported, if not supported then extend SCSI command timer to 200; if data not redundant, then disable SCT ERC if supported, and extend SCSI command timer to 200. For SCSI (SAS most likely these days), keep things the same as now. But that's only because this is a rare enough configuration now I don't know if we really know the problems there. It may be that their error recovery in 7 seconds is massively better and more reliable than consumer drives over 180 seconds. I don't see why you would think this is not common. If you count just by systems, then it's absolutely outnumbered at least 100 to 1 by regular ATA disks. If you look at individual disks though, the reverse is true, because people who use SCSI drives tend to use _lots_ of disks (think big data centers, NAS and SAN systems and such). OTOH, both are probably vastly outnumbered by stuff that doesn't use either standard for storage... Separately, USB gets _really_ complicated if you want to cover everything, USB drives may or may not present as non-rotational, may or may not show
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 12:24 PM, Andrei Borzenkovwrote: > On Wed, Jul 6, 2016 at 8:19 PM, Chris Murphy wrote: >> >> I'm mainly concerned with rootfs. And I'm mainly concerned with a very >> simple 2 disk raid1. With a simple user opt in using >> rootflags=degraded, it should be possible to boot the system. Right >> now it's not possible. Maybe just deleting 64-btrfs.rules would fix >> this problem, I haven't tried it. >> > > While deleting this rule will fix your specific degraded 2 disk raid 1 > it will break non-degraded multi-device filesystem. Logic currently > implemented by systemd assumes that mount is called after > prerequisites have been fulfilled. Deleting this rule will call mount > as soon as the very first device is seen; such filesystem is obviously > not mountable. Seems like we need more granularity by btrfs ioctl for device ready, e.g. some way to indicate: 0 all devices ready 1 devices not ready (don't even try to mount) 2 minimum devices ready (degraded mount possible) Btrfs multiple device single and raid0 only return code 0 or 1. Where raid 1, 5, 6 could return code 2. The systemd default policy for code 2 could be to wait some amount of time to see if state goes to 0. At the timeout, try to mount anyway. If rootflags=degraded, it mounts. If not, mount fails, and we get a dracut prompt. That's better behavior than now. > Equivalent of this rule is required under systemd and desired in > general to avoid polling. On systemd list I outlined possible > alternative implementation as systemd service instead of really > hackish udev rule. I'll go read it there. Thanks. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 11:18 AM, Austin S. Hemmelgarnwrote: > On 2016-07-06 12:43, Chris Murphy wrote: >> So does it make sense to just set the default to 180? Or is there a >> smarter way to do this? I don't know. > > Just thinking about this: > 1. People who are setting this somewhere will be functionally unaffected. I think it's statistically 0 people changing this from default. It's people with drives that have no SCT ERC support, used in raid1+, who happen to stumble upon this very obscure work around to avoid link resets in the face of media defects. Rare. > 2. People using single disks which have lots of errors may or may not see an > apparent degradation of performance, but will likely have the life > expectancy of their device extended. Well they have link resets and their file system presumably face plants as a result of a pile of commands in the queue returning as unsuccessful. So they have premature death of their system, rather than it getting sluggish. This is a long standing indicator on Windows to just reinstall the OS and restore data from backups -> the user has an opportunity to freshen up user data backup, and the reinstallation and restore from backup results in freshly written sectors which is how bad sectors get fixed. The marginally bad sectors get new writes and now read fast (or fast enough), and the persistently bad sectors result in the drive firmware remapping to reserve sectors. The main thing in my opinion is less extension of drive life, as it is the user gets to use the system, albeit sluggish, to make a backup of their data rather than possibly losing it. > 3. Individuals who are not setting this but should be will on average be no > worse off than before other than seeing a bigger performance hit on a disk > error. > 4. People with single disks which are new will see no functional change > until the disk has an error. I follow. > > In an ideal situation, what I'd want to see is: > 1. If the device supports SCT ERC, set scsi_command_timer to reasonable > percentage over that (probably something like 25%, which would give roughly > 10 seconds for the normal 7 second ERC timer). > 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< > this is reasonable for SCSI disks). > 3. Otherwise, set the timer to 200 (we need a slight buffer over the > expected disk timeout to account for things like latency outside of the > disk). Well if it's a non-redundant configuration, you'd want those long recoveries permitted, rather than enable SCT ERC. The drive has the ability to relocate sector data on a marginal (slow) read that's still successful. But clearly many manufacturers tolerate slow reads that don't result in immediate reallocation or overwrite or we wouldn't be in this situation in the first place. I think this auto reallocation is thwarted by enabling SCT ERC. It just flat out gives up and reports a read error. So it is still data loss in the non-redundant configuration and thus not an improvement. Basically it's: For SATA and USB drives: if data redundant, then enable short SCT ERC time if supported, if not supported then extend SCSI command timer to 200; if data not redundant, then disable SCT ERC if supported, and extend SCSI command timer to 200. For SCSI (SAS most likely these days), keep things the same as now. But that's only because this is a rare enough configuration now I don't know if we really know the problems there. It may be that their error recovery in 7 seconds is massively better and more reliable than consumer drives over 180 seconds. > >> >> I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. >>> >>> >>> OTOH, I would not be surprised if the stance there is 'you get no support >>> if >>> your not using enterprise drives', not because of the project itself, but >>> because it's ZFS. Part of their minimum recommended hardware >>> requirements >>> is ECC RAM, so it wouldn't surprise me if enterprise storage devices are >>> there too. >> >> >> http://open-zfs.org/wiki/Hardware >> "Consistent performance requires hard drives that support error >> recovery control. " >> >> "Drives that lack such functionality can be expected to have >> arbitrarily high limits. Several minutes is not impossible. Drives >> with this functionality typically default to 7 seconds. ZFS does not >> currently adjust this setting on drives. However, it is advisable to >> write a script to set the error recovery time to a low value, such as >> 0.1 seconds until ZFS is modified to control it. This must be done on >> every boot. " >> >> They do not explicitly require enterprise drives, but they clearly >> expect SCT ERC enabled to some sane value. >> >> At least for Btrfs and ZFS, the mkfs is in a position to know all >>
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 9:23 PM, Chris Murphywrote: >>> [root@f24s ~]# btrfs fi show >>> warning, device 2 is missing >>> Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 >>> Total devices 3 FS bytes used 2.26GiB >>> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 >>> *** Some devices missing >>> >>> [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 >>> [root@f24s ~]# >> >> You won't get any output from it regardless, you have to check the return >> code as it's intended to be a tool for scripts and such. > > How do I check the return code? When I use strace, no matter what I'm getting > > +++ exited with 0 +++ > > I see both 'brfs device ready' and the udev btrfs builtin test are > calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using > user space tools to check but rather a btrfs ioctl. Correct. It is possible that ioctl returns correct result only the very first time; notice that in your example btrfs had seen all other devices at least once while at boot it is really the case of other devices missing so far. Which returns us to the question - how we can reliably query kernel about mountability of filesystem. > So clearly that > works or I wouldn't have stalled boots when all devices aren't > present. > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 8:19 PM, Chris Murphywrote: > > I'm mainly concerned with rootfs. And I'm mainly concerned with a very > simple 2 disk raid1. With a simple user opt in using > rootflags=degraded, it should be possible to boot the system. Right > now it's not possible. Maybe just deleting 64-btrfs.rules would fix > this problem, I haven't tried it. > While deleting this rule will fix your specific degraded 2 disk raid 1 it will break non-degraded multi-device filesystem. Logic currently implemented by systemd assumes that mount is called after prerequisites have been fulfilled. Deleting this rule will call mount as soon as the very first device is seen; such filesystem is obviously not mountable. Equivalent of this rule is required under systemd and desired in general to avoid polling. On systemd list I outlined possible alternative implementation as systemd service instead of really hackish udev rule. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 12:04 PM, Austin S. Hemmelgarnwrote: > On 2016-07-06 13:19, Chris Murphy wrote: >> >> On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkov >> wrote: >>> >>> 3) can we query btrfs whether it is mountable in degraded mode? >>> according to documentation, "btrfs device ready" (which udev builtin >>> follows) checks "if it has ALL of it’s devices in cache for mounting". >>> This is required for proper systemd ordering of services. >> >> >> Where does udev builtin use btrfs itself? I see "btrfs ready $device" >> which is not a valid btrfs user space command. >> >> I never get any errors from "btrfs device ready" even when too many >> devices are missing. I don't know what it even does or if it's broken. >> >> This is a three device raid1 where I removed 2 devices and "btrfs >> device ready" does not complain, it always returns silent for me no >> matter what. It's been this way for years as far as I know. >> >> [root@f24s ~]# lvs >> LV VG Attr LSize Pool Origin Data% Meta% Move >> Log Cpy%Sync Convert >> 1 VG Vwi-a-tz-- 50.00g thintastic2.55 >> 2 VG Vwi-a-tz-- 50.00g thintastic4.00 >> 3 VG Vwi-a-tz-- 50.00g thintastic2.54 >> thintastic VG twi-aotz-- 90.00g 5.05 2.92 >> [root@f24s ~]# btrfs fi show >> Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 >> Total devices 3 FS bytes used 2.26GiB >> devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1 >> devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2 >> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 >> >> [root@f24s ~]# btrfs device ready /dev/mapper/VG-1 >> [root@f24s ~]# >> [root@f24s ~]# lvchange -an VG/1 >> [root@f24s ~]# lvchange -an VG/2 >> [root@f24s ~]# btrfs dev scan >> Scanning for Btrfs filesystems >> [root@f24s ~]# lvs >> LV VG Attr LSize Pool Origin Data% Meta% Move >> Log Cpy%Sync Convert >> 1 VG Vwi---tz-- 50.00g thintastic >> 2 VG Vwi---tz-- 50.00g thintastic >> 3 VG Vwi-a-tz-- 50.00g thintastic2.54 >> thintastic VG twi-aotz-- 90.00g 5.05 2.92 >> [root@f24s ~]# btrfs fi show >> warning, device 2 is missing >> Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 >> Total devices 3 FS bytes used 2.26GiB >> devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 >> *** Some devices missing >> >> [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 >> [root@f24s ~]# > > You won't get any output from it regardless, you have to check the return > code as it's intended to be a tool for scripts and such. How do I check the return code? When I use strace, no matter what I'm getting +++ exited with 0 +++ I see both 'brfs device ready' and the udev btrfs builtin test are calling BTRFS_IOC_DEVICES_READY so, it looks like udev is not using user space tools to check but rather a btrfs ioctl. So clearly that works or I wouldn't have stalled boots when all devices aren't present. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount degraded RAID5
On Wed, Jul 6, 2016 at 11:12 AM, Gonzalo Gomez-Arrue Azpiazuwrote: > Hello, > > I had a RAID5 with 3 disks and one failed; now the filesystem cannot be > mounted. > > None of the recommendations that I found seem to work. The situation > seems to be similar to this one: > http://www.spinics.net/lists/linux-btrfs/msg56825.html > > Any suggestion on what to try next? Basically if you are degraded *and* it runs into additional errors, then it's broken because raid5 only protects against one device error. The main problem is if it can't read the chunk root it's hard for any tool to recover data because the chunk tree mapping is vital to finding data. What do you get for: btrfs rescue super-recover -v /dev/sdc1 It's a problem with the chunk tree because all of your super blocks point to the same chunk tree root so there isn't another one to try. >sudo btrfs-find-root /dev/sdc1 >warning, device 2 is missing >Couldn't read chunk root >Open ctree failed It's bad news. I'm not even sure 'btrfs restore' can help this case. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount degraded RAID5
On Wed, Jul 6, 2016 at 11:50 AM, Tomáš Hrdinawrote: > sudo mount -o ro /dev/sdc /shares > mount: wrong fs type, bad option, bad superblock on /dev/sdc, >missing codepage or helper program, or other error > >In some cases useful info is found in syslog - try >dmesg | tail or so. > > > sudo mount -o ro,recovery /dev/sdc /shares > mount: wrong fs type, bad option, bad superblock on /dev/sdc, >missing codepage or helper program, or other error > >In some cases useful info is found in syslog - try >dmesg | tail or so. [ 275.688919] BTRFS error (device sda): parent transid verify failed on 7008533413888 wanted 70175 found 70132 Looks like the generation is too far back for backup roots. Just for grins, now that all drives are present, what do you get for # btrfs rescue super-recover -v /dev/sda Next I suggest btrfs-image -c9 -t4 and optionally -s to sanitize file names. And also btrfs-debug-tree (this time no -d) redirected to a file. These two files can be big, about the size of the used amount of metadata chunks. These go in the cloud at some point, reference them in a bugzilla.kernel.org bug report by URL. Expect it to be months before a dev looks at it. So now what you want to try to do is use restore. https://btrfs.wiki.kernel.org/index.php/Restore You can use the information from btrfs-find-root to give restore a -t value to try. For example: >Found tree root at 6062830010368 gen 70182 level 1 >Well block 6062434418688(gen: 70181 level: 1) seems good, but >generation/level doesn't match, want gen: 70182 level: 1 >Well block 6062497202176(gen: 69186 level: 0) seems good, but >generation/level doesn't match, want gen: 70182 level: 1 >Well block 6062470332416(gen: 69186 level: 0) seems good, but >generation/level doesn't match, want gen: 70182 level: 1 btrfs restore -t 6062830010368 -v -i /dev/sda If that fails totally you can try the next bytenr, for the -t value, 6062434418688. And then the next. Each value down is going backward in time, so it implies some data loss. This is not the end. It's just that it's the safest since no changes to the fs have happened. If you set up some kind of overlay you can be more aggressive like going right for btrfs check --repair and seeing if it can fix things, but without the overlay it's possible to totally break the fs such that even restore won't work. Once you pretty much have everything important off the volume, you can get more aggressive with trying to fix it. OR just blow it away and start over. But I think it's valid to gather as much information about the file system and try to fix it because the autopsy is the main way to make Btrfs better. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On 2016-07-06 13:19, Chris Murphy wrote: On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkovwrote: 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. Where does udev builtin use btrfs itself? I see "btrfs ready $device" which is not a valid btrfs user space command. I never get any errors from "btrfs device ready" even when too many devices are missing. I don't know what it even does or if it's broken. This is a three device raid1 where I removed 2 devices and "btrfs device ready" does not complain, it always returns silent for me no matter what. It's been this way for years as far as I know. [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi-a-tz-- 50.00g thintastic2.55 2 VG Vwi-a-tz-- 50.00g thintastic4.00 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1 devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2 devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 [root@f24s ~]# btrfs device ready /dev/mapper/VG-1 [root@f24s ~]# [root@f24s ~]# lvchange -an VG/1 [root@f24s ~]# lvchange -an VG/2 [root@f24s ~]# btrfs dev scan Scanning for Btrfs filesystems [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi---tz-- 50.00g thintastic 2 VG Vwi---tz-- 50.00g thintastic 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show warning, device 2 is missing Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 *** Some devices missing [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 [root@f24s ~]# You won't get any output from it regardless, you have to check the return code as it's intended to be a tool for scripts and such. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount degraded RAID5
sudo mount -o ro /dev/sdc /shares mount: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. sudo mount -o ro,recovery /dev/sdc /shares mount: wrong fs type, bad option, bad superblock on /dev/sdc, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. dmesg http://sebsauvage.net/paste/?04d1162dc44d7e55#uY0kIaX66o7Kh+TZAGK2T+CKdRk2jorIWM3w5gfXp8I= Do you want any other log to see? For all 3 disks: sudo smartctl -l scterc,70,70 /dev/sdx smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control set to: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) Thank you Tomas *From:* Chris Murphy *Sent:* Wednesday, July 06, 2016 6:08PM *To:* Tomáš Hrdina *Cc:* Chris Murphy, Btrfs Btrfs *Subject:* Re: Unable to mount degraded RAID5 On Wed, Jul 6, 2016 at 2:07 AM, Tomáš Hrdinawrote: > Now with 3 disks: > > sudo btrfs check /dev/sda > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > checksum verify failed on 7008807157760 found F192848C wanted 1571393A > checksum verify failed on 7008807157760 found F192848C wanted 1571393A > bytenr mismatch, want=7008807157760, have=65536 > Checking filesystem on /dev/sda > UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e > checking extents > parent transid verify failed on 7009468874752 wanted 70180 found 70133 > parent transid verify failed on 7009468874752 wanted 70180 found 70133 > checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC > checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC > bytenr mismatch, want=7009468874752, have=65536 > parent transid verify failed on 7008859045888 wanted 70175 found 70133 > parent transid verify failed on 7008859045888 wanted 70175 found 70133 > checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 > checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 > bytenr mismatch, want=7008859045888, have=65536 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > Ignoring transid failure > leaf parent key incorrect 7008899547136 > bad block 7008899547136 > Errors found in extent allocation tree or chunk allocation > parent transid verify failed on 7009074167808 wanted 70175 found 70133 > parent transid verify failed on 7009074167808 wanted 70175 found 70133 > checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 > checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 > bytenr mismatch, want=7009074167808, have=65536 Ok much better than before, these all seem sane with a limited number of problems. Maybe --repair can fix it, but don't do that yet. > sudo btrfs-debug-tree -d /dev/sdc > http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA= OK good, so now it finds the chunk tree OK. This is good news. I would try to mount it ro first, if you need to make or refresh a backup. So in order: mount -o ro mount -o ro,recovery If those don't work lets see what the user and kernel errors are. > >> > sudo btrfs-find-root /dev/sdc > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > Superblock thinks the generation is 70182 > Superblock thinks the level is 1 > Found tree root at 6062830010368 gen 70182 level 1 > Well block 6062434418688(gen: 70181 level: 1) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 > Well block 6062497202176(gen: 69186 level: 0) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 > Well block 6062470332416(gen: 69186 level: 0) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 This is also a good sign that you can probably get btrfs rescue to work and point it to one of these older tree roots, if mount won't work. > >> > sudo smartctl -l scterc /dev/sda > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control: >Read: Disabled > Write: Disabled > >> > sudo smartctl -l scterc /dev/sdb > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build)
Re: Out of space error even though there's 100 GB unused?
On Wed, Jul 6, 2016 at 3:55 AM, Stanislaw Kaminskiwrote: > Device unallocated: 97.89GiB There should be no problem creating any type of block group from this much space. It's a bug. I would try regression testing. Kernel 4.5.7 has some changes that may or may not relate to this (they should only relate when there is no unallocated space left) so you could try 4.5.6 and 4.5.7. And also 4.4.14. But also the kernel messages are important. There is this obscure enospc with error -28, so either with or without enospc_debug mount option is useful to try in 4.6.3 (I think it's less useful in older kernels). But do try nospace_cache first. If that works, you could then mount with clear_cache one time and see if that provides an enduring fix. It can take some time to rebuild the cache after clear_cache is used. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
> On Jul 6, 2016, at 10:33 AM, Joerg Schilling >wrote: > > "Austin S. Hemmelgarn" wrote: > >> On 2016-07-06 11:22, Joerg Schilling wrote: >>> >>> >>> You are mistaken. >>> >>> stat /proc/$$/as >>> File: `/proc/6518/as' >>> Size: 2793472 Blocks: 5456 IO Block: 512regular file >>> Device: 544h/88342528d Inode: 7557Links: 1 >>> Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> stat /proc/$$/auxv >>> File: `/proc/6518/auxv' >>> Size: 168 Blocks: 1 IO Block: 512regular file >>> Device: 544h/88342528d Inode: 7568Links: 1 >>> Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) >>> Access: 2016-07-06 16:33:15.660224934 +0200 >>> Modify: 2016-07-06 16:33:15.660224934 +0200 >>> Change: 2016-07-06 16:33:15.660224934 +0200 >>> >>> Any correct implementation of /proc returns the expected numbers in >>> st_size as well as in st_blocks. >> >> Odd, because I get 0 for both values on all the files in /proc/self and >> all the top level files on all kernels I tested prior to sending that > > I tested this with an official PROCFS-2 implementation that was written by > the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased > away last weekend. > > You may have done your tests on an inofficial procfs implementation So, what you are saying is that you don't care about star working properly on Linux, because it has an "inofficial" procfs implementation, while Solaris has an "official" implementation? >>> Now you know why BTRFS is still an incomplete filesystem. In a few years >>> when it turns 10, this may change. People who implement filesystems of >>> course need to learn that they need to hide implementation details from >>> the official user space interfaces. >> >> So in other words you think we should be lying about how much is >> actually allocated on disk and thus violating the standard directly (and >> yes, ext4 and everyone else who does this with delayed allocation _is_ >> strictly speaking violating the standard, because _nothing_ is allocated >> yet)? > > If it returns 0, it would be lying or it would be wrong anyway as it did not > check fpe available space. > > Also note that I mentioned already that the priciple availability of SEEK_HOLE > does not help as there is e.g. NFS... So, it's OK that NFS is not POSIX compliant in various ways, and star will deal with it, but you aren't willing to fix a heuristic used by star for a behaviour that is unspecified by POSIX but has caused users to lose data when archiving from several modern filesystems? That's fine, so long as GNU tar is fixed to use the safe fallback in such cases (i.e. trying to archive data from files that are newly created, even if they report st_blocks == 0). Cheers, Andreas signature.asc Description: Message signed with OpenPGP using GPGMail
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 3:51 AM, Andrei Borzenkovwrote: > On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy wrote: >> I started a systemd-devel@ thread since that's where most udev stuff >> gets talked about. >> >> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html >> > > Before discussing how to implement it in systemd, we need to decide > what to implement. I.e. Fair. > 1) do you always want to mount filesystem in degraded mode if not > enough devices are present or only if explicit hint is given? Right now on Btrfs, it should be explicit. The faulty device concept, handling, and notification is not mature. It's not a good idea to silently mount degraded considering Btrfs does not actively catch up the devices that are behind the next time there's a normal mount. It only fixes things passively. So the user must opt into degraded mounts rather than opt out. The problem is the current udev rule is doing its own check for device availability. So the mount command with explicit hint doesn't even get attempted. > 2) do you want to restrict degrade handling to root only or to other > filesystems as well? Note that there could be more early boot > filesystems that absolutely need same treatment (enters separate > /usr), and there are also normal filesystems that may need be mounted > even degraded. I'm mainly concerned with rootfs. And I'm mainly concerned with a very simple 2 disk raid1. With a simple user opt in using rootflags=degraded, it should be possible to boot the system. Right now it's not possible. Maybe just deleting 64-btrfs.rules would fix this problem, I haven't tried it. > 3) can we query btrfs whether it is mountable in degraded mode? > according to documentation, "btrfs device ready" (which udev builtin > follows) checks "if it has ALL of it’s devices in cache for mounting". > This is required for proper systemd ordering of services. Where does udev builtin use btrfs itself? I see "btrfs ready $device" which is not a valid btrfs user space command. I never get any errors from "btrfs device ready" even when too many devices are missing. I don't know what it even does or if it's broken. This is a three device raid1 where I removed 2 devices and "btrfs device ready" does not complain, it always returns silent for me no matter what. It's been this way for years as far as I know. [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi-a-tz-- 50.00g thintastic2.55 2 VG Vwi-a-tz-- 50.00g thintastic4.00 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid1 size 50.00GiB used 3.00GiB path /dev/mapper/VG-1 devid2 size 50.00GiB used 2.01GiB path /dev/mapper/VG-2 devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 [root@f24s ~]# btrfs device ready /dev/mapper/VG-1 [root@f24s ~]# [root@f24s ~]# lvchange -an VG/1 [root@f24s ~]# lvchange -an VG/2 [root@f24s ~]# btrfs dev scan Scanning for Btrfs filesystems [root@f24s ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert 1 VG Vwi---tz-- 50.00g thintastic 2 VG Vwi---tz-- 50.00g thintastic 3 VG Vwi-a-tz-- 50.00g thintastic2.54 thintastic VG twi-aotz-- 90.00g 5.05 2.92 [root@f24s ~]# btrfs fi show warning, device 2 is missing Label: none uuid: 96240fd9-ea76-47e7-8cf4-05d3570ccfd7 Total devices 3 FS bytes used 2.26GiB devid3 size 50.00GiB used 3.01GiB path /dev/mapper/VG-3 *** Some devices missing [root@f24s ~]# btrfs device ready /dev/mapper/VG-3 [root@f24s ~]# -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-06 12:43, Chris Murphy wrote: On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarnwrote: On 2016-07-05 19:05, Chris Murphy wrote: Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. The thing I don't get about this is that while the per-device settings on a given system are policy, the default value is not, and should be expected to work correctly (but not necessarily optimally) on as many systems as possible, so any claim that this should be fixed in udev are bogus by the regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. Just thinking about this: 1. People who are setting this somewhere will be functionally unaffected. 2. People using single disks which have lots of errors may or may not see an apparent degradation of performance, but will likely have the life expectancy of their device extended. 3. Individuals who are not setting this but should be will on average be no worse off than before other than seeing a bigger performance hit on a disk error. 4. People with single disks which are new will see no functional change until the disk has an error. In an ideal situation, what I'd want to see is: 1. If the device supports SCT ERC, set scsi_command_timer to reasonable percentage over that (probably something like 25%, which would give roughly 10 seconds for the normal 7 second ERC timer). 2. If the device is actually a SCSI device, keep the 30 second timer (IIRC< this is reasonable for SCSI disks). 3. Otherwise, set the timer to 200 (we need a slight buffer over the expected disk timeout to account for things like latency outside of the disk). I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. OTOH, I would not be surprised if the stance there is 'you get no support if your not using enterprise drives', not because of the project itself, but because it's ZFS. Part of their minimum recommended hardware requirements is ECC RAM, so it wouldn't surprise me if enterprise storage devices are there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing. I"m pretty certain they don't create rules, or even try to check the drive for SCT ERC support. The problem with doing this is that you can't be certain that your underlying device is actually a physical storage device or not, and thus you have to check more than just the SCT ERC commands, and many people (myself included) don't like tools doing things that modify the persistent functioning of their system that the tool itself is not intended to do (and messing with block layer settings falls into that category for a mkfs tool). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Fwd: Unable to mount degraded RAID5
Hello, I had a RAID5 with 3 disks and one failed; now the filesystem cannot be mounted. None of the recommendations that I found seem to work. The situation seems to be similar to this one: http://www.spinics.net/lists/linux-btrfs/msg56825.html Any suggestion on what to try next? Thanks a lot beforehand! sudo btrfs version btrfs-progs v4.4 uname -a Linux ubuntu 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux sudo btrfs fi show warning, device 2 is missing checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 bytenr mismatch, want=2339175972864, have=65536 Couldn't read chunk root Label: none uuid: 495efbc6-2f62-4cd7-962b-7ae3d0e929f1 Total devices 3 FS bytes used 1.29TiB devid1 size 2.73TiB used 674.03GiB path /dev/sdc1 devid3 size 2.73TiB used 674.03GiB path /dev/sdd1 *** Some devices missing sudo mount -t btrfs -o ro,degraded,recovery /dev/sdc1 /btrfs mount: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. dmesg | tail [ 2440.036368] BTRFS info (device sdd1): allowing degraded mounts [ 2440.036383] BTRFS info (device sdd1): enabling auto recovery [ 2440.036390] BTRFS info (device sdd1): disk space caching is enabled [ 2440.037928] BTRFS warning (device sdd1): devid 2 uuid 0c7d7db2-6a27-4b19-937b-b6266ba81257 is missing [ 2440.652085] BTRFS info (device sdd1): bdev (null) errs: wr 1413, rd 362, flush 471, corrupt 0, gen 0 [ 2441.359066] BTRFS error (device sdd1): bad tree block start 0 833766391808 [ 2441.359306] BTRFS error (device sdd1): bad tree block start 0 833766391808 [ 2441.359330] BTRFS: Failed to read block groups: -5 [ 2441.383793] BTRFS: open_ctree failed sudo btrfs restore /dev/sdc1 /bkp warning, device 2 is missing checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 bytenr mismatch, want=2339175972864, have=65536 Couldn't read chunk root Could not open root, trying backup super warning, device 2 is missing warning, device 3 is missing checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 bytenr mismatch, want=2339175972864, have=65536 Couldn't read chunk root Could not open root, trying backup super warning, device 2 is missing warning, device 3 is missing checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 checksum verify failed on 2339175972864 found A781ADC2 wanted 43621074 bytenr mismatch, want=2339175972864, have=65536 Couldn't read chunk root Could not open root, trying backup super sudo btrfs-show-super -fa /dev/sdc1 http://sebsauvage.net/paste/?d79e9e9c385cf1a5#fNwoEj5o2aQ6T7nDl4vjrFqEJG0SHeVpmGknbbCVnd0= sudo btrfs-find-root /dev/sdc1 warning, device 2 is missing Couldn't read chunk root Open ctree failed -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On Wed, Jul 6, 2016 at 5:51 AM, Austin S. Hemmelgarnwrote: > On 2016-07-05 19:05, Chris Murphy wrote: >> >> Related: >> http://www.spinics.net/lists/raid/msg52880.html >> >> Looks like there is some traction to figuring out what to do about >> this, whether it's a udev rule or something that happens in the kernel >> itself. Pretty much the only hardware setup unaffected by this are >> those with enterprise or NAS drives. Every configuration of a consumer >> drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) >> RAID Levels are adversely affected by this. > > The thing I don't get about this is that while the per-device settings on a > given system are policy, the default value is not, and should be expected to > work correctly (but not necessarily optimally) on as many systems as > possible, so any claim that this should be fixed in udev are bogus by the > regular kernel rules. Sure. But changing it in the kernel leads to what other consequences? It fixes the problem under discussion but what problem will it introduce? I think it's valid to explore this, at the least so affected parties can be informed. Also, the problem isn't instigated by Linux, rather by drive manufacturers introducing a whole new kind of error recovery, with an order of magnitude longer recovery time. Now probably most hardware in the field are such drives. Even SSDs like my Samsung 840 EVO that support SCT ERC have it disabled, therefore the top end recovery time is undiscoverable in the device itself. Maybe it's buried in a spec. So does it make sense to just set the default to 180? Or is there a smarter way to do this? I don't know. >> I suspect, but haven't tested, that ZFS On Linux would be equally >> affected, unless they're completely reimplementing their own block >> layer (?) So there are quite a few parties now negatively impacted by >> the current default behavior. > > OTOH, I would not be surprised if the stance there is 'you get no support if > your not using enterprise drives', not because of the project itself, but > because it's ZFS. Part of their minimum recommended hardware requirements > is ECC RAM, so it wouldn't surprise me if enterprise storage devices are > there too. http://open-zfs.org/wiki/Hardware "Consistent performance requires hard drives that support error recovery control. " "Drives that lack such functionality can be expected to have arbitrarily high limits. Several minutes is not impossible. Drives with this functionality typically default to 7 seconds. ZFS does not currently adjust this setting on drives. However, it is advisable to write a script to set the error recovery time to a low value, such as 0.1 seconds until ZFS is modified to control it. This must be done on every boot. " They do not explicitly require enterprise drives, but they clearly expect SCT ERC enabled to some sane value. At least for Btrfs and ZFS, the mkfs is in a position to know all parameters for properly setting SCT ERC and the SCSI command timer for every device. Maybe it could create the udev rule? Single and raid0 profiles need to permit long recoveries; where raid1, 5, 6 need to set things for very short recoveries. Possibly mdadm and lvm tools do the same thing. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
"Austin S. Hemmelgarn"wrote: > On 2016-07-06 11:22, Joerg Schilling wrote: > > "Austin S. Hemmelgarn" wrote: > > > >>> It should be obvious that a file that offers content also has allocated > >>> blocks. > >> What you mean then is that POSIX _implies_ that this is the case, but > >> does not say whether or not it is required. There are all kinds of > >> counterexamples to this too, procfs is a POSIX compliant filesystem > >> (every POSIX certified system has it), yet does not display the behavior > >> that you expect, every single file in /proc for example reports 0 for > >> both st_blocks and st_size, and yet all of them very obviously have > >> content. > > > > You are mistaken. > > > > stat /proc/$$/as > > File: `/proc/6518/as' > > Size: 2793472 Blocks: 5456 IO Block: 512regular file > > Device: 544h/88342528d Inode: 7557Links: 1 > > Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) > > Access: 2016-07-06 16:33:15.660224934 +0200 > > Modify: 2016-07-06 16:33:15.660224934 +0200 > > Change: 2016-07-06 16:33:15.660224934 +0200 > > > > stat /proc/$$/auxv > > File: `/proc/6518/auxv' > > Size: 168 Blocks: 1 IO Block: 512regular file > > Device: 544h/88342528d Inode: 7568Links: 1 > > Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) > > Access: 2016-07-06 16:33:15.660224934 +0200 > > Modify: 2016-07-06 16:33:15.660224934 +0200 > > Change: 2016-07-06 16:33:15.660224934 +0200 > > > > Any correct implementation of /proc returns the expected numbers in st_size > > as > > well as in st_blocks. > Odd, because I get 0 for both values on all the files in /proc/self and > all the top level files on all kernels I tested prior to sending that I tested this with an official PROCFS-2 implementation that was written by the inventor of the PROC filesystem (Roger Faulkner) who as a sad news pased away last weekend. You may have done your tests on an inofficial procfs implementation > > Now you know why BTRFS is still an incomplete filesystem. In a few years > > when > > it turns 10, this may change. People who implement filesystems of course > > need > > to learn that they need to hide implementation details from the official > > user > > space interfaces. > So in other words you think we should be lying about how much is > actually allocated on disk and thus violating the standard directly (and > yes, ext4 and everyone else who does this with delayed allocation _is_ > strictly speaking violating the standard, because _nothing_ is allocated > yet)? If it returns 0, it would be lying or it would be wrong anyway as it did not check fpe available space. Also note that I mentioned already that the priciple availability of SEEK_HOLE does not help as there is e.g. NFS... Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On Wed, Jul 06, 2016 at 05:42:33PM +0200, Holger Hoffstätte wrote: > On 07/06/16 17:20, Hugo Mills wrote: > > On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote: > >> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte > >>wrote: > >>> On 07/06/16 14:25, Wang Shilong wrote: > 'btrfs file du' is a very useful tool to watch my system > file usage with snapshot aware. > > when trying to run following commands: > [root@localhost btrfs-progs]# btrfs file du / > Total Exclusive Set shared Filename > ERROR: Failed to lookup root id - Inappropriate ioctl for device > ERROR: cannot check space of '/': Unknown error -1 > > and My Filesystem looks like this: > [root@localhost btrfs-progs]# df -Th > Filesystem Type Size Used Avail Use% Mounted on > devtmpfs devtmpfs 16G 0 16G 0% /dev > tmpfs tmpfs 16G 368K 16G 1% /dev/shm > tmpfs tmpfs 16G 1.4M 16G 1% /run > tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup > /dev/sda3 btrfs 60G 19G 40G 33% / > tmpfs tmpfs 16G 332K 16G 1% /tmp > /dev/sdc btrfs 2.8T 166G 1.7T 9% /data > /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot > /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi > tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 > > So I installed Btrfs as my root partition, but boot partition > can be other fs. > > We can Let btrfs tool aware of this is not a btrfs file or > directory and skip those files, so that someone like me > could just run 'btrfs file du /' to scan all btrfs filesystems. > > After patch, it will look like: > Total Exclusive Set shared Filename > skipping not btrfs dir/file: boot > skipping not btrfs dir/file: dev > skipping not btrfs dir/file: proc > skipping not btrfs dir/file: run > skipping not btrfs dir/file: sys > 0.00B 0.00B - //root/.bash_logout > 0.00B 0.00B - //root/.bash_profile > 0.00B 0.00B - //root/.bashrc > 0.00B 0.00B - //root/.cshrc > 0.00B 0.00B - //root/.tcshrc > > This works for me to analysis system usage and analysis > performaces. > >>> > >>> This is great, but can we please skip the "skipping .." messages? > >>> Maybe it's just me but I really don't see the value of printing them > >>> when they don't contribute to the result. > >>> They also mess up the display. :) > >> > >> I don't have a taste whether it needed or not, because it is somehow > >> useful to let users know some files/directories skipped > > When you run "find /path -type d" you don't get messages for all the > things you just didn't want to find either. No, but you do get messages about unreadable directories from find. Your example above would be "You asked for X and isn't an X". That's not what these messages are about -- what we're seeing here is "I tried to do what you asked to , but couldn't". Hugo. > >At the absolute minimum, I think that these messages should go to > > stderr (like du does when it deosn't have permissions), and should go > > away with -q. They're still irritating, but at least you can get rid > > of them easily. > > If anything this should require a --verbose, not the other way > around. Maybe instead of breaking the output just indicate the > special status via "-- --" values, or default to 0.00? > Still, we're explicitly only interested in btrfs stuff and not > anything else, so printing non-information can only yield noise. > > This is very much orthogonal to not printing anything after an > otherwise successful command execution. > > -h > > -- Hugo Mills | "There's a Martian war machine outside -- they want hugo@... carfax.org.uk | to talk to you about a cure for the common cold." http://carfax.org.uk/ | PGP: E2AB1DE4 | Stephen Franklin, Babylon 5 signature.asc Description: Digital signature
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 2016-07-06 12:05, Austin S. Hemmelgarn wrote: On 2016-07-06 11:22, Joerg Schilling wrote: "Austin S. Hemmelgarn"wrote: It should be obvious that a file that offers content also has allocated blocks. What you mean then is that POSIX _implies_ that this is the case, but does not say whether or not it is required. There are all kinds of counterexamples to this too, procfs is a POSIX compliant filesystem (every POSIX certified system has it), yet does not display the behavior that you expect, every single file in /proc for example reports 0 for both st_blocks and st_size, and yet all of them very obviously have content. You are mistaken. stat /proc/$$/as File: `/proc/6518/as' Size: 2793472 Blocks: 5456 IO Block: 512regular file Device: 544h/88342528d Inode: 7557Links: 1 Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 stat /proc/$$/auxv File: `/proc/6518/auxv' Size: 168 Blocks: 1 IO Block: 512regular file Device: 544h/88342528d Inode: 7568Links: 1 Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 Any correct implementation of /proc returns the expected numbers in st_size as well as in st_blocks. Odd, because I get 0 for both values on all the files in /proc/self and all the top level files on all kernels I tested prior to sending that e-mail, for reference, they include: * A direct clone of HEAD on torvalds/linux * 4.6.3 mainline * 4.1.27 mainline * 4.6.3 mainline with a small number of local patches on top * 4.1.19+ from the Raspberry Pi foundation * 4.4.6-gentoo (mainline with Gentoo patches on top) * 4.5.5-linode69 (not certain about the patches on top) Further ones I've now tested that behave like the others listed above: * 2.4.20-8 from RedHat 9 * 2.6.18-1.2798.fc6 from Fedora Core 6 * 3.11.10-301.fc20 from Fedora 20 IOW, it looks like whatever you're running is an exception here. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount degraded RAID5
On Wed, Jul 6, 2016 at 2:07 AM, Tomáš Hrdinawrote: > Now with 3 disks: > > sudo btrfs check /dev/sda > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > checksum verify failed on 7008807157760 found F192848C wanted 1571393A > checksum verify failed on 7008807157760 found F192848C wanted 1571393A > bytenr mismatch, want=7008807157760, have=65536 > Checking filesystem on /dev/sda > UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e > checking extents > parent transid verify failed on 7009468874752 wanted 70180 found 70133 > parent transid verify failed on 7009468874752 wanted 70180 found 70133 > checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC > checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC > bytenr mismatch, want=7009468874752, have=65536 > parent transid verify failed on 7008859045888 wanted 70175 found 70133 > parent transid verify failed on 7008859045888 wanted 70175 found 70133 > checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 > checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 > bytenr mismatch, want=7008859045888, have=65536 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3 > parent transid verify failed on 7008899547136 wanted 70175 found 70133 > Ignoring transid failure > leaf parent key incorrect 7008899547136 > bad block 7008899547136 > Errors found in extent allocation tree or chunk allocation > parent transid verify failed on 7009074167808 wanted 70175 found 70133 > parent transid verify failed on 7009074167808 wanted 70175 found 70133 > checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 > checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 > bytenr mismatch, want=7009074167808, have=65536 Ok much better than before, these all seem sane with a limited number of problems. Maybe --repair can fix it, but don't do that yet. > sudo btrfs-debug-tree -d /dev/sdc > http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA= OK good, so now it finds the chunk tree OK. This is good news. I would try to mount it ro first, if you need to make or refresh a backup. So in order: mount -o ro mount -o ro,recovery If those don't work lets see what the user and kernel errors are. > > > sudo btrfs-find-root /dev/sdc > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > parent transid verify failed on 7008807157760 wanted 70175 found 70133 > Superblock thinks the generation is 70182 > Superblock thinks the level is 1 > Found tree root at 6062830010368 gen 70182 level 1 > Well block 6062434418688(gen: 70181 level: 1) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 > Well block 6062497202176(gen: 69186 level: 0) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 > Well block 6062470332416(gen: 69186 level: 0) seems good, but > generation/level doesn't match, want gen: 70182 level: 1 This is also a good sign that you can probably get btrfs rescue to work and point it to one of these older tree roots, if mount won't work. > > > sudo smartctl -l scterc /dev/sda > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control: >Read: Disabled > Write: Disabled > > > sudo smartctl -l scterc /dev/sdb > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control: >Read: 70 (7.0 seconds) > Write: 70 (7.0 seconds) > > > sudo smartctl -l scterc /dev/sdc > smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) > Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org > > SCT Error Recovery Control: >Read: Disabled > Write: Disabled There's good news and bad news. The good news is all the drives support SCT ERC. The bad news is two of the drives have the wrong setting for raid1+, including raid5. Issue: smartctl -l scterc,70,70 /dev/sdX #for each drive This is not a persistent setting. The drive being powered off (maybe even reset) will revert the setting to drive default. Some people use a udev rule to set this during startup. I think it can also be done with a systemd unit. You'd want to specify the drives by id, wwn if available, so that it's always consistent across boots. The point of this setting is to force the drive to give up on errors quickly, allowing Btrfs in this case to be informed of the exact problem (media error and what sector)
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 2016-07-06 11:22, Joerg Schilling wrote: "Austin S. Hemmelgarn"wrote: It should be obvious that a file that offers content also has allocated blocks. What you mean then is that POSIX _implies_ that this is the case, but does not say whether or not it is required. There are all kinds of counterexamples to this too, procfs is a POSIX compliant filesystem (every POSIX certified system has it), yet does not display the behavior that you expect, every single file in /proc for example reports 0 for both st_blocks and st_size, and yet all of them very obviously have content. You are mistaken. stat /proc/$$/as File: `/proc/6518/as' Size: 2793472 Blocks: 5456 IO Block: 512regular file Device: 544h/88342528d Inode: 7557Links: 1 Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 stat /proc/$$/auxv File: `/proc/6518/auxv' Size: 168 Blocks: 1 IO Block: 512regular file Device: 544h/88342528d Inode: 7568Links: 1 Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 Any correct implementation of /proc returns the expected numbers in st_size as well as in st_blocks. Odd, because I get 0 for both values on all the files in /proc/self and all the top level files on all kernels I tested prior to sending that e-mail, for reference, they include: * A direct clone of HEAD on torvalds/linux * 4.6.3 mainline * 4.1.27 mainline * 4.6.3 mainline with a small number of local patches on top * 4.1.19+ from the Raspberry Pi foundation * 4.4.6-gentoo (mainline with Gentoo patches on top) * 4.5.5-linode69 (not certain about the patches on top) It's probably notable that I don't see /proc/$PID/as on any of these systems, which implies you're running some significantly different kernel version to begin with, and therefore it's not unreasonable to assume that what you see is because of some misguided patch that got added to allow tar to archive /proc. In all seriousness though, this started out because stuff wasn't cached to anywhere near the degree it is today, and there was no such thing as delayed allocation. When you said to write, the filesystem allocated the blocks, regardless of when it actually wrote the data. IOW, the behavior that GNU tar is relying on is an implementation detail, not an API. Just like df, this breaks under modern designs, not because they chose to break it, but because it wasn't designed for use with such implementations. This seems to be a strange interpretation if what a standard is. Except what I'm talking about is the _interpretation_ of the standard, not the standard itself. I said nothing about the standard, all it requires is that st_blocks be the number of 512 byte blocks allocated by the filesystem for the file. There is nothing in there about it having to reflect the expected size of the allocated content on disk. In fact, there's technically nothing in there about how to handle sparse files either. To further explain what I'm trying to say, here's a rough description of what happens in SVR4 UFS (and other non-delayed allocation filesystems) when you issue a write: 1. The number of new blocks needed to fulfill the write request is calculated. 2. If this number is greater than 0, that many new blocks are allocated, and st_blocks for that file is functionally updated (I don't recall if it was dynamically calculated per call or not) 3. At some indeterminate point in the future, the decision is made to flush the cache. 4. The data is written to the appropriate place in the file. By comparison, in a delayed allocation scenario, 3 happens before 1 and 2. 1 and 2 obviously have to be strictly ordered WRT each other and 4, but based on the POSIX standard, 3 does not have to be strictly ordered with regards to any of them (although it is illogical to have it between 1 and 2 or after 4). Because it is not required by the standard to have 3 be strictly ordered and the ordering isn't part of the API itself, where it happens in the sequence is an implementation detail. A new filesystem cannot introduce new rules just because people believe it would save time. Saying the file has no blocks when there are no blocks allocated for it is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a way to pack file data into the inode if it was small enough. In that case, it woulod be perfectly reasonable to return 0 for st_blocks because the inode table in UFS is a fixed pre-allocated structure, and Given that inode size is 128, such a change would not break things as the heuristics would not imply a sparse file here. OK, so change the heuristic
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On 07/06/16 17:20, Hugo Mills wrote: > On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote: >> On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte >>wrote: >>> On 07/06/16 14:25, Wang Shilong wrote: 'btrfs file du' is a very useful tool to watch my system file usage with snapshot aware. when trying to run following commands: [root@localhost btrfs-progs]# btrfs file du / Total Exclusive Set shared Filename ERROR: Failed to lookup root id - Inappropriate ioctl for device ERROR: cannot check space of '/': Unknown error -1 and My Filesystem looks like this: [root@localhost btrfs-progs]# df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 368K 16G 1% /dev/shm tmpfs tmpfs 16G 1.4M 16G 1% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/sda3 btrfs 60G 19G 40G 33% / tmpfs tmpfs 16G 332K 16G 1% /tmp /dev/sdc btrfs 2.8T 166G 1.7T 9% /data /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 So I installed Btrfs as my root partition, but boot partition can be other fs. We can Let btrfs tool aware of this is not a btrfs file or directory and skip those files, so that someone like me could just run 'btrfs file du /' to scan all btrfs filesystems. After patch, it will look like: Total Exclusive Set shared Filename skipping not btrfs dir/file: boot skipping not btrfs dir/file: dev skipping not btrfs dir/file: proc skipping not btrfs dir/file: run skipping not btrfs dir/file: sys 0.00B 0.00B - //root/.bash_logout 0.00B 0.00B - //root/.bash_profile 0.00B 0.00B - //root/.bashrc 0.00B 0.00B - //root/.cshrc 0.00B 0.00B - //root/.tcshrc This works for me to analysis system usage and analysis performaces. >>> >>> This is great, but can we please skip the "skipping .." messages? >>> Maybe it's just me but I really don't see the value of printing them >>> when they don't contribute to the result. >>> They also mess up the display. :) >> >> I don't have a taste whether it needed or not, because it is somehow >> useful to let users know some files/directories skipped When you run "find /path -type d" you don't get messages for all the things you just didn't want to find either. >At the absolute minimum, I think that these messages should go to > stderr (like du does when it deosn't have permissions), and should go > away with -q. They're still irritating, but at least you can get rid > of them easily. If anything this should require a --verbose, not the other way around. Maybe instead of breaking the output just indicate the special status via "-- --" values, or default to 0.00? Still, we're explicitly only interested in btrfs stuff and not anything else, so printing non-information can only yield noise. This is very much orthogonal to not printing anything after an otherwise successful command execution. -h signature.asc Description: OpenPGP digital signature
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstättewrote: > On 07/06/16 14:25, Wang Shilong wrote: >> 'btrfs file du' is a very useful tool to watch my system >> file usage with snapshot aware. >> >> when trying to run following commands: >> [root@localhost btrfs-progs]# btrfs file du / >> Total Exclusive Set shared Filename >> ERROR: Failed to lookup root id - Inappropriate ioctl for device >> ERROR: cannot check space of '/': Unknown error -1 >> >> and My Filesystem looks like this: >> [root@localhost btrfs-progs]# df -Th >> Filesystem Type Size Used Avail Use% Mounted on >> devtmpfs devtmpfs 16G 0 16G 0% /dev >> tmpfs tmpfs 16G 368K 16G 1% /dev/shm >> tmpfs tmpfs 16G 1.4M 16G 1% /run >> tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup >> /dev/sda3 btrfs 60G 19G 40G 33% / >> tmpfs tmpfs 16G 332K 16G 1% /tmp >> /dev/sdc btrfs 2.8T 166G 1.7T 9% /data >> /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot >> /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi >> tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 >> >> So I installed Btrfs as my root partition, but boot partition >> can be other fs. >> >> We can Let btrfs tool aware of this is not a btrfs file or >> directory and skip those files, so that someone like me >> could just run 'btrfs file du /' to scan all btrfs filesystems. >> >> After patch, it will look like: >>Total Exclusive Set shared Filename >> skipping not btrfs dir/file: boot >> skipping not btrfs dir/file: dev >> skipping not btrfs dir/file: proc >> skipping not btrfs dir/file: run >> skipping not btrfs dir/file: sys >> 0.00B 0.00B - //root/.bash_logout >> 0.00B 0.00B - //root/.bash_profile >> 0.00B 0.00B - //root/.bashrc >> 0.00B 0.00B - //root/.cshrc >> 0.00B 0.00B - //root/.tcshrc >> >> This works for me to analysis system usage and analysis >> performaces. > > This is great, but can we please skip the "skipping .." messages? > Maybe it's just me but I really don't see the value of printing them > when they don't contribute to the result. > They also mess up the display. :) I don't have a taste whether it needed or not, because it is somehow useful to let users know some files/directories skipped Wait some other guys opinion for this... thanks, Shilong > > thanks, > Holger > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
"Austin S. Hemmelgarn"wrote: > > It should be obvious that a file that offers content also has allocated > > blocks. > What you mean then is that POSIX _implies_ that this is the case, but > does not say whether or not it is required. There are all kinds of > counterexamples to this too, procfs is a POSIX compliant filesystem > (every POSIX certified system has it), yet does not display the behavior > that you expect, every single file in /proc for example reports 0 for > both st_blocks and st_size, and yet all of them very obviously have content. You are mistaken. stat /proc/$$/as File: `/proc/6518/as' Size: 2793472 Blocks: 5456 IO Block: 512regular file Device: 544h/88342528d Inode: 7557Links: 1 Access: (0600/-rw---) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 stat /proc/$$/auxv File: `/proc/6518/auxv' Size: 168 Blocks: 1 IO Block: 512regular file Device: 544h/88342528d Inode: 7568Links: 1 Access: (0400/-r) Uid: ( xx/ joerg) Gid: ( xx/ bs) Access: 2016-07-06 16:33:15.660224934 +0200 Modify: 2016-07-06 16:33:15.660224934 +0200 Change: 2016-07-06 16:33:15.660224934 +0200 Any correct implementation of /proc returns the expected numbers in st_size as well as in st_blocks. > In all seriousness though, this started out because stuff wasn't cached > to anywhere near the degree it is today, and there was no such thing as > delayed allocation. When you said to write, the filesystem allocated > the blocks, regardless of when it actually wrote the data. IOW, the > behavior that GNU tar is relying on is an implementation detail, not an > API. Just like df, this breaks under modern designs, not because they > chose to break it, but because it wasn't designed for use with such > implementations. This seems to be a strange interpretation if what a standard is. > > A new filesystem cannot introduce new rules just because people believe it > > would > > save time. > Saying the file has no blocks when there are no blocks allocated for it > is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a > way to pack file data into the inode if it was small enough. In that > case, it woulod be perfectly reasonable to return 0 for st_blocks > because the inode table in UFS is a fixed pre-allocated structure, and Given that inode size is 128, such a change would not break things as the heuristics would not imply a sparse file here. > therefore nothing is allocated to the file itself except the inode. The > same applies in the case of a file packed into it's own metadata block > on BTRFS, nothing is allocated to that file beyond the metadata block it > has to have to store the inode. In the case of delayed allocation where > the file hasn't been flushed, there is nothing allocated, so st_blocks > based on a strict interpretation of it's description in POSIX _should_ > be 0, because nothing is allocated yet. Now you know why BTRFS is still an incomplete filesystem. In a few years when it turns 10, this may change. People who implement filesystems of course need to learn that they need to hide implementation details from the official user space interfaces. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On Thu, Jul 07, 2016 at 12:16:01AM +0900, Wang Shilong wrote: > On Wed, Jul 6, 2016 at 10:35 PM, Holger Hoffstätte >wrote: > > On 07/06/16 14:25, Wang Shilong wrote: > >> 'btrfs file du' is a very useful tool to watch my system > >> file usage with snapshot aware. > >> > >> when trying to run following commands: > >> [root@localhost btrfs-progs]# btrfs file du / > >> Total Exclusive Set shared Filename > >> ERROR: Failed to lookup root id - Inappropriate ioctl for device > >> ERROR: cannot check space of '/': Unknown error -1 > >> > >> and My Filesystem looks like this: > >> [root@localhost btrfs-progs]# df -Th > >> Filesystem Type Size Used Avail Use% Mounted on > >> devtmpfs devtmpfs 16G 0 16G 0% /dev > >> tmpfs tmpfs 16G 368K 16G 1% /dev/shm > >> tmpfs tmpfs 16G 1.4M 16G 1% /run > >> tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup > >> /dev/sda3 btrfs 60G 19G 40G 33% / > >> tmpfs tmpfs 16G 332K 16G 1% /tmp > >> /dev/sdc btrfs 2.8T 166G 1.7T 9% /data > >> /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot > >> /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi > >> tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 > >> > >> So I installed Btrfs as my root partition, but boot partition > >> can be other fs. > >> > >> We can Let btrfs tool aware of this is not a btrfs file or > >> directory and skip those files, so that someone like me > >> could just run 'btrfs file du /' to scan all btrfs filesystems. > >> > >> After patch, it will look like: > >>Total Exclusive Set shared Filename > >> skipping not btrfs dir/file: boot > >> skipping not btrfs dir/file: dev > >> skipping not btrfs dir/file: proc > >> skipping not btrfs dir/file: run > >> skipping not btrfs dir/file: sys > >> 0.00B 0.00B - //root/.bash_logout > >> 0.00B 0.00B - //root/.bash_profile > >> 0.00B 0.00B - //root/.bashrc > >> 0.00B 0.00B - //root/.cshrc > >> 0.00B 0.00B - //root/.tcshrc > >> > >> This works for me to analysis system usage and analysis > >> performaces. > > > > This is great, but can we please skip the "skipping .." messages? > > Maybe it's just me but I really don't see the value of printing them > > when they don't contribute to the result. > > They also mess up the display. :) > > I don't have a taste whether it needed or not, because it is somehow > useful to let users know some files/directories skipped At the absolute minimum, I think that these messages should go to stderr (like du does when it deosn't have permissions), and should go away with -q. They're still irritating, but at least you can get rid of them easily. Hugo. > Wait some other guys opinion for this... > > thanks, > Shilong > > > > > thanks, > > Holger > > -- Hugo Mills | "There's a Martian war machine outside -- they want hugo@... carfax.org.uk | to talk to you about a cure for the common cold." http://carfax.org.uk/ | PGP: E2AB1DE4 | Stephen Franklin, Babylon 5 signature.asc Description: Digital signature
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 2016-07-06 10:53, Joerg Schilling wrote: Antonio Diaz Diazwrote: Joerg Schilling wrote: POSIX requires st_blocks to be != 0 in case that the file contains data. Please, could you provide a reference? I can't find such requirement at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html blkcnt_t st_blocks Number of blocks allocated for this object. It should be obvious that a file that offers content also has allocated blocks. What you mean then is that POSIX _implies_ that this is the case, but does not say whether or not it is required. There are all kinds of counterexamples to this too, procfs is a POSIX compliant filesystem (every POSIX certified system has it), yet does not display the behavior that you expect, every single file in /proc for example reports 0 for both st_blocks and st_size, and yet all of them very obviously have content. Blocks are "allocated" when the OS decides whether the new data will fit on the medium. The fact that some filesystems may have data in a cache but not yet on the medium does not matter here. This is how UNIX worked since st_block has been introduced nearly 40 years ago. Tradition is the corpse of wisdom. Backwards comparability is a problem just as much as a good thing. In all seriousness though, this started out because stuff wasn't cached to anywhere near the degree it is today, and there was no such thing as delayed allocation. When you said to write, the filesystem allocated the blocks, regardless of when it actually wrote the data. IOW, the behavior that GNU tar is relying on is an implementation detail, not an API. Just like df, this breaks under modern designs, not because they chose to break it, but because it wasn't designed for use with such implementations. In the case of tar and similar things though, I'd argue that it's not sensible to special case files that are 'sparse', it should store any long enough run of zeroes as a sparse region, then provide an option to say to not make those files sparse when restored. A new filesystem cannot introduce new rules just because people believe it would save time. Saying the file has no blocks when there are no blocks allocated for it is not to 'save time', it's absolutely accurate. Suppose SVR4 UFS had a way to pack file data into the inode if it was small enough. In that case, it woulod be perfectly reasonable to return 0 for st_blocks because the inode table in UFS is a fixed pre-allocated structure, and therefore nothing is allocated to the file itself except the inode. The same applies in the case of a file packed into it's own metadata block on BTRFS, nothing is allocated to that file beyond the metadata block it has to have to store the inode. In the case of delayed allocation where the file hasn't been flushed, there is nothing allocated, so st_blocks based on a strict interpretation of it's description in POSIX _should_ be 0, because nothing is allocated yet. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 07/06/2016 05:09 PM, Joerg Schilling wrote: you concur that a delayed assignment of the "correct" value for st_blocks while the contend of the file does not change is not permitted. I'm not sure I agree even with that. A file system may undergo garbage collection and compaction, for instance, in which a file's data do not change but its internal representation does. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
Paul Eggertwrote: > On 07/06/2016 04:53 PM, Joerg Schilling wrote: > > Antonio Diaz Diaz wrote: > > > >> >Joerg Schilling wrote: > >>> > >POSIX requires st_blocks to be != 0 in case that the file contains > >>> > >data. > >> > > >> >Please, could you provide a reference? I can't find such requirement at > >> >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html > > blkcnt_t st_blocks Number of blocks allocated for this object. > > This doesn't require that st_blocks must be nonzero if the file contains > nonzero data, any more that it requires that st_blocks must be nonzero > if the file contains zero data. In either case, metadata outside the > scope of st_blocks might contain enough information for the file system > to represent all the file's data. In other words, you concur that a delayed assignment of the "correct" value for st_blocks while the contend of the file does not change is not permitted. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 07/06/2016 04:53 PM, Joerg Schilling wrote: Antonio Diaz Diazwrote: >Joerg Schilling wrote: > >POSIX requires st_blocks to be != 0 in case that the file contains data. > >Please, could you provide a reference? I can't find such requirement at >http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html blkcnt_t st_blocks Number of blocks allocated for this object. This doesn't require that st_blocks must be nonzero if the file contains nonzero data, any more that it requires that st_blocks must be nonzero if the file contains zero data. In either case, metadata outside the scope of st_blocks might contain enough information for the file system to represent all the file's data. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
Antonio Diaz Diazwrote: > Joerg Schilling wrote: > > POSIX requires st_blocks to be != 0 in case that the file contains data. > > Please, could you provide a reference? I can't find such requirement at > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html blkcnt_t st_blocks Number of blocks allocated for this object. It should be obvious that a file that offers content also has allocated blocks. Blocks are "allocated" when the OS decides whether the new data will fit on the medium. The fact that some filesystems may have data in a cache but not yet on the medium does not matter here. This is how UNIX worked since st_block has been introduced nearly 40 years ago. A new filesystem cannot introduce new rules just because people believe it would save time. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
Joerg Schilling wrote: POSIX requires st_blocks to be != 0 in case that the file contains data. Please, could you provide a reference? I can't find such requirement at http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html Thanks. Antonio. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
On 07/06/16 14:25, Wang Shilong wrote: > 'btrfs file du' is a very useful tool to watch my system > file usage with snapshot aware. > > when trying to run following commands: > [root@localhost btrfs-progs]# btrfs file du / > Total Exclusive Set shared Filename > ERROR: Failed to lookup root id - Inappropriate ioctl for device > ERROR: cannot check space of '/': Unknown error -1 > > and My Filesystem looks like this: > [root@localhost btrfs-progs]# df -Th > Filesystem Type Size Used Avail Use% Mounted on > devtmpfs devtmpfs 16G 0 16G 0% /dev > tmpfs tmpfs 16G 368K 16G 1% /dev/shm > tmpfs tmpfs 16G 1.4M 16G 1% /run > tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup > /dev/sda3 btrfs 60G 19G 40G 33% / > tmpfs tmpfs 16G 332K 16G 1% /tmp > /dev/sdc btrfs 2.8T 166G 1.7T 9% /data > /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot > /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi > tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 > > So I installed Btrfs as my root partition, but boot partition > can be other fs. > > We can Let btrfs tool aware of this is not a btrfs file or > directory and skip those files, so that someone like me > could just run 'btrfs file du /' to scan all btrfs filesystems. > > After patch, it will look like: >Total Exclusive Set shared Filename > skipping not btrfs dir/file: boot > skipping not btrfs dir/file: dev > skipping not btrfs dir/file: proc > skipping not btrfs dir/file: run > skipping not btrfs dir/file: sys > 0.00B 0.00B - //root/.bash_logout > 0.00B 0.00B - //root/.bash_profile > 0.00B 0.00B - //root/.bashrc > 0.00B 0.00B - //root/.cshrc > 0.00B 0.00B - //root/.tcshrc > > This works for me to analysis system usage and analysis > performaces. This is great, but can we please skip the "skipping .." messages? Maybe it's just me but I really don't see the value of printing them when they don't contribute to the result. They also mess up the display. :) thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: du: fix to skip not btrfs dir/file
'btrfs file du' is a very useful tool to watch my system file usage with snapshot aware. when trying to run following commands: [root@localhost btrfs-progs]# btrfs file du / Total Exclusive Set shared Filename ERROR: Failed to lookup root id - Inappropriate ioctl for device ERROR: cannot check space of '/': Unknown error -1 and My Filesystem looks like this: [root@localhost btrfs-progs]# df -Th Filesystem Type Size Used Avail Use% Mounted on devtmpfs devtmpfs 16G 0 16G 0% /dev tmpfs tmpfs 16G 368K 16G 1% /dev/shm tmpfs tmpfs 16G 1.4M 16G 1% /run tmpfs tmpfs 16G 0 16G 0% /sys/fs/cgroup /dev/sda3 btrfs 60G 19G 40G 33% / tmpfs tmpfs 16G 332K 16G 1% /tmp /dev/sdc btrfs 2.8T 166G 1.7T 9% /data /dev/sda2 xfs 2.0G 452M 1.6G 23% /boot /dev/sda1 vfat 1.9G 11M 1.9G 1% /boot/efi tmpfs tmpfs 3.2G 24K 3.2G 1% /run/user/1000 So I installed Btrfs as my root partition, but boot partition can be other fs. We can Let btrfs tool aware of this is not a btrfs file or directory and skip those files, so that someone like me could just run 'btrfs file du /' to scan all btrfs filesystems. After patch, it will look like: Total Exclusive Set shared Filename skipping not btrfs dir/file: boot skipping not btrfs dir/file: dev skipping not btrfs dir/file: proc skipping not btrfs dir/file: run skipping not btrfs dir/file: sys 0.00B 0.00B - //root/.bash_logout 0.00B 0.00B - //root/.bash_profile 0.00B 0.00B - //root/.bashrc 0.00B 0.00B - //root/.cshrc 0.00B 0.00B - //root/.tcshrc This works for me to analysis system usage and analysis performaces. Signed-off-by: Wang Shilong--- cmds-fi-du.c | 11 ++- cmds-inspect.c | 2 +- utils.c| 8 3 files changed, 15 insertions(+), 6 deletions(-) diff --git a/cmds-fi-du.c b/cmds-fi-du.c index 12855a5..bf0e62c 100644 --- a/cmds-fi-du.c +++ b/cmds-fi-du.c @@ -389,8 +389,17 @@ static int du_walk_dir(struct du_dir_ctxt *ctxt, struct rb_root *shared_extents) dirfd(dirstream), shared_extents, , , 0); - if (ret) + if (ret == -ENOTTY) { + fprintf(stdout, + "skipping not btrfs dir/file: %s\n", + entry->d_name); + continue; + } else if (ret) { + fprintf(stderr, + "failed to walk dir/file: %s :%s\n", + entry->d_name, strerror(-ret)); break; + } ctxt->bytes_total += tot; ctxt->bytes_shared += shr; diff --git a/cmds-inspect.c b/cmds-inspect.c index dd7b9dd..2ae44be 100644 --- a/cmds-inspect.c +++ b/cmds-inspect.c @@ -323,7 +323,7 @@ static int cmd_inspect_rootid(int argc, char **argv) ret = lookup_ino_rootid(fd, ); if (ret) { - error("rootid failed with ret=%d", ret); + error("failed to lookup root id: %s", strerror(-ret)); goto out; } diff --git a/utils.c b/utils.c index 578fdb0..f73b048 100644 --- a/utils.c +++ b/utils.c @@ -2815,6 +2815,8 @@ path: if (fd < 0) goto err; ret = lookup_ino_rootid(fd, ); + if (ret) + error("failed to lookup root id: %s", strerror(-ret)); close(fd); if (ret < 0) goto err; @@ -3497,10 +3499,8 @@ int lookup_ino_rootid(int fd, u64 *rootid) args.objectid = BTRFS_FIRST_FREE_OBJECTID; ret = ioctl(fd, BTRFS_IOC_INO_LOOKUP, ); - if (ret < 0) { - error("failed to lookup root id: %s", strerror(errno)); - return ret; - } + if (ret < 0) + return -errno; *rootid = args.treeid; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 06, 2016 at 02:55:37PM +0300, Andrei Borzenkov wrote: > On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn >wrote: > > On 2016-07-06 05:51, Andrei Borzenkov wrote: > >> > >> On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy > >> wrote: > >>> > >>> I started a systemd-devel@ thread since that's where most udev stuff > >>> gets talked about. > >>> > >>> > >>> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html > >>> > >> > >> Before discussing how to implement it in systemd, we need to decide > >> what to implement. I.e. > >> > >> 1) do you always want to mount filesystem in degraded mode if not > >> enough devices are present or only if explicit hint is given? > >> 2) do you want to restrict degrade handling to root only or to other > >> filesystems as well? Note that there could be more early boot > >> filesystems that absolutely need same treatment (enters separate > >> /usr), and there are also normal filesystems that may need be mounted > >> even degraded. > >> 3) can we query btrfs whether it is mountable in degraded mode? > >> according to documentation, "btrfs device ready" (which udev builtin > >> follows) checks "if it has ALL of it’s devices in cache for mounting". > >> This is required for proper systemd ordering of services. > > > > > > To be entirely honest, if it were me, I'd want systemd to fsck off. If the > > kernel mount(2) call succeeds, then the filesystem was ready enough to > > mount, and if it doesn't, then it wasn't, end of story. > > How should user space know when to try mount? What user space is > supposed to do during boot if mount fails? Do you suggest > > while true; do > mount /dev/foo && exit 0 > done > > as part of startup sequence? And note that nowhere is systemd involved so far. Getting rid of such loops was the original motivation for the ioctl: http://www.spinics.net/lists/linux-btrfs/msg17372.html Maybe the ioctl need extending? Instead of returning 1/0, it could take flag saying ”return 1 as soon as degraded mount is possible”? -- Tomasz Torcz Morality must always be based on practicality. xmpp: zdzich...@chrome.pl-- Baron Vladimir Harkonnen -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On 2016-07-06 08:39, Andrei Borzenkov wrote: Отправлено с iPhone 6 июля 2016 г., в 15:14, Austin S. Hemmelgarnнаписал(а): On 2016-07-06 07:55, Andrei Borzenkov wrote: On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn wrote: On 2016-07-06 05:51, Andrei Borzenkov wrote: On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy wrote: I started a systemd-devel@ thread since that's where most udev stuff gets talked about. https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html Before discussing how to implement it in systemd, we need to decide what to implement. I.e. 1) do you always want to mount filesystem in degraded mode if not enough devices are present or only if explicit hint is given? 2) do you want to restrict degrade handling to root only or to other filesystems as well? Note that there could be more early boot filesystems that absolutely need same treatment (enters separate /usr), and there are also normal filesystems that may need be mounted even degraded. 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. To be entirely honest, if it were me, I'd want systemd to fsck off. If the kernel mount(2) call succeeds, then the filesystem was ready enough to mount, and if it doesn't, then it wasn't, end of story. How should user space know when to try mount? What user space is supposed to do during boot if mount fails? Do you suggest while true; do mount /dev/foo && exit 0 done as part of startup sequence? And note that nowhere is systemd involved so far. Nowhere there, except if you have a filesystem in fstab (or a mount unit, which I hate for other reasons that I will not go into right now), and you mount it and systemd thinks the device isn't ready, it unmounts it _immediately_. In the case of boot, it's because of systemd thinking the device isn't ready that you can't mount degraded with a missing device. In the case of the root filesystem at least, the initramfs is expected to handle this, and most of them do poll in some way, or have other methods of determining this. I occasionally have issues with it with dracut without systemd, but that's due to a separate bug there involving the device mapper. How this systemd bashing answers my question - how user space knows when it can call mount at startup? You mentioned that systemd wasn't involved, which is patently false if it's being used as your init system, and I was admittedly mostly responding to that. Now, to answer the primary question which I forgot to answer: Userspace doesn't. Systemd doesn't either but assumes it does and checks in a flawed way. Dracut's polling loop assumes it does but sometimes fails in a different way. There is no way other than calling mount right now to know for sure if the mount will succeed, and that actually applies to a certain degree to any filesystem (because any number of things that are outside of even the kernel's control might happen while trying to mount the device. The whole concept of trying to track in userspace something the kernel itself tracks and knows a whole lot more about is absolutely stupid. It need not be user space. If kernel notifies user space when filesystem is mountable, problem solved. It could be udev event, netlink, whatever. Until kernel does it, user space need to either poll or somehow track it based on available events. THis I agree could be done better, but it absolutely should not be in userspace, the notification needs to come from the kernel, but that leads to the problem of knowing whether or not the FS can mount degraded, or only ro, or any number of other situations. It makes some sense when dealing with LVM or MD, because that is potentially a security issue (someone could inject a bogus device node that you then mount instead of your desired target), I do not understand it at all. MD and LVM has exactly the same problem - they need to know when they can assemble MD/VG. I miss what it has to do with security, sorry. If you don't track whether or not the device is assembled, then someone could create an arbitrary device node with the same name and then get you to mount that, possibly causing all kinds of issues depending on any number of other factors. Device node is created as soon as array is seen for the first time. If you imply someone may replace it, what prevents doing it at any arbitrary time in the future? It's still possible, but it's not as easy because replacing it after it's mounted would require a remount to have any effect. The most reliable time to do something like this is during boot before the mount. LVM and/or MD may or may not replace the node properly when they start (I don't have enough
Re: 64-btrfs.rules and degraded boot
Отправлено с iPhone > 6 июля 2016 г., в 15:14, Austin S. Hemmelgarn> написал(а): > >> On 2016-07-06 07:55, Andrei Borzenkov wrote: >> On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarn >> wrote: >>> On 2016-07-06 05:51, Andrei Borzenkov wrote: On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy wrote: > > I started a systemd-devel@ thread since that's where most udev stuff > gets talked about. > > > https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html Before discussing how to implement it in systemd, we need to decide what to implement. I.e. 1) do you always want to mount filesystem in degraded mode if not enough devices are present or only if explicit hint is given? 2) do you want to restrict degrade handling to root only or to other filesystems as well? Note that there could be more early boot filesystems that absolutely need same treatment (enters separate /usr), and there are also normal filesystems that may need be mounted even degraded. 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. >>> >>> >>> To be entirely honest, if it were me, I'd want systemd to fsck off. If the >>> kernel mount(2) call succeeds, then the filesystem was ready enough to >>> mount, and if it doesn't, then it wasn't, end of story. >> >> How should user space know when to try mount? What user space is >> supposed to do during boot if mount fails? Do you suggest >> >> while true; do >> mount /dev/foo && exit 0 >> done >> >> as part of startup sequence? And note that nowhere is systemd involved so >> far. > Nowhere there, except if you have a filesystem in fstab (or a mount unit, > which I hate for other reasons that I will not go into right now), and you > mount it and systemd thinks the device isn't ready, it unmounts it > _immediately_. In the case of boot, it's because of systemd thinking the > device isn't ready that you can't mount degraded with a missing device. In > the case of the root filesystem at least, the initramfs is expected to handle > this, and most of them do poll in some way, or have other methods of > determining this. I occasionally have issues with it with dracut without > systemd, but that's due to a separate bug there involving the device mapper. > How this systemd bashing answers my question - how user space knows when it can call mount at startup? >> >>> The whole concept >>> of trying to track in userspace something the kernel itself tracks and knows >>> a whole lot more about is absolutely stupid. >> >> It need not be user space. If kernel notifies user space when >> filesystem is mountable, problem solved. It could be udev event, >> netlink, whatever. Until kernel does it, user space need to either >> poll or somehow track it based on available events. > THis I agree could be done better, but it absolutely should not be in > userspace, the notification needs to come from the kernel, but that leads to > the problem of knowing whether or not the FS can mount degraded, or only ro, > or any number of other situations. >> >>> It makes some sense when >>> dealing with LVM or MD, because that is potentially a security issue >>> (someone could inject a bogus device node that you then mount instead of >>> your desired target), >> >> I do not understand it at all. MD and LVM has exactly the same problem >> - they need to know when they can assemble MD/VG. I miss what it has >> to do with security, sorry. > If you don't track whether or not the device is assembled, then someone could > create an arbitrary device node with the same name and then get you to mount > that, possibly causing all kinds of issues depending on any number of other > factors. Device node is created as soon as array is seen for the first time. If you imply someone may replace it, what prevents doing it at any arbitrary time in the future? >> >>> but it makes no sense here, because there's no way to >>> prevent the equivalent from happening in BTRFS. >>> >>> As far as the udev rules, I'm pretty certain that _we_ ship those with >>> btrfs-progs, >> >> No, you do not. You ship rule to rename devices to be more >> "user-friendly". But the rule in question has always been part of >> udev. > Ah, you're right, I was mistaken about this. >> >>> I have no idea why they're packaged with udev in CentOS (oh >>> wait, I bet they package every single possible udev rule in that package >>> just in case, don't they?). > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()
On 07/06/16 12:37, Wang Xiaoguang wrote: > Below test scripts can reproduce this false ENOSPC: > #!/bin/bash > dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 > dev=$(losetup --show -f fs.img) > mkfs.btrfs -f -M $dev > mkdir /tmp/mntpoint > mount /dev/loop0 /tmp/mntpoint > cd mntpoint > xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile > > Above fallocate(2) operation will fail for ENOSPC reason, but indeed > fs still has free space to satisfy this request. The reason is > btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use > just in time, and it calls btrfs_free_reserved_data_space_noquota() in > the end of btrfs_fallocate(), which is too late and have already added > false unnecessary pressure to enospc system. See call graph: > btrfs_fallocate() > |-> btrfs_alloc_data_chunk_ondemand() > It will add btrfs_space_info's bytes_may_use accordingly. > |-> btrfs_prealloc_file_range() > It will call btrfs_reserve_extent(), but note that alloc type is > RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will > only increase btrfs_space_info's bytes_reserved accordingly, but > will not decrease btrfs_space_info's bytes_may_use, then obviously > we have overestimated real needed disk space, and it'll impact > other processes who do write(2) or fallocate(2) operations, also > can impact metadata reservation in mixed mode, and bytes_max_use > will only be decreased in the end of btrfs_fallocate(). To fix > this false ENOSPC, we need to decrease btrfs_space_info's > bytes_may_use in btrfs_prealloc_file_range() in time, as what we > do in cow_file_range(), > See call graph in : > cow_file_range() > |-> extent_clear_unlock_delalloc() > |-> clear_extent_bit() > |-> btrfs_clear_bit_hook() > |-> btrfs_free_reserved_data_space_noquota() > This function will decrease bytes_may_use accordingly. > > So this patch choose to call btrfs_free_reserved_data_space() in > __btrfs_prealloc_file_range() for both successful and failed path. > > Also this patch removes some old and useless comments. > > Signed-off-by: Wang XiaoguangVerified that the reproducer script indeed fails (with btrfs ~4.7) and the patch (on top of 1/2) fixes it. Also ran a bunch of other fallocating things without problem. Free space also still seems sane, as far as I could tell. So for both patches: Tested-by: Holger Hoffstätte cheers, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs RAID 10 truncates files over 2G to 4096 bytes.
> On 6 Jul 2016, at 02:25, Henk Slagerwrote: > > On Wed, Jul 6, 2016 at 2:32 AM, Tomasz Kusmierz > wrote: >> >> On 6 Jul 2016, at 00:30, Henk Slager wrote: >> >> On Mon, Jul 4, 2016 at 11:28 PM, Tomasz Kusmierz >> wrote: >> >> I did consider that, but: >> - some files were NOT accessed by anything with 100% certainty (well if >> there is a rootkit on my system or something in that shape than maybe yes) >> - the only application that could access those files is totem (well >> Nautilius checks extension -> directs it to totem) so in that case we would >> hear about out break of totem killing people files. >> - if it was a kernel bug then other large files would be affected. >> >> Maybe I’m wrong and it’s actually related to the fact that all those files >> are located in single location on file system (single folder) that might >> have a historical bug in some structure somewhere ? >> >> >> I find it hard to imagine that this has something to do with the >> folderstructure, unless maybe the folder is a subvolume with >> non-default attributes or so. How the files in that folder are created >> (at full disktransferspeed or during a day or even a week) might give >> some hint. You could run filefrag and see if that rings a bell. >> >> files that are 4096 show: >> 1 extent found > > I actually meant filefrag for the files that are not (yet) truncated > to 4k. For example for virtual machine imagefiles (CoW), one could see > an MBR write. 117 extents found filesize 15468645003 good / bad ? > >> I did forgot to add that file system was created a long time ago and it was >> created with leaf & node size = 16k. >> >> >> If this long time ago is >2 years then you have likely specifically >> set node size = 16k, otherwise with older tools it would have been 4K. >> >> You are right I used -l 16K -n 16K >> >> Have you created it as raid10 or has it undergone profile conversions? >> >> Due to lack of spare disks >> (it may sound odd for some but spending for more than 6 disks for home use >> seems like an overkill) >> and due to last I’ve had I had to migrate all data to new file system. >> This played that way that I’ve: >> 1. from original FS I’ve removed 2 disks >> 2. Created RAID1 on those 2 disks, >> 3. shifted 2TB >> 4. removed 2 disks from source FS and adde those to destination FS >> 5 shifted 2 further TB >> 6 destroyed original FS and adde 2 disks to destination FS >> 7 converted destination FS to RAID10 >> >> FYI, when I convert to raid 10 I use: >> btrfs balance start -mconvert=raid10 -dconvert=raid10 -sconvert=raid10 -f >> /path/to/FS >> >> this filesystem has 5 sub volumes. Files affected are located in separate >> folder within a “victim folder” that is within a one sub volume. >> >> >> It could also be that the ondisk format is somewhat corrupted (btrfs >> check should find that ) and that that causes the issue. >> >> >> root@noname_server:/mnt# btrfs check /dev/sdg1 >> Checking filesystem on /dev/sdg1 >> UUID: d4cd1d5f-92c4-4b0f-8d45-1b378eff92a1 >> checking extents >> checking free space cache >> checking fs roots >> checking csums >> checking root refs >> found 4424060642634 bytes used err is 0 >> total csum bytes: 4315954936 >> total tree bytes: 4522786816 >> total fs tree bytes: 61702144 >> total extent tree bytes: 41402368 >> btree space waste bytes: 72430813 >> file data blocks allocated: 4475917217792 >> referenced 4420407603200 >> >> No luck there :/ > > Indeed looks all normal. > >> In-lining on raid10 has caused me some trouble (I had 4k nodes) over >> time, it has happened over a year ago with kernels recent at that >> time, but the fs was converted from raid5 >> >> Could you please elaborate on that ? you also ended up with files that got >> truncated to 4096 bytes ? > > I did not have truncated to 4k files, but your case lets me think of > small files inlining. Default max_inline mount option is 8k and that > means that 0 to ~3k files end up in metadata. I had size corruptions > for several of those small sized files that were updated quite > frequent, also within commit time AFAIK. Btrfs check lists this as > errors 400, although fs operation is not disturbed. I don't know what > happens if those small files are being updated/rewritten and are just > below or just above the max_inline limit. > > The only thing I was thinking of is that your files were started as > small, so inline, then extended to multi-GB. In the past, there were > 'bad extent/chunk type' issues and it was suggested that the fs would > have been an ext4-converted one (which had non-compliant mixed > metadata and data) but for most it was not the case. So there was/is > something unclear, but full balance or so fixed it as far as I > remember. But it is guessing, I do not have any failure cases like the > one you see. When I think of it, I did move this folder first when filesystem was RAID 1 (or not
Re: 64-btrfs.rules and degraded boot
On 2016-07-06 07:55, Andrei Borzenkov wrote: On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarnwrote: On 2016-07-06 05:51, Andrei Borzenkov wrote: On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy wrote: I started a systemd-devel@ thread since that's where most udev stuff gets talked about. https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html Before discussing how to implement it in systemd, we need to decide what to implement. I.e. 1) do you always want to mount filesystem in degraded mode if not enough devices are present or only if explicit hint is given? 2) do you want to restrict degrade handling to root only or to other filesystems as well? Note that there could be more early boot filesystems that absolutely need same treatment (enters separate /usr), and there are also normal filesystems that may need be mounted even degraded. 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. To be entirely honest, if it were me, I'd want systemd to fsck off. If the kernel mount(2) call succeeds, then the filesystem was ready enough to mount, and if it doesn't, then it wasn't, end of story. How should user space know when to try mount? What user space is supposed to do during boot if mount fails? Do you suggest while true; do mount /dev/foo && exit 0 done as part of startup sequence? And note that nowhere is systemd involved so far. Nowhere there, except if you have a filesystem in fstab (or a mount unit, which I hate for other reasons that I will not go into right now), and you mount it and systemd thinks the device isn't ready, it unmounts it _immediately_. In the case of boot, it's because of systemd thinking the device isn't ready that you can't mount degraded with a missing device. In the case of the root filesystem at least, the initramfs is expected to handle this, and most of them do poll in some way, or have other methods of determining this. I occasionally have issues with it with dracut without systemd, but that's due to a separate bug there involving the device mapper. The whole concept of trying to track in userspace something the kernel itself tracks and knows a whole lot more about is absolutely stupid. It need not be user space. If kernel notifies user space when filesystem is mountable, problem solved. It could be udev event, netlink, whatever. Until kernel does it, user space need to either poll or somehow track it based on available events. THis I agree could be done better, but it absolutely should not be in userspace, the notification needs to come from the kernel, but that leads to the problem of knowing whether or not the FS can mount degraded, or only ro, or any number of other situations. It makes some sense when dealing with LVM or MD, because that is potentially a security issue (someone could inject a bogus device node that you then mount instead of your desired target), I do not understand it at all. MD and LVM has exactly the same problem - they need to know when they can assemble MD/VG. I miss what it has to do with security, sorry. If you don't track whether or not the device is assembled, then someone could create an arbitrary device node with the same name and then get you to mount that, possibly causing all kinds of issues depending on any number of other factors. but it makes no sense here, because there's no way to prevent the equivalent from happening in BTRFS. As far as the udev rules, I'm pretty certain that _we_ ship those with btrfs-progs, No, you do not. You ship rule to rename devices to be more "user-friendly". But the rule in question has always been part of udev. Ah, you're right, I was mistaken about this. I have no idea why they're packaged with udev in CentOS (oh wait, I bet they package every single possible udev rule in that package just in case, don't they?). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On Wed, Jul 6, 2016 at 2:45 PM, Austin S. Hemmelgarnwrote: > On 2016-07-06 05:51, Andrei Borzenkov wrote: >> >> On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphy >> wrote: >>> >>> I started a systemd-devel@ thread since that's where most udev stuff >>> gets talked about. >>> >>> >>> https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html >>> >> >> Before discussing how to implement it in systemd, we need to decide >> what to implement. I.e. >> >> 1) do you always want to mount filesystem in degraded mode if not >> enough devices are present or only if explicit hint is given? >> 2) do you want to restrict degrade handling to root only or to other >> filesystems as well? Note that there could be more early boot >> filesystems that absolutely need same treatment (enters separate >> /usr), and there are also normal filesystems that may need be mounted >> even degraded. >> 3) can we query btrfs whether it is mountable in degraded mode? >> according to documentation, "btrfs device ready" (which udev builtin >> follows) checks "if it has ALL of it’s devices in cache for mounting". >> This is required for proper systemd ordering of services. > > > To be entirely honest, if it were me, I'd want systemd to fsck off. If the > kernel mount(2) call succeeds, then the filesystem was ready enough to > mount, and if it doesn't, then it wasn't, end of story. How should user space know when to try mount? What user space is supposed to do during boot if mount fails? Do you suggest while true; do mount /dev/foo && exit 0 done as part of startup sequence? And note that nowhere is systemd involved so far. > The whole concept > of trying to track in userspace something the kernel itself tracks and knows > a whole lot more about is absolutely stupid. It need not be user space. If kernel notifies user space when filesystem is mountable, problem solved. It could be udev event, netlink, whatever. Until kernel does it, user space need to either poll or somehow track it based on available events. > It makes some sense when > dealing with LVM or MD, because that is potentially a security issue > (someone could inject a bogus device node that you then mount instead of > your desired target), I do not understand it at all. MD and LVM has exactly the same problem - they need to know when they can assemble MD/VG. I miss what it has to do with security, sorry. > but it makes no sense here, because there's no way to > prevent the equivalent from happening in BTRFS. > > As far as the udev rules, I'm pretty certain that _we_ ship those with > btrfs-progs, No, you do not. You ship rule to rename devices to be more "user-friendly". But the rule in question has always been part of udev. > I have no idea why they're packaged with udev in CentOS (oh > wait, I bet they package every single possible udev rule in that package > just in case, don't they?). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Adventures in btrfs raid5 disk recovery
On 2016-07-05 19:05, Chris Murphy wrote: Related: http://www.spinics.net/lists/raid/msg52880.html Looks like there is some traction to figuring out what to do about this, whether it's a udev rule or something that happens in the kernel itself. Pretty much the only hardware setup unaffected by this are those with enterprise or NAS drives. Every configuration of a consumer drive, single, linear/concat, and all software (mdadm, lvm, Btrfs) RAID Levels are adversely affected by this. The thing I don't get about this is that while the per-device settings on a given system are policy, the default value is not, and should be expected to work correctly (but not necessarily optimally) on as many systems as possible, so any claim that this should be fixed in udev are bogus by the regular kernel rules. I suspect, but haven't tested, that ZFS On Linux would be equally affected, unless they're completely reimplementing their own block layer (?) So there are quite a few parties now negatively impacted by the current default behavior. OTOH, I would not be surprised if the stance there is 'you get no support if your not using enterprise drives', not because of the project itself, but because it's ZFS. Part of their minimum recommended hardware requirements is ECC RAM, so it wouldn't surprise me if enterprise storage devices are there too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
"Austin S. Hemmelgarn"wrote: > > A broken filesystem is a broken filesystem. > > > > If you try to change gtar to work around a specific problem, it may fail in > > other situations. > The problem with this is that tar is assuming things that are not > guaranteed to be true. There is absolutely nothing that says that > st_blocks has to be non-zero if there's data in the file. In fact, the This is not true: POSIX requires st_blocks to be != 0 in case that the file contains data. > behavior that BTRFS used to have of reporting st_blocks to be 0 for > files entirely inlined in the metadata is absolutely correct given the > description of the field by POSIX, because there _are_ no blocks > allocated to the file (because the metadata block is technically > equivalent to the inode, which isn't counted by st_blocks). This is yet > another example of an old interface (in this case, sparse file > detection) being short-sighted (read in this case as non-existent). The internal state of a file system is irrelevant. The only thing that counts is the user space view and if a file contains data (read succeeds in user space), it needs to report st_blocks != 0. > The proper fix for this is that tar (and anything else that handles > sparse files differently) should be parsing the file regardless. It has > to anyway for a normal sparse file to figure out where the sparse > regions are, and optimizing for a file that's completely sparse (and > therefore probably pre-allocated with fallocate) is not all that > reasonable considering that this is going to be a very rare case in > normal usage. This does not help. Even on a decent OS (e.g. Solaris since Summer 2005) and a decent tar implementation (star) that supports SEEK_HOLE since Summer 2005, this method will not work for all filesystems as there may be old filesystem implementations and as there may be NFS... For this reason, star still checks st_blocks in case that SEEK_HOLE did not work. Jörg -- EMail:jo...@schily.net(home) Jörg Schilling D-13353 Berlin joerg.schill...@fokus.fraunhofer.de (work) Blog: http://schily.blogspot.com/ URL: http://cdrecord.org/private/ http://sourceforge.net/projects/schilytools/files/' -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On 2016-07-06 05:51, Andrei Borzenkov wrote: On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphywrote: I started a systemd-devel@ thread since that's where most udev stuff gets talked about. https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html Before discussing how to implement it in systemd, we need to decide what to implement. I.e. 1) do you always want to mount filesystem in degraded mode if not enough devices are present or only if explicit hint is given? 2) do you want to restrict degrade handling to root only or to other filesystems as well? Note that there could be more early boot filesystems that absolutely need same treatment (enters separate /usr), and there are also normal filesystems that may need be mounted even degraded. 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. To be entirely honest, if it were me, I'd want systemd to fsck off. If the kernel mount(2) call succeeds, then the filesystem was ready enough to mount, and if it doesn't, then it wasn't, end of story. The whole concept of trying to track in userspace something the kernel itself tracks and knows a whole lot more about is absolutely stupid. It makes some sense when dealing with LVM or MD, because that is potentially a security issue (someone could inject a bogus device node that you then mount instead of your desired target), but it makes no sense here, because there's no way to prevent the equivalent from happening in BTRFS. As far as the udev rules, I'm pretty certain that _we_ ship those with btrfs-progs, I have no idea why they're packaged with udev in CentOS (oh wait, I bet they package every single possible udev rule in that package just in case, don't they?). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bug-tar] stat() on btrfs reports the st_blocks with delay (data loss in archivers)
On 2016-07-05 05:28, Joerg Schilling wrote: Andreas Dilgerwrote: I think in addition to fixing btrfs (because it needs to work with existing tar/rsync/etc. tools) it makes sense to *also* fix the heuristics of tar to handle this situation more robustly. One option is if st_blocks == 0 then tar should also check if st_mtime is less than 60s in the past, and if yes then it should call fsync() on the file to flush any unwritten data to disk, or assume the file is not sparse and read the whole file, so that it doesn't incorrectly assume that the file is sparse and skip archiving the file data. A broken filesystem is a broken filesystem. If you try to change gtar to work around a specific problem, it may fail in other situations. The problem with this is that tar is assuming things that are not guaranteed to be true. There is absolutely nothing that says that st_blocks has to be non-zero if there's data in the file. In fact, the behavior that BTRFS used to have of reporting st_blocks to be 0 for files entirely inlined in the metadata is absolutely correct given the description of the field by POSIX, because there _are_ no blocks allocated to the file (because the metadata block is technically equivalent to the inode, which isn't counted by st_blocks). This is yet another example of an old interface (in this case, sparse file detection) being short-sighted (read in this case as non-existent). The proper fix for this is that tar (and anything else that handles sparse files differently) should be parsing the file regardless. It has to anyway for a normal sparse file to figure out where the sparse regions are, and optimizing for a file that's completely sparse (and therefore probably pre-allocated with fallocate) is not all that reasonable considering that this is going to be a very rare case in normal usage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Out of space error even though there's 100 GB unused?
Hi Hugo, I agree that it seems to be a bug, and I'll be glad to help nail that down - if only because I have no other drive to move the data to :-) As for your suggestion - no change: [root@archb3 stan]# mount | grep home /dev/sda4 on /home type btrfs (rw,relatime,nospace_cache,clear_cache,subvolid=5,subvol=/) [root@archb3 stan]# touch test touch: cannot touch 'test': No space left on device Cheers, Stan 2016-07-06 12:34 GMT+02:00 Hugo Mills: > On Wed, Jul 06, 2016 at 11:55:42AM +0200, Stanislaw Kaminski wrote: >> Hi, >> I am fighting with this since at least Monday - see >> https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left >> >> Here's the data: >> # uname -a >> Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016 >> armv5tel GNU/Linux >> >> # btrfs --version >> btrfs-progs v4.6 >> >> # btrfs fi show >> Label: 'home' uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b >> Total devices 1 FS bytes used 1.71TiB >> devid1 size 1.81TiB used 1.71TiB path /dev/sda4 > >In this state, you should definitely not be seeing out of space > errors. This is, therefore, a bug you're seeing. > >I've not been following things as closely as I'd like of late, but > I think there was a bug recently involving the free space cache. It > might be worth unmounting the FS and mounting again with the > nospace_cache option, just to see if that helps. > >Hugo. > >> # btrfs fi df /home >> Data, single: total=1.71TiB, used=1.71TiB >> System, DUP: total=32.00MiB, used=224.00KiB >> Metadata, DUP: total=4.00GiB, used=2.07GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> # btrfs f usage -T /home >> Overall: >> Device size: 1.81TiB >> Device allocated: 1.71TiB >> Device unallocated: 97.89GiB >> Device missing: 0.00B >> Used: 1.71TiB >> Free (estimated): 98.22GiB (min: 49.27GiB) >> Data ratio: 1.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> DataMetadata System >> Id Path single DUP DUP Unallocated >> -- - --- - --- >> 1 /dev/sda4 1.71TiB 8.00GiB 64.00MiB97.89GiB >> -- - --- - --- >>Total 1.71TiB 4.00GiB 32.00MiB97.89GiB >>Used 1.71TiB 2.07GiB 224.00KiB >> >> # btrfs fi du -s /home >> Total Exclusive Set shared Filename >> 1.60TiB 1.60TiB 0.00B /home >> >> # btrfs f resize 1:+1G /home/ >> Resize '/home/' of '1:+1G' >> ERROR: unable to resize '/home/': no enough free space >> >> This all is after closely following: >> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 >> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html >> >> So, already did full volume rebalance, defrag, rebooted multiple times >> - still, "Error: out of disk space". >> >> To sum up: >> - my files sum to 1.6 TiB >> - disk usage is shown to be 1.71 TiB >> - volume size is 1.81 TiB >> - btrfs util shows I have ~98 GiB free space on the volume >> - I am getting "out of space" message >> >> Bonus: >> - I removed 50 GB of data from the drive and I still get "out of >> space" message after writing ~1 GB. >> >> Help would be very appreciated. >> >> Cheers, >> Stan > > -- > Hugo Mills | You can play with your friends' privates, but you > hugo@... carfax.org.uk | can't play with your friends' childrens' privates. > http://carfax.org.uk/ | > PGP: E2AB1DE4 | C++ coding rule -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: fix false ENOSPC for btrfs_fallocate()
Below test scripts can reproduce this false ENOSPC: #!/bin/bash dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128 dev=$(losetup --show -f fs.img) mkfs.btrfs -f -M $dev mkdir /tmp/mntpoint mount /dev/loop0 /tmp/mntpoint cd mntpoint xfs_io -f -c "falloc 0 $((40*1024*1024))" testfile Above fallocate(2) operation will fail for ENOSPC reason, but indeed fs still has free space to satisfy this request. The reason is btrfs_fallocate() dose not decrease btrfs_space_info's bytes_may_use just in time, and it calls btrfs_free_reserved_data_space_noquota() in the end of btrfs_fallocate(), which is too late and have already added false unnecessary pressure to enospc system. See call graph: btrfs_fallocate() |-> btrfs_alloc_data_chunk_ondemand() It will add btrfs_space_info's bytes_may_use accordingly. |-> btrfs_prealloc_file_range() It will call btrfs_reserve_extent(), but note that alloc type is RESERVE_ALLOC_NO_ACCOUNT, so btrfs_update_reserved_bytes() will only increase btrfs_space_info's bytes_reserved accordingly, but will not decrease btrfs_space_info's bytes_may_use, then obviously we have overestimated real needed disk space, and it'll impact other processes who do write(2) or fallocate(2) operations, also can impact metadata reservation in mixed mode, and bytes_max_use will only be decreased in the end of btrfs_fallocate(). To fix this false ENOSPC, we need to decrease btrfs_space_info's bytes_may_use in btrfs_prealloc_file_range() in time, as what we do in cow_file_range(), See call graph in : cow_file_range() |-> extent_clear_unlock_delalloc() |-> clear_extent_bit() |-> btrfs_clear_bit_hook() |-> btrfs_free_reserved_data_space_noquota() This function will decrease bytes_may_use accordingly. So this patch choose to call btrfs_free_reserved_data_space() in __btrfs_prealloc_file_range() for both successful and failed path. Also this patch removes some old and useless comments. Signed-off-by: Wang Xiaoguang--- fs/btrfs/extent-tree.c | 1 - fs/btrfs/file.c| 23 --- fs/btrfs/inode-map.c | 3 +-- fs/btrfs/inode.c | 12 fs/btrfs/relocation.c | 10 +- 5 files changed, 34 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 82b912a..b0c86d2 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3490,7 +3490,6 @@ again: dcs = BTRFS_DC_SETUP; else if (ret == -ENOSPC) set_bit(BTRFS_TRANS_CACHE_ENOSPC, >transaction->flags); - btrfs_free_reserved_data_space(inode, 0, num_pages); out_put: iput(inode); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 2234e88..f872113 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2669,6 +2669,7 @@ static long btrfs_fallocate(struct file *file, int mode, alloc_start = round_down(offset, blocksize); alloc_end = round_up(offset + len, blocksize); + cur_offset = alloc_start; /* Make sure we aren't being give some crap mode */ if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE)) @@ -2761,7 +2762,6 @@ static long btrfs_fallocate(struct file *file, int mode, /* First, check if we exceed the qgroup limit */ INIT_LIST_HEAD(_list); - cur_offset = alloc_start; while (1) { em = btrfs_get_extent(inode, NULL, 0, cur_offset, alloc_end - cur_offset, 0); @@ -2788,6 +2788,14 @@ static long btrfs_fallocate(struct file *file, int mode, last_byte - cur_offset); if (ret < 0) break; + } else { + /* +* Do not need to reserve unwritten extent for this +* range, free reserved data space first, otherwise +* it'll result false ENOSPC error. +*/ + btrfs_free_reserved_data_space(inode, cur_offset, + last_byte - cur_offset); } free_extent_map(em); cur_offset = last_byte; @@ -2839,18 +2847,11 @@ out_unlock: unlock_extent_cached(_I(inode)->io_tree, alloc_start, locked_end, _state, GFP_KERNEL); out: - /* -* As we waited the extent range, the data_rsv_map must be empty -* in the range, as written data range will be released from it. -* And for prealloacted extent, it will also be released when -* its metadata is written. -* So this is completely used as cleanup. -*/ - btrfs_qgroup_free_data(inode, alloc_start, alloc_end - alloc_start); inode_unlock(inode); /*
[PATCH 1/2] btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster()
In prealloc_file_extent_cluster(), btrfs_check_data_free_space() uses wrong file offset for reloc_inode, it uses cluster->start and cluster->end, which indeed are extent's bytenr. The correct value should be cluster->[start|end] minus block group's start bytenr. start bytenr cluster->start | | extent | extent | ...| extent | || |block group reloc_inode | Signed-off-by: Wang Xiaoguang--- fs/btrfs/relocation.c | 27 +++ 1 file changed, 15 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 0477dca..abc2f69 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -3030,34 +3030,37 @@ int prealloc_file_extent_cluster(struct inode *inode, u64 num_bytes; int nr = 0; int ret = 0; + u64 prealloc_start, prealloc_end; BUG_ON(cluster->start != cluster->boundary[0]); inode_lock(inode); - ret = btrfs_check_data_free_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + start = cluster->start - offset; + end = cluster->end - offset; + ret = btrfs_check_data_free_space(inode, start, end + 1 - start); if (ret) goto out; while (nr < cluster->nr) { - start = cluster->boundary[nr] - offset; + prealloc_start = cluster->boundary[nr] - offset; if (nr + 1 < cluster->nr) - end = cluster->boundary[nr + 1] - 1 - offset; + prealloc_end = cluster->boundary[nr + 1] - 1 - offset; else - end = cluster->end - offset; + prealloc_end = cluster->end - offset; - lock_extent(_I(inode)->io_tree, start, end); - num_bytes = end + 1 - start; - ret = btrfs_prealloc_file_range(inode, 0, start, + lock_extent(_I(inode)->io_tree, prealloc_start, + prealloc_end); + num_bytes = prealloc_end + 1 - prealloc_start; + ret = btrfs_prealloc_file_range(inode, 0, prealloc_start, num_bytes, num_bytes, - end + 1, _hint); - unlock_extent(_I(inode)->io_tree, start, end); + prealloc_end + 1, _hint); + unlock_extent(_I(inode)->io_tree, prealloc_start, + prealloc_end); if (ret) break; nr++; } - btrfs_free_reserved_data_space(inode, cluster->start, - cluster->end + 1 - cluster->start); + btrfs_free_reserved_data_space(inode, start, end + 1 - start); out: inode_unlock(inode); return ret; -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Out of space error even though there's 100 GB unused?
On Wed, Jul 06, 2016 at 11:55:42AM +0200, Stanislaw Kaminski wrote: > Hi, > I am fighting with this since at least Monday - see > https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left > > Here's the data: > # uname -a > Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016 > armv5tel GNU/Linux > > # btrfs --version > btrfs-progs v4.6 > > # btrfs fi show > Label: 'home' uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b > Total devices 1 FS bytes used 1.71TiB > devid1 size 1.81TiB used 1.71TiB path /dev/sda4 In this state, you should definitely not be seeing out of space errors. This is, therefore, a bug you're seeing. I've not been following things as closely as I'd like of late, but I think there was a bug recently involving the free space cache. It might be worth unmounting the FS and mounting again with the nospace_cache option, just to see if that helps. Hugo. > # btrfs fi df /home > Data, single: total=1.71TiB, used=1.71TiB > System, DUP: total=32.00MiB, used=224.00KiB > Metadata, DUP: total=4.00GiB, used=2.07GiB > GlobalReserve, single: total=512.00MiB, used=0.00B > > # btrfs f usage -T /home > Overall: > Device size: 1.81TiB > Device allocated: 1.71TiB > Device unallocated: 97.89GiB > Device missing: 0.00B > Used: 1.71TiB > Free (estimated): 98.22GiB (min: 49.27GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > DataMetadata System > Id Path single DUP DUP Unallocated > -- - --- - --- > 1 /dev/sda4 1.71TiB 8.00GiB 64.00MiB97.89GiB > -- - --- - --- >Total 1.71TiB 4.00GiB 32.00MiB97.89GiB >Used 1.71TiB 2.07GiB 224.00KiB > > # btrfs fi du -s /home > Total Exclusive Set shared Filename > 1.60TiB 1.60TiB 0.00B /home > > # btrfs f resize 1:+1G /home/ > Resize '/home/' of '1:+1G' > ERROR: unable to resize '/home/': no enough free space > > This all is after closely following: > https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 > http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html > > So, already did full volume rebalance, defrag, rebooted multiple times > - still, "Error: out of disk space". > > To sum up: > - my files sum to 1.6 TiB > - disk usage is shown to be 1.71 TiB > - volume size is 1.81 TiB > - btrfs util shows I have ~98 GiB free space on the volume > - I am getting "out of space" message > > Bonus: > - I removed 50 GB of data from the drive and I still get "out of > space" message after writing ~1 GB. > > Help would be very appreciated. > > Cheers, > Stan -- Hugo Mills | You can play with your friends' privates, but you hugo@... carfax.org.uk | can't play with your friends' childrens' privates. http://carfax.org.uk/ | PGP: E2AB1DE4 | C++ coding rule signature.asc Description: Digital signature
Re: Out of space error even though there's 100 GB unused?
Hi Alex, Thank for having a look. "You're trying to resize a fs that is probably already fully using the block device it's on. I don't see anything incorrect happening here, but I might be missing something." This was just to show that I can't do this, I know that it is already utilizing the entire block device. "The unallocated space will be allocated if you start writing files to it." That's what I would expect, unfortunately it's kind of hard to write files to it, as I get "Out of space" error. Tongue-in-cheek, if you know how to ignore the issue and start writing files, it would solve my issue. Bottom line: if the disk is really full, then none of the tools shows that. If it is not (and I suspect it's not - as I mentioned, I just removed 50 GB of data from it), then why am I getting "out of space"? As for block device size: # fdisk -l Disk /dev/sda: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 4096 bytes Disklabel type: gpt Disk identifier: 6070B645-D738-4730-BEF7-989210EF1DD7 DeviceStartEndSectors Size Type /dev/sda1 2048 133119 131072 64M Linux filesystem /dev/sda2133120223027120971521G Linux swap /dev/sda3 2230272 19007487 167772168G Linux filesystem /dev/sda4 19007488 3907029134 3888021647 1.8T Linux home Cheers, Stan 2016-07-06 12:10 GMT+02:00 Alexander Fougner: > > Den 6 juli 2016 12:03 em skrev "Stanislaw Kaminski" > : >> > >> Hi, >> I am fighting with this since at least Monday - see >> >> https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left >> >> Here's the data: >> # uname -a >> Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016 >> armv5tel GNU/Linux >> >> # btrfs --version >> btrfs-progs v4.6 >> >> # btrfs fi show >> Label: 'home' uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b >> Total devices 1 FS bytes used 1.71TiB >> devid1 size 1.81TiB used 1.71TiB path /dev/sda4 >> >> # btrfs fi df /home >> Data, single: total=1.71TiB, used=1.71TiB >> System, DUP: total=32.00MiB, used=224.00KiB >> Metadata, DUP: total=4.00GiB, used=2.07GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> >> # btrfs f usage -T /home >> Overall: >> Device size: 1.81TiB >> Device allocated: 1.71TiB >> Device unallocated: 97.89GiB >> Device missing: 0.00B >> Used: 1.71TiB >> Free (estimated): 98.22GiB (min: 49.27GiB) >> Data ratio: 1.00 >> Metadata ratio: 2.00 >> Global reserve: 512.00MiB (used: 0.00B) >> >> DataMetadata System >> Id Path single DUP DUP Unallocated >> -- - --- - --- >> 1 /dev/sda4 1.71TiB 8.00GiB 64.00MiB97.89GiB >> -- - --- - --- >>Total 1.71TiB 4.00GiB 32.00MiB97.89GiB >>Used 1.71TiB 2.07GiB 224.00KiB >> >> # btrfs fi du -s /home >> Total Exclusive Set shared Filename >> 1.60TiB 1.60TiB 0.00B /home >> >> # btrfs f resize 1:+1G /home/ >> Resize '/home/' of '1:+1G' >> ERROR: unable to resize '/home/': no enough free space >> > > You're trying to resize a fs that is probably already fully using the block > device it's on. I don't see anything incorrect happening here, but I might > be missing something. > > The used space amounting to 1.6TiB is not as reliable as the btrfs fi df > tool. > The unallocated space will be allocated if you start writing files to it. > What size is the parent block device? > >> This all is after closely following: >> >> https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 >> >> http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html >> >> So, already did full volume rebalance, defrag, rebooted multiple times >> - still, "Error: out of disk space". >> >> To sum up: >> - my files sum to 1.6 TiB >> - disk usage is shown to be 1.71 TiB >> - volume size is 1.81 TiB >> - btrfs util shows I have ~98 GiB free space on the volume >> - I am getting "out of space" message >> >> Bonus: >> - I removed 50 GB of data from the drive and I still get "out of >> space" message after writing ~1 GB. >> >> Help would be very appreciated. >> >> Cheers, >> Stan >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: fix fsfreeze hang caused by delayed iputs deal
hello, On 07/05/2016 01:35 AM, David Sterba wrote: On Wed, Jun 29, 2016 at 01:15:10PM +0800, Wang Xiaoguang wrote: When running fstests generic/068, sometimes we got below WARNING: xfs_io D 8800331dbb20 0 6697 6693 0x0080 8800331dbb20 88007acfc140 880034d895c0 8800331dc000 880032d243e8 fffe 880032d24400 0001 8800331dbb38 816a9045 880034d895c0 8800331dbba8 Call Trace: [] schedule+0x35/0x80 [] rwsem_down_read_failed+0xf2/0x140 [] ? __filemap_fdatawrite_range+0xd1/0x100 [] call_rwsem_down_read_failed+0x18/0x30 [] ? btrfs_alloc_block_rsv+0x2c/0xb0 [btrfs] [] percpu_down_read+0x35/0x50 [] __sb_start_write+0x2c/0x40 [] start_transaction+0x2a5/0x4d0 [btrfs] [] btrfs_join_transaction+0x17/0x20 [btrfs] [] btrfs_evict_inode+0x3c4/0x5d0 [btrfs] [] evict+0xba/0x1a0 [] iput+0x196/0x200 [] btrfs_run_delayed_iputs+0x70/0xc0 [btrfs] [] btrfs_commit_transaction+0x928/0xa80 [btrfs] [] btrfs_freeze+0x30/0x40 [btrfs] [] freeze_super+0xf0/0x190 [] do_vfs_ioctl+0x4a5/0x5c0 [] ? do_audit_syscall_entry+0x66/0x70 [] ? syscall_trace_enter_phase1+0x11f/0x140 [] SyS_ioctl+0x79/0x90 [] do_syscall_64+0x62/0x110 [] entry_SYSCALL64_slow_path+0x25/0x25 >From this warning, freeze_super() already holds SB_FREEZE_FS, but btrfs_freeze() will call btrfs_commit_transaction() again, if btrfs_commit_transaction() finds that it has delayed iputs to handle, it'll start_transaction(), which will try to get SB_FREEZE_FS lock again, then deadlock occurs. The root cause is that in btrfs, sync_filesystem(sb) does not make sure all metadata is updated. See below race window in freeze_super(): sync_filesystem(sb); | | race window | In this period, cleaner_kthread() may be scheduled to | run, and it call btrfs_delete_unused_bgs() which will | add some delayed iputs. | sb->s_writers.frozen = SB_FREEZE_FS; sb_wait_write(sb, SB_FREEZE_FS); if (sb->s_op->freeze_fs) { /* freeze_fs will call btrfs_commit_transaction() */ ret = sb->s_op->freeze_fs(sb); So if btrfs is doing freeze job, we should block btrfs_delete_unused_bgs(), to avoid add delayed iputs. Signed-off-by: Wang Xiaoguang--- fs/btrfs/disk-io.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 863bf7a..fdbe0df 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1846,8 +1846,11 @@ static int cleaner_kthread(void *arg) * after acquiring fs_info->delete_unused_bgs_mutex. So we * can't hold, nor need to, fs_info->cleaner_mutex when deleting * unused block groups. +* Extra line, but I think you intended to write a comment that explains why the freeze protection is required here :) Yes, but forgot to... :) */ + __sb_start_write(root->fs_info->sb, SB_FREEZE_WRITE, true); There's opencoding an existing wrapper sb_start_write, please use it instead. OK, I can submit a new version using this wrapper. Also could you please have a look at my reply to Filipe Manana in last mail? I suggest another solution, thanks. Regards, Xiaoguang Wang btrfs_delete_unused_bgs(root->fs_info); + __sb_end_write(root->fs_info->sb, SB_FREEZE_WRITE); sleep: if (!again) { set_current_state(TASK_INTERRUPTIBLE); -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix free space calculation in dump_space_info()
hello, On 07/05/2016 01:10 AM, David Sterba wrote: On Wed, Jun 29, 2016 at 01:12:16PM +0800, Wang Xiaoguang wrote: Can you please describe in more detail what is this patch fixing? In original dump_space_info(), free space info is calculated by info->total_bytes - info->bytes_used - info->bytes_pinned - info->bytes_reserved - info->bytes_readonly, but I think free space info should also minus info->bytes_may_use :) Regards, Xiaoguang Wang Signed-off-by: Wang Xiaoguang--- fs/btrfs/extent-tree.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8550a0e..520ba8f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -7747,8 +7747,8 @@ static void dump_space_info(struct btrfs_space_info *info, u64 bytes, printk(KERN_INFO "BTRFS: space_info %llu has %llu free, is %sfull\n", info->flags, info->total_bytes - info->bytes_used - info->bytes_pinned - - info->bytes_reserved - info->bytes_readonly, - (info->full) ? "" : "not "); + info->bytes_reserved - info->bytes_readonly - + info->bytes_may_use, (info->full) ? "" : "not "); printk(KERN_INFO "BTRFS: space_info total=%llu, used=%llu, pinned=%llu, " "reserved=%llu, may_use=%llu, readonly=%llu\n", info->total_bytes, info->bytes_used, info->bytes_pinned, -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Out of space error even though there's 100 GB unused?
Hi, I am fighting with this since at least Monday - see https://superuser.com/questions/1096658/btrfs-out-of-space-even-though-there-should-be-10-left Here's the data: # uname -a Linux archb3 4.6.3-2-ARCH #1 PREEMPT Wed Jun 29 07:15:33 MDT 2016 armv5tel GNU/Linux # btrfs --version btrfs-progs v4.6 # btrfs fi show Label: 'home' uuid: 1c7e35e8-f013-4f65-9d19-eaa168ac088b Total devices 1 FS bytes used 1.71TiB devid1 size 1.81TiB used 1.71TiB path /dev/sda4 # btrfs fi df /home Data, single: total=1.71TiB, used=1.71TiB System, DUP: total=32.00MiB, used=224.00KiB Metadata, DUP: total=4.00GiB, used=2.07GiB GlobalReserve, single: total=512.00MiB, used=0.00B # btrfs f usage -T /home Overall: Device size: 1.81TiB Device allocated: 1.71TiB Device unallocated: 97.89GiB Device missing: 0.00B Used: 1.71TiB Free (estimated): 98.22GiB (min: 49.27GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) DataMetadata System Id Path single DUP DUP Unallocated -- - --- - --- 1 /dev/sda4 1.71TiB 8.00GiB 64.00MiB97.89GiB -- - --- - --- Total 1.71TiB 4.00GiB 32.00MiB97.89GiB Used 1.71TiB 2.07GiB 224.00KiB # btrfs fi du -s /home Total Exclusive Set shared Filename 1.60TiB 1.60TiB 0.00B /home # btrfs f resize 1:+1G /home/ Resize '/home/' of '1:+1G' ERROR: unable to resize '/home/': no enough free space This all is after closely following: https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_I_ran_out_of_disk_space.21 http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html So, already did full volume rebalance, defrag, rebooted multiple times - still, "Error: out of disk space". To sum up: - my files sum to 1.6 TiB - disk usage is shown to be 1.71 TiB - volume size is 1.81 TiB - btrfs util shows I have ~98 GiB free space on the volume - I am getting "out of space" message Bonus: - I removed 50 GB of data from the drive and I still get "out of space" message after writing ~1 GB. Help would be very appreciated. Cheers, Stan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 64-btrfs.rules and degraded boot
On Tue, Jul 5, 2016 at 11:10 PM, Chris Murphywrote: > I started a systemd-devel@ thread since that's where most udev stuff > gets talked about. > > https://lists.freedesktop.org/archives/systemd-devel/2016-July/037031.html > Before discussing how to implement it in systemd, we need to decide what to implement. I.e. 1) do you always want to mount filesystem in degraded mode if not enough devices are present or only if explicit hint is given? 2) do you want to restrict degrade handling to root only or to other filesystems as well? Note that there could be more early boot filesystems that absolutely need same treatment (enters separate /usr), and there are also normal filesystems that may need be mounted even degraded. 3) can we query btrfs whether it is mountable in degraded mode? according to documentation, "btrfs device ready" (which udev builtin follows) checks "if it has ALL of it’s devices in cache for mounting". This is required for proper systemd ordering of services. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to mount degraded RAID5
Now with 3 disks: sudo btrfs check /dev/sda parent transid verify failed on 7008807157760 wanted 70175 found 70133 parent transid verify failed on 7008807157760 wanted 70175 found 70133 checksum verify failed on 7008807157760 found F192848C wanted 1571393A checksum verify failed on 7008807157760 found F192848C wanted 1571393A bytenr mismatch, want=7008807157760, have=65536 Checking filesystem on /dev/sda UUID: 2dab74bb-fc73-4c47-a413-a55840f6f71e checking extents parent transid verify failed on 7009468874752 wanted 70180 found 70133 parent transid verify failed on 7009468874752 wanted 70180 found 70133 checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC checksum verify failed on 7009468874752 found 2B10421A wanted CFF3FFAC bytenr mismatch, want=7009468874752, have=65536 parent transid verify failed on 7008859045888 wanted 70175 found 70133 parent transid verify failed on 7008859045888 wanted 70175 found 70133 checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 checksum verify failed on 7008859045888 found 7313A127 wanted 97F01C91 bytenr mismatch, want=7008859045888, have=65536 parent transid verify failed on 7008899547136 wanted 70175 found 70133 parent transid verify failed on 7008899547136 wanted 70175 found 70133 checksum verify failed on 7008899547136 found 2B6F9045 wanted CF8C2DF3 parent transid verify failed on 7008899547136 wanted 70175 found 70133 Ignoring transid failure leaf parent key incorrect 7008899547136 bad block 7008899547136 Errors found in extent allocation tree or chunk allocation parent transid verify failed on 7009074167808 wanted 70175 found 70133 parent transid verify failed on 7009074167808 wanted 70175 found 70133 checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 checksum verify failed on 7009074167808 found FDA6D1F0 wanted 19456C46 bytenr mismatch, want=7009074167808, have=65536 sudo btrfs-debug-tree -d /dev/sdc http://sebsauvage.net/paste/?d690b2c9d130008d#cni3fnKUZ7Y/oaXm+nsOw0afoWDFXNl26eC+vbJmcRA= sudo btrfs-find-root /dev/sdc parent transid verify failed on 7008807157760 wanted 70175 found 70133 parent transid verify failed on 7008807157760 wanted 70175 found 70133 Superblock thinks the generation is 70182 Superblock thinks the level is 1 Found tree root at 6062830010368 gen 70182 level 1 Well block 6062434418688(gen: 70181 level: 1) seems good, but generation/level doesn't match, want gen: 70182 level: 1 Well block 6062497202176(gen: 69186 level: 0) seems good, but generation/level doesn't match, want gen: 70182 level: 1 Well block 6062470332416(gen: 69186 level: 0) seems good, but generation/level doesn't match, want gen: 70182 level: 1 sudo smartctl -l scterc /dev/sda smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled sudo smartctl -l scterc /dev/sdb smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: 70 (7.0 seconds) Write: 70 (7.0 seconds) sudo smartctl -l scterc /dev/sdc smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.4.0-24-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org SCT Error Recovery Control: Read: Disabled Write: Disabled sudo smartcl -a /dev/sdx http://sebsauvage.net/paste/?aab1d282ceb1e1cf#auxFRkK5GCW8j1gR7mwgzR1z92Qn9oqtc6EEC2C6sEE= cat /sys/block/sda/device/timeout 30 cat /sys/block/sdb/device/timeout 30 cat /sys/block/sdc/device/timeout 30 Thank you Tomas *From:* Chris Murphy *Sent:* Wednesday, July 06, 2016 1:19AM *To:* Tomáš Hrdina *Cc:* Chris Murphy, Btrfs Btrfs *Subject:* Re: Unable to mount degraded RAID5 btrfs check --- Tato zpráva byla zkontrolována na viry programem Avast Antivirus. https://www.avast.com/antivirus -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html