Re: RAID system with adaption to changed number of disks
OK so we know for raid5 data block groups there can be RMW. And because of that, any interruption results in the write hole. On Btrfs thought, the write hole is on disk only. If there's a lost strip (failed drive or UNC read), reconstruction from wrong parity results in a checksum error and EIO. That's good. However, what happens in the metadata case? If metadata is raid5, and there's a crash or power failure during metadata RMW, same problem, wrong parity, bad reconstruction, csum mismatch, and EIO. So what's the effect of EIO when reading metadata? And how common is RMW for metadata operations? I wonder where all of these damn strange cases where people can't do anything at all with a normally degraded raid5 - one device failed, and no other failures, but they can't mount due to a bunch of csum errors. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to rescue RAID5
> That's the proper answer. In practice... all hope isn't yet lost. I understood the proper answer. I'll take care it in the future. Is there something step/method can I do from this situation? Thank you Hiroshi Honda -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unable to rescue RAID5
On 2016-10-14 06:11, Hiroshi Honda wrote: That's the proper answer. In practice... all hope isn't yet lost. I understood the proper answer. I'll take care it in the future. Is there something step/method can I do from this situation? You should probably look at `btrfs restore`. I'm not sure how well it works with the raid56 code, and you'll need somewhere to put the data being restored, but that's probably your best option. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots
On 2016-10-13 17:21, Alberto Bursi wrote: Hi, I'm using OpenSUSE on a btrfs volume spanning 2 disks (set as raid1 for both metadata and data), no separate /home partition. The distro loves to create dozens of subvolumes for various things and makes snapshots, see: alby@openSUSE-xeon:~> sudo btrfs subvolume list / ID 257 gen 394 top level 5 path @ ID 258 gen 293390 top level 257 path @/.snapshots ID 259 gen 293607 top level 258 path @/.snapshots/1/snapshot ID 260 gen 107012 top level 257 path @/boot/grub2/i386-pc ID 261 gen 107012 top level 257 path @/boot/grub2/x86_64-efi ID 262 gen 293610 top level 257 path @/home ID 263 gen 292439 top level 257 path @/opt ID 264 gen 288726 top level 257 path @/srv ID 265 gen 293610 top level 257 path @/tmp ID 266 gen 292657 top level 257 path @/usr/local ID 267 gen 104612 top level 257 path @/var/crash ID 268 gen 133454 top level 257 path @/var/lib/libvirt/images ID 269 gen 104612 top level 257 path @/var/lib/mailman ID 270 gen 104612 top level 257 path @/var/lib/mariadb ID 271 gen 292441 top level 257 path @/var/lib/mysql ID 272 gen 104612 top level 257 path @/var/lib/named ID 273 gen 104612 top level 257 path @/var/lib/pgsql ID 274 gen 293608 top level 257 path @/var/log ID 275 gen 104612 top level 257 path @/var/opt ID 276 gen 293610 top level 257 path @/var/spool ID 277 gen 293606 top level 257 path @/var/tmp ID 362 gen 228259 top level 258 path @/.snapshots/56/snapshot ID 364 gen 228259 top level 258 path @/.snapshots/57/snapshot ID 528 gen 228259 top level 258 path @/.snapshots/110/snapshot ID 529 gen 228259 top level 258 path @/.snapshots/111/snapshot ID 670 gen 228259 top level 258 path @/.snapshots/240/snapshot ID 671 gen 228259 top level 258 path @/.snapshots/241/snapshot ID 894 gen 228283 top level 258 path @/.snapshots/438/snapshot ID 895 gen 228283 top level 258 path @/.snapshots/439/snapshot ID 896 gen 228283 top level 258 path @/.snapshots/440/snapshot ID 897 gen 228283 top level 258 path @/.snapshots/441/snapshot ID 1033 gen 288965 top level 258 path @/.snapshots/554/snapshot ID 1034 gen 289531 top level 258 path @/.snapshots/555/snapshot ID 1035 gen 289726 top level 258 path @/.snapshots/556/snapshot ID 1036 gen 289729 top level 258 path @/.snapshots/557/snapshot ID 1037 gen 290297 top level 258 path @/.snapshots/558/snapshot ID 1038 gen 290301 top level 258 path @/.snapshots/559/snapshot ID 1039 gen 290336 top level 258 path @/.snapshots/560/snapshot ID 1041 gen 290338 top level 258 path @/.snapshots/562/snapshot ID 1043 gen 292047 top level 258 path @/.snapshots/563/snapshot ID 1044 gen 292051 top level 258 path @/.snapshots/564/snapshot ID 1045 gen 292531 top level 258 path @/.snapshots/565/snapshot ID 1046 gen 293153 top level 258 path @/.snapshots/566/snapshot I'd like to be able to clone verbatim the whole volume to another volume, for backup purposes. Now, I think I can do that with a brutal partition clone from my "recovery" (a debian system installed in another drive) and then doing the procedures to deal with a lost drive. But I'd rather prefer a clone on a live system, and without doing a brutal clone as that will keep the same UUIDs. I can(will) script this so even if it is a tedious process or it involves writing a huge line in the commandline it's not an issue. I'm not sure there is any way to do this on a live system. You essentially need to either do a block level copy and change the UUID's (which recent versions of btrfstune can do), or use send with some manual setup to get the subvolumes across. Both options require the filesystem to be effectively read-only, which is not something that any modern Linux distro can reliably handle for more than a few minutes. If I had to do this, I'd go with the block level copy, since it requires a lot less effort, just make sure to use btrfstune to change the UUID's when the copy is done (that may take a while itself though, since it has to rewrite a lot of metadata). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and numa - needing drop_caches to keep speed up
On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote: Hello list, while running the same workload on two machines (single xeon and a dual xeon) both with 64GB RAM. I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to keep the speed as good as on the non numa system. I'm not sure whether this is related to numa. Is there any sysctl parameter to tune? Tested with vanilla v4.8.1 Greets, Stefan hi, why do you think this is related to btrfs? The only known issue that has this type of workaround that I know of are transparent huge pages. This is easy to diagnose but recording some kernel stacks during the problem with perf. If there is very high system cpu usage in a spinlock called by compaction functions during page faults it is the synchronous memory defragmentation needed for thp. Should that be the case the better workaround is disabling it in /sys/kernel/mm/transparent_hugepage/defrag cheers, Julian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs and numa - needing drop_caches to keep speed up
On 2016-10-14 02:28, Stefan Priebe - Profihost AG wrote: Hello list, while running the same workload on two machines (single xeon and a dual xeon) both with 64GB RAM. I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to keep the speed as good as on the non numa system. I'm not sure whether this is related to numa. Is there any sysctl parameter to tune? Tested with vanilla v4.8.1 This may sound odd, but does echoing 1 or 2 to /proc/sys/vm/drop_caches help at all, or is it just 3? The value itself is actually a bit-field with just two bits defined. 1 just drops the page-cache, while 2 just drops freeable SLAB objects, and 3 drops both. The thing is, both have an impact when dealing with filesystems (page-cache contains cached file contents, while freeable SLAB objects includes cached dentries and inodes), so knowing whether one or the other or only both is what's helping things can help diagnose this further. Regardless of that, you might try adjusting vm.vfs_cache_pressure. Increasing that will make the page-cache reclaim more aggressive, while decreasing it will make it less aggressive. It defaults to 100, and I wouldn't suggest setting it below 50 or above about 150. Keep in mind that increasing that will mean you're likely to put more load on the storage devices. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs: improve inode's outstanding_extents computation
Am 06.10.2016 um 04:51 schrieb Wang Xiaoguang: > This issue was revealed by modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, > When modifying BTRFS_MAX_EXTENT_SIZE(128MB) to 64KB, fsstress test often > gets these warnings from btrfs_destroy_inode(): > WARN_ON(BTRFS_I(inode)->outstanding_extents); > WARN_ON(BTRFS_I(inode)->reserved_extents); > > Simple test program below can reproduce this issue steadily. > Note: you need to modify BTRFS_MAX_EXTENT_SIZE to 64KB to have test, > otherwise there won't be such WARNING. > #include > #include > #include > #include > #include > > int main(void) > { > int fd; > char buf[68 *1024]; > > memset(buf, 0, 68 * 1024); > fd = open("testfile", O_CREAT | O_EXCL | O_RDWR); > pwrite(fd, buf, 68 * 1024, 64 * 1024); > return; > } > > When BTRFS_MAX_EXTENT_SIZE is 64KB, and buffered data range is: > 64KB 128K132KB > |---|---| > 64 + 4KB > > 1) for above data range, btrfs_delalloc_reserve_metadata() will reserve > metadata and set BTRFS_I(inode)->outstanding_extents to 2. > (68KB + 64KB - 1) / 64KB == 2 > > Outstanding_extents: 2 > > 2) then btrfs_dirty_page() will be called to dirty pages and set > EXTENT_DELALLOC flag. In this case, btrfs_set_bit_hook() will be called > twice. > The 1st set_bit_hook() call will set DEALLOC flag for the first 64K. > 64KB 128KB > |---| > 64KB DELALLOC > Outstanding_extents: 2 > > Set_bit_hooks() uses FIRST_DELALLOC flag to avoid re-increase > outstanding_extents counter. > So for 1st set_bit_hooks() call, it won't modify outstanding_extents, > it's still 2. > > Then FIRST_DELALLOC flag is *CLEARED*. > > 3) 2nd btrfs_set_bit_hook() call. > Because FIRST_DELALLOC have been cleared by previous set_bit_hook(), > btrfs_set_bit_hook() will increase BTRFS_I(inode)->outstanding_extents by > one, so now BTRFS_I(inode)->outstanding_extents is 3. > 64KB128KB132KB > |---|| > 64K DELALLOC 4K DELALLOC > Outstanding_extents: 3 > > But the correct outstanding_extents number should be 2, not 3. > The 2nd btrfs_set_bit_hook() call just screwed up this, and leads to the > WARN_ON(). > > Normally, we can solve it by only increasing outstanding_extents in > set_bit_hook(). > But the problem is for delalloc_reserve/release_metadata(), we only have > a 'length' parameter, and calculate in-accurate outstanding_extents. > If we only rely on set_bit_hook() release_metadata() will crew things up > as it will decrease inaccurate number. > > So the fix we use is: > 1) Increase *INACCURATE* outstanding_extents at delalloc_reserve_meta >Just as a place holder. > 2) Increase *accurate* outstanding_extents at set_bit_hooks() >This is the real increaser. > 3) Decrease *INACCURATE* outstanding_extents before returning >This makes outstanding_extents to correct value. > > For 128M BTRFS_MAX_EXTENT_SIZE, due to limitation of > __btrfs_buffered_write(), each iteration will only handle about 2MB > data. > So btrfs_dirty_pages() won't need to handle cases cross 2 extents. > > Signed-off-by: Wang Xiaoguang Tested-by: Stefan Priebe Works fine since 8 days - no ENOSPC errors anymore. Greets, Stefan > --- > fs/btrfs/ctree.h | 2 ++ > fs/btrfs/inode.c | 65 > ++-- > fs/btrfs/ioctl.c | 6 ++ > 3 files changed, 62 insertions(+), 11 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 33fe035..16885f6 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -3119,6 +3119,8 @@ int btrfs_start_delalloc_roots(struct btrfs_fs_info > *fs_info, int delay_iput, > int nr); > int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, > struct extent_state **cached_state); > +int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, > + struct extent_state **cached_state); > int btrfs_create_subvol_root(struct btrfs_trans_handle *trans, >struct btrfs_root *new_root, >struct btrfs_root *parent_root, > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index e6811c4..a7193b1 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -1590,6 +1590,9 @@ static void btrfs_split_extent_hook(struct inode *inode, > if (!(orig->state & EXTENT_DELALLOC)) > return; > > + if (btrfs_is_free_space_inode(inode)) > + return; > + > size = orig->end - orig->start + 1; > if (si
Re: [PATCH 2/2] btrfs: fix false enospc for compression
Am 06.10.2016 um 04:51 schrieb Wang Xiaoguang: > When testing btrfs compression, sometimes we got ENOSPC error, though fs > still has much free space, xfstests generic/171, generic/172, generic/173, > generic/174, generic/175 can reveal this bug in my test environment when > compression is enabled. > > After some debuging work, we found that it's btrfs_delalloc_reserve_metadata() > which sometimes tries to reserve plenty of metadata space, even for very small > data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes > we try to reserve is calculated by the difference between outstanding_extents > and reserved_extents. Please see below case for how ENOSPC occurs: > > 1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode > outstanding extents be 1, and reserved_extents be 1024. Note it's > btrfs_merge_extent_hook() that merges these 128KB units into one big > outstanding extent, but do not change reserved_extents. > > 2, When writing dirty pages, for compression, cow_file_range_async() will > split above big extent in unit of 128KB(compression extent size is 128KB). > When first split opeartion finishes, we'll have 2 outstanding extents and 1024 > reserved extents, and just right now the currently generated ordered extent is > dispatched to run and complete, then btrfs_delalloc_release_metadata()(see > btrfs_finish_ordered_io()) will be called to release metadata, after that we > will have 1 outstanding extents and 1 reserved extents(also see logic in > drop_outstanding_extent()). Later cow_file_range_async() continues to handles > left data range[128KB, 128MB), and if no other ordered extent was dispatched > to run, there will be 1023 outstanding extents and 1 reserved extent. > > 3, Now if another bufferd write for this file enters, then > btrfs_delalloc_reserve_metadata() will at least try to reserve metadata > for 1023 outstanding extents' metadata, for 16KB node size, it'll be > 1023*16384*2*8, > about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, > so > obviously it's not sane and can easily result in enospc error. > > The root cause is that for compression, its max extent size will no longer be > BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation > method in btrfs is not appropriate or correct, here we introduce: > enum btrfs_metadata_reserve_type { > BTRFS_RESERVE_NORMAL, > BTRFS_RESERVE_COMPRESS, > }; > and expand btrfs_delalloc_reserve_metadata() and > btrfs_delalloc_reserve_space() > by adding a new enum btrfs_metadata_reserve_type argument. When a data range > will > go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata. > Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go > through compression path. > > With this patch, we can fix these false enospc error for compression. > > Signed-off-by: Wang Xiaoguang Tested-by: Stefan Priebe Works fine since 8 days - no ENOSPC errors anymore. Greets, Stefan > --- > fs/btrfs/ctree.h | 31 ++-- > fs/btrfs/extent-tree.c | 55 + > fs/btrfs/extent_io.c | 59 +- > fs/btrfs/extent_io.h | 2 + > fs/btrfs/file.c | 26 +-- > fs/btrfs/free-space-cache.c | 6 +- > fs/btrfs/inode-map.c | 5 +- > fs/btrfs/inode.c | 181 > --- > fs/btrfs/ioctl.c | 12 ++- > fs/btrfs/relocation.c| 14 +++- > fs/btrfs/tests/inode-tests.c | 15 ++-- > 11 files changed, 309 insertions(+), 97 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 16885f6..fa6a19a 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -97,6 +97,19 @@ static const int btrfs_csum_sizes[] = { 4 }; > > #define BTRFS_DIRTY_METADATA_THRESH SZ_32M > > +/* > + * for compression, max file extent size would be limited to 128K, so when > + * reserving metadata for such delalloc writes, pass BTRFS_RESERVE_COMPRESS > to > + * btrfs_delalloc_reserve_metadata() or btrfs_delalloc_reserve_space() to > + * calculate metadata, for none-compression, use BTRFS_RESERVE_NORMAL. > + */ > +enum btrfs_metadata_reserve_type { > + BTRFS_RESERVE_NORMAL, > + BTRFS_RESERVE_COMPRESS, > +}; > +int inode_need_compress(struct inode *inode); > +u64 btrfs_max_extent_size(enum btrfs_metadata_reserve_type reserve_type); > + > #define BTRFS_MAX_EXTENT_SIZE SZ_128M > > struct btrfs_mapping_tree { > @@ -2677,10 +2690,14 @@ int btrfs_subvolume_reserve_metadata(struct > btrfs_root *root, > void btrfs_subvolume_release_metadata(struct btrfs_root *root, > struct btrfs_block_rsv *rsv, > u64 qgroup_reserved); > -int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes); > -void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes); > -i
Re: btrfs and numa - needing drop_caches to keep speed up
Dear julian, Am 14.10.2016 um 14:26 schrieb Julian Taylor: > On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote: >> Hello list, >> >> while running the same workload on two machines (single xeon and a dual >> xeon) both with 64GB RAM. >> >> I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to >> keep the speed as good as on the non numa system. I'm not sure whether >> this is related to numa. >> >> Is there any sysctl parameter to tune? >> >> Tested with vanilla v4.8.1 >> >> Greets, >> Stefan > > hi, > why do you think this is related to btrfs? was just an idea as i couldn't find any other difference between those systems. > This is easy to diagnose but recording some kernel stacks during the > problem with perf. you just mean perf top? Does it also show locking problems? As i see not much CPU usage in that case. > The only known issue that has this type of workaround that I know of are > transparent huge pages. I already disabled thp by: echo never > /sys/kernel/mm/transparent_hugepage/enabled cat /proc/meminfo says: HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Greets, Stefan > > cheers, > Julian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: fix false enospc for compression
On 10/06/16 04:51, Wang Xiaoguang wrote: > When testing btrfs compression, sometimes we got ENOSPC error, though fs > still has much free space, xfstests generic/171, generic/172, generic/173, > generic/174, generic/175 can reveal this bug in my test environment when > compression is enabled. > > After some debuging work, we found that it's btrfs_delalloc_reserve_metadata() > which sometimes tries to reserve plenty of metadata space, even for very small > data range. In btrfs_delalloc_reserve_metadata(), the number of metadata bytes > we try to reserve is calculated by the difference between outstanding_extents > and reserved_extents. Please see below case for how ENOSPC occurs: > > 1, Buffered write 128MB data in unit of 128KB, so finially we'll have inode > outstanding extents be 1, and reserved_extents be 1024. Note it's > btrfs_merge_extent_hook() that merges these 128KB units into one big > outstanding extent, but do not change reserved_extents. > > 2, When writing dirty pages, for compression, cow_file_range_async() will > split above big extent in unit of 128KB(compression extent size is 128KB). > When first split opeartion finishes, we'll have 2 outstanding extents and 1024 > reserved extents, and just right now the currently generated ordered extent is > dispatched to run and complete, then btrfs_delalloc_release_metadata()(see > btrfs_finish_ordered_io()) will be called to release metadata, after that we > will have 1 outstanding extents and 1 reserved extents(also see logic in > drop_outstanding_extent()). Later cow_file_range_async() continues to handles > left data range[128KB, 128MB), and if no other ordered extent was dispatched > to run, there will be 1023 outstanding extents and 1 reserved extent. > > 3, Now if another bufferd write for this file enters, then > btrfs_delalloc_reserve_metadata() will at least try to reserve metadata > for 1023 outstanding extents' metadata, for 16KB node size, it'll be > 1023*16384*2*8, > about 255MB, for 64K node size, it'll be 1023*65536*8*2, about 1GB metadata, > so > obviously it's not sane and can easily result in enospc error. > > The root cause is that for compression, its max extent size will no longer be > BTRFS_MAX_EXTENT_SIZE(128MB), it'll be 128KB, so current metadata reservation > method in btrfs is not appropriate or correct, here we introduce: > enum btrfs_metadata_reserve_type { > BTRFS_RESERVE_NORMAL, > BTRFS_RESERVE_COMPRESS, > }; > and expand btrfs_delalloc_reserve_metadata() and > btrfs_delalloc_reserve_space() > by adding a new enum btrfs_metadata_reserve_type argument. When a data range > will > go through compression, we use BTRFS_RESERVE_COMPRESS to reserve metatata. > Meanwhile we introduce EXTENT_COMPRESS flag to mark a data range that will go > through compression path. > > With this patch, we can fix these false enospc error for compression. > > Signed-off-by: Wang Xiaoguang I took some time again to get this into my tree on top of what's in btrfs-4.9rc1 and managed to merge it after all. Both this and patch #1 seem to work fine, and they don't seem to cause any regressions; ran a couple of both full and incremental rsync backups with >100GB on a new and now compressed subvolume without problem. Also Stefan just reported that his ENOSPC seems to be gone as well, so it seems to be good. \o/ So for both this and patch #1 have a careful: Tested-by: Holger Hoffstätte Also a comment about something I found while resolving conflicts caused by the preliminary dedupe suppoort: [..] > int btrfs_set_extent_delalloc(struct inode *inode, u64 start, u64 end, > - struct extent_state **cached_state); > + struct extent_state **cached_state, int flag); > int btrfs_set_extent_defrag(struct inode *inode, u64 start, u64 end, > - struct extent_state **cached_state); > + struct extent_state **cached_state, int flag); [..] > int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, > struct page **pages, size_t num_pages, > loff_t pos, size_t write_bytes, > - struct extent_state **cached); > + struct extent_state **cached, int flag); Instead of adding "int flag" why not use the already defined btrfs_metadata_reserve_type enum? I know it's just an int at the end of the day, but the dedupe support already added another "int dedupe" argument and it's probably easy to cause confusion. Maybe later it would be beneficial to consolidate the flags into a consistent set of enum values to prevent more "int flag" inflation and better declare the intent of the extent state change. Not sure if that makes sense. thanks, Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info
Re: btrfs and numa - needing drop_caches to keep speed up
Hi, Am 14.10.2016 um 15:19 schrieb Stefan Priebe - Profihost AG: > Dear julian, > > Am 14.10.2016 um 14:26 schrieb Julian Taylor: >> On 10/14/2016 08:28 AM, Stefan Priebe - Profihost AG wrote: >>> Hello list, >>> >>> while running the same workload on two machines (single xeon and a dual >>> xeon) both with 64GB RAM. >>> >>> I need to run echo 3 >/proc/sys/vm/drop_caches every 15-30 minutes to >>> keep the speed as good as on the non numa system. I'm not sure whether >>> this is related to numa. >>> >>> Is there any sysctl parameter to tune? >>> >>> Tested with vanilla v4.8.1 >>> >>> Greets, >>> Stefan >> >> hi, >> why do you think this is related to btrfs? > > was just an idea as i couldn't find any other difference between those > systems. > >> This is easy to diagnose but recording some kernel stacks during the > > problem with perf. > > you just mean perf top? Does it also show locking problems? As i see not > much CPU usage in that case. perf top looks like this: 5,46% libc-2.19.so [.] memset 5,26% [kernel] [k] page_fault 3,63% [kernel] [k] clear_page_c_e 1,38% [kernel] [k] _raw_spin_lock 1,06% [kernel] [k] get_page_from_freelist 0,83% [kernel] [k] copy_user_enhanced_fast_string 0,79% [kernel] [k] release_pages 0,68% [kernel] [k] handle_mm_fault 0,57% [kernel] [k] free_hot_cold_page 0,55% [kernel] [k] handle_pte_fault 0,54% [kernel] [k] __pagevec_lru_add_fn 0,45% [kernel] [k] unmap_page_range 0,45% [kernel] [k] __mod_zone_page_state 0,43% [kernel] [k] page_add_new_anon_rmap 0,38% [kernel] [k] free_pcppages_bulk > >> The only known issue that has this type of workaround that I know of are >> transparent huge pages. > > I already disabled thp by: > echo never > /sys/kernel/mm/transparent_hugepage/enabled > > cat /proc/meminfo says: > HugePages_Total: 0 > HugePages_Free:0 > HugePages_Rsvd:0 > HugePages_Surp:0 > > > > Greets, > Stefan > >> >> cheers, >> Julian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/4] Btrfs-convert: Add support to copy common inode flags
On Fri, Oct 14, 2016 at 08:50:31AM +0800, Qu Wenruo wrote: > > I can pull the branches from you, either to devel or integration. Please > > base them on the last release point (ie. master) if there are no other > > dependencies. Otherwise, I'll publish a branch that contains the desired > > patches and will promise not to change it with rebases. > > Sorry I'm not very familiar with when to send a pull request and when to > send normal patches, and to which branch it should be rebased to. It's no required, just would make a few things easier. The pull is additional to mailinglist patches, so send patches as usual. If you put the patches to a branch that I can pull, I'll do that eventually. This happens if the patches are in a state that I don't need to touch them at all or do just minor modifications. > From my understanding, for large patchset pull request is preferred(and > patchse should still be sent out for review) and for small changes, > normal mail should be good enough. Basically yes. Single patches or very short series do not need to go through git (just mails). > And normally I rebase all my branches/patches to your devel branch. > Or should I rebase them to the master branch? As the devel branch could be rebased, it's not suitable as a starting point. Even if you rebase it, I might not be able to pull it if something changes in the meantime. As I don't process progs patches every day, this is likely to happen. If you don't want to do the git way now, no big deal. It's a long term goal, but this requires to do the things right, so both parties, develoeprs and maintainers, know what to expect and get it. I'll also write that to wiki for future reference. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs build fails for 4.8
On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote: > At 10/13/2016 01:26 AM, David Sterba wrote: > > On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote: > >> > >> > >> On 10/12/2016 09:58 PM, Abhay Sachan wrote: > >>> Hi, > >>> I tried building latest btrfs progs on CentOS 6, "./configure" failed > >>> with the error: > >>> > >>> checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no > >>> configure: error: no definition of FIEMAP_EXTENT_SHARED found > >>> > >>> Any ideas what am I lacking in the system? > >> > >> Your kernel is too old and its header doesn't support > >> FIEMAP_EXTENT_SHARED flag for fiemap ioctl. > > > > As this is not the first time, I wonder if we could provide a defintion > > of the macro in progs, regardless of the installed headers. Note that > > this does not mean the running kernel is old. Previously the user said > > he was running a 4.4 kernel [1] (or it could be any other kernel > > version). For that combination of headers and running kernel, I think > > it's ok to add a fallback definition. > > Yes, fallback definition is good for case like running kernel supports > SHARED_EXTENT flag, but not kernel header. > > But I'm a little concerned about such fallback definition will lead to > corrupted result for "fi du". Well, not corrupted bug wrong. The shared data will be reported as exclusive. > >> And since that flag is very important for tools like "btrfs filesystem > >> du", for old kernel doesn't support EXTENT_SHARED flag, we have no > >> choice but abort configuration. > > > > The check was added to configure so it's caught earlier than during > > build. The 'fi du' command is useful, but not IMO critical to stop the > > build. > > What about just disabling subcommands using SHARED_EXTENT flag? > Like "fi du". This would need to be checked at runtime, based only on kernel version. I think we can do that, print a warning in 'fi du' but otherwise let it continue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs build fails for 4.8
On 10/14/16 10:49 AM, David Sterba wrote: > On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote: >> At 10/13/2016 01:26 AM, David Sterba wrote: >>> On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote: On 10/12/2016 09:58 PM, Abhay Sachan wrote: > Hi, > I tried building latest btrfs progs on CentOS 6, "./configure" failed > with the error: > > checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no > configure: error: no definition of FIEMAP_EXTENT_SHARED found > > Any ideas what am I lacking in the system? Your kernel is too old and its header doesn't support FIEMAP_EXTENT_SHARED flag for fiemap ioctl. >>> >>> As this is not the first time, I wonder if we could provide a defintion >>> of the macro in progs, regardless of the installed headers. Note that >>> this does not mean the running kernel is old. Previously the user said >>> he was running a 4.4 kernel [1] (or it could be any other kernel >>> version). For that combination of headers and running kernel, I think >>> it's ok to add a fallback definition. >> >> Yes, fallback definition is good for case like running kernel supports >> SHARED_EXTENT flag, but not kernel header. >> >> But I'm a little concerned about such fallback definition will lead to >> corrupted result for "fi du". > > Well, not corrupted bug wrong. The shared data will be reported as > exclusive. > And since that flag is very important for tools like "btrfs filesystem du", for old kernel doesn't support EXTENT_SHARED flag, we have no choice but abort configuration. >>> >>> The check was added to configure so it's caught earlier than during >>> build. The 'fi du' command is useful, but not IMO critical to stop the >>> build. >> >> What about just disabling subcommands using SHARED_EXTENT flag? >> Like "fi du". > > This would need to be checked at runtime, based only on kernel version. > I think we can do that, print a warning in 'fi du' but otherwise let it > continue. Checking kernel version is pretty unreliable - distro kernels that backport will quite possibly fail the version test. Sure, it's possible to patch the check back out of userspace in the distro, but one must first know that such things exist, and look out for them... Maybe warn on kernel version < 2.6.33 /and/ no shared extents found. Or, why not just: 1) #ifndef FIEMAP_EXTENT_SHARED / #define FIEMAP_EXTENT_SHARED 0x2000 / #endif 2) carry on as usual. Worst case is that you don't get shared accounting on very old kernels. And FIEMAP_EXTENT_SHARED has been around since 2.6.33 so it's not going to bite too many people, right? (If you really wanted to, you could export has_fiemap_shared in the btrfs features sysfs file, and runtime could see if shared extents are detectable and if not, print a warning or something ... but given that this would be a new feature much newer than the fiemap flag I'm not sure that would do much good.) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: define FIEMAP_EXTENT_SHARED locally if needed
As it stands today, btrfs-progs won't build if the installed kernel headers don't define FIEMAP_EXTENT_SHARED. But that doesn't tell us anything about the capabilities of the /running/ kernel - so just define it locally if not found in the fiemap header, and allow the build to proceed. If run against an old kernel, worst case scenario is that no shared extents will be reported via the du command. Signed-off-by: Eric Sandeen --- you can take it or leave it, but I had this locally anyway, so if it's helpful here you go :) diff --git a/cmds-fi-du.c b/cmds-fi-du.c index a5b66e6..fa8f421 100644 --- a/cmds-fi-du.c +++ b/cmds-fi-du.c @@ -29,6 +29,11 @@ #include #include +/* Appeared upstream in 2.6.33 */ +#ifndef FIEMAP_EXTENT_SHARED +#define FIEMAP_EXTENT_SHARED 0x2000 +#endif + #include "utils.h" #include "commands.h" #include "kerncompat.h" diff --git a/configure.ac b/configure.ac index 8fd8f42..9b87b24 100644 --- a/configure.ac +++ b/configure.ac @@ -144,8 +144,6 @@ if test "$DISABLE_BTRFSCONVERT" = 0 && test "x$convertfs" = "x"; then AC_MSG_ERROR([no filesystems for convert, use --disable-convert instead]) fi -AX_CHECK_DEFINE([linux/fiemap.h], [FIEMAP_EXTENT_SHARED], [], - [AC_MSG_ERROR([no definition of FIEMAP_EXTENT_SHARED found])]) dnl Define _LIBS= and _CFLAGS= by pkg-config dnl -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: build: detect fiemap shared flag but don't fail build
The FIEMAP_EXTENT_SHARED fiemap flag was introduced in 2.6.33. If the headers do not provide the definition, the build will fail. The support of the fiemap sharing depends on the running kernel. There are still systems with 2.6.32 kernel headers but running newer versions. To support such environment, don't fail build, provide own defintion of the structure and detect if there's an old kernel in use in the relevant command (btrfs fi du). Reported-by: Abhay Sachan Signed-off-by: David Sterba --- cmds-fi-du.c | 14 ++ configure.ac | 10 +- 2 files changed, 23 insertions(+), 1 deletion(-) diff --git a/cmds-fi-du.c b/cmds-fi-du.c index a5b66e6fdf1a..895c242e821c 100644 --- a/cmds-fi-du.c +++ b/cmds-fi-du.c @@ -27,8 +27,13 @@ #include #include +#include #include +#if !defined(FIEMAP_EXTENT_SHARED) && (HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE == 1) +#define FIEMAP_EXTENT_SHARED 0x2000 +#endif + #include "utils.h" #include "commands.h" #include "kerncompat.h" @@ -546,6 +551,7 @@ int cmd_filesystem_du(int argc, char **argv) { int ret = 0, err = 0; int i; + u32 kernel_version; unit_mode = get_unit_mode_from_arg(&argc, argv, 1); @@ -570,6 +576,14 @@ int cmd_filesystem_du(int argc, char **argv) if (check_argc_min(argc - optind, 1)) usage(cmd_filesystem_du_usage); + kernel_version = get_running_kernel_version(); + + if (kernel_version < KERNEL_VERSION(2,6,33)) { + warning( +"old kernel version detected, shared space will be reported as exclusive\n" +"due to missing support for FIEMAP_EXTENT_SHARED flag"); + } + printf("%10s %10s %10s %s\n", "Total", "Exclusive", "Set shared", "Filename"); diff --git a/configure.ac b/configure.ac index 8fd8f425a076..9c6df3b9d4ed 100644 --- a/configure.ac +++ b/configure.ac @@ -144,8 +144,16 @@ if test "$DISABLE_BTRFSCONVERT" = 0 && test "x$convertfs" = "x"; then AC_MSG_ERROR([no filesystems for convert, use --disable-convert instead]) fi +HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE=0 AX_CHECK_DEFINE([linux/fiemap.h], [FIEMAP_EXTENT_SHARED], [], - [AC_MSG_ERROR([no definition of FIEMAP_EXTENT_SHARED found])]) + [HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE=1 +AC_MSG_WARN([no definition of FIEMAP_EXTENT_SHARED found, probably old kernel, will use own defintion, 'btrfs fi du' might report wrong numbers])]) + +if test "x$HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE" == "x1"; then +AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [1], [We defined FIEMAP_EXTENT_SHARED]) +else +AC_DEFINE([HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE], [0], [We did not define FIEMAP_EXTENT_SHARED]) +fi dnl Define _LIBS= and _CFLAGS= by pkg-config dnl -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs build fails for 4.8
On Fri, Oct 14, 2016 at 12:00:56PM -0500, Eric Sandeen wrote: > On 10/14/16 10:49 AM, David Sterba wrote: > > On Thu, Oct 13, 2016 at 03:25:57PM +0800, Qu Wenruo wrote: > >> At 10/13/2016 01:26 AM, David Sterba wrote: > >>> On Wed, Oct 12, 2016 at 10:01:27PM +0800, Qu Wenruo wrote: > > > On 10/12/2016 09:58 PM, Abhay Sachan wrote: > > Hi, > > I tried building latest btrfs progs on CentOS 6, "./configure" failed > > with the error: > > > > checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no > > configure: error: no definition of FIEMAP_EXTENT_SHARED found > > > > Any ideas what am I lacking in the system? > > Your kernel is too old and its header doesn't support > FIEMAP_EXTENT_SHARED flag for fiemap ioctl. > >>> > >>> As this is not the first time, I wonder if we could provide a defintion > >>> of the macro in progs, regardless of the installed headers. Note that > >>> this does not mean the running kernel is old. Previously the user said > >>> he was running a 4.4 kernel [1] (or it could be any other kernel > >>> version). For that combination of headers and running kernel, I think > >>> it's ok to add a fallback definition. > >> > >> Yes, fallback definition is good for case like running kernel supports > >> SHARED_EXTENT flag, but not kernel header. > >> > >> But I'm a little concerned about such fallback definition will lead to > >> corrupted result for "fi du". > > > > Well, not corrupted bug wrong. The shared data will be reported as > > exclusive. > > > And since that flag is very important for tools like "btrfs filesystem > du", for old kernel doesn't support EXTENT_SHARED flag, we have no > choice but abort configuration. > >>> > >>> The check was added to configure so it's caught earlier than during > >>> build. The 'fi du' command is useful, but not IMO critical to stop the > >>> build. > >> > >> What about just disabling subcommands using SHARED_EXTENT flag? > >> Like "fi du". > > > > This would need to be checked at runtime, based only on kernel version. > > I think we can do that, print a warning in 'fi du' but otherwise let it > > continue. > > Checking kernel version is pretty unreliable - distro kernels that backport > will quite possibly fail the version test. Sure, it's possible to patch > the check back out of userspace in the distro, but one must first know that > such things exist, and look out for them... > > Maybe warn on kernel version < 2.6.33 /and/ no shared extents found. Yeah version is not reliable, 'fi du' will just warn. As the warning is printed at the beginning, we don't know if there will be any shared extent. > Or, why not just: > > 1) #ifndef FIEMAP_EXTENT_SHARED / #define FIEMAP_EXTENT_SHARED 0x2000 / > #endif > 2) carry on as usual. > > Worst case is that you don't get shared accounting on very old kernels. Yes, implemented it that way. > And FIEMAP_EXTENT_SHARED has been around since 2.6.33 so it's not going > to bite too many people, right? Right, I want to avoid another report that the build failed on centos 6. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs progs build fails for 4.8
On Wed, Oct 12, 2016 at 07:28:20PM +0530, Abhay Sachan wrote: > Hi, > I tried building latest btrfs progs on CentOS 6, "./configure" failed > with the error: > > checking for FIEMAP_EXTENT_SHARED defined in linux/fiemap.h... no > configure: error: no definition of FIEMAP_EXTENT_SHARED found > > Any ideas what am I lacking in the system? Can you please test the build from current devel branch? [1] Configure will just print a warning if the flag is not defined, 'btrfs fi du' will print another warning if it's running on kernel < 2.6.33 . [1] git://github.com/kdave/btrfs-progs (branch devel) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is is possible to submit binary image as fstest test case?
On Mon, Oct 10, 2016 at 10:56:12AM +1100, Dave Chinner wrote: > On Fri, Oct 07, 2016 at 06:05:51PM +0200, David Sterba wrote: > > On Fri, Oct 07, 2016 at 08:18:38PM +1100, Dave Chinner wrote: > > > On Thu, Oct 06, 2016 at 04:12:56PM +0800, Qu Wenruo wrote: > > > > So I'm wondering if I can just upload a zipped raw image as part of > > > > the test case? > > > > > > Preferably not. We've managed to avoid pre-built images in xfstests > > > for 15 years, so there'd have to be a really good reason to start > > > doing this, especially as once we open that floodgate we'll end up > > > with everyone wanting to do this and it will blow out the size of > > > the repository in now time. > > > > > > If the issue is just timing or being unable to trigger an error > > > at the right time, this is what error injection frameworks or > > > debug-only sysfs hooks are for. The XFS kernel code has both, > > > xfstests use both, and they pretty much do away with the need for > > > custom binary filesystem images for such testing... > > > > Error injection framework may not be alwasy available, eg. in kernels > > built for production. Yet I'd like to make the test possible. > > Why would you ever enable error injection on a production kernel? That's not what I mean. Let me rephrase. If the test needs the injection infrastructue it can't be run with production build. A crafted image can reproduce that. > We simply don't run the error injection tests when the > infrastructure is not there, jsut like we do with all other tests > that are depenent on optional kernel/fs features I think this depends on the test and the bug class it's supposed to hit. The images need some determinism, eg. a traversal of structures, or existence (or not) of some state, or error handling. > > Also, I find it useful to keep the exact images that are attached to a > > report and not necessarily try to recreate it to trigger a bug. If the > > images are small, hosting them with the sources makes testing easy. > > Big images would probably go to own repository and be linked. > > > > I don't really expect fstests to store crafted images and would advise > > Qu to not even try to propose that, because the answer was quite > > predictable. > > You say that like it's a bad thing? >From the POV of fstests, it won't work, possibly in some very limited number of tests. As it's a multi-filesystem project, this won't scale. You said that earlier and I consider this obvious. What we can do for fstests, configure an external path with filesystem images and let tests use it for any purpose. In btrfs-progs we eg. don't have mount tests of the fuzzed images, but I would like to make it more accessible for wider testing. In a way that's acceptable for fstests. > What's the guarantee that a > specific corrupt image will always be sufficient to trigger the > problem the test is supposed to check? i.e. we could change a > different part of the FS code and that could mask the bug the image > used to trigger. The test then passes, but we haven't actually fix > the bug that the test used to exercise. i.e. we've got a false "we > fixed the problem" report, when all we did is prevent a specific > vector from tripping over it. Again, I think this depends. I can't be sure the image will always trigger the bug that was there originally. I've seen that some fuzzed images required to add early checks that in some cases did not reach the original point of failure in other images. As long as the test assumptions are met, the test will not give false sense of fixing. In case of the fuzzed images, one may not cover all the assumptions, but I take them more like statistical proof that we don't have new bugs. And it's convenient to have the images at hand. > Error injection and sysfs hooks into debug code are much more > reliable ways of testing that the root cause of a problem is fixed > and stays fixed. The reproducing trigger cannot be masked by > changes in other code layers, so we know that if the error > injection/sysfs hook test handles the problem correctly, we have > actually fixed the underlying bug and not just masked it... Agreed. This applies to the example that Qu asked about, as balance modifies a global filesystem state, a single image won't cover many variants of structures (with snapshots, optional mkfs features). The injection framework will give better testing coverage. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 01:16:05AM -0600, Chris Murphy wrote: > OK so we know for raid5 data block groups there can be RMW. And > because of that, any interruption results in the write hole. On Btrfs > thought, the write hole is on disk only. If there's a lost strip > (failed drive or UNC read), reconstruction from wrong parity results > in a checksum error and EIO. That's good. > > However, what happens in the metadata case? If metadata is raid5, and > there's a crash or power failure during metadata RMW, same problem, > wrong parity, bad reconstruction, csum mismatch, and EIO. So what's > the effect of EIO when reading metadata? The effect is you can't access the page or anything referenced by the page. If the page happens to be a root or interior node of something important, large parts of the filesystem are inaccessible, or the filesystem is not mountable at all. RAID device management and balance operations don't work because they abort as soon as they find the first unreadable metadata page. In theory it's still possible to rebuild parts of the filesystem offline using backrefs or brute-force search. Using an old root might work too, but in bad cases the newest viable root could be thousands of generations old (i.e. it's more likely that no viable root exists at all). > And how common is RMW for metadata operations? RMW in metadata is the norm. It happens on nearly all commits--the only exception seems to be when both ends of a commit write happen to land on stripe boundaries accidentally, which is less than 1% of the time on 3 disks. > I wonder where all of these damn strange cases where people can't do > anything at all with a normally degraded raid5 - one device failed, > and no other failures, but they can't mount due to a bunch of csum > errors. I'm *astonished* to hear about real-world successes with raid5 metadata. The total-loss failure reports are the result I expect. The current btrfs raid5 implementation is a thin layer of bugs on top of code that is still missing critical pieces. There is no mechanism to prevent RMW-related failures combined with zero tolerance for RMW-related failures in metadata, so I expect a btrfs filesystem using raid5 metadata to be extremely fragile. Failure is not likely--it's *inevitable*. The non-RMW-aware allocator almost maximizes the risk of RMW data loss. Every transaction commit contains multiple tree root pages, which are the most critical metadata that could be lost due to RMW failure. There is a window at least a few milliseconds wide, and potentially several seconds wide, where some data on disk is in an unrecoverable state due to RMW. This happens twice a minute with the default commit interval and 99% of commits are affected. That's a million opportunities per machine-year to lose metadata. If a crash lands on one of those, boom, no more filesystem. I expect one random crash (i.e. a crash that is not strongly correlated to RMW activity) out of 30-2000 (depending on filesystem size, workload, rotation speed, btrfs mount parameters) will destroy a filesystem under typical conditions. Real world crashes tend not to be random (i.e. they are strongly correlated to RMW activity), so filesystem loss will be much more frequent in practice. > > Chris Murphy > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.9 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.9 Has some fixes from Omar and Dave Sterba for our new free space tree. This isn't heavily used yet, but as we move toward making it the new default we wanted to nail down an endian bug. Omar Sandoval (5) commits (+259/-145): Btrfs: expand free space tree sanity tests to catch endianness bug (+96/-68) Btrfs: fix extent buffer bitmap tests on big-endian systems (+51/-36) Btrfs: fix free space tree bitmaps on big-endian systems (+76/-27) Btrfs: fix mount -o clear_cache,space_cache=v2 (+12/-12) Btrfs: catch invalid free space trees (+24/-2) David Sterba (2) commits (+13/-12): btrfs: tests: uninline member definitions in free_space_extent (+2/-1) btrfs: tests: constify free space extent specs (+11/-11) Total: (7) commits (+272/-157) fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 33 +++--- fs/btrfs/extent_io.c | 64 +++ fs/btrfs/extent_io.h | 22 fs/btrfs/free-space-tree.c | 19 ++-- fs/btrfs/tests/extent-io-tests.c | 87 --- fs/btrfs/tests/free-space-tree-tests.c | 189 +++-- include/uapi/linux/btrfs.h | 12 ++- 8 files changed, 272 insertions(+), 157 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
Zygo Blaxell posted on Fri, 14 Oct 2016 15:55:45 -0400 as excerpted: > The current btrfs raid5 implementation is a thin layer of bugs on top of > code that is still missing critical pieces. There is no mechanism to > prevent RMW-related failures combined with zero tolerance for > RMW-related failures in metadata, so I expect a btrfs filesystem using > raid5 metadata to be extremely fragile. Failure is not likely--it's > *inevitable*. Wow, that's a signature-quality quote reflecting just how dire the situation with btrfs parity-raid is ATM. First sentence for a short sig, full paragraph for a longer one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell wrote: > >> And how common is RMW for metadata operations? > > RMW in metadata is the norm. It happens on nearly all commits--the only > exception seems to be when both ends of a commit write happen to land > on stripe boundaries accidentally, which is less than 1% of the time on > 3 disks. In the interest of due diligence, and the fact I can't confirm or deny this myself from reading the code (although I do see many comments involving RMW in the code), I must ask Qu if he can corroborate this. Because basically means btrfs raid56 is not better than md raid56 - by design. It has nothing to do with bugs. This is substantially worse than the scrub->wrong parity bug. Does it make sense to proscribe raid5 profile for metadata? As in, disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6 seems like it could be specious - yes there are two copies but if there is constant RMW, then there's no CoW and we're not really protected that well with all of these overwrites, statistically speaking. Basically you have to have a setup where there's no chance of torn or misdirected writes, and no corruptions, in which case Btrfs checksums aren't really helpful, you're using it for other reasons (snapshots and what not). Really seriously the CoW part of Btrfs being violated by all of this RMW to me sounds like it reduces the pros of Btrfs. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6 file system in a bad state
This may be relevant and is pretty terrible. http://www.spinics.net/lists/linux-btrfs/msg59741.html Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots
On 10/14/2016 01:38 PM, Austin S. Hemmelgarn wrote: > On 2016-10-13 17:21, Alberto Bursi wrote: >> Hi, I'm using OpenSUSE on a btrfs volume spanning 2 disks (set as raid1 >> for both metadata and data), no separate /home partition. >> - >> I'd like to be able to clone verbatim the whole volume to another >> volume, for backup purposes. >> >> Now, I think I can do that with a brutal partition clone from my >> "recovery" (a debian system installed in another drive) and then doing >> the procedures to deal with a lost drive. >> >> But I'd rather prefer a clone on a live system, and without doing a >> brutal clone as that will keep the same UUIDs. >> >> I can(will) script this so even if it is a tedious process or it >> involves writing a huge line in the commandline it's not an issue. > I'm not sure there is any way to do this on a live system. You > essentially need to either do a block level copy and change the UUID's > (which recent versions of btrfstune can do), or use send with some > manual setup to get the subvolumes across. Both options require the > filesystem to be effectively read-only, which is not something that any > modern Linux distro can reliably handle for more than a few minutes. > > If I had to do this, I'd go with the block level copy, since it requires > a lot less effort, just make sure to use btrfstune to change the UUID's > when the copy is done (that may take a while itself though, since it has > to rewrite a lot of metadata). Heh, one of the reasons I migrated to btrfs was that I wanted to do things like these on a live system. With my script I can already generate any necessary folder structure and send over all subvolumes with a bunch of btrfs send | btrfs receive with like 5 lines of script. I was hoping there was some neat trick with btrfs send | btrfs receive that allowed me to send a snapshot of / to the / of the new volume. With rsync (from a ro snapshot of /) it should be possible to use the subvolume list as exclusion list, but I'd have rather wanted to use a btrfs feature instead. -Alberto
Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots
It should be -e can accept a listing of all the subvolumes you want to send at once. And possibly an -r flag, if it existed, could automatically populate -e. But the last time I tested -e I just got errors. https://bugzilla.kernel.org/show_bug.cgi?id=111221 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 3:38 PM, Chris Murphy wrote: > On Fri, Oct 14, 2016 at 1:55 PM, Zygo Blaxell > wrote: > >> >>> And how common is RMW for metadata operations? >> >> RMW in metadata is the norm. It happens on nearly all commits--the only >> exception seems to be when both ends of a commit write happen to land >> on stripe boundaries accidentally, which is less than 1% of the time on >> 3 disks. > > In the interest of due diligence, and the fact I can't confirm or deny > this myself from reading the code (although I do see many comments > involving RMW in the code), I must ask Qu if he can corroborate this. > > Because basically means btrfs raid56 is not better than md raid56 - by > design. It has nothing to do with bugs. This is substantially worse > than the scrub->wrong parity bug. > > Does it make sense to proscribe raid5 profile for metadata? As in, > disallow -m raid5 at mkfs time? Maybe recommend raid1. Even raid6 > seems like it could be specious - yes there are two copies but if > there is constant RMW, then there's no CoW and we're not really > protected that well with all of these overwrites, statistically > speaking. > > Basically you have to have a setup where there's no chance of torn or > misdirected writes, and no corruptions, in which case Btrfs checksums > aren't really helpful, you're using it for other reasons (snapshots > and what not). > > Really seriously the CoW part of Btrfs being violated by all of this > RMW to me sounds like it reduces the pros of Btrfs. Also, is there RMW with raid0, or even raid10? Or is that always CoW for metadata and data, just like single and dup? If raid0 is always CoW, then I don't think it's correct to consider raid5 minus parity to be anything like raid0 - in a Btrfs context anyway. Outside of that context, I understand the argument. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copy BTRFS volume to another BTRFS volume including subvolumes and snapshots
On 10/15/2016 12:17 AM, Chris Murphy wrote: > It should be -e can accept a listing of all the subvolumes you want to > send at once. And possibly an -r flag, if it existed, could > automatically populate -e. But the last time I tested -e I just got > errors. > > https://bugzilla.kernel.org/show_bug.cgi?id=111221 > > Not a problem (for me anyway), I can send all subvolumes already with my script (one after another, but still automatically). What I can't do with btrfs commands is to send over the contents of a ro snapshot of / called for example "oldRootSnapshot", directly to "/tmp/newroot" (which is where I have mounted the other drive/volume). The only thing I can do is send over the subvolume as a subvolume. So I end up with /tmp/newroot/oldRootSnapshot and inside oldRootSnapshot I get my root, not what I wanted. Only way I found so far is using rsync to move the contents of oldRootSnapshot in the /tmp/newroot by setting an exclusion list for all subvolumes, then run a deduplication with duperemove. So, is there something I missed to do that? -Alberto N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥
Re: RAID system with adaption to changed number of disks
On Fri, Oct 14, 2016 at 04:30:42PM -0600, Chris Murphy wrote: > Also, is there RMW with raid0, or even raid10? No. Mirroring is writing the same data in two isolated places. Striping is writing data at different isolated places. No matter which sectors you write through these layers, it does not affect the correctness of data in any sector at a different logical address. None of these use RMW--you read or write only complete sectors and act only on the specific sectors requested. Only parity RAID does RMW. e.g. in RAID0, when you modify block 47, you may actually modify block 93 on a different disk, but there's always a 1:1 mapping between every logical and physical address. If there is a crash we go back to an earlier tree that does not contain block 47/93 so we don't care if the write was interrupted. e.g. in RAID1, when you modify block 47, you modify physical block 47 on two separate disks. The state of disk1-block47 may be different from the state of disk2-block47 if the write is interrupted. If there is a crash we go back to an earlier tree that does not contain either copy of block 47 so we don't care about any inconsistency there. So raid0, single, dup, raid1, and raid10 are OK--they fall into one or both of the above cases. CoW works there. None of these properties change in degraded mode with the mirroring profiles. Parity RAID is writing data in non-isolated places. When you write to some sectors, additional sectors are implicitly modified in degraded mode (whether you are in degraded mode at the time of the writes or not). This is different from the other cases because the other cases never modify any sectors that were not explicitly requested by the upper layer. This is OK if and only if the CoW layer is aware of this behavior and works around it. > Or is that always CoW > for metadata and data, just like single and dup? It's always CoW at the higher levels, even for parity RAID. The problem is that the CoW layer is not aware of the RMW behavior buried in the parity RAID layer, so the combination doesn't work properly. CoW thinks it's modifying only block 47, when in fact it's modifying an entire stripe in degraded mode. Let's assume 5-disk RAID5 with a strip size of one block for this example, and say blocks 45-48 are one RAID stripe. If there is a crash, data in blocks 45, 46, 47, and 48 may be irretrievably damaged by inconsistent modification of parity and data blocks. When we try to go back to an earlier tree that does not contain block 47, we will end up with a tree that contains corruption in one of the blocks 45, 46, or 48. This corruption will only be visible when something else goes wrong (parity mismatch, data csum failure, disk failure, or scrub) so a damaged filesystem that isn't degraded could appear to be healthy for a long time. If the CoW layer is aware of this, it can arrange operations such that no stripe is modified while it is referenced by a committed tree. Suppose the stripe at blocks 49-52 is empty, so we write our CoW block at block 49 instead of 47. Since blocks 50-52 contain no data we care about, we don't even have to bother reading them (just fill the other blocks with zero or find some other data to write in the same commit), and we can eliminate many slow RMW operations entirely*. If there is a crash we just fall back to an earlier tree that does not contain block 49. This tree is not damaged because we left blocks 45-48 alone. One way to tell this is done right is all data in each RAID stripe will always belong to exactly zero or one transaction, not dozens of different transactions as stripes have now. The other way to fix things is to make stripe RMW atomic so that CoW works properly. You can tell this is done right if you can find a stripe update journal in the disk format or the code. > If raid0 is always > CoW, then I don't think it's correct to consider raid5 minus parity to > be anything like raid0 - in a Btrfs context anyway. Outside of that > context, I understand the argument. > > > > -- > Chris Murphy [*] We'd still need parity RAID RMW for nodatacow and PREALLOC because neither uses the CoW layer. That doesn't matter for nodatacow because nodatacow is how users tell us they don't want to read their data any more, but it has interesting implications for PREALLOC. Maybe a solution for PREALLOC is to do the first write strictly in RAID-stripe-sized units? signature.asc Description: Digital signature