Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable
On Sun, Feb 7, 2016 at 6:28 PM, Benjamin Valentinwrote: > Hi, > > I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5 > configuration. > I copied some TB of data onto it without errors (from eSATA drives, so > rather fast - I mention that because of [1]), then set it up as a > fileserver where it had data read and written to it over a gigabit > ethernet connection for several days. > This however didn't go so well because after one day, one of the drives > dropped off the SATA bus. > > I don't know if that was related to [1] (I was running Linux 4.4-rc6 to > avoid that) and by now all evidence has been eaten by logrotate :\ > > But I was not concerned for I had set up raid5 to provide redundancy > against one disc failure - unfortunately it did not. > > When trying to read a file I'd get an I/O error after some hundret MB > (this is random across multiple files, but consistent for the same > file) on both files written before and after the disc failue. > > (There was still data being written to the volume at this point.) > > After a reboot a couple days later the drive showed up again and SMART > reported no errors, but the I/O errors remained. > > I then ran btrfs scrub (this took about 10 days) and afterwards I was > again able to completely read all files written *before* the disc > failure. > > However, many files written *after* the event (while only 2 drives were > online) are still only readable up to a point: > > $ dd if=Dr.Strangelove.mkv of=/dev/null > dd: error reading ‘Dr.Strangelove.mkv’: > Input/output error > 5331736+0 records in > 5331736+0 records out > 2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s > > $ ls -sh > 4,4G Dr.Strangelove.mkv > > [ 197.321552] BTRFS warning (device sda): csum failed ino 171545 off > 2269564928 csum 2566472073 expected csum 2434927850 > [ 197.321574] BTRFS warning (device sda): csum failed ino 171545 off > 2269569024 csum 566472073 expected csum 212160686 > [ 197.321592] BTRFS warning (device sda): csum failed ino 171545 off > 2269573120 csum 2566472073 expected sum 2202342500 > > I tried btrfs check --repair but to no avail, got some > > [ 4549.762299] BTRFS warning (device sda): failed to load free space cache > for block group 1614937063424, rebuilding it now > [ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache > > and this result > > checking extents > Fixed 0 roots. > checking free space cache > checking fs roots > checking csums > checking root refs > enabling repair mode > Checking filesystem on /dev/sda > UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8 > cache and super generation don't match, space cache will be invalidated > found 11674258875712 bytes used err is 0 > total csum bytes: 11387937220 > total tree bytes: 13011156992 > total fs tree bytes: 338083840 > total extent tree bytes: 99123200 > btree space waste bytes: 1079766991 > file data blocks allocated: 14669115838464 > referenced 14668840665088 > > when I mount the volume with -o nospace_cache I instead get > > [ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off > 2269560832 csum 2566472073 expected csum 874509527 > [ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off > 2269564928 csum 566472073 expected csum 2434927850 > [ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off > 2269569024 csum 2566472073 expected csum 212160686 > > when trying to read the file. You could use 1-time mount option clear_cache, then mount normally and cache will be rebuild automatically (but also corrected if you don't clear it) > Do you think there is still a chance to recover those files? You can use btrfs restore to get files off a damaged fs. > Also am I mistaken to believe that btrfs-raid5 would continue to > function when one disc fails? The problem you encountered is quite typical unfortunately, the answer is yes if you stop writing to the fs. But thats not acceptable of course. A key problem of btrfs raid (also in recent kernels like 4.4) is that when a (redundant) device goes offline (like pulling SATA cable or HDD firmware crash) btrfs/kernel does not notice or does not act correctly upon it under various circumstances. So same as in you case, the writing to disappeared device seems to continue. For just the data, this might then still be recoverable, but for the rest of the structures, it might corrupt the fs heavily. What should happen is that the btrfs+kernel+fs state switches to degraded mode and warn about devicefailure so that user can take action. Or completely automatically start using a spare disk that is standby but connected. But this spare disk method is currently just patched in this list, it will take time before they appear in mainline kernel I assume. It is possible to reproduce the issue of 1 device of a raid array disappearing while btrfs/kernel still thinks its there. I hit this problem myself twice with loop devices, it ruined things, luckily
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On Fri, Feb 5, 2016 at 12:36 PM, Mackenzie Meyerwrote: > > RAID 6 write holes? I don't even understand the nature of the write hole on Btrfs. If modification is still always COW, then either an fs block, a strip, or whole stripe write happens, I'm not sure where the hole comes from. It suggests some raid56 writes are not atomic. If you're worried about raid56 write holes, then a.) you need a server running this raid where power failures or crashes don't happen b.) don't use raid56 c.) use ZFS. > RAID 6 stability? > Any articles I've tried looking for online seem to be from early 2014, > I can't find anything recent discussing the stability of RAID 5 or 6. > Are there or have there recently been any data corruption bugs which > impact RAID 6? Would you consider RAID 6 safe/stable enough for > production use? It's not stable for your use case, if you have to ask others if it's stable enough for your use case. Simple as that. Right now some raid6 users are experiencing remarkably slow balances, on the order of weeks. If device replacement rebuild times are that long, I'd say it's disqualifying for most any use case, just because there are alternatives that have better fail over behavior than this. So far there's no word from any developers what the problem might be, or where to gather more information. So chances are they're already aware of it but haven't reproduced it, or isolated it, or have a fix for it yet. If you're prepared to make Btrfs better in the event you have a problem, with possibly some delay in getting that volume up and running again (including the likelihood of having to rebuild it from a backup), then it might be compatible with your use case. > Do you still strongly recommend backups, or has stability reached a > point where backups aren't as critical? I'm thinking from a data > consistency standpoint, not a hardware failure standpoint. You can't separate them. On completely stable hardware, stem to stern, you'd have no backups, no Btrfs or ZFS, you'd just run linear/concat arrays with XFS, for example. So you can't just hand wave the hardware part away. There are bugs in the entire storage stack, there are connectors that can become intermittent, the system could crash. All of these affect data consistency. Stability has not reach a point where backups aren't as critical. I don't really even know what that means though. No matter Btrfs or not, you need to be doing backups such that if the primary stack is a 100% loss without notice, is not a disaster. Plan on having to use it. If you don't like the sound of that, look elsewhere. > I plan to start with a small array and add disks over time. That said, > currently I have mostly 2TB disks and some 3TB disks. If I replace all > 2TB disks with 3TB disks, would BTRFS then start utilizing the full > 3TB capacity of each disk, or would I need to destroy and rebuild my > array to benefit from the larger disks? Btrfs, or LVM raid, or mdraid, and ZFS all let you grow arrays, each has different levels of ease of doing this and how long it will take, without having to recreate the file system from scratch. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 13/23] xfs: test fragmentation characteristics of copy-on-write
On Mon, Feb 08, 2016 at 05:13:09PM -0800, Darrick J. Wong wrote: > Perform copy-on-writes at random offsets to stress the CoW allocation > system. Assess the effectiveness of the extent size hint at > combatting fragmentation via unshare, a rewrite, and no-op after the > random writes. > > Signed-off-by: Darrick J. Wong> +seq=`basename "$0"` > +seqres="$RESULT_DIR/$seq" > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1# failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > +cd / > +#rm -rf "$tmp".* "$testdir" Now that I've noticed it, a few tests have this line commented out. Probably should remove the tmp files, at least. > +rm -f "$seqres.full" > + > +echo "Format and mount" > +_scratch_mkfs > "$seqres.full" 2>&1 > +_scratch_mount >> "$seqres.full" 2>&1 > + > +testdir="$SCRATCH_MNT/test-$seq" > +rm -rf $testdir > +mkdir $testdir Again, somthing that is repeated - we just mkfs'd the scratch device, so the $testdir is guaranteed not to exist... > +echo "Check for damage" > +umount "$SCRATCH_MNT" I've also noticed this in a lot of tests - the scratch device will be unmounted by the harness, so I don't think this is necessary > +free_blocks=$(stat -f -c '%a' "$testdir") > +real_blksz=$(stat -f -c '%S' "$testdir") > +space_needed=$(((blksz * nr * 3) * 5 / 4)) > +space_avail=$((free_blocks * real_blksz)) > +internal_blks=$((blksz * nr / real_blksz)) > +test $space_needed -gt $space_avail && _notrun "Not enough space. > $space_avail < $space_needed" Why not: _require_fs_space $space_needed At minimum, it seems to be a repeated hunk of code, so it shoul dbe factored. > +testdir="$SCRATCH_MNT/test-$seq" > +rm -rf $testdir > +mkdir $testdir > + > +echo "Create the original files" > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 0" "$testdir/file1" >> "$seqres.full" > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 1048576" "$testdir/file2" >> > "$seqres.full" > +_scratch_remount > + > +echo "Set extsz and cowextsz on zero byte file" > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file1" | _filter_scratch > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file1" | _filter_scratch > + > +echo "Set extsz and cowextsz on 1Mbyte file" > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file2" | _filter_scratch > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file2" | _filter_scratch > +_scratch_remount > + > +fn() { > + "$XFS_IO_PROG" -c "$1" "$2" | sed -e 's/.\([0-9]*\).*$/\1/g' > +} > +echo "Check extsz and cowextsz settings on zero byte file" > +test $(fn extsize "$testdir/file1") -eq 1048576 || echo "file1 extsize not > set" > +test $(fn cowextsize "$testdir/file1") -eq 1048576 || echo "file1 cowextsize > not set" For this sort of thing, just dump the extent size value to the golden output. i.e. echo "Check extsz and cowextsz settings on zero byte file" $XFS_IO_PROG -c extsize $testdir/file1 $XFS_IO_PROG -c cowextsize $testdir/file1 is all that is needed. that way if it fails, we see what value it had instead of the expected 1MB. This also makes the test much less verbose and easier to read > + > +echo "Check extsz and cowextsz settings on 1Mbyte file" > +test $(fn extsize "$testdir/file2") -eq 0 || echo "file2 extsize not set" > +test $(fn cowextsize "$testdir/file2") -eq 1048576 || echo "file2 cowextsize > not set" > + > +echo "Set cowextsize and check flag" > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file3" | _filter_scratch > +_scratch_remount > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | > awk '{print $4}' | grep -c 'C') -eq 1 || echo "file3 cowextsz flag not set" > +test $(fn cowextsize "$testdir/file3") -eq 1048576 || echo "file3 cowextsize > not set" > +"$XFS_IO_PROG" -f -c "cowextsize 0" "$testdir/file3" | _filter_scratch > +_scratch_remount > +test $(fn cowextsize "$testdir/file3") -eq 0 || echo "file3 cowextsize not > set" > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | > awk '{print $4}' | grep -c 'C') -eq 0 || echo "file3 cowextsz flag not set" Same with all these - just grep the output for the line you want, and the golden output matching does everything else. e.g. the flag check simply becomes: $XFS_IO_PROG -c "stat" $testdir/file3 | grep 'fsxattr.xflags' Again, this tells us what the wrong flags are if it fails... There are quite a few bits of these tests where the same thing applies -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/23] reflink: test CoW across a mixed range of block types with cowextsize set
On Mon, Feb 08, 2016 at 05:13:35PM -0800, Darrick J. Wong wrote: > Signed-off-by: Darrick J. Wong> --- > tests/xfs/215 | 108 ++ > tests/xfs/215.out | 14 + > tests/xfs/218 | 108 ++ > tests/xfs/218.out | 14 + > tests/xfs/219 | 108 ++ > tests/xfs/219.out | 14 + > tests/xfs/221 | 108 ++ > tests/xfs/221.out | 14 + > tests/xfs/223 | 113 > tests/xfs/223.out | 14 + > tests/xfs/224 | 113 > tests/xfs/224.out | 14 + > tests/xfs/225 | 108 ++ > tests/xfs/225.out | 14 + > tests/xfs/226 | 108 ++ > tests/xfs/226.out | 14 + > tests/xfs/228 | 137 > + > tests/xfs/228.out | 14 + > tests/xfs/230 | 137 > + > tests/xfs/230.out | 14 + > tests/xfs/group | 10 > 21 files changed, 1298 insertions(+) > create mode 100755 tests/xfs/215 > create mode 100644 tests/xfs/215.out > create mode 100755 tests/xfs/218 > create mode 100644 tests/xfs/218.out > create mode 100755 tests/xfs/219 > create mode 100644 tests/xfs/219.out > create mode 100755 tests/xfs/221 > create mode 100644 tests/xfs/221.out > create mode 100755 tests/xfs/223 > create mode 100644 tests/xfs/223.out > create mode 100755 tests/xfs/224 > create mode 100644 tests/xfs/224.out > create mode 100755 tests/xfs/225 > create mode 100644 tests/xfs/225.out > create mode 100755 tests/xfs/226 > create mode 100644 tests/xfs/226.out > create mode 100755 tests/xfs/228 > create mode 100644 tests/xfs/228.out > create mode 100755 tests/xfs/230 > create mode 100644 tests/xfs/230.out > > > diff --git a/tests/xfs/215 b/tests/xfs/215 > new file mode 100755 > index 000..8dd5cb5 > --- /dev/null > +++ b/tests/xfs/215 > @@ -0,0 +1,108 @@ > +#! /bin/bash > +# FS QA Test No. 215 > +# > +# Ensuring that copy on write in direct-io mode works when the CoW > +# range originally covers multiple extents, some unwritten, some not. > +# - Set cowextsize hint. > +# - Create a file and fallocate a second file. > +# - Reflink the odd blocks of the first file into the second file. > +# - directio CoW across the halfway mark, starting with the unwritten > extent. > +# - Check that the files are now different where we say they're different. > +# > +#--- > +# Copyright (c) 2016, Oracle and/or its affiliates. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > + > +seq=`basename "$0"` > +seqres="$RESULT_DIR/$seq" > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1# failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > +cd / > +rm -rf "$tmp".* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/reflink > + > +# real QA test starts here > +_supported_os Linux > +_require_scratch_reflink > +_require_xfs_io_command "falloc" > + > +rm -f "$seqres.full" > + > +echo "Format and mount" > +_scratch_mkfs > "$seqres.full" 2>&1 > +_scratch_mount >> "$seqres.full" 2>&1 > + > +testdir="$SCRATCH_MNT/test-$seq" > +rm -rf $testdir > +mkdir $testdir > + > +echo "Create the original files" > +blksz=65536 > +nr=64 > +real_blksz=$(stat -f -c '%S' "$testdir") > +internal_blks=$((blksz * nr / real_blksz)) > +"$XFS_IO_PROG" -c "cowextsize $((blksz * 16))" "$testdir" >> "$seqres.full" > +_pwrite_byte 0x61 0 $((blksz * nr)) "$testdir/file1" >> "$seqres.full" > +$XFS_IO_PROG -f -c "falloc 0 $((blksz * nr))" "$testdir/file3" >> > "$seqres.full" > +_pwrite_byte 0x00 0 $((blksz * nr)) "$testdir/file3.chk" >> "$seqres.full" > +seq 0 2 $((nr-1)) | while read f; do > + _reflink_range "$testdir/file1" $((blksz * f)) "$testdir/file3" > $((blksz * f)) $blksz >> "$seqres.full" > + _pwrite_byte 0x61 $((blksz
Re: [PATCH 06/23] dio unwritten conversion bug tests
On Tue, Feb 09, 2016 at 06:37:32PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:12:23PM -0800, Darrick J. Wong wrote: > > Check that we don't expose old disk contents when a directio write to > > an unwritten extent fails due to IO errors. This primarily affects > > XFS and ext4. > > > > Signed-off-by: Darrick J. Wong> . > > --- a/tests/generic/group > > +++ b/tests/generic/group > > @@ -252,7 +252,9 @@ > > 247 auto quick rw > > 248 auto quick rw > > 249 auto quick rw > > +250 auto quick > > 251 ioctl trim > > +252 auto quick > > Also should be in the prealloc group if we are testing unwritten > extent behaviour and the rw group because it's testing IO. Done. Should the CoW tests be in 'rw' too? They're testing IO, but otoh they (most probably) require shared blocks to have much of a point. --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/23] xfs: test rmapbt functionality
On Mon, Feb 08, 2016 at 05:13:48PM -0800, Darrick J. Wong wrote: > Signed-off-by: Darrick J. Wong> --- > common/xfs| 44 ++ > tests/xfs/233 | 78 ++ > tests/xfs/233.out |6 +++ > tests/xfs/234 | 89 > tests/xfs/234.out |6 +++ > tests/xfs/235 | 108 > + > tests/xfs/235.out | 14 +++ > tests/xfs/236 | 93 ++ > tests/xfs/236.out |8 > tests/xfs/group |4 ++ > 10 files changed, 450 insertions(+) > create mode 100644 common/xfs > create mode 100755 tests/xfs/233 > create mode 100644 tests/xfs/233.out > create mode 100755 tests/xfs/234 > create mode 100644 tests/xfs/234.out > create mode 100755 tests/xfs/235 > create mode 100644 tests/xfs/235.out > create mode 100755 tests/xfs/236 > create mode 100644 tests/xfs/236.out > > > diff --git a/common/xfs b/common/xfs > new file mode 100644 > index 000..2d1a76f > --- /dev/null > +++ b/common/xfs > @@ -0,0 +1,44 @@ > +##/bin/bash > +# Routines for handling XFS > +#--- > +# Copyright (c) 2015 Oracle. All Rights Reserved. > +# This program is free software; you can redistribute it and/or modify > +# it under the terms of the GNU General Public License as published by > +# the Free Software Foundation; either version 2 of the License, or > +# (at your option) any later version. > +# > +# This program is distributed in the hope that it will be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write to the Free Software > +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 > +# USA > +# > +# Contact information: Oracle Corporation, 500 Oracle Parkway, > +# Redwood Shores, CA 94065, USA, or: http://www.oracle.com/ > +#--- > + > +_require_xfs_test_rmapbt() > +{ > + _require_test > + > + if [ "$(xfs_info "$TEST_DIR" | grep -c "rmapbt=1")" -ne 1 ]; then > + _notrun "rmapbt not supported by test filesystem type: $FSTYP" > + fi > +} > + > +_require_xfs_scratch_rmapbt() > +{ > + _require_scratch > + > + _scratch_mkfs > /dev/null > + _scratch_mount > + if [ "$(xfs_info "$SCRATCH_MNT" | grep -c "rmapbt=1")" -ne 1 ]; then > + _scratch_unmount > + _notrun "rmapbt not supported by scratch filesystem type: > $FSTYP" > + fi > + _scratch_unmount > +} No, not yet. :) Wait until I get my "split common/rc" patchset out there, because it does not require: > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > +. ./common/xfs This. And i don't want to have to undo a bunch of stuff in tests yet. Just lump it all in common/rc for the moment. > + > +# real QA test starts here > +_supported_os Linux > +_supported_fs xfs > +_require_xfs_scratch_rmapbt > + > +echo "Format and mount" > +_scratch_mkfs -d size=$((2 * 4096 * 4096)) -l size=4194304 > "$seqres.full" > 2>&1 > +_scratch_mount >> "$seqres.full" 2>&1 _scratch_mkfs_sized ? > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > +cd / > +#rm -f $tmp.* More random uncommenting needed. > + > +echo "Check for damage" > +umount "$SCRATCH_MNT" > +_check_scratch_fs > + > +# success, all done > +status=0 > +exit Cull. -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/23] xfs/122: support refcount/rmap data structures
On Mon, Feb 08, 2016 at 11:55:06PM -0800, Darrick J. Wong wrote: > On Tue, Feb 09, 2016 at 06:43:30PM +1100, Dave Chinner wrote: > > On Mon, Feb 08, 2016 at 05:13:03PM -0800, Darrick J. Wong wrote: > > > Include the refcount and rmap structures in the golden output. > > > > > > Signed-off-by: Darrick J. Wong> > > --- > > > tests/xfs/122 |3 +++ > > > tests/xfs/122.out |4 > > > tests/xfs/group |2 +- > > > 3 files changed, 8 insertions(+), 1 deletion(-) > > > > > > > > > diff --git a/tests/xfs/122 b/tests/xfs/122 > > > index e6697a2..758cb50 100755 > > > --- a/tests/xfs/122 > > > +++ b/tests/xfs/122 > > > @@ -90,6 +90,9 @@ xfs_da3_icnode_hdr > > > xfs_dir3_icfree_hdr > > > xfs_dir3_icleaf_hdr > > > xfs_name > > > +xfs_owner_info > > > +xfs_refcount_irec > > > +xfs_rmap_irec > > > xfs_alloctype_t > > > xfs_buf_cancel_t > > > xfs_bmbt_rec_32_t > > > > So this is going to cause failures on any userspace that doesn't > > know about these new types, right? > > > > Should these be conditional in some way? > > I wasn't sure how to handle this -- I could just keep the patch at the head of > my stack (unreleased) until xfsprogs pulls in the appropriate libxfs pieces? > So long as we're not dead certain of the final format of the rmapbt and > refcountbt, there's probably not a lot of value in putting this in (yet). Well, I'm more concerned about running on older/current distros that don't have support for them in userspace. My brain is mush right now, so I don't have any brilliant ideas (hence the question, rather than also presenting a posible solution). I'll have a think; maybe we can make use of the configurable .out file code we have now? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
USB memory sticks wear & speed: btrfs vs f2fs?
How does btrfs compare to f2fs for use on (128GByte) USB memory sticks? Particularly for wearing out certain storage blocks? Does btrfs heavily use particular storage blocks that will prematurely "wear out"? (That is, could the whole 128GBytes be lost due to one 4kByte block having been re-written excessively too many times due to a fixed repeatedly used filesystem block?) Any other comparisons/thoughts for btrfs vs f2fs? Thanks for any comment, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/23] xfs: aio cow tests
On Mon, Feb 08, 2016 at 05:14:01PM -0800, Darrick J. Wong wrote: .,,, > + > +echo "Check for damage" > +_dmerror_unmount > +_dmerror_cleanup > +_repair_scratch_fs >> "$seqres.full" 2>&1 Are you testing repair here? If so, why doesn't failure matter. If not, why do it? Or is _require_scratch_nocheck all that is needed here? > +echo "CoW and unmount" > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * bsz)) 1" "$testdir/file2" >> > "$seqres.full" > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 -b $((blksz * bsz)) 0 $((blksz * nr))" > "$TEST_DIR/moo" >> "$seqres.full" offset = block size times block size? I think some better names might be needed... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 18/23] xfs: test the automatic cowextsize extent garbage collector
On Mon, Feb 08, 2016 at 05:13:42PM -0800, Darrick J. Wong wrote: > Signed-off-by: Darrick J. Wong> + > +_cleanup() > +{ > +cd / > +echo $old_cow_lifetime > > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime > +#rm -rf "$tmp".* "$testdir" uncomment. > +echo "CoW and leave leftovers" > +echo $old_cow_lifetime > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime > +seq 2 2 $((nr - 1)) | while read f; do > + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" "$testdir/file2" > >> "$seqres.full" > + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" > "$testdir/file2.chk" >> "$seqres.full" > +done Ok, I just realised what was bugging me about these loops: "f" is not a typical loop iterator for a count. Normally we'd use "i" for these > +echo "old extents: $old_extents" >> "$seqres.full" > +echo "new extents: $new_extents" >> "$seqres.full" > +echo "maximum extents: $internal_blks" >> "$seqres.full" > +test $new_extents -lt $((internal_blks / 7)) || _fail "file2 badly > fragmented" I wouldn't use _fail like this, echo is sufficient to cause the test to fail. > +echo "Check for damage" > +umount "$SCRATCH_MNT" > + > +# success, all done > +status=0 > +exit As would getting rid of the unmount and just setting status appropriately... /repeat -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/23] xfs: more reflink tests
On Tue, Feb 09, 2016 at 06:36:22PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:12:50PM -0800, Darrick J. Wong wrote: > > Create a couple of XFS-specific tests -- one to check that growing > > and shrinking the refcount btree works and a second one to check > > what happens when we hit maximum refcount. > > > > Signed-off-by: Darrick J. Wong> . > > +# real QA test starts here > > +_supported_os Linux > > +_supported_fs xfs > > +_require_scratch_reflink > > +_require_cp_reflink > > > + > > +test -x "$here/src/punch-alternating" || _notrun "punch-alternating not > > built" > > I suspect we need a _require rule for checking that something in > the test src directory has been built. Crapola, we also need punch-alternating, which doesn't appear until the next patch. Guess I'll go move it out of the next patch (or swap the order of these two I guess.) I added _require_test_program() which complains if src/$1 isn't built. > > +echo "Check scratch fs" > > +umount "$SCRATCH_MNT" > > +echo "check refcount after removing all files" >> "$seqres.full" > > +"$XFS_DB_PROG" -c 'agf 0' -c 'addr refcntroot' -c 'p recs[1]' > > "$SCRATCH_DEV" >> "$seqres.full" > > +"$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" > > 2>&1 > > +res=$? > > +if [ $res -eq 0 ]; then > > + # If repair succeeds then format the device so that the post-test > > + # check doesn't fail due to the single AG. > > + _scratch_mkfs >> "$seqres.full" 2>&1 > > +else > > + _fail "xfs_repair fails" > > +fi > > + > > +# success, all done > > +status=0 > > +exit > > This is what _require_scratch_nocheck avoids. > > i.e. do this instead: > > _require_scratch_nocheck > . > > "$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" 2>&1 > status=$? > exit Ok. > Also, we really don't need the quotes around these global > variables. They are just noise and lots of stuff will break if > those variables are set to something that requires them to be > quoted. --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB memory sticks wear & speed: btrfs vs f2fs?
On 2/9/2016 1:13 PM, Martin wrote: How does btrfs compare to f2fs for use on (128GByte) USB memory sticks? Particularly for wearing out certain storage blocks? Does btrfs heavily use particular storage blocks that will prematurely "wear out"? (That is, could the whole 128GBytes be lost due to one 4kByte block having been re-written excessively too many times due to a fixed repeatedly used filesystem block?) Any other comparisons/thoughts for btrfs vs f2fs? Copy-on-write (CoW) designs tend naturally to work well with flash media. F2fs is *specifically* designed to work well with flash, whereas for btrfs it is a natural consequence of the copy-on-write design. With both filesystems, if you randomly generate a 1GB file and delete it 1000 times, onto a 1TB flash, you are *very* likely to get exactly one write to *every* block on the flash (possibly two writes to <1% of the blocks) rather than, as would be the case with non-CoW filesystems, 1000 writes to a small chunk of blocks. I haven't found much reference or comparison information online wrt wear leveling - mostly performance benchmarks that don't really address your request. Personally I will likely never bother with f2fs unless I somehow end up working on a project requiring relatively small storage in Flash (as that is what f2fs was designed for). If someone can provide or link to some proper comparison data, that would be nice. :) -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions
On 05/02/16 20:36, Mackenzie Meyer wrote: Hello, I've tried checking around on google but can't find information regarding the RAM requirements of BTRFS and most of the topics on stability seem quite old. To keep my answer short: every time I've tried (offline) deduplication or raid5 pools I've ended with borked filesystems. Last attempt was about a year ago. Given that the pages you mention looked the same by then, I'd stay away of raid56 for anything but testing purposes. I haven't read anything about raid5 that increases my confidence in it recently (i.e. post 3.19 kernels). Dedup, OTOH, I don't know. What I used were third-party (I think?) things so the fault may have rested on them and not btrfs (does that makes sense?) I'm building a new small raid5 pool as we speak, though, for throw-away data, so I hope to be favourably impressed. Cheers. So first would be memory requirements, my goal is to use deduplication and compression. Approximately how many GB of RAM per TB of storage would be recommended? RAID 6 write holes? The BTRFS wiki states that parity might be inconsistent after a crash. That said, the wiki page for RAID 5/6 doesn't look like it has much recent information on there. Has this issue been addressed and if not, are there plans to address the RAID write hole issue? What would be a recommended workaround to resolve inconsistent parity, should an unexpected power down happen during write operations? RAID 6 stability? Any articles I've tried looking for online seem to be from early 2014, I can't find anything recent discussing the stability of RAID 5 or 6. Are there or have there recently been any data corruption bugs which impact RAID 6? Would you consider RAID 6 safe/stable enough for production use? Do you still strongly recommend backups, or has stability reached a point where backups aren't as critical? I'm thinking from a data consistency standpoint, not a hardware failure standpoint. I plan to start with a small array and add disks over time. That said, currently I have mostly 2TB disks and some 3TB disks. If I replace all 2TB disks with 3TB disks, would BTRFS then start utilizing the full 3TB capacity of each disk, or would I need to destroy and rebuild my array to benefit from the larger disks? Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB memory sticks wear & speed: btrfs vs f2fs?
On 2016-02-09 09:08, Brendan Hide wrote: On 2/9/2016 1:13 PM, Martin wrote: How does btrfs compare to f2fs for use on (128GByte) USB memory sticks? Particularly for wearing out certain storage blocks? Does btrfs heavily use particular storage blocks that will prematurely "wear out"? (That is, could the whole 128GBytes be lost due to one 4kByte block having been re-written excessively too many times due to a fixed repeatedly used filesystem block?) Any other comparisons/thoughts for btrfs vs f2fs? Copy-on-write (CoW) designs tend naturally to work well with flash media. F2fs is *specifically* designed to work well with flash, whereas for btrfs it is a natural consequence of the copy-on-write design. With both filesystems, if you randomly generate a 1GB file and delete it 1000 times, onto a 1TB flash, you are *very* likely to get exactly one write to *every* block on the flash (possibly two writes to <1% of the blocks) rather than, as would be the case with non-CoW filesystems, 1000 writes to a small chunk of blocks. This goes double if you're using the 'ssd' mount option on BTRFS. Also, the only blocks that are rewritten in place on BTRFS (unless you turn off COW) are the superblocks, but all filesystems rewrite those in-place. I haven't found much reference or comparison information online wrt wear leveling - mostly performance benchmarks that don't really address your request. Personally I will likely never bother with f2fs unless I somehow end up working on a project requiring relatively small storage in Flash (as that is what f2fs was designed for). I would tend to agree, but that's largely because BTRFS is more of a known entity for me, and certain features (send/receive in particular) are important enough for my usage that I'm willing to take the performance hit. IIRC, F2FS was developed for usage in stuff like Android devices and other compact embedded devices, where the FTL may not do a good job of wear leveling, so it should work equally well on USB flash drives (many of the cheap ones have no wear-leveling at all, and even some of the expensive ones have sub-par wear-leveling compared to good SSD's). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
On 02/01/2016 09:52 PM, Chris Murphy wrote: >> Would some sort of stracing or profiling of the process help to narrow >> > down where the time is currently spent and why the balancing is only >> > running single-threaded? > This can't be straced. Someone a lot more knowledgeable than I am > might figure out where all the waits are with just a sysrq + t, if it > is a hold up in say parity computations. Otherwise perf which is a > rabbit hole but perf top is kinda cool to watch. That might give you > an idea where most of the cpu cycles are going if you can isolate the > workload to just the balance. Otherwise you may end up with noisy > data. My balance run is now working away since 19th of January: "885 out of about 3492 chunks balanced (996 considered), 75% left" So this will take several more WEEKS to finish. Is there really nothing anyone here wants me to do or analyze to help finding the root cause of this? I mean with this kind of performance there is no way a RAID6 can be used in production. Not because the code is not stable or functioning, but because regular maintenance like replacing a drive or growing an array takes WEEKS in which another maintenance procedure could be necessary or, much worse, another drive might have failed. What I'm saying is: Such a slow RAID6 balance renders the redundancy unusable because drives might fail quicker than the potential rebuild (read "balance"). Regards Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "layout" of a six drive raid10
On 2016-02-09 02:02, Kai Krakow wrote: Am Tue, 9 Feb 2016 01:42:40 + (UTC) schrieb Duncan <1i5t5.dun...@cox.net>: Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1 on spinning rust will in practice fully saturate the gigabit Ethernet, particularly as it gets fragmented (which COW filesystems such as btrfs tend to do much more so than non-COW, unless you're using something like the autodefrag mount option from the get-go, as I do here, tho in that case, striping won't necessarily help a lot either). If you're concerned about getting the last bit of performance possible, I'd say raid10, tho over the gigabit ethernet, the difference isn't likely to be much. If performance is an issue, I suggest putting an SSD and bcache into the equation. I have very nice performance improvements with that, especially with writeback caching (random write go to bcache first, then to harddisk in background idle time). Apparently, afaik it's currently not possible to have native bcache redundandancy yet - so bcache can only be one SSD. It may be possible to use two bcaches and assign the btrfs members alternating to it - tho btrfs may decide to put two mirrors on the same bcache then. On the other side, you could put bcache on lvm oder mdraid - but I would not do it. On the bcache list, multiple people had problems with that including btrfs corruption beyond repair. On the other hand, you could simply go with bcache writearound caching (only reads become cached) or writethrough caching (writes go in parallel to bcache and btrfs). If the SSD dies, btrfs will still be perfectly safe in this case. If you are going with one of the latter options, the tuning knobs of bcache may help you actually cache not only random accesses to bcache but also linear accesses. It should help to saturate a gigabit link. Currently, SANdisk offers a pretty cheap (not top performance) drive with 500GB which should perfectly cover this usecase. Tho, I'm not sure how stable this drive works with bcache. I only checked Crucial MX100 and Samsung Evo 840 yet - both working very stable with latest kernel and discard enabled, no mdraid or lvm involved. FWIW, the other option if you want good performance and don't want to get an SSD is to run BTRFS in raid1 mode on top of two LVM or MD-RAID RAID0 volumes. I do this regularly for VM's and see a roughly 25-30% performance increase compared to BTRFS raid10 for my workloads, and that's with things laid out such that each block in BTRFS (16k in my case) ends up entirely on one disk in the RAID0 volume (you could theoretically get better performance by sizing the stripes on the RAID0 volume such that a block from BTRFS gets spread across all the disks in the volume, but that is marginally less safe than forcing each to one). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On 2016-02-08 16:44, Nikolaus Rath wrote: On Feb 07 2016, Martin Steigerwaldwrote: Am Sonntag, 7. Februar 2016, 21:07:13 CET schrieb Kai Krakow: Am Sun, 07 Feb 2016 11:06:58 -0800 schrieb Nikolaus Rath : Hello, I have a large home directory on a spinning disk that I regularly synchronize between different computers using unison. That takes ages, even though the amount of changed files is typically small. I suspect most if the time is spend walking through the file system and checking mtimes. So I was wondering if I could possibly speed-up this operation by storing all btrfs metadata on a fast, SSD drive. It seems that mkfs.btrfs allows me to put the metadata in raid1 or dup mode, and the file contents in single mode. However, I could not find a way to tell btrfs to use a device *only* for metadata. Is there a way to do that? Also, what is the difference between using "dup" and "raid1" for the metadata? You may want to try bcache. It will speedup random access which is probably the main cause for your slow sync. Unfortunately it requires you to reformat your btrfs partitions to add a bcache superblock. But it's worth the efforts. I use a nightly rsync to USB3 disk, and bcache reduced it from 5+ hours to typically 1.5-3 depending on how much data changed. An alternative is using dm-cache, I think it doesn´t need to recreate the filesystem. Yes, I tried that already but it didn't improve things at all. I wrote a message to the lvm list though, so maybe someone will be able to help. That's interesting. I've been using BTRFS on dm-cache for a while, and have seen measurable improvements in performance. They are not big improvements (only about 5% peak), but they're still improvements, which is somewhat impressive considering that the backing storage that's being cached is a RAID0 set which gets almost the same raw throughput as the SSD that's caching it. Of course, I'm using it more for the power savings (SSD's use less power, and I've got a big enough cache I can often spin down the traditional disks in the RAID0 set), and I also re-tune my system as hardware and workloads change, and my workloads tend to be atypical (lots of sequential isochronous writes, regular long sequential reads, and some random reads and rewrites), so YMMV. Otherwise I'll give bcache a shot. I've avoided it so far because of the need to reformat and because of rumours that it doesn't work well with LVM or BTRFS. But it sounds as if that's not the case.. It should work fine with _just_ BTRFS, but don't put any other layers into the storage system like LVM or dmcrypt or mdraid, it's got some pretty pathological interactions with the device mapper and md frameworks still. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Feb 09 2016, Kai Krakowwrote: > You could even format a bcache superblock "just in case", > and add an SSD later. Without SSD, bcache will just work in passthru > mode. Do the LVM concerns still apply in passthrough mode, or only when there's an actual cache? Thanks, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Feb 09 2016, Kai Krakowwrote: > I'm myself using bcache+btrfs and it ran bullet proof so far, even > after unintentional resets or power outage. It's important tho to NOT > put any storage layer between bcache and your devices or between btrfs > and your device as there are reports it becomes unstable with md or lvm > involved. Do you mean I should not use anything in the stack other than btrfs and bcache, or do you mean I should not put anything under bcache? In other words, I assume bcache on LVM is a bad idea. But what about LVM on bcache? Also, btrfs on LVM on disk is working fine for me, but you seem to be saying that it should not? Or are you talking specifically about btrfs on LVM on bcache? If there's no way to put LVM anywhere into the stack that'd be a bummer, I very much want to use dm-crypt (and I guess that counts as lvm?). Thanks, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: copy functionality of btrfs-debug-tree to inspect-internal subcommand
The long-term plan is to merge the features of standalone tools into the btrfs binary, reducing the number of shipped binaries. Signed-off-by: Alexander Fougner--- Makefile.in | 2 +- btrfs-debug-tree.c | 424 +--- cmds-inspect-dump-tree.c | 451 +++ cmds-inspect-dump-tree.h | 15 ++ cmds-inspect.c | 8 + 5 files changed, 481 insertions(+), 419 deletions(-) create mode 100644 cmds-inspect-dump-tree.c create mode 100644 cmds-inspect-dump-tree.h diff --git a/Makefile.in b/Makefile.in index 19697ff..14dab76 100644 --- a/Makefile.in +++ b/Makefile.in @@ -70,7 +70,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ qgroup.o raid6.o free-space-cache.o list_sort.o props.o \ ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \ - inode.o file.o find-root.o free-space-tree.o help.o + inode.o file.o find-root.o free-space-tree.o help.o cmds-inspect-dump-tree.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c index 266176f..057a715 100644 --- a/btrfs-debug-tree.c +++ b/btrfs-debug-tree.c @@ -30,433 +30,21 @@ #include "transaction.h" #include "volumes.h" #include "utils.h" - -static int print_usage(int ret) -{ - fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u]\n"); - fprintf(stderr, "[-b block_num ] device\n"); - fprintf(stderr, "\t-e : print detailed extents info\n"); - fprintf(stderr, "\t-d : print info of btrfs device and root tree dirs" -" only\n"); - fprintf(stderr, "\t-r : print info of roots only\n"); - fprintf(stderr, "\t-R : print info of roots and root backups\n"); - fprintf(stderr, "\t-u : print info of uuid tree only\n"); - fprintf(stderr, "\t-b block_num : print info of the specified block" -" only\n"); - fprintf(stderr, - "\t-t tree_id : print only the tree with the given id\n"); - fprintf(stderr, "%s\n", PACKAGE_STRING); - exit(ret); -} - -static void print_extents(struct btrfs_root *root, struct extent_buffer *eb) -{ - int i; - u32 nr; - u32 size; - - if (!eb) - return; - - if (btrfs_is_leaf(eb)) { - btrfs_print_leaf(root, eb); - return; - } - - size = btrfs_level_size(root, btrfs_header_level(eb) - 1); - nr = btrfs_header_nritems(eb); - for (i = 0; i < nr; i++) { - struct extent_buffer *next = read_tree_block(root, -btrfs_node_blockptr(eb, i), -size, -btrfs_node_ptr_generation(eb, i)); - if (!extent_buffer_uptodate(next)) - continue; - if (btrfs_is_leaf(next) && - btrfs_header_level(eb) != 1) - BUG(); - if (btrfs_header_level(next) != - btrfs_header_level(eb) - 1) - BUG(); - print_extents(root, next); - free_extent_buffer(next); - } -} - -static void print_old_roots(struct btrfs_super_block *super) -{ - struct btrfs_root_backup *backup; - int i; - - for (i = 0; i < BTRFS_NUM_BACKUP_ROOTS; i++) { - backup = super->super_roots + i; - printf("btrfs root backup slot %d\n", i); - printf("\ttree root gen %llu block %llu\n", - (unsigned long long)btrfs_backup_tree_root_gen(backup), - (unsigned long long)btrfs_backup_tree_root(backup)); - - printf("\t\textent root gen %llu block %llu\n", - (unsigned long long)btrfs_backup_extent_root_gen(backup), - (unsigned long long)btrfs_backup_extent_root(backup)); - - printf("\t\tchunk root gen %llu block %llu\n", - (unsigned long long)btrfs_backup_chunk_root_gen(backup), - (unsigned long long)btrfs_backup_chunk_root(backup)); - - printf("\t\tdevice root gen %llu block %llu\n", - (unsigned long long)btrfs_backup_dev_root_gen(backup), - (unsigned long long)btrfs_backup_dev_root(backup)); - - printf("\t\tcsum root gen %llu block %llu\n", - (unsigned long long)btrfs_backup_csum_root_gen(backup), - (unsigned long long)btrfs_backup_csum_root(backup)); - -
[PATCH 2/2] btrfs-progs: update docs for inspect-internal dump-tree
Signed-off-by: Alexander Fougner--- Documentation/btrfs-debug-tree.asciidoc | 7 +++ Documentation/btrfs-inspect-internal.asciidoc | 26 ++ 2 files changed, 33 insertions(+) diff --git a/Documentation/btrfs-debug-tree.asciidoc b/Documentation/btrfs-debug-tree.asciidoc index 23fc115..6d6d884 100644 --- a/Documentation/btrfs-debug-tree.asciidoc +++ b/Documentation/btrfs-debug-tree.asciidoc @@ -25,8 +25,15 @@ Print detailed extents info. Print info of btrfs device and root tree dirs only. -r:: Print info of roots only. +-R:: +Print info of roots and root backups. +-u:: +Print info of UUID tree only. -b :: Print info of the specified block only. +-t :: +Print only the tree with the specified ID. + EXIT STATUS --- diff --git a/Documentation/btrfs-inspect-internal.asciidoc b/Documentation/btrfs-inspect-internal.asciidoc index 1c7c361..25e6b8b 100644 --- a/Documentation/btrfs-inspect-internal.asciidoc +++ b/Documentation/btrfs-inspect-internal.asciidoc @@ -67,6 +67,32 @@ inode number 2), but such subvolume does not contain any files anyway + resolve the absolute path of a the subvolume id 'subvolid' +*dump-tree* [options] :: +(needs root privileges) ++ +Dump the whole tree of the given device. +This is useful for analyzing filesystem state or inconsistence and has +a positive educational effect on understanding the internal structure. + is the device file where the filesystem is stored. ++ +`Options` ++ +-e +Print detailed extents info. +-d +Print info of btrfs device and root tree dirs only. +-r +Print info of roots only. +-R +Print info of roots and root backups. +-u +Print info of UUID tree only. +-b +Print info of the specified block only. +-t +Print only the tree with the specified ID. + + EXIT STATUS --- *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
On Tue, Feb 09, 2016 at 02:48:14PM +0100, Christian Rohmann wrote: > > > On 02/01/2016 09:52 PM, Chris Murphy wrote: > >> Would some sort of stracing or profiling of the process help to narrow > >> > down where the time is currently spent and why the balancing is only > >> > running single-threaded? > > This can't be straced. Someone a lot more knowledgeable than I am > > might figure out where all the waits are with just a sysrq + t, if it > > is a hold up in say parity computations. Otherwise perf which is a > > rabbit hole but perf top is kinda cool to watch. That might give you > > an idea where most of the cpu cycles are going if you can isolate the > > workload to just the balance. Otherwise you may end up with noisy > > data. > > My balance run is now working away since 19th of January: > "885 out of about 3492 chunks balanced (996 considered), 75% left" > > So this will take several more WEEKS to finish. Is there really nothing > anyone here wants me to do or analyze to help finding the root cause of > this? I mean with this kind of performance there is no way a RAID6 can > be used in production. Not because the code is not stable or > functioning, but because regular maintenance like replacing a drive or > growing an array takes WEEKS in which another maintenance procedure > could be necessary or, much worse, another drive might have failed. > > What I'm saying is: Such a slow RAID6 balance renders the redundancy > unusable because drives might fail quicker than the potential rebuild > (read "balance"). I agree, this is bad. For what it's worth, one of my own filesystems (target for backups, many many files) has apparently become slow enough that it half hangs my system when I'm using it. I've just unmounted it to make sure my overall system performance comes back, and I may have to delete and recreate it. Sadly, this also means that btrfs still seems to get itself in corner cases that are causing performance issues. I'm not saying that you did hit this problem, but it is possible. Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 21/23] xfs: aio cow tests
On Tue, Feb 09, 2016 at 07:32:15PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:14:01PM -0800, Darrick J. Wong wrote: > .,,, > > + > > +echo "Check for damage" > > +_dmerror_unmount > > +_dmerror_cleanup > > +_repair_scratch_fs >> "$seqres.full" 2>&1 > > Are you testing repair here? If so, why doesn't failure matter. > If not, why do it? Or is _require_scratch_nocheck all that is needed > here? Uggghhh so xfs_repair dumps its regular output to stderr, so the "2>&1" pushes the output to $seqres.full. The return codes from xfs_repair seem to be: 0: fs is ok now 1: fs is probably broken 2: log needs to be replayed The return codes from fsck seem to be: 0: no errors found 1: errors fixed 2: errors fixed, reboot required (etc) So I guess the way out is to provide a better wrapper to the repair tools so that _repair_scratch_fs always returns 0 for "fs should be ok now" and nonzero otherwise: _repair_scratch_fs() { case $FSTYP in xfs) _scratch_xfs_repair "$@" 2>&1 res=$? if [ "$res" -eq 2 ]; then echo "xfs_repair returns $res; replay log?" _scratch_mount res=$? if [ "$res" -gt 0 ]; then echo "mount returns $res; zap log?" _scratch_xfs_repair -L 2>&1 echo "log zap returns $?" else umount "$SCRATCH_MNT" fi _scratch_xfs_repair "$@" 2>&1 res=$? fi test $res -ne 0 && >&2 echo "xfs_repair failed, err=$res" return $res ;; *) # Let's hope fsck -y suffices... fsck -t $FSTYP -y $SCRATCH_DEV 2>&1 res=$? case $res in 0|1|2) res=0 ;; *) >&2 echo "fsck.$FSTYP failed, err=$res" ;; esac return $res ;; esac } > > +echo "CoW and unmount" > > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * bsz)) 1" "$testdir/file2" > > >> "$seqres.full" > > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 -b $((blksz * bsz)) 0 $((blksz * > > nr))" "$TEST_DIR/moo" >> "$seqres.full" > > offset = block size times block size? > > I think some better names might be needed... Yes. Is now "bufnr" and bufsize=$((blksz * bufnr)). --D > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: USB memory sticks wear & speed: btrfs vs f2fs?
Am Tue, 9 Feb 2016 09:59:12 -0500 schrieb "Austin S. Hemmelgarn": > > I haven't found much reference or comparison information online wrt > > wear leveling - mostly performance benchmarks that don't really > > address your request. Personally I will likely never bother with > > f2fs unless I somehow end up working on a project requiring > > relatively small storage in Flash (as that is what f2fs was > > designed for). > I would tend to agree, but that's largely because BTRFS is more of a > known entity for me, and certain features (send/receive in > particular) are important enough for my usage that I'm willing to > take the performance hit. IIRC, F2FS was developed for usage in > stuff like Android devices and other compact embedded devices, where > the FTL may not do a good job of wear leveling, so it should work > equally well on USB flash drives (many of the cheap ones have no > wear-leveling at all, and even some of the expensive ones have > sub-par wear-leveling compared to good SSD's). Actually, I think most of them only do wear-levelling in the storage area where the FAT is expected - making them pretty useless for anything else than FAT formatting... I think the expected use-case for USB flash drives is only adding files, and occasionally delete them - or just delete all / reformat. It's not expected to actually "work" with files on such drives. Most of them are pretty bad at performance anyways for such usage patterns. It's actually pretty easy to wear out such a drive within a few days. I've tried myself with a drive called "ReadyBoost-capable" - yeah, it took me 2 weeks to wear it out after activating "ReadyBoost" on it, and it took only a few days to make its performance crawl. It's just slow now and full of unusable blocks. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs, test for send with clone operations
On Thu, Feb 4, 2016 at 9:21 PM, Dave Chinnerwrote: > On Thu, Feb 04, 2016 at 12:11:28AM +, fdman...@kernel.org wrote: >> From: Filipe Manana >> >> Test that an incremental send operation which issues clone operations >> works for files that have a full path containing more than one parent >> directory component. >> >> This used to fail before the following patch for the linux kernel: >> >> "[PATCH] Btrfs: send, fix extent buffer tree lock assertion failure" >> >> Signed-off-by: Filipe Manana > > Looks ok, I've pulled it in. Something to think about: > >> +# Create a bunch of small and empty files, this is just to make sure our >> +# subvolume's btree gets more than 1 leaf, a condition necessary to trigger >> a >> +# past bug (1000 files is enough even for a leaf/node size of 64K, the >> largest >> +# possible size). >> +for ((i = 1; i <= 1000; i++)); do >> + echo -n > $SCRATCH_MNT/a/b/c/z_$i >> +done > > We already do have a generic function for doing this called > _populate_fs(), it's just not optimised for speed with large numbers > of files being created. > > i.e. The above is simple a single directory tree with a single level > with 1000 files of size 0: > > _populate_fs() -d 1 -n 1 -f 1000 -s 0 -r $SCRATCH_MNT/a/b/ > > Can you look into optimising _populate_fs() to use multiple threads > (say up to 4 by default) and "echo -n" to create files, and then > convert all our open coded "create lots of files" loops in tests to > use it? Sure, I'll take a look at it when I get some spare time. Thanks Dave. > > Cheers, > > Dave. > -- > Dave Chinner > da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Feb 09 2016, Kai Krakowwrote: >> If there's no way to put LVM anywhere into the stack that'd be a >> bummer, I very much want to use dm-crypt (and I guess that counts as >> lvm?). > > Wasn't there plans for integrating per-file encryption into btrfs (like > there's already for ext4)? I think this could pretty well obsolete your > plans - except you prefer full-device encryption. Well, it could obsolete it once the plan turns into an implementation, but not today :-). > If you don't put encryption below the bcache caching device, everything > going to the cache won't be encrypted - so that's probably what you are > having to do anyways. No, I could use put separate encryption layers between bcache and the disk - for both the backing and the caching device. > But I don't know how such a setup recovers from power outage, I'm not > familiar with dm-crypt at all, how it integrates with maybe initrd > etc. Initrd is not a concern. You can put on it whatever is needed to set up the stack. As far as power outages is concerned, I think dm-crypt doesn't change anything - it's an intermediate layer with no caching. Any write gets passed through synchronously. > The caching device is treated dirty always. That means, it replays all > dirty data automatically during device discovery. Backing and caching > create a unified pair - that's why the superblock is needed. It saves > you from accidently using the backing without the cache. So even after > unclean shutdown, from the user-space view, the pair is always > consistent. Bcache will only remove persisted data from its log if it > ensured it was written correctly to the backing. The backing on its > own, however, is not guaranteed to be consistent at any time - except > you cleanly stop bcache and disconnect the pair (detach the cache). > > When dm-crypt comes in, I'm not sure how this is handled - given that > the encryption key must be loaded from somewhere... Someone else may > have a better clue here. The encryption keys are supplied by userspace when setting up the device. > So actually there's two questions: > > 1. Which order of stacking makes more sense and is more resilient to > errors? I think in an ideal world (i.e, no software bugs), inserting dm-crypt anywhere in the stack will not make a difference at all even when there is a crash. Thus... > 2. Which order of stacking is exhibiting bugs? ..indeed becomes the important question. Now if only someone had an answer :-). Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
On Tue, Feb 9, 2016 at 6:48 AM, Christian Rohmannwrote: > > > On 02/01/2016 09:52 PM, Chris Murphy wrote: >>> Would some sort of stracing or profiling of the process help to narrow >>> > down where the time is currently spent and why the balancing is only >>> > running single-threaded? >> This can't be straced. Someone a lot more knowledgeable than I am >> might figure out where all the waits are with just a sysrq + t, if it >> is a hold up in say parity computations. Otherwise perf which is a >> rabbit hole but perf top is kinda cool to watch. That might give you >> an idea where most of the cpu cycles are going if you can isolate the >> workload to just the balance. Otherwise you may end up with noisy >> data. > > My balance run is now working away since 19th of January: > "885 out of about 3492 chunks balanced (996 considered), 75% left" > > So this will take several more WEEKS to finish. Is there really nothing > anyone here wants me to do or analyze to help finding the root cause of > this? Can you run 'perf top' and let it run for a few minutes, then copy/paste or screenshot it somewhere? I'll definitely say in advance this is just a matter of curiosity where the kernel is spending all of its time, that this is going so slowly. In no way can I imagine being able to help fix it. I'm a bit surprised there's no dev response, maybe try the IRC channel? Weeks is just too long. My concern is if there's a drive failure, a.) what state is the fs going to be in and b.) will device replace be this slow too? I'd expect the code path for balance and replace to be the same, so I suspect yes. > I mean with this kind of performance there is no way a RAID6 can > be used in production. Not because the code is not stable or > functioning, but because regular maintenance like replacing a drive or > growing an array takes WEEKS in which another maintenance procedure > could be necessary or, much worse, another drive might have failed. That's right. In my dummy test, which should have run slower than your setup, the other differences on my end: elevator=noop## because I'm running an SSD kernel 4.5rc0 I could redo my test, using 'perf top' also and see if there's any glaring difference in where the kernel is spending its time on a system pushing the block device to its max write ability, vs ones that aren't. I don't have any other ideas. I'd rather a developer say, "try this" to gather more useful information, rather than just poking things with a random stick. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
Am Tue, 09 Feb 2016 08:10:15 -0800 schrieb Nikolaus Rath: > On Feb 09 2016, Kai Krakow wrote: > > You could even format a bcache superblock "just in case", > > and add an SSD later. Without SSD, bcache will just work in passthru > > mode. > > Do the LVM concerns still apply in passthrough mode, or only when > there's an actual cache? I don't think anyone ever tried... But I think there's actually not much logic involved in passthru mode, still it would pass through the bcache layer - where the bugs may be. It may be worth stress testing such a setup first, then do your backups (which you should do anyways when using btrfs, so this is more or less a no-op). There may even be differences if backing is on lvm, or if caching is on lvm, and the order of layering (bcache+lvm+btrfs, or lvm+bcache+btrfs). I think you may find some more details with the search machine of your preference. I remember there were actually some posts detailing exactly about this - including some mid-term experience with such a setup. What ever you find, passthru-mode is probably the easiest path regarding to code complexity, so it may not reproduce bugs others found. You may want to try to reproduce exactly their situations but just using passthru mode and see if it works. I suspect the hardware storage stack may also play its role (SSD firmware, SATA/RAID chipset, trim support on/off, NCQ support, ...) -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
Am Tue, 09 Feb 2016 08:09:20 -0800 schrieb Nikolaus Rath: > On Feb 09 2016, Kai Krakow wrote: > > I'm myself using bcache+btrfs and it ran bullet proof so far, even > > after unintentional resets or power outage. It's important tho to > > NOT put any storage layer between bcache and your devices or > > between btrfs and your device as there are reports it becomes > > unstable with md or lvm involved. > > Do you mean I should not use anything in the stack other than btrfs > and bcache, or do you mean I should not put anything under bcache? I never tried, I just use rawdevice+bcache+btrfs. Nothing stacked below or inbetween. This works for me. > In other words, I assume bcache on LVM is a bad idea. But what about > LVM on bcache? I think it makes a difference. > Also, btrfs on LVM on disk is working fine for me, but you seem to be > saying that it should not? Or are you talking specifically about btrfs > on LVM on bcache? Btrfs alone should be no problem. Any combination of all three could get you in trouble. I suggest doing your tests and keep it as simple as it can be. > If there's no way to put LVM anywhere into the stack that'd be a > bummer, I very much want to use dm-crypt (and I guess that counts as > lvm?). Wasn't there plans for integrating per-file encryption into btrfs (like there's already for ext4)? I think this could pretty well obsolete your plans - except you prefer full-device encryption. If you don't put encryption below the bcache caching device, everything going to the cache won't be encrypted - so that's probably what you are having to do anyways. But I don't know how such a setup recovers from power outage, I'm not familiar with dm-crypt at all, how it integrates with maybe initrd etc. But get a bigger picture let me explain how bcache works: The caching device is treated dirty always. That means, it replays all dirty data automatically during device discovery. Backing and caching create a unified pair - that's why the superblock is needed. It saves you from accidently using the backing without the cache. So even after unclean shutdown, from the user-space view, the pair is always consistent. Bcache will only remove persisted data from its log if it ensured it was written correctly to the backing. The backing on its own, however, is not guaranteed to be consistent at any time - except you cleanly stop bcache and disconnect the pair (detach the cache). When dm-crypt comes in, I'm not sure how this is handled - given that the encryption key must be loaded from somewhere... Someone else may have a better clue here. So actually there's two questions: 1. Which order of stacking makes more sense and is more resilient to errors? 2. Which order of stacking is exhibiting bugs? -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Tue, Feb 9, 2016 at 2:43 PM, Kai Krakowwrote: > Wasn't there plans for integrating per-file encryption into btrfs (like > there's already for ext4)? I think this could pretty well obsolete your > plans - except you prefer full-device encryption. https://btrfs.wiki.kernel.org/index.php/Project_ideas#Encryption I don't know whether the ZFS strategy (it would be per subvolume on Btrfs) or the per directory strategy of ext4 is simpler. The simpler it is, the more viable it is, I feel. Maybe it's too much of a tonka toy to only encrypt file data, not metadata (?) a question for someone more security conscious, but I'd rather have some level of integrated encryption rather than none. So I wonder if encryption could be a compression option - that is, it'd fit into the compression code path and instead of compressing, it'd encrypt. I guess the bigger problem then is user space tools to manage keys. For booting, there'd need to be a libbtrfs api or ioctl for systemd+plymouth to get the passphrase from the user. And for home, it actually can't be in the startup process at all, it has to be integrated into the desktop, using the user login passphrase to unlock a KEK, and from there the DEK. The whole point of per directory encryption is, a bunch of stuff remains encrypted. If it were treated as a variation on compression, specifically a variant of forced compression, it means no key is needed to do balance, scrub, device replace, etc, and even inline data gets encrypted also. Open question if the metadata slot for compression is big enough to include something like a key uuid, because each dir item (at least) needs to point to the key needed to decrypt the data. Hmm, or maybe a new tree to contain and track the encryption keys meant for each dir item. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Tue, Feb 9, 2016 at 8:29 AM, Kai Krakowwrote: > Am Mon, 08 Feb 2016 13:44:17 -0800 > schrieb Nikolaus Rath : > >> On Feb 07 2016, Martin Steigerwald wrote: >> > Am Sonntag, 7. Februar 2016, 21:07:13 CET schrieb Kai Krakow: >> >> Am Sun, 07 Feb 2016 11:06:58 -0800 >> >> >> >> schrieb Nikolaus Rath : >> >> > Hello, >> >> > >> >> > I have a large home directory on a spinning disk that I regularly >> >> > synchronize between different computers using unison. That takes >> >> > ages, even though the amount of changed files is typically >> >> > small. I suspect most if the time is spend walking through the >> >> > file system and checking mtimes. >> >> > >> >> > So I was wondering if I could possibly speed-up this operation by >> >> > storing all btrfs metadata on a fast, SSD drive. It seems that >> >> > mkfs.btrfs allows me to put the metadata in raid1 or dup mode, >> >> > and the file contents in single mode. However, I could not find >> >> > a way to tell btrfs to use a device *only* for metadata. Is >> >> > there a way to do that? >> >> > >> >> > Also, what is the difference between using "dup" and "raid1" for >> >> > the metadata? >> >> >> >> You may want to try bcache. It will speedup random access which is >> >> probably the main cause for your slow sync. Unfortunately it >> >> requires you to reformat your btrfs partitions to add a bcache >> >> superblock. But it's worth the efforts. >> >> >> >> I use a nightly rsync to USB3 disk, and bcache reduced it from 5+ >> >> hours to typically 1.5-3 depending on how much data changed. >> > >> > An alternative is using dm-cache, I think it doesn´t need to >> > recreate the filesystem. >> >> Yes, I tried that already but it didn't improve things at all. I >> wrote a message to the lvm list though, so maybe someone will be able >> to help. >> >> Otherwise I'll give bcache a shot. I've avoided it so far because of >> the need to reformat and because of rumours that it doesn't work well >> with LVM or BTRFS. But it sounds as if that's not the case.. > > I'm myself using bcache+btrfs and it ran bullet proof so far, even > after unintentional resets or power outage. It's important tho to NOT > put any storage layer between bcache and your devices or between btrfs > and your device as there are reports it becomes unstable with md or lvm > involved. In my setup I can even use discard/trim without problems. I'd > recommend a current kernel, tho. > > Since it requires reformatting, it's a big pita but it's worth the > efforts. It appeared, from its design, much more effective and stable > than dmcache. You could even format a bcache superblock "just in case", > and add an SSD later. Without SSD, bcache will just work in passthru > mode. Actually, I started to format all my storage with bcache > superblock "just in case". It is similar to having another partition > table folded inside - so it doesn't hurt (except you need bcache-probe > in initrd to detect the contained filesystems). Same positive bcache+BTRFS experience for me, I am using it since kernel 4.1.6 and now just latest 4.4. Especially now it is possible to use VM images in normal CoW mode with speed/performance comparable to the image on SSD. This is with 50G images consisting of about 50k extents, raid10 btrfs with mount options noatime,nossd,autodefrag and writeback on. Initial amount of extents was in order of 100 or so, but later small writes inside the VM just almost all end up in the bcache. Nightly incremental send|receive is just a few minutes. Kernel compile from local git repo clone almost works like from SSD. When both RAM cache is invalidated and bcache detached / stopped / not there, filesystem finds or operations that have to deal with fragmentation or a lot of seeks clearly take way more time. From there, after starting and using an OS in a VM for lets say 10 minutes for common tasks, speed is 'SSD like' and not 'HDD like' anymore and stays that way (until eviction of blocks of course). The 'reformatting' might be avoided by using this: https://github.com/g2p/blocks I haven't used it myself as one fs was just full harddisk and my python installations had some issues. I wanted to keep same UUID ( due to longterm incremental send|receive cloning setup) so I did shrink the filesystem to its almost smallest possible and then used an extra device (4TB) to dd_rescue the fs image onto and then 2nd step dd_rescue it back to the original disk (to a partition that is bcache'd). A btrfs replace would have also been an option. Or some 2-step add-remove action or tricks with raid1. For another disk I did not have a spare disk, so I made a script to do an 'in-place' filesystem image replace. I have browsed the superblocks (don't remember size, but its a few kB AFAIK), so 1G copyblocksize is huge enough and keeping at least 2 copyblocks readahead stored on intermediate storage worked fine. Same can be used for LUKS header
Re: [PATCH 18/23] xfs: test the automatic cowextsize extent garbage collector
On Tue, Feb 09, 2016 at 07:15:47PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:13:42PM -0800, Darrick J. Wong wrote: > > Signed-off-by: Darrick J. Wong> > + > > +_cleanup() > > +{ > > +cd / > > +echo $old_cow_lifetime > > > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime > > +#rm -rf "$tmp".* "$testdir" > > uncomment. > > > +echo "CoW and leave leftovers" > > +echo $old_cow_lifetime > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime > > +seq 2 2 $((nr - 1)) | while read f; do > > + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" "$testdir/file2" > > >> "$seqres.full" > > + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" > > "$testdir/file2.chk" >> "$seqres.full" > > +done > > Ok, I just realised what was bugging me about these loops: "f" is > not a typical loop iterator for a count. Normally we'd use "i" for > these > > > +echo "old extents: $old_extents" >> "$seqres.full" > > +echo "new extents: $new_extents" >> "$seqres.full" > > +echo "maximum extents: $internal_blks" >> "$seqres.full" > > +test $new_extents -lt $((internal_blks / 7)) || _fail "file2 badly > > fragmented" > > I wouldn't use _fail like this, echo is sufficient to cause the test > to fail. Ok, fixed. --D > > +echo "Check for damage" > > +umount "$SCRATCH_MNT" > > + > > +# success, all done > > +status=0 > > +exit > > As would getting rid of the unmount and just setting status > appropriately... > > /repeat > > -Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 19/23] xfs: test rmapbt functionality
On Tue, Feb 09, 2016 at 07:26:40PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:13:48PM -0800, Darrick J. Wong wrote: > > Signed-off-by: Darrick J. Wong> > --- > > common/xfs| 44 ++ > > tests/xfs/233 | 78 ++ > > tests/xfs/233.out |6 +++ > > tests/xfs/234 | 89 > > tests/xfs/234.out |6 +++ > > tests/xfs/235 | 108 > > + > > tests/xfs/235.out | 14 +++ > > tests/xfs/236 | 93 ++ > > tests/xfs/236.out |8 > > tests/xfs/group |4 ++ > > 10 files changed, 450 insertions(+) > > create mode 100644 common/xfs > > create mode 100755 tests/xfs/233 > > create mode 100644 tests/xfs/233.out > > create mode 100755 tests/xfs/234 > > create mode 100644 tests/xfs/234.out > > create mode 100755 tests/xfs/235 > > create mode 100644 tests/xfs/235.out > > create mode 100755 tests/xfs/236 > > create mode 100644 tests/xfs/236.out > > > > > > diff --git a/common/xfs b/common/xfs > > new file mode 100644 > > index 000..2d1a76f > > --- /dev/null > > +++ b/common/xfs > > @@ -0,0 +1,44 @@ > > +##/bin/bash > > +# Routines for handling XFS > > +#--- > > +# Copyright (c) 2015 Oracle. All Rights Reserved. > > +# This program is free software; you can redistribute it and/or modify > > +# it under the terms of the GNU General Public License as published by > > +# the Free Software Foundation; either version 2 of the License, or > > +# (at your option) any later version. > > +# > > +# This program is distributed in the hope that it will be useful, > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +# GNU General Public License for more details. > > +# > > +# You should have received a copy of the GNU General Public License > > +# along with this program; if not, write to the Free Software > > +# Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 > > +# USA > > +# > > +# Contact information: Oracle Corporation, 500 Oracle Parkway, > > +# Redwood Shores, CA 94065, USA, or: http://www.oracle.com/ > > +#--- > > + > > +_require_xfs_test_rmapbt() > > +{ > > + _require_test > > + > > + if [ "$(xfs_info "$TEST_DIR" | grep -c "rmapbt=1")" -ne 1 ]; then > > + _notrun "rmapbt not supported by test filesystem type: $FSTYP" > > + fi > > +} > > + > > +_require_xfs_scratch_rmapbt() > > +{ > > + _require_scratch > > + > > + _scratch_mkfs > /dev/null > > + _scratch_mount > > + if [ "$(xfs_info "$SCRATCH_MNT" | grep -c "rmapbt=1")" -ne 1 ]; then > > + _scratch_unmount > > + _notrun "rmapbt not supported by scratch filesystem type: > > $FSTYP" > > + fi > > + _scratch_unmount > > +} > > No, not yet. :) > > Wait until I get my "split common/rc" patchset out there, because it > does not require: Ok, I moved all the common/xfs stuff back to common/rc. > > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > +. ./common/xfs > > This. > > And i don't want to have to undo a bunch of stuff in tests yet. Just > lump it all in common/rc for the moment. > > > + > > +# real QA test starts here > > +_supported_os Linux > > +_supported_fs xfs > > +_require_xfs_scratch_rmapbt > > + > > +echo "Format and mount" > > +_scratch_mkfs -d size=$((2 * 4096 * 4096)) -l size=4194304 > > > "$seqres.full" 2>&1 > > +_scratch_mount >> "$seqres.full" 2>&1 > > _scratch_mkfs_sized ? Done. > > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1 # failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > +cd / > > +#rm -f $tmp.* > > More random uncommenting needed. > > > + > > +echo "Check for damage" > > +umount "$SCRATCH_MNT" > > +_check_scratch_fs > > + > > +# success, all done > > +status=0 > > +exit > > Cull. Done --D > > -Dave. > -- > Dave Chinner > da...@fromorbit.com > > ___ > xfs mailing list > x...@oss.sgi.com > http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 17/23] reflink: test CoW across a mixed range of block types with cowextsize set
On Tue, Feb 09, 2016 at 07:09:23PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:13:35PM -0800, Darrick J. Wong wrote: > > Signed-off-by: Darrick J. Wong> > --- > > tests/xfs/215 | 108 ++ > > tests/xfs/215.out | 14 + > > tests/xfs/218 | 108 ++ > > tests/xfs/218.out | 14 + > > tests/xfs/219 | 108 ++ > > tests/xfs/219.out | 14 + > > tests/xfs/221 | 108 ++ > > tests/xfs/221.out | 14 + > > tests/xfs/223 | 113 > > tests/xfs/223.out | 14 + > > tests/xfs/224 | 113 > > tests/xfs/224.out | 14 + > > tests/xfs/225 | 108 ++ > > tests/xfs/225.out | 14 + > > tests/xfs/226 | 108 ++ > > tests/xfs/226.out | 14 + > > tests/xfs/228 | 137 > > + > > tests/xfs/228.out | 14 + > > tests/xfs/230 | 137 > > + > > tests/xfs/230.out | 14 + > > tests/xfs/group | 10 > > 21 files changed, 1298 insertions(+) > > create mode 100755 tests/xfs/215 > > create mode 100644 tests/xfs/215.out > > create mode 100755 tests/xfs/218 > > create mode 100644 tests/xfs/218.out > > create mode 100755 tests/xfs/219 > > create mode 100644 tests/xfs/219.out > > create mode 100755 tests/xfs/221 > > create mode 100644 tests/xfs/221.out > > create mode 100755 tests/xfs/223 > > create mode 100644 tests/xfs/223.out > > create mode 100755 tests/xfs/224 > > create mode 100644 tests/xfs/224.out > > create mode 100755 tests/xfs/225 > > create mode 100644 tests/xfs/225.out > > create mode 100755 tests/xfs/226 > > create mode 100644 tests/xfs/226.out > > create mode 100755 tests/xfs/228 > > create mode 100644 tests/xfs/228.out > > create mode 100755 tests/xfs/230 > > create mode 100644 tests/xfs/230.out > > > > > > diff --git a/tests/xfs/215 b/tests/xfs/215 > > new file mode 100755 > > index 000..8dd5cb5 > > --- /dev/null > > +++ b/tests/xfs/215 > > @@ -0,0 +1,108 @@ > > +#! /bin/bash > > +# FS QA Test No. 215 > > +# > > +# Ensuring that copy on write in direct-io mode works when the CoW > > +# range originally covers multiple extents, some unwritten, some not. > > +# - Set cowextsize hint. > > +# - Create a file and fallocate a second file. > > +# - Reflink the odd blocks of the first file into the second file. > > +# - directio CoW across the halfway mark, starting with the unwritten > > extent. > > +# - Check that the files are now different where we say they're > > different. > > +# > > +#--- > > +# Copyright (c) 2016, Oracle and/or its affiliates. All Rights Reserved. > > +# > > +# This program is free software; you can redistribute it and/or > > +# modify it under the terms of the GNU General Public License as > > +# published by the Free Software Foundation. > > +# > > +# This program is distributed in the hope that it would be useful, > > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > > +# GNU General Public License for more details. > > +# > > +# You should have received a copy of the GNU General Public License > > +# along with this program; if not, write the Free Software Foundation, > > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > > +#--- > > + > > +seq=`basename "$0"` > > +seqres="$RESULT_DIR/$seq" > > +echo "QA output created by $seq" > > + > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1# failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > +cd / > > +rm -rf "$tmp".* > > +} > > + > > +# get standard environment, filters and checks > > +. ./common/rc > > +. ./common/filter > > +. ./common/reflink > > + > > +# real QA test starts here > > +_supported_os Linux > > +_require_scratch_reflink > > +_require_xfs_io_command "falloc" > > + > > +rm -f "$seqres.full" > > + > > +echo "Format and mount" > > +_scratch_mkfs > "$seqres.full" 2>&1 > > +_scratch_mount >> "$seqres.full" 2>&1 > > + > > +testdir="$SCRATCH_MNT/test-$seq" > > +rm -rf $testdir > > +mkdir $testdir > > + > > +echo "Create the original files" > > +blksz=65536 > > +nr=64 > > +real_blksz=$(stat -f -c '%S' "$testdir") > > +internal_blks=$((blksz * nr / real_blksz)) > > +"$XFS_IO_PROG" -c "cowextsize $((blksz * 16))" "$testdir" >> "$seqres.full" > > +_pwrite_byte 0x61 0 $((blksz * nr)) "$testdir/file1" >> "$seqres.full" > > +$XFS_IO_PROG -f -c "falloc 0
Re: [PATCH 13/23] xfs: test fragmentation characteristics of copy-on-write
On Tue, Feb 09, 2016 at 07:01:44PM +1100, Dave Chinner wrote: > On Mon, Feb 08, 2016 at 05:13:09PM -0800, Darrick J. Wong wrote: > > Perform copy-on-writes at random offsets to stress the CoW allocation > > system. Assess the effectiveness of the extent size hint at > > combatting fragmentation via unshare, a rewrite, and no-op after the > > random writes. > > > > Signed-off-by: Darrick J. Wong> > > +seq=`basename "$0"` > > +seqres="$RESULT_DIR/$seq" > > +echo "QA output created by $seq" > > + > > +here=`pwd` > > +tmp=/tmp/$$ > > +status=1# failure is the default! > > +trap "_cleanup; exit \$status" 0 1 2 3 15 > > + > > +_cleanup() > > +{ > > +cd / > > +#rm -rf "$tmp".* "$testdir" > > Now that I've noticed it, a few tests have this line commented out. > Probably should remove the tmp files, at least. Done. > > +rm -f "$seqres.full" > > + > > +echo "Format and mount" > > +_scratch_mkfs > "$seqres.full" 2>&1 > > +_scratch_mount >> "$seqres.full" 2>&1 > > + > > +testdir="$SCRATCH_MNT/test-$seq" > > +rm -rf $testdir > > +mkdir $testdir > > Again, somthing that is repeated - we just mkfs'd the scratch > device, so the $testdir is guaranteed not to exist... I've done that to the new tests, will do to the existing ones. > > +echo "Check for damage" > > +umount "$SCRATCH_MNT" > > I've also noticed this in a lot of tests - the scratch device will > be unmounted by the harness, so I don't think this is necessary Done. > > +free_blocks=$(stat -f -c '%a' "$testdir") > > +real_blksz=$(stat -f -c '%S' "$testdir") > > +space_needed=$(((blksz * nr * 3) * 5 / 4)) > > +space_avail=$((free_blocks * real_blksz)) > > +internal_blks=$((blksz * nr / real_blksz)) > > +test $space_needed -gt $space_avail && _notrun "Not enough space. > > $space_avail < $space_needed" > > Why not: > > _require_fs_space $space_needed > > At minimum, it seems to be a repeated hunk of code, so it shoul dbe > factored. Ok, done. > > +testdir="$SCRATCH_MNT/test-$seq" > > +rm -rf $testdir > > +mkdir $testdir > > + > > +echo "Create the original files" > > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 0" "$testdir/file1" >> > > "$seqres.full" > > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 1048576" "$testdir/file2" >> > > "$seqres.full" > > +_scratch_remount > > + > > +echo "Set extsz and cowextsz on zero byte file" > > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file1" | _filter_scratch > > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file1" | > > _filter_scratch > > + > > +echo "Set extsz and cowextsz on 1Mbyte file" > > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file2" | _filter_scratch > > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file2" | > > _filter_scratch > > +_scratch_remount > > + > > +fn() { > > + "$XFS_IO_PROG" -c "$1" "$2" | sed -e 's/.\([0-9]*\).*$/\1/g' > > +} > > +echo "Check extsz and cowextsz settings on zero byte file" > > +test $(fn extsize "$testdir/file1") -eq 1048576 || echo "file1 extsize not > > set" > > +test $(fn cowextsize "$testdir/file1") -eq 1048576 || echo "file1 > > cowextsize not set" > > For this sort of thing, just dump the extent size value to the > golden output. i.e. > > echo "Check extsz and cowextsz settings on zero byte file" > $XFS_IO_PROG -c extsize $testdir/file1 > $XFS_IO_PROG -c cowextsize $testdir/file1 > > is all that is needed. that way if it fails, we see what value it > had instead of the expected 1MB. This also makes the test much less > verbose and easier to read Done. > > + > > +echo "Check extsz and cowextsz settings on 1Mbyte file" > > +test $(fn extsize "$testdir/file2") -eq 0 || echo "file2 extsize not set" > > +test $(fn cowextsize "$testdir/file2") -eq 1048576 || echo "file2 > > cowextsize not set" > > + > > +echo "Set cowextsize and check flag" > > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file3" | > > _filter_scratch > > +_scratch_remount > > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | > > awk '{print $4}' | grep -c 'C') -eq 1 || echo "file3 cowextsz flag not set" > > +test $(fn cowextsize "$testdir/file3") -eq 1048576 || echo "file3 > > cowextsize not set" > > +"$XFS_IO_PROG" -f -c "cowextsize 0" "$testdir/file3" | _filter_scratch > > +_scratch_remount > > +test $(fn cowextsize "$testdir/file3") -eq 0 || echo "file3 cowextsize not > > set" > > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | > > awk '{print $4}' | grep -c 'C') -eq 0 || echo "file3 cowextsz flag not set" > > Same with all these - just grep the output for the line you want, > and the golden output matching does everything else. e.g. the flag > check simply becomes: > > $XFS_IO_PROG -c "stat" $testdir/file3 | grep 'fsxattr.xflags' > > Again, this tells us what the wrong flags are if it fails... Done. It'll probably break whenever we add new flags, but that can be fixed. --D > > There are quite a few bits of these tests where the same thing >
Re: Use fast device only for metadata?
On Tue, Feb 9, 2016 at 11:38 PM, Nikolaus Rathwrote: > On Feb 09 2016, Kai Krakow wrote: >>> If there's no way to put LVM anywhere into the stack that'd be a >>> bummer, I very much want to use dm-crypt (and I guess that counts as >>> lvm?). >> >> Wasn't there plans for integrating per-file encryption into btrfs (like >> there's already for ext4)? I think this could pretty well obsolete your >> plans - except you prefer full-device encryption. > > Well, it could obsolete it once the plan turns into an implementation, > but not today :-). > >> If you don't put encryption below the bcache caching device, everything >> going to the cache won't be encrypted - so that's probably what you are >> having to do anyways. > > No, I could use put separate encryption layers between bcache and the > disk - for both the backing and the caching device. > >> But I don't know how such a setup recovers from power outage, I'm not >> familiar with dm-crypt at all, how it integrates with maybe initrd >> etc. > > Initrd is not a concern. You can put on it whatever is needed to set up > the stack. > > As far as power outages is concerned, I think dm-crypt doesn't change > anything - it's an intermediate layer with no caching. Any write gets > passed through synchronously. > >> The caching device is treated dirty always. That means, it replays all >> dirty data automatically during device discovery. Backing and caching >> create a unified pair - that's why the superblock is needed. It saves >> you from accidently using the backing without the cache. So even after >> unclean shutdown, from the user-space view, the pair is always >> consistent. Bcache will only remove persisted data from its log if it >> ensured it was written correctly to the backing. The backing on its >> own, however, is not guaranteed to be consistent at any time - except >> you cleanly stop bcache and disconnect the pair (detach the cache). >> >> When dm-crypt comes in, I'm not sure how this is handled - given that >> the encryption key must be loaded from somewhere... Someone else may >> have a better clue here. > > The encryption keys are supplied by userspace when setting up the > device. > > >> So actually there's two questions: >> >> 1. Which order of stacking makes more sense and is more resilient to >> errors? > > I think in an ideal world (i.e, no software bugs), inserting dm-crypt > anywhere in the stack will not make a difference at all even when there > is a crash. Thus... Most sense to me made dm-crypt between bcache and btrfs. And that works fine I can say. Actually, I have been using the following since kernel 4.4.0-rc4 was there: rawdevice + bcache + iscsi + dm-crypt + btrfs This way, the IP link transports encrypted data (although it is only a local short ethernetcable+switch). It works fine, scrubs are still with 0 errors and the last btrfs check did not report any errors. (It also works well with AoE, top perfomance after I put MTU's to 9000 ) >> 2. Which order of stacking is exhibiting bugs? > > ..indeed becomes the important question. Now if only someone had an > answer :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Use fast device only for metadata?
On Feb 08 2016, Nikolaus Rathwrote: > Otherwise I'll give bcache a shot. I've avoided it so far because of the > need to reformat and because of rumours that it doesn't work well with > LVM or BTRFS. But it sounds as if that's not the case.. I now have the following stack: btrfs on LUKS on LVM on bcache The VG contains two bcache PVs with backing devices on different spinning disks, and a shared cache device on SSD. I'm using Kernel 4.3. I'm super happy with the performance, boot times increased from 1:30 minutes to X11 and 2:00 to Firefox roughly 0:10 to X11 and 0:30 to Firefox. Time will tell if it also keeps my data intact, but I hope btrfs would at least detect any corruption. Best, -Nikolaus -- GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.« -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
# perf stat -e 'btrfs:*' -a sleep 10 ## This is single device HDD, balance of a root fs was started before these 10 seconds of sampling. There are some differences in the statistics depending on whether there are predominately reads or writes for the balance, so clearly balance does predominately reads, then predominately writes. Unsurprising but the three tries I did were largely in agreement (orders of magnitude wise). http://fpaste.org/320551/06921614/ # perf record -e block:block_rq_issue -ag ^C ## after ~30 seconds # perf report ## Single device HDD, balance of root fs start before perf record. There's a lot of data, collapsed by default. I expanded a few items at random just as an example. I suspect the write of the perf.data file is a non-factor because it was just under 2MiB. http://fpaste.org/320555/14550698/raw/ # perf top ## Single device HDD, balance of root fs start before issuing this command, and let it run for about 20 seconds. This is actually not as interesting as I thought it might be, but I don't really know what I'm looking for. I'd need something else to compare it to. http://fpaste.org/320559/55070873/ Anyway, all of these are single device, so it's not apples/apples comparison, but it is a working (full speed for the block device) balance. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?
This could also be interesting. It means canceling the balance in progress; waiting some time; and then cancelling it again to get results to return. # perf stat -B btrfs balance start / ## Again, single device example, balancing at expected performance. http://fpaste.org/320562/55071438/ I didn't try this but, it looks like it'd be a variation on the above, attaching to a running balance: # perf stat -B -p sleep 60 Anyway... Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to show current profile?
On Tue, Feb 09, 2016 at 11:36:49PM -0800, Ian Kelling wrote: > I searched the man pages, can't seem to find it. > btrfs-balance can change profiles, but not show > the current profile... seems odd. btrfs fi df /mountpoint Hugo. -- Hugo Mills | Gentlemen! You can't fight here! This is the War hugo@... carfax.org.uk | Room! http://carfax.org.uk/ | PGP: E2AB1DE4 |Dr Strangelove signature.asc Description: Digital signature
RAID5 Unable to remove Failing HD
Hi, This morning i woke up to a failing disk: [230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush 503, corrupt 0, gen 0 [230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush 503, corrupt 0, gen 0 [230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc [230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc [230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed [230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush 503, corrupt 0, gen 0 [230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush 503, corrupt 0, gen 0 [230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed [230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc [230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush 503, corrupt 0, gen 0 [230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc [230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush 503, corrupt 0, gen 0 [230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc [230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush 503, corrupt 0, gen 0 [230802.000425] nfsd: last server has exited, flushing export cache [230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc [230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush 503, corrupt 0, gen 0 [230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc [230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush 503, corrupt 0, gen 0 [230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc [230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush 503, corrupt 0, gen 0 [230898.830893] BTRFS info (device sdd): allowing degraded mounts [230898.830902] BTRFS info (device sdd): disk space caching is enabled Eventually i remounted it as degraded, hopefully to prevent any loss of data. It seems taht the btrfs filesystem still hasn't noticed that the disk has failed: $btrfs fi show Label: 'RenesData' uuid: ee80dae2-7c86-43ea-a253-c8f04589b496 Total devices 5 FS bytes used 5.38TiB devid1 size 2.73TiB used 1.84TiB path /dev/sdb devid2 size 2.73TiB used 1.84TiB path /dev/sde devid3 size 3.64TiB used 1.84TiB path /dev/sdf devid4 size 2.73TiB used 1.84TiB path /dev/sdd devid5 size 3.64TiB used 1.84TiB path /dev/sdc I tried deleting the device: # btrfs device delete /dev/sdc /mnt2/RenesData/ ERROR: error removing device '/dev/sdc': Invalid argument I have been unlucky and already had a failure last friday, where a RAID5 array failed after a disk failure. I rebooted, and the data was unrecoverable. Fortunately this was only temp data so the failure wasn't a real issue. Can somebody give me some advice how to delete the failing disk? I plan on replacing the disk but unfortunately the system doesn't have hotplug, so i will need to shutdown to replace the disk without loosing any of the data stored on these devices. Regards Rene Castberg # uname -a Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST 2015 x86_64 x86_64 x86_64 GNU/Linux [root@midgard ~]# btrfs --version btrfs-progs v4.3.1 [root@midgard ~]# btrfs fi df /mnt2/RenesData/ Data, RAID6: total=5.52TiB, used=5.37TiB System, RAID6: total=96.00MiB, used=480.00KiB Metadata, RAID6: total=17.53GiB, used=11.86GiB GlobalReserve, single: total=512.00MiB, used=0.00B # btrfs device stats /mnt2/RenesData/ [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [/dev/sde].write_io_errs 0 [/dev/sde].read_io_errs0 [/dev/sde].flush_io_errs 0 [/dev/sde].corruption_errs 0 [/dev/sde].generation_errs 0 [/dev/sdf].write_io_errs 0 [/dev/sdf].read_io_errs0 [/dev/sdf].flush_io_errs 0 [/dev/sdf].corruption_errs 0 [/dev/sdf].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sdc].write_io_errs 1583 [/dev/sdc].read_io_errs45652 [/dev/sdc].flush_io_errs 503 [/dev/sdc].corruption_errs 0 [/dev/sdc].generation_errs 0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to show current profile?
I searched the man pages, can't seem to find it. btrfs-balance can change profiles, but not show the current profile... seems odd. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html