Re: One disc of 3-disc btrfs-raid5 failed - files only partially readable

2016-02-09 Thread Henk Slager
On Sun, Feb 7, 2016 at 6:28 PM, Benjamin Valentin
 wrote:
> Hi,
>
> I created a btrfs volume with 3x8TB drives (ST8000AS0002-1NA) in raid5
> configuration.
> I copied some TB of data onto it without errors (from eSATA drives, so
> rather fast - I mention that because of [1]), then set it up as a
> fileserver where it had data read and written to it over a gigabit
> ethernet connection for several days.
> This however didn't go so well because after one day, one of the drives
> dropped off the SATA bus.
>
> I don't know if that was related to [1] (I was running Linux 4.4-rc6 to
> avoid that) and by now all evidence has been eaten by logrotate :\
>
> But I was not concerned for I had set up raid5 to provide redundancy
> against one disc failure - unfortunately it did not.
>
> When trying to read a file I'd get an I/O error after some hundret MB
> (this is random across multiple files, but consistent for the same
> file) on both files written before and after the disc failue.
>
> (There was still data being written to the volume at this point.)
>
> After a reboot a couple days later the drive showed up again and SMART
> reported no errors, but the I/O errors remained.
>
> I then ran btrfs scrub (this took about 10 days) and afterwards I was
> again able to completely read all files written *before* the disc
> failure.
>
> However, many files written *after* the event (while only 2 drives were
> online) are still only readable up to a point:
>
> $ dd if=Dr.Strangelove.mkv of=/dev/null
> dd: error reading ‘Dr.Strangelove.mkv’:
> Input/output error
> 5331736+0 records in
> 5331736+0 records out
> 2729848832 bytes (2,7 GB) copied, 11,1318 s, 245 MB/s
>
> $ ls -sh
> 4,4G Dr.Strangelove.mkv
>
> [  197.321552] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269564928 csum 2566472073 expected csum 2434927850
> [  197.321574] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269569024 csum 566472073 expected csum 212160686
> [  197.321592] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269573120 csum 2566472073 expected sum 2202342500
>
> I tried btrfs check --repair but to no avail, got some
>
> [ 4549.762299] BTRFS warning (device sda): failed to load free space cache 
> for block group 1614937063424, rebuilding it now
> [ 4549.790389] BTRFS error (device sda): csum mismatch on free space cache
>
> and this result
>
> checking extents
> Fixed 0 roots.
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> enabling repair mode
> Checking filesystem on /dev/sda
> UUID: ed263a9a-f65c-4bb6-8ee7-0df42b7fbfb8
> cache and super generation don't match, space cache will be invalidated
> found 11674258875712 bytes used err is 0
> total csum bytes: 11387937220
> total tree bytes: 13011156992
> total fs tree bytes: 338083840
> total extent tree bytes: 99123200
> btree space waste bytes: 1079766991
> file data blocks allocated: 14669115838464
>  referenced 14668840665088
>
> when I mount the volume with -o nospace_cache I instead get
>
> [ 6985.165421] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269560832 csum 2566472073 expected csum 874509527
> [ 6985.165469] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269564928 csum 566472073 expected csum 2434927850
> [ 6985.165490] BTRFS warning (device sda): csum failed ino 171545 off 
> 2269569024 csum 2566472073 expected csum 212160686
>
> when trying to read the file.

You could use 1-time mount option clear_cache, then mount normally and
cache will be rebuild automatically (but also corrected if you don't
clear it)

> Do you think there is still a chance to recover those files?

You can use  btrfs restore  to get files off a damaged fs.

> Also am I mistaken to believe that btrfs-raid5 would continue to
> function when one disc fails?

The problem you encountered is quite typical unfortunately, the answer
is yes if you stop writing to the fs. But thats not acceptable of
course. A key problem of btrfs raid (also in recent kernels like 4.4)
is that when a (redundant) device goes offline (like pulling SATA
cable or HDD firmware crash) btrfs/kernel does not notice or does not
act correctly upon it under various circumstances. So same as in you
case, the writing to disappeared device seems to continue. For just
the data, this might then still be recoverable, but for the rest of
the structures, it might corrupt the fs heavily.

What should happen is that the btrfs+kernel+fs state switches to
degraded mode and warn about devicefailure so that user can take
action. Or completely automatically start using a spare disk that is
standby but connected. But this spare disk method is currently just
patched in this list, it will take time before they appear in mainline
kernel I assume.

It is possible to reproduce the issue of 1 device of a raid array
disappearing while btrfs/kernel still thinks its there. I hit this
problem myself twice with loop devices, it ruined things, luckily 

Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-09 Thread Chris Murphy
On Fri, Feb 5, 2016 at 12:36 PM, Mackenzie Meyer  wrote:

>
> RAID 6 write holes?

I don't even understand the nature of the write hole on Btrfs. If
modification is still always COW, then either an fs block, a strip, or
whole stripe write happens, I'm not sure where the hole comes from. It
suggests some raid56 writes are not atomic.

If you're worried about raid56 write holes, then a.) you need a server
running this raid where power failures or crashes don't happen b.)
don't use raid56 c.) use ZFS.


> RAID 6 stability?
> Any articles I've tried looking for online seem to be from early 2014,
> I can't find anything recent discussing the stability of RAID 5 or 6.
> Are there or have there recently been any data corruption bugs which
> impact RAID 6? Would you consider RAID 6 safe/stable enough for
> production use?

It's not stable for your use case, if you have to ask others if it's
stable enough for your use case. Simple as that. Right now some raid6
users are experiencing remarkably slow balances, on the order of
weeks. If device replacement rebuild times are that long, I'd say it's
disqualifying for most any use case, just because there are
alternatives that have better fail over behavior than this. So far
there's no word from any developers what the problem might be, or
where to gather more information. So chances are they're already aware
of it but haven't reproduced it, or isolated it, or have a fix for it
yet.

If you're prepared to make Btrfs better in the event you have a
problem, with possibly some delay in getting that volume up and
running again (including the likelihood of having to rebuild it from a
backup), then it might be compatible with your use case.

> Do you still strongly recommend backups, or has stability reached a
> point where backups aren't as critical? I'm thinking from a data
> consistency standpoint, not a hardware failure standpoint.

You can't separate them. On completely stable hardware, stem to stern,
you'd have no backups, no Btrfs or ZFS, you'd just run linear/concat
arrays with XFS, for example. So you can't just hand wave the hardware
part away. There are bugs in the entire storage stack, there are
connectors that can become intermittent, the system could crash. All
of these affect data consistency.

Stability has not reach a point where backups aren't as critical. I
don't really even know what that means though. No matter Btrfs or not,
you need to be doing backups such that if the primary stack is a 100%
loss without notice, is not a disaster. Plan on having to use it. If
you don't like the sound of that, look elsewhere.


> I plan to start with a small array and add disks over time. That said,
> currently I have mostly 2TB disks and some 3TB disks. If I replace all
> 2TB disks with 3TB disks, would BTRFS then start utilizing the full
> 3TB capacity of each disk, or would I need to destroy and rebuild my
> array to benefit from the larger disks?

Btrfs, or LVM raid, or  mdraid, and ZFS all let you grow arrays, each
has different levels of ease of doing this and how long it will take,
without having to recreate the file system from scratch.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 13/23] xfs: test fragmentation characteristics of copy-on-write

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:09PM -0800, Darrick J. Wong wrote:
> Perform copy-on-writes at random offsets to stress the CoW allocation
> system.  Assess the effectiveness of the extent size hint at
> combatting fragmentation via unshare, a rewrite, and no-op after the
> random writes.
> 
> Signed-off-by: Darrick J. Wong 

> +seq=`basename "$0"`
> +seqres="$RESULT_DIR/$seq"
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +#rm -rf "$tmp".* "$testdir"

Now that I've noticed it, a few tests have this line commented out.
Probably should remove the tmp files, at least.

> +rm -f "$seqres.full"
> +
> +echo "Format and mount"
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount >> "$seqres.full" 2>&1
> +
> +testdir="$SCRATCH_MNT/test-$seq"
> +rm -rf $testdir
> +mkdir $testdir

Again, somthing that is repeated - we just mkfs'd the scratch
device, so the $testdir is guaranteed not to exist...

> +echo "Check for damage"
> +umount "$SCRATCH_MNT"

I've also noticed this in a lot of tests - the scratch device will
be unmounted by the harness, so I don't think this is necessary

> +free_blocks=$(stat -f -c '%a' "$testdir")
> +real_blksz=$(stat -f -c '%S' "$testdir")
> +space_needed=$(((blksz * nr * 3) * 5 / 4))
> +space_avail=$((free_blocks * real_blksz))
> +internal_blks=$((blksz * nr / real_blksz))
> +test $space_needed -gt $space_avail && _notrun "Not enough space. 
> $space_avail < $space_needed"

Why not:

_require_fs_space $space_needed

At minimum, it seems to be a repeated hunk of code, so it shoul dbe
factored.

> +testdir="$SCRATCH_MNT/test-$seq"
> +rm -rf $testdir
> +mkdir $testdir
> +
> +echo "Create the original files"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 0" "$testdir/file1" >> "$seqres.full"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 1048576" "$testdir/file2" >> 
> "$seqres.full"
> +_scratch_remount
> +
> +echo "Set extsz and cowextsz on zero byte file"
> +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file1" | _filter_scratch
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file1" | _filter_scratch
> +
> +echo "Set extsz and cowextsz on 1Mbyte file"
> +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file2" | _filter_scratch
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file2" | _filter_scratch
> +_scratch_remount
> +
> +fn() {
> + "$XFS_IO_PROG" -c "$1" "$2" | sed -e 's/.\([0-9]*\).*$/\1/g'
> +}
> +echo "Check extsz and cowextsz settings on zero byte file"
> +test $(fn extsize "$testdir/file1") -eq 1048576 || echo "file1 extsize not 
> set"
> +test $(fn cowextsize "$testdir/file1") -eq 1048576 || echo "file1 cowextsize 
> not set" 

For this sort of thing, just dump the extent size value to the
golden output. i.e.

echo "Check extsz and cowextsz settings on zero byte file"
$XFS_IO_PROG -c extsize $testdir/file1
$XFS_IO_PROG -c cowextsize $testdir/file1

is all that is needed. that way if it fails, we see what value it
had instead of the expected 1MB. This also makes the test much less
verbose and easier to read

> +
> +echo "Check extsz and cowextsz settings on 1Mbyte file"
> +test $(fn extsize "$testdir/file2") -eq 0 || echo "file2 extsize not set"
> +test $(fn cowextsize "$testdir/file2") -eq 1048576 || echo "file2 cowextsize 
> not set" 
> +
> +echo "Set cowextsize and check flag"
> +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file3" | _filter_scratch
> +_scratch_remount
> +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> awk '{print $4}' | grep -c 'C') -eq 1 || echo "file3 cowextsz flag not set"
> +test $(fn cowextsize "$testdir/file3") -eq 1048576 || echo "file3 cowextsize 
> not set"
> +"$XFS_IO_PROG" -f -c "cowextsize 0" "$testdir/file3" | _filter_scratch
> +_scratch_remount
> +test $(fn cowextsize "$testdir/file3") -eq 0 || echo "file3 cowextsize not 
> set"
> +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> awk '{print $4}' | grep -c 'C') -eq 0 || echo "file3 cowextsz flag not set"

Same with all these - just grep the output for the line you want,
and the golden output matching does everything else. e.g. the flag
check simply becomes:

$XFS_IO_PROG -c "stat" $testdir/file3 | grep 'fsxattr.xflags'

Again, this tells us what the wrong flags are if it fails...

There are quite a few bits of these tests where the same thing
applies

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/23] reflink: test CoW across a mixed range of block types with cowextsize set

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:35PM -0800, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong 
> ---
>  tests/xfs/215 |  108 ++
>  tests/xfs/215.out |   14 +
>  tests/xfs/218 |  108 ++
>  tests/xfs/218.out |   14 +
>  tests/xfs/219 |  108 ++
>  tests/xfs/219.out |   14 +
>  tests/xfs/221 |  108 ++
>  tests/xfs/221.out |   14 +
>  tests/xfs/223 |  113 
>  tests/xfs/223.out |   14 +
>  tests/xfs/224 |  113 
>  tests/xfs/224.out |   14 +
>  tests/xfs/225 |  108 ++
>  tests/xfs/225.out |   14 +
>  tests/xfs/226 |  108 ++
>  tests/xfs/226.out |   14 +
>  tests/xfs/228 |  137 
> +
>  tests/xfs/228.out |   14 +
>  tests/xfs/230 |  137 
> +
>  tests/xfs/230.out |   14 +
>  tests/xfs/group   |   10 
>  21 files changed, 1298 insertions(+)
>  create mode 100755 tests/xfs/215
>  create mode 100644 tests/xfs/215.out
>  create mode 100755 tests/xfs/218
>  create mode 100644 tests/xfs/218.out
>  create mode 100755 tests/xfs/219
>  create mode 100644 tests/xfs/219.out
>  create mode 100755 tests/xfs/221
>  create mode 100644 tests/xfs/221.out
>  create mode 100755 tests/xfs/223
>  create mode 100644 tests/xfs/223.out
>  create mode 100755 tests/xfs/224
>  create mode 100644 tests/xfs/224.out
>  create mode 100755 tests/xfs/225
>  create mode 100644 tests/xfs/225.out
>  create mode 100755 tests/xfs/226
>  create mode 100644 tests/xfs/226.out
>  create mode 100755 tests/xfs/228
>  create mode 100644 tests/xfs/228.out
>  create mode 100755 tests/xfs/230
>  create mode 100644 tests/xfs/230.out
> 
> 
> diff --git a/tests/xfs/215 b/tests/xfs/215
> new file mode 100755
> index 000..8dd5cb5
> --- /dev/null
> +++ b/tests/xfs/215
> @@ -0,0 +1,108 @@
> +#! /bin/bash
> +# FS QA Test No. 215
> +#
> +# Ensuring that copy on write in direct-io mode works when the CoW
> +# range originally covers multiple extents, some unwritten, some not.
> +#   - Set cowextsize hint.
> +#   - Create a file and fallocate a second file.
> +#   - Reflink the odd blocks of the first file into the second file.
> +#   - directio CoW across the halfway mark, starting with the unwritten 
> extent.
> +#   - Check that the files are now different where we say they're different.
> +#
> +#---
> +# Copyright (c) 2016, Oracle and/or its affiliates.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +
> +seq=`basename "$0"`
> +seqres="$RESULT_DIR/$seq"
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +rm -rf "$tmp".*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/reflink
> +
> +# real QA test starts here
> +_supported_os Linux
> +_require_scratch_reflink
> +_require_xfs_io_command "falloc"
> +
> +rm -f "$seqres.full"
> +
> +echo "Format and mount"
> +_scratch_mkfs > "$seqres.full" 2>&1
> +_scratch_mount >> "$seqres.full" 2>&1
> +
> +testdir="$SCRATCH_MNT/test-$seq"
> +rm -rf $testdir
> +mkdir $testdir
> +
> +echo "Create the original files"
> +blksz=65536
> +nr=64
> +real_blksz=$(stat -f -c '%S' "$testdir")
> +internal_blks=$((blksz * nr / real_blksz))
> +"$XFS_IO_PROG" -c "cowextsize $((blksz * 16))" "$testdir" >> "$seqres.full"
> +_pwrite_byte 0x61 0 $((blksz * nr)) "$testdir/file1" >> "$seqres.full"
> +$XFS_IO_PROG -f -c "falloc 0 $((blksz * nr))" "$testdir/file3" >> 
> "$seqres.full"
> +_pwrite_byte 0x00 0 $((blksz * nr)) "$testdir/file3.chk" >> "$seqres.full"
> +seq 0 2 $((nr-1)) | while read f; do
> + _reflink_range "$testdir/file1" $((blksz * f)) "$testdir/file3" 
> $((blksz * f)) $blksz >> "$seqres.full"
> + _pwrite_byte 0x61 $((blksz 

Re: [PATCH 06/23] dio unwritten conversion bug tests

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 06:37:32PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:12:23PM -0800, Darrick J. Wong wrote:
> > Check that we don't expose old disk contents when a directio write to
> > an unwritten extent fails due to IO errors.  This primarily affects
> > XFS and ext4.
> > 
> > Signed-off-by: Darrick J. Wong 
> .
> > --- a/tests/generic/group
> > +++ b/tests/generic/group
> > @@ -252,7 +252,9 @@
> >  247 auto quick rw
> >  248 auto quick rw
> >  249 auto quick rw
> > +250 auto quick
> >  251 ioctl trim
> > +252 auto quick
> 
> Also should be in the prealloc group if we are testing unwritten
> extent behaviour and the rw group because it's testing IO.

Done.

Should the CoW tests be in 'rw' too?  They're testing IO, but otoh they
(most probably) require shared blocks to have much of a point.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] xfs: test rmapbt functionality

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:48PM -0800, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong 
> ---
>  common/xfs|   44 ++
>  tests/xfs/233 |   78 ++
>  tests/xfs/233.out |6 +++
>  tests/xfs/234 |   89 
>  tests/xfs/234.out |6 +++
>  tests/xfs/235 |  108 
> +
>  tests/xfs/235.out |   14 +++
>  tests/xfs/236 |   93 ++
>  tests/xfs/236.out |8 
>  tests/xfs/group   |4 ++
>  10 files changed, 450 insertions(+)
>  create mode 100644 common/xfs
>  create mode 100755 tests/xfs/233
>  create mode 100644 tests/xfs/233.out
>  create mode 100755 tests/xfs/234
>  create mode 100644 tests/xfs/234.out
>  create mode 100755 tests/xfs/235
>  create mode 100644 tests/xfs/235.out
>  create mode 100755 tests/xfs/236
>  create mode 100644 tests/xfs/236.out
> 
> 
> diff --git a/common/xfs b/common/xfs
> new file mode 100644
> index 000..2d1a76f
> --- /dev/null
> +++ b/common/xfs
> @@ -0,0 +1,44 @@
> +##/bin/bash
> +# Routines for handling XFS
> +#---
> +#  Copyright (c) 2015 Oracle.  All Rights Reserved.
> +#  This program is free software; you can redistribute it and/or modify
> +#  it under the terms of the GNU General Public License as published by
> +#  the Free Software Foundation; either version 2 of the License, or
> +#  (at your option) any later version.
> +#
> +#  This program is distributed in the hope that it will be useful,
> +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
> +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +#  GNU General Public License for more details.
> +#
> +#  You should have received a copy of the GNU General Public License
> +#  along with this program; if not, write to the Free Software
> +#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
> +#  USA
> +#
> +#  Contact information: Oracle Corporation, 500 Oracle Parkway,
> +#  Redwood Shores, CA 94065, USA, or: http://www.oracle.com/
> +#---
> +
> +_require_xfs_test_rmapbt()
> +{
> + _require_test
> +
> + if [ "$(xfs_info "$TEST_DIR" | grep -c "rmapbt=1")" -ne 1 ]; then
> + _notrun "rmapbt not supported by test filesystem type: $FSTYP"
> + fi
> +}
> +
> +_require_xfs_scratch_rmapbt()
> +{
> + _require_scratch
> +
> + _scratch_mkfs > /dev/null
> + _scratch_mount
> + if [ "$(xfs_info "$SCRATCH_MNT" | grep -c "rmapbt=1")" -ne 1 ]; then
> + _scratch_unmount
> + _notrun "rmapbt not supported by scratch filesystem type: 
> $FSTYP"
> + fi
> + _scratch_unmount
> +}

No, not yet. :)

Wait until I get my "split common/rc" patchset out there, because it
does not require:

> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +. ./common/xfs

This.

And i don't want to have to undo a bunch of stuff in tests yet. Just
lump it all in common/rc for the moment.

> +
> +# real QA test starts here
> +_supported_os Linux
> +_supported_fs xfs
> +_require_xfs_scratch_rmapbt
> +
> +echo "Format and mount"
> +_scratch_mkfs -d size=$((2 * 4096 * 4096)) -l size=4194304 > "$seqres.full" 
> 2>&1
> +_scratch_mount >> "$seqres.full" 2>&1

_scratch_mkfs_sized ?

> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +cd /
> +#rm -f $tmp.*

More random uncommenting needed.

> +
> +echo "Check for damage"
> +umount "$SCRATCH_MNT"
> +_check_scratch_fs
> +
> +# success, all done
> +status=0
> +exit

Cull.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/23] xfs/122: support refcount/rmap data structures

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 11:55:06PM -0800, Darrick J. Wong wrote:
> On Tue, Feb 09, 2016 at 06:43:30PM +1100, Dave Chinner wrote:
> > On Mon, Feb 08, 2016 at 05:13:03PM -0800, Darrick J. Wong wrote:
> > > Include the refcount and rmap structures in the golden output.
> > > 
> > > Signed-off-by: Darrick J. Wong 
> > > ---
> > >  tests/xfs/122 |3 +++
> > >  tests/xfs/122.out |4 
> > >  tests/xfs/group   |2 +-
> > >  3 files changed, 8 insertions(+), 1 deletion(-)
> > > 
> > > 
> > > diff --git a/tests/xfs/122 b/tests/xfs/122
> > > index e6697a2..758cb50 100755
> > > --- a/tests/xfs/122
> > > +++ b/tests/xfs/122
> > > @@ -90,6 +90,9 @@ xfs_da3_icnode_hdr
> > >  xfs_dir3_icfree_hdr
> > >  xfs_dir3_icleaf_hdr
> > >  xfs_name
> > > +xfs_owner_info
> > > +xfs_refcount_irec
> > > +xfs_rmap_irec
> > >  xfs_alloctype_t
> > >  xfs_buf_cancel_t
> > >  xfs_bmbt_rec_32_t
> > 
> > So this is going to cause failures on any userspace that doesn't
> > know about these new types, right?
> > 
> > Should these be conditional in some way?
> 
> I wasn't sure how to handle this -- I could just keep the patch at the head of
> my stack (unreleased) until xfsprogs pulls in the appropriate libxfs pieces?
> So long as we're not dead certain of the final format of the rmapbt and
> refcountbt, there's probably not a lot of value in putting this in (yet).

Well, I'm more concerned about running on older/current distros that
don't have support for them in userspace. My brain is mush right
now, so I don't have any brilliant ideas (hence the question, rather
than also presenting a posible solution). I'll have a think; maybe
we can make use of the configurable .out file code we have now?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


USB memory sticks wear & speed: btrfs vs f2fs?

2016-02-09 Thread Martin
How does btrfs compare to f2fs for use on (128GByte) USB memory sticks?

Particularly for wearing out certain storage blocks?

Does btrfs heavily use particular storage blocks that will prematurely
"wear out"?

(That is, could the whole 128GBytes be lost due to one 4kByte block
having been re-written excessively too many times due to a fixed
repeatedly used filesystem block?)

Any other comparisons/thoughts for btrfs vs f2fs?


Thanks for any comment,
Martin

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/23] xfs: aio cow tests

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:14:01PM -0800, Darrick J. Wong wrote:
.,,,
> +
> +echo "Check for damage"
> +_dmerror_unmount
> +_dmerror_cleanup
> +_repair_scratch_fs >> "$seqres.full" 2>&1

Are you testing repair here? If so, why doesn't failure matter.
If not, why do it? Or is _require_scratch_nocheck all that is needed
here?

> +echo "CoW and unmount"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * bsz)) 1" "$testdir/file2" >> 
> "$seqres.full"
> +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 -b $((blksz * bsz)) 0 $((blksz * nr))" 
> "$TEST_DIR/moo" >> "$seqres.full"

offset = block size times block size?

I think some better names might be needed...

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 18/23] xfs: test the automatic cowextsize extent garbage collector

2016-02-09 Thread Dave Chinner
On Mon, Feb 08, 2016 at 05:13:42PM -0800, Darrick J. Wong wrote:
> Signed-off-by: Darrick J. Wong 
> +
> +_cleanup()
> +{
> +cd /
> +echo $old_cow_lifetime > 
> /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> +#rm -rf "$tmp".* "$testdir"

uncomment.

> +echo "CoW and leave leftovers"
> +echo $old_cow_lifetime > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> +seq 2 2 $((nr - 1)) | while read f; do
> + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" "$testdir/file2" 
> >> "$seqres.full"
> + "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" 
> "$testdir/file2.chk" >> "$seqres.full"
> +done

Ok, I just realised what was bugging me about these loops: "f" is
not a typical loop iterator for a count. Normally we'd use "i" for
these

> +echo "old extents: $old_extents" >> "$seqres.full"
> +echo "new extents: $new_extents" >> "$seqres.full"
> +echo "maximum extents: $internal_blks" >> "$seqres.full"
> +test $new_extents -lt $((internal_blks / 7)) || _fail "file2 badly 
> fragmented"

I wouldn't use _fail like this, echo is sufficient to cause the test
to fail.

> +echo "Check for damage"
> +umount "$SCRATCH_MNT"
> +
> +# success, all done
> +status=0
> +exit

As would getting rid of the unmount and just setting status
appropriately...

/repeat

-Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 10/23] xfs: more reflink tests

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 06:36:22PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:12:50PM -0800, Darrick J. Wong wrote:
> > Create a couple of XFS-specific tests -- one to check that growing
> > and shrinking the refcount btree works and a second one to check
> > what happens when we hit maximum refcount.
> > 
> > Signed-off-by: Darrick J. Wong 
> .
> > +# real QA test starts here
> > +_supported_os Linux
> > +_supported_fs xfs
> > +_require_scratch_reflink
> > +_require_cp_reflink
> 
> > +
> > +test -x "$here/src/punch-alternating" || _notrun "punch-alternating not 
> > built"
> 
> I suspect we need a _require rule for checking that something in
> the test src directory has been built.

Crapola, we also need punch-alternating, which doesn't appear until the next
patch.  Guess I'll go move it out of the next patch (or swap the order of
these two I guess.)

I added _require_test_program() which complains if src/$1 isn't built.

> > +echo "Check scratch fs"
> > +umount "$SCRATCH_MNT"
> > +echo "check refcount after removing all files" >> "$seqres.full"
> > +"$XFS_DB_PROG" -c 'agf 0' -c 'addr refcntroot' -c 'p recs[1]' 
> > "$SCRATCH_DEV" >> "$seqres.full"
> > +"$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" 
> > 2>&1
> > +res=$?
> > +if [ $res -eq 0 ]; then
> > +   # If repair succeeds then format the device so that the post-test
> > +   # check doesn't fail due to the single AG.
> > +   _scratch_mkfs >> "$seqres.full" 2>&1
> > +else
> > +   _fail "xfs_repair fails"
> > +fi
> > +
> > +# success, all done
> > +status=0
> > +exit
> 
> This is what _require_scratch_nocheck avoids.
> 
> i.e. do this instead:
> 
> _require_scratch_nocheck
> .
> 
> "$XFS_REPAIR_PROG" -o force_geometry -n "$SCRATCH_DEV" >> "$seqres.full" 2>&1 
> status=$?
> exit

Ok.

> Also, we really don't need the quotes around these global
> variables.  They are just noise and lots of stuff will break if
> those variables are set to something that requires them to be
> quoted.



--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB memory sticks wear & speed: btrfs vs f2fs?

2016-02-09 Thread Brendan Hide

On 2/9/2016 1:13 PM, Martin wrote:

How does btrfs compare to f2fs for use on (128GByte) USB memory sticks?

Particularly for wearing out certain storage blocks?

Does btrfs heavily use particular storage blocks that will prematurely
"wear out"?

(That is, could the whole 128GBytes be lost due to one 4kByte block
having been re-written excessively too many times due to a fixed
repeatedly used filesystem block?)

Any other comparisons/thoughts for btrfs vs f2fs?
Copy-on-write (CoW) designs tend naturally to work well with flash 
media. F2fs is *specifically* designed to work well with flash, whereas 
for btrfs it is a natural consequence of the copy-on-write design. With 
both filesystems, if you randomly generate a 1GB file and delete it 1000 
times, onto a 1TB flash, you are *very* likely to get exactly one write 
to *every* block on the flash (possibly two writes to <1% of the blocks) 
rather than, as would be the case with non-CoW filesystems, 1000 writes 
to a small chunk of blocks.


I haven't found much reference or comparison information online wrt wear 
leveling - mostly performance benchmarks that don't really address your 
request. Personally I will likely never bother with f2fs unless I 
somehow end up working on a project requiring relatively small storage 
in Flash (as that is what f2fs was designed for).


If someone can provide or link to some proper comparison data, that 
would be nice. :)


--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS RAM requirements, RAID 6 stability/write holes and expansion questions

2016-02-09 Thread Psalle



On 05/02/16 20:36, Mackenzie Meyer wrote:

Hello,

I've tried checking around on google but can't find information
regarding the RAM requirements of BTRFS and most of the topics on
stability seem quite old.


To keep my answer short: every time I've tried (offline) deduplication 
or raid5 pools I've ended with borked filesystems. Last attempt was 
about a year ago. Given that the pages you mention looked the same by 
then, I'd stay away of raid56 for anything but testing purposes. I 
haven't read anything about raid5 that increases my confidence in it 
recently (i.e. post 3.19 kernels). Dedup, OTOH, I don't know. What I 
used were third-party (I think?) things so the fault may have rested on 
them and not btrfs (does that makes sense?)


I'm building a new small raid5 pool as we speak, though, for throw-away 
data, so I hope to be favourably impressed.


Cheers.


So first would be memory requirements, my goal is to use deduplication
and compression. Approximately how many GB of RAM per TB of storage
would be recommended?

RAID 6 write holes?
The BTRFS wiki states that parity might be inconsistent after a crash.
That said, the wiki page for RAID 5/6 doesn't look like it has much
recent information on there. Has this issue been addressed and if not,
are there plans to address the RAID write hole issue? What would be a
recommended workaround to resolve inconsistent parity, should an
unexpected power down happen during write operations?

RAID 6 stability?
Any articles I've tried looking for online seem to be from early 2014,
I can't find anything recent discussing the stability of RAID 5 or 6.
Are there or have there recently been any data corruption bugs which
impact RAID 6? Would you consider RAID 6 safe/stable enough for
production use?

Do you still strongly recommend backups, or has stability reached a
point where backups aren't as critical? I'm thinking from a data
consistency standpoint, not a hardware failure standpoint.

I plan to start with a small array and add disks over time. That said,
currently I have mostly 2TB disks and some 3TB disks. If I replace all
2TB disks with 3TB disks, would BTRFS then start utilizing the full
3TB capacity of each disk, or would I need to destroy and rebuild my
array to benefit from the larger disks?


Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB memory sticks wear & speed: btrfs vs f2fs?

2016-02-09 Thread Austin S. Hemmelgarn

On 2016-02-09 09:08, Brendan Hide wrote:

On 2/9/2016 1:13 PM, Martin wrote:

How does btrfs compare to f2fs for use on (128GByte) USB memory sticks?

Particularly for wearing out certain storage blocks?

Does btrfs heavily use particular storage blocks that will prematurely
"wear out"?

(That is, could the whole 128GBytes be lost due to one 4kByte block
having been re-written excessively too many times due to a fixed
repeatedly used filesystem block?)

Any other comparisons/thoughts for btrfs vs f2fs?

Copy-on-write (CoW) designs tend naturally to work well with flash
media. F2fs is *specifically* designed to work well with flash, whereas
for btrfs it is a natural consequence of the copy-on-write design. With
both filesystems, if you randomly generate a 1GB file and delete it 1000
times, onto a 1TB flash, you are *very* likely to get exactly one write
to *every* block on the flash (possibly two writes to <1% of the blocks)
rather than, as would be the case with non-CoW filesystems, 1000 writes
to a small chunk of blocks.
This goes double if you're using the 'ssd' mount option on BTRFS.  Also, 
the only blocks that are rewritten in place on BTRFS (unless you turn 
off COW) are the superblocks, but all filesystems rewrite those in-place.


I haven't found much reference or comparison information online wrt wear
leveling - mostly performance benchmarks that don't really address your
request. Personally I will likely never bother with f2fs unless I
somehow end up working on a project requiring relatively small storage
in Flash (as that is what f2fs was designed for).
I would tend to agree, but that's largely because BTRFS is more of a 
known entity for me, and certain features (send/receive in particular) 
are important enough for my usage that I'm willing to take the 
performance hit.  IIRC, F2FS was developed for usage in stuff like 
Android devices and other compact embedded devices, where the FTL may 
not do a good job of wear leveling, so it should work equally well on 
USB flash drives (many of the cheap ones have no wear-leveling at all, 
and even some of the expensive ones have sub-par wear-leveling compared 
to good SSD's).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-09 Thread Christian Rohmann


On 02/01/2016 09:52 PM, Chris Murphy wrote:
>> Would some sort of stracing or profiling of the process help to narrow
>> > down where the time is currently spent and why the balancing is only
>> > running single-threaded?
> This can't be straced. Someone a lot more knowledgeable than I am
> might figure out where all the waits are with just a sysrq + t, if it
> is a hold up in say parity computations. Otherwise perf which is a
> rabbit hole but perf top is kinda cool to watch. That might give you
> an idea where most of the cpu cycles are going if you can isolate the
> workload to just the balance. Otherwise you may end up with noisy
> data.

My balance run is now working away since 19th of January:
 "885 out of about 3492 chunks balanced (996 considered),  75% left"

So this will take several more WEEKS to finish. Is there really nothing
anyone here wants me to do or analyze to help finding the root cause of
this? I mean with this kind of performance there is no way a RAID6 can
be used in production. Not because the code is not stable or
functioning, but because regular maintenance like replacing a drive or
growing an array takes WEEKS in which another maintenance procedure
could be necessary or, much worse, another drive might have failed.

What I'm saying is: Such a slow RAID6 balance renders the redundancy
unusable because drives might fail quicker than the potential rebuild
(read "balance").



Regards

Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: "layout" of a six drive raid10

2016-02-09 Thread Austin S. Hemmelgarn

On 2016-02-09 02:02, Kai Krakow wrote:

Am Tue, 9 Feb 2016 01:42:40 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:


Tho I'd consider benchmarking or testing, as I'm not sure btrfs raid1
on spinning rust will in practice fully saturate the gigabit
Ethernet, particularly as it gets fragmented (which COW filesystems
such as btrfs tend to do much more so than non-COW, unless you're
using something like the autodefrag mount option from the get-go, as
I do here, tho in that case, striping won't necessarily help a lot
either).

If you're concerned about getting the last bit of performance
possible, I'd say raid10, tho over the gigabit ethernet, the
difference isn't likely to be much.


If performance is an issue, I suggest putting an SSD and bcache into
the equation. I have very nice performance improvements with that,
especially with writeback caching (random write go to bcache first,
then to harddisk in background idle time).

Apparently, afaik it's currently not possible to have native bcache
redundandancy yet - so bcache can only be one SSD. It may be possible
to use two bcaches and assign the btrfs members alternating to it - tho
btrfs may decide to put two mirrors on the same bcache then. On the
other side, you could put bcache on lvm oder mdraid - but I would not
do it. On the bcache list, multiple people had problems with that
including btrfs corruption beyond repair.

On the other hand, you could simply go with bcache writearound caching
(only reads become cached) or writethrough caching (writes go in
parallel to bcache and btrfs). If the SSD dies, btrfs will still be
perfectly safe in this case.

If you are going with one of the latter options, the tuning knobs of
bcache may help you actually cache not only random accesses to bcache
but also linear accesses. It should help to saturate a gigabit link.

Currently, SANdisk offers a pretty cheap (not top performance) drive
with 500GB which should perfectly cover this usecase. Tho, I'm not sure
how stable this drive works with bcache. I only checked Crucial MX100
and Samsung Evo 840 yet - both working very stable with latest kernel
and discard enabled, no mdraid or lvm involved.

FWIW, the other option if you want good performance and don't want to 
get an SSD is to run BTRFS in raid1 mode on top of two LVM or MD-RAID 
RAID0 volumes.  I do this regularly for VM's and see a roughly 25-30% 
performance increase compared to BTRFS raid10 for my workloads, and 
that's with things laid out such that each block in BTRFS (16k in my 
case) ends up entirely on one disk in the RAID0 volume (you could 
theoretically get better performance by sizing the stripes on the RAID0 
volume such that a block from BTRFS gets spread across all the disks in 
the volume, but that is marginally less safe than forcing each to one).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Austin S. Hemmelgarn

On 2016-02-08 16:44, Nikolaus Rath wrote:

On Feb 07 2016, Martin Steigerwald  wrote:

Am Sonntag, 7. Februar 2016, 21:07:13 CET schrieb Kai Krakow:

Am Sun, 07 Feb 2016 11:06:58 -0800

schrieb Nikolaus Rath :

Hello,

I have a large home directory on a spinning disk that I regularly
synchronize between different computers using unison. That takes ages,
even though the amount of changed files is typically small. I suspect
most if the time is spend walking through the file system and checking
mtimes.

So I was wondering if I could possibly speed-up this operation by
storing all btrfs metadata on a fast, SSD drive. It seems that
mkfs.btrfs allows me to put the metadata in raid1 or dup mode, and the
file contents in single mode. However, I could not find a way to tell
btrfs to use a device *only* for metadata. Is there a way to do that?

Also, what is the difference between using "dup" and "raid1" for the
metadata?


You may want to try bcache. It will speedup random access which is
probably the main cause for your slow sync. Unfortunately it requires
you to reformat your btrfs partitions to add a bcache superblock. But
it's worth the efforts.

I use a nightly rsync to USB3 disk, and bcache reduced it from 5+ hours
to typically 1.5-3 depending on how much data changed.


An alternative is using dm-cache, I think it doesn´t need to recreate the
filesystem.


Yes, I tried that already but it didn't improve things at all. I wrote a
message to the lvm list though, so maybe someone will be able to help.
That's interesting.  I've been using BTRFS on dm-cache for a while, and 
have seen measurable improvements in performance.  They are not big 
improvements (only about 5% peak), but they're still improvements, which 
is somewhat impressive considering that the backing storage that's being 
cached is a RAID0 set which gets almost the same raw throughput as the 
SSD that's caching it.  Of course, I'm using it more for the power 
savings (SSD's use less power, and I've got a big enough cache I can 
often spin down the traditional disks in the RAID0 set), and I also 
re-tune my system as hardware and workloads change, and my workloads 
tend to be atypical (lots of sequential isochronous writes, regular long 
sequential reads, and some random reads and rewrites), so YMMV.


Otherwise I'll give bcache a shot. I've avoided it so far because of the
need to reformat and because of rumours that it doesn't work well with
LVM or BTRFS. But it sounds as if that's not the case..
It should work fine with _just_ BTRFS, but don't put any other layers 
into the storage system like LVM or dmcrypt or mdraid, it's got some 
pretty pathological interactions with the device mapper and md 
frameworks still.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Nikolaus Rath
On Feb 09 2016, Kai Krakow  wrote:
> You could even format a bcache superblock "just in case",
> and add an SSD later. Without SSD, bcache will just work in passthru
> mode.

Do the LVM concerns still apply in passthrough mode, or only when
there's an actual cache?

Thanks,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Nikolaus Rath
On Feb 09 2016, Kai Krakow  wrote:
> I'm myself using bcache+btrfs and it ran bullet proof so far, even
> after unintentional resets or power outage. It's important tho to NOT
> put any storage layer between bcache and your devices or between btrfs
> and your device as there are reports it becomes unstable with md or lvm
> involved.

Do you mean I should not use anything in the stack other than btrfs and
bcache, or do you mean I should not put anything under bcache?

In other words, I assume bcache on LVM is a bad idea. But what about LVM
on bcache?

Also, btrfs on LVM on disk is working fine for me, but you seem to be
saying that it should not? Or are you talking specifically about btrfs
on LVM on bcache?


If there's no way to put LVM anywhere into the stack that'd be a bummer,
I very much want to use dm-crypt (and I guess that counts as lvm?).

Thanks,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs-progs: copy functionality of btrfs-debug-tree to inspect-internal subcommand

2016-02-09 Thread Alexander Fougner
The long-term plan is to merge the features of standalone tools
into the btrfs binary, reducing the number of shipped binaries.

Signed-off-by: Alexander Fougner 
---
 Makefile.in  |   2 +-
 btrfs-debug-tree.c   | 424 +---
 cmds-inspect-dump-tree.c | 451 +++
 cmds-inspect-dump-tree.h |  15 ++
 cmds-inspect.c   |   8 +
 5 files changed, 481 insertions(+), 419 deletions(-)
 create mode 100644 cmds-inspect-dump-tree.c
 create mode 100644 cmds-inspect-dump-tree.h

diff --git a/Makefile.in b/Makefile.in
index 19697ff..14dab76 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -70,7 +70,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o 
print-tree.o \
  extent-cache.o extent_io.o volumes.o utils.o repair.o \
  qgroup.o raid6.o free-space-cache.o list_sort.o props.o \
  ulist.o qgroup-verify.o backref.o string-table.o task-utils.o \
- inode.o file.o find-root.o free-space-tree.o help.o
+ inode.o file.o find-root.o free-space-tree.o help.o 
cmds-inspect-dump-tree.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \
   cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \
diff --git a/btrfs-debug-tree.c b/btrfs-debug-tree.c
index 266176f..057a715 100644
--- a/btrfs-debug-tree.c
+++ b/btrfs-debug-tree.c
@@ -30,433 +30,21 @@
 #include "transaction.h"
 #include "volumes.h"
 #include "utils.h"
-
-static int print_usage(int ret)
-{
-   fprintf(stderr, "usage: btrfs-debug-tree [-e] [-d] [-r] [-R] [-u]\n");
-   fprintf(stderr, "[-b block_num ] device\n");
-   fprintf(stderr, "\t-e : print detailed extents info\n");
-   fprintf(stderr, "\t-d : print info of btrfs device and root tree dirs"
-" only\n");
-   fprintf(stderr, "\t-r : print info of roots only\n");
-   fprintf(stderr, "\t-R : print info of roots and root backups\n");
-   fprintf(stderr, "\t-u : print info of uuid tree only\n");
-   fprintf(stderr, "\t-b block_num : print info of the specified block"
-" only\n");
-   fprintf(stderr,
-   "\t-t tree_id : print only the tree with the given id\n");
-   fprintf(stderr, "%s\n", PACKAGE_STRING);
-   exit(ret);
-}
-
-static void print_extents(struct btrfs_root *root, struct extent_buffer *eb)
-{
-   int i;
-   u32 nr;
-   u32 size;
-
-   if (!eb)
-   return;
-
-   if (btrfs_is_leaf(eb)) {
-   btrfs_print_leaf(root, eb);
-   return;
-   }
-
-   size = btrfs_level_size(root, btrfs_header_level(eb) - 1);
-   nr = btrfs_header_nritems(eb);
-   for (i = 0; i < nr; i++) {
-   struct extent_buffer *next = read_tree_block(root,
-btrfs_node_blockptr(eb, i),
-size,
-btrfs_node_ptr_generation(eb, i));
-   if (!extent_buffer_uptodate(next))
-   continue;
-   if (btrfs_is_leaf(next) &&
-   btrfs_header_level(eb) != 1)
-   BUG();
-   if (btrfs_header_level(next) !=
-   btrfs_header_level(eb) - 1)
-   BUG();
-   print_extents(root, next);
-   free_extent_buffer(next);
-   }
-}
-
-static void print_old_roots(struct btrfs_super_block *super)
-{
-   struct btrfs_root_backup *backup;
-   int i;
-
-   for (i = 0; i < BTRFS_NUM_BACKUP_ROOTS; i++) {
-   backup = super->super_roots + i;
-   printf("btrfs root backup slot %d\n", i);
-   printf("\ttree root gen %llu block %llu\n",
-  (unsigned long long)btrfs_backup_tree_root_gen(backup),
-  (unsigned long long)btrfs_backup_tree_root(backup));
-
-   printf("\t\textent root gen %llu block %llu\n",
-  (unsigned long long)btrfs_backup_extent_root_gen(backup),
-  (unsigned long long)btrfs_backup_extent_root(backup));
-
-   printf("\t\tchunk root gen %llu block %llu\n",
-  (unsigned long long)btrfs_backup_chunk_root_gen(backup),
-  (unsigned long long)btrfs_backup_chunk_root(backup));
-
-   printf("\t\tdevice root gen %llu block %llu\n",
-  (unsigned long long)btrfs_backup_dev_root_gen(backup),
-  (unsigned long long)btrfs_backup_dev_root(backup));
-
-   printf("\t\tcsum root gen %llu block %llu\n",
-  (unsigned long long)btrfs_backup_csum_root_gen(backup),
-  (unsigned long long)btrfs_backup_csum_root(backup));
-
-   

[PATCH 2/2] btrfs-progs: update docs for inspect-internal dump-tree

2016-02-09 Thread Alexander Fougner
Signed-off-by: Alexander Fougner 
---
 Documentation/btrfs-debug-tree.asciidoc   |  7 +++
 Documentation/btrfs-inspect-internal.asciidoc | 26 ++
 2 files changed, 33 insertions(+)

diff --git a/Documentation/btrfs-debug-tree.asciidoc 
b/Documentation/btrfs-debug-tree.asciidoc
index 23fc115..6d6d884 100644
--- a/Documentation/btrfs-debug-tree.asciidoc
+++ b/Documentation/btrfs-debug-tree.asciidoc
@@ -25,8 +25,15 @@ Print detailed extents info.
 Print info of btrfs device and root tree dirs only.
 -r::
 Print info of roots only.
+-R::
+Print info of roots and root backups.
+-u::
+Print info of UUID tree only.
 -b ::
 Print info of the specified block only.
+-t ::
+Print only the tree with the specified ID.
+
 
 EXIT STATUS
 ---
diff --git a/Documentation/btrfs-inspect-internal.asciidoc 
b/Documentation/btrfs-inspect-internal.asciidoc
index 1c7c361..25e6b8b 100644
--- a/Documentation/btrfs-inspect-internal.asciidoc
+++ b/Documentation/btrfs-inspect-internal.asciidoc
@@ -67,6 +67,32 @@ inode number 2), but such subvolume does not contain any 
files anyway
 +
 resolve the absolute path of a the subvolume id 'subvolid'
 
+*dump-tree* [options] ::
+(needs root privileges)
++
+Dump the whole tree of the given device.
+This is useful for analyzing filesystem state or inconsistence and has
+a positive educational effect on understanding the internal structure.
+ is the device file where the filesystem is stored.
++
+`Options`
++
+-e
+Print detailed extents info.
+-d
+Print info of btrfs device and root tree dirs only.
+-r
+Print info of roots only.
+-R
+Print info of roots and root backups.
+-u
+Print info of UUID tree only.
+-b 
+Print info of the specified block only.
+-t 
+Print only the tree with the specified ID.
+
+
 EXIT STATUS
 ---
 *btrfs inspect-internal* returns a zero exit status if it succeeds. Non zero is
-- 
2.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-09 Thread Marc MERLIN
On Tue, Feb 09, 2016 at 02:48:14PM +0100, Christian Rohmann wrote:
> 
> 
> On 02/01/2016 09:52 PM, Chris Murphy wrote:
> >> Would some sort of stracing or profiling of the process help to narrow
> >> > down where the time is currently spent and why the balancing is only
> >> > running single-threaded?
> > This can't be straced. Someone a lot more knowledgeable than I am
> > might figure out where all the waits are with just a sysrq + t, if it
> > is a hold up in say parity computations. Otherwise perf which is a
> > rabbit hole but perf top is kinda cool to watch. That might give you
> > an idea where most of the cpu cycles are going if you can isolate the
> > workload to just the balance. Otherwise you may end up with noisy
> > data.
> 
> My balance run is now working away since 19th of January:
>  "885 out of about 3492 chunks balanced (996 considered),  75% left"
> 
> So this will take several more WEEKS to finish. Is there really nothing
> anyone here wants me to do or analyze to help finding the root cause of
> this? I mean with this kind of performance there is no way a RAID6 can
> be used in production. Not because the code is not stable or
> functioning, but because regular maintenance like replacing a drive or
> growing an array takes WEEKS in which another maintenance procedure
> could be necessary or, much worse, another drive might have failed.
> 
> What I'm saying is: Such a slow RAID6 balance renders the redundancy
> unusable because drives might fail quicker than the potential rebuild
> (read "balance").

I agree, this is bad.
For what it's worth, one of my own filesystems (target for backups, many
many files) has apparently become slow enough that it half hangs my
system when I'm using it.
I've just unmounted it to make sure my overall system performance comes
back, and I may have to delete and recreate it.

Sadly, this also means that btrfs still seems to get itself in corner
cases that are causing performance issues.
I'm not saying that you did hit this problem, but it is possible.

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 21/23] xfs: aio cow tests

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 07:32:15PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:14:01PM -0800, Darrick J. Wong wrote:
> .,,,
> > +
> > +echo "Check for damage"
> > +_dmerror_unmount
> > +_dmerror_cleanup
> > +_repair_scratch_fs >> "$seqres.full" 2>&1
> 
> Are you testing repair here? If so, why doesn't failure matter.
> If not, why do it? Or is _require_scratch_nocheck all that is needed
> here?

Uggghhh so xfs_repair dumps its regular output to stderr, so the "2>&1"
pushes the output to $seqres.full.

The return codes from xfs_repair seem to be:
0: fs is ok now
1: fs is probably broken
2: log needs to be replayed

The return codes from fsck seem to be:
0: no errors found
1: errors fixed
2: errors fixed, reboot required
(etc)

So I guess the way out is to provide a better wrapper to the repair tools
so that _repair_scratch_fs always returns 0 for "fs should be ok now" and
nonzero otherwise:

_repair_scratch_fs()
{
case $FSTYP in
xfs)
_scratch_xfs_repair "$@" 2>&1
res=$?
if [ "$res" -eq 2 ]; then
echo "xfs_repair returns $res; replay log?"
_scratch_mount
res=$?
if [ "$res" -gt 0 ]; then
echo "mount returns $res; zap log?"
_scratch_xfs_repair -L 2>&1
echo "log zap returns $?"
else
umount "$SCRATCH_MNT"
fi
_scratch_xfs_repair "$@" 2>&1
res=$?
fi
test $res -ne 0 && >&2 echo "xfs_repair failed, err=$res"
return $res
;;
*)
# Let's hope fsck -y suffices...
fsck -t $FSTYP -y $SCRATCH_DEV 2>&1
res=$?
case $res in
0|1|2)
res=0
;;
*)
>&2 echo "fsck.$FSTYP failed, err=$res"
;;
esac
return $res
;;
esac
}

> > +echo "CoW and unmount"
> > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * bsz)) 1" "$testdir/file2" 
> > >> "$seqres.full"
> > +"$XFS_IO_PROG" -f -c "pwrite -S 0x63 -b $((blksz * bsz)) 0 $((blksz * 
> > nr))" "$TEST_DIR/moo" >> "$seqres.full"
> 
> offset = block size times block size?
> 
> I think some better names might be needed...

Yes.  Is now "bufnr" and bufsize=$((blksz * bufnr)).

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: USB memory sticks wear & speed: btrfs vs f2fs?

2016-02-09 Thread Kai Krakow
Am Tue, 9 Feb 2016 09:59:12 -0500
schrieb "Austin S. Hemmelgarn" :

> > I haven't found much reference or comparison information online wrt
> > wear leveling - mostly performance benchmarks that don't really
> > address your request. Personally I will likely never bother with
> > f2fs unless I somehow end up working on a project requiring
> > relatively small storage in Flash (as that is what f2fs was
> > designed for).  
> I would tend to agree, but that's largely because BTRFS is more of a 
> known entity for me, and certain features (send/receive in
> particular) are important enough for my usage that I'm willing to
> take the performance hit.  IIRC, F2FS was developed for usage in
> stuff like Android devices and other compact embedded devices, where
> the FTL may not do a good job of wear leveling, so it should work
> equally well on USB flash drives (many of the cheap ones have no
> wear-leveling at all, and even some of the expensive ones have
> sub-par wear-leveling compared to good SSD's).

Actually, I think most of them only do wear-levelling in the storage
area where the FAT is expected - making them pretty useless for
anything else than FAT formatting...

I think the expected use-case for USB flash drives is only adding
files, and occasionally delete them - or just delete all / reformat.
It's not expected to actually "work" with files on such drives. Most of
them are pretty bad at performance anyways for such usage patterns.
It's actually pretty easy to wear out such a drive within a few days.
I've tried myself with a drive called "ReadyBoost-capable" - yeah, it
took me 2 weeks to wear it out after activating "ReadyBoost" on it, and
it took only a few days to make its performance crawl. It's just slow
now and full of unusable blocks.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: btrfs, test for send with clone operations

2016-02-09 Thread Filipe Manana
On Thu, Feb 4, 2016 at 9:21 PM, Dave Chinner  wrote:
> On Thu, Feb 04, 2016 at 12:11:28AM +, fdman...@kernel.org wrote:
>> From: Filipe Manana 
>>
>> Test that an incremental send operation which issues clone operations
>> works for files that have a full path containing more than one parent
>> directory component.
>>
>> This used to fail before the following patch for the linux kernel:
>>
>>   "[PATCH] Btrfs: send, fix extent buffer tree lock assertion failure"
>>
>> Signed-off-by: Filipe Manana 
>
> Looks ok, I've pulled it in. Something to think about:
>
>> +# Create a bunch of small and empty files, this is just to make sure our
>> +# subvolume's btree gets more than 1 leaf, a condition necessary to trigger 
>> a
>> +# past bug (1000 files is enough even for a leaf/node size of 64K, the 
>> largest
>> +# possible size).
>> +for ((i = 1; i <= 1000; i++)); do
>> + echo -n > $SCRATCH_MNT/a/b/c/z_$i
>> +done
>
> We already do have a generic function for doing this called
> _populate_fs(), it's just not optimised for speed with large numbers
> of files being created.
>
> i.e. The above is simple a single directory tree with a single level
> with 1000 files of size 0:
>
> _populate_fs() -d 1 -n 1 -f 1000 -s 0 -r $SCRATCH_MNT/a/b/
>
> Can you look into optimising _populate_fs() to use multiple threads
> (say up to 4 by default) and "echo -n" to create files, and then
> convert all our open coded "create lots of files" loops in tests to
> use it?

Sure, I'll take a look at it when I get some spare time.
Thanks Dave.

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Nikolaus Rath
On Feb 09 2016, Kai Krakow  wrote:
>> If there's no way to put LVM anywhere into the stack that'd be a
>> bummer, I very much want to use dm-crypt (and I guess that counts as
>> lvm?).
>
> Wasn't there plans for integrating per-file encryption into btrfs (like
> there's already for ext4)? I think this could pretty well obsolete your
> plans - except you prefer full-device encryption.

Well, it could obsolete it once the plan turns into an implementation,
but not today :-).

> If you don't put encryption below the bcache caching device, everything
> going to the cache won't be encrypted - so that's probably what you are
> having to do anyways.

No, I could use put separate encryption layers between bcache and the
disk - for both the backing and the caching device.

> But I don't know how such a setup recovers from power outage, I'm not
> familiar with dm-crypt at all, how it integrates with maybe initrd
> etc.

Initrd is not a concern. You can put on it whatever is needed to set up
the stack.

As far as power outages is concerned, I think dm-crypt doesn't change
anything - it's an intermediate layer with no caching. Any write gets
passed through synchronously.

> The caching device is treated dirty always. That means, it replays all
> dirty data automatically during device discovery. Backing and caching
> create a unified pair - that's why the superblock is needed. It saves
> you from accidently using the backing without the cache. So even after
> unclean shutdown, from the user-space view, the pair is always
> consistent. Bcache will only remove persisted data from its log if it
> ensured it was written correctly to the backing. The backing on its
> own, however, is not guaranteed to be consistent at any time - except
> you cleanly stop bcache and disconnect the pair (detach the cache).
>
> When dm-crypt comes in, I'm not sure how this is handled - given that
> the encryption key must be loaded from somewhere... Someone else may
> have a better clue here.

The encryption keys are supplied by userspace when setting up the
device. 


> So actually there's two questions:
>
> 1. Which order of stacking makes more sense and is more resilient to
> errors?

I think in an ideal world (i.e, no software bugs), inserting dm-crypt
anywhere in the stack will not make a difference at all even when there
is a crash. Thus...

> 2. Which order of stacking is exhibiting bugs?

..indeed becomes the important question. Now if only someone had an
answer :-).


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-09 Thread Chris Murphy
On Tue, Feb 9, 2016 at 6:48 AM, Christian Rohmann
 wrote:
>
>
> On 02/01/2016 09:52 PM, Chris Murphy wrote:
>>> Would some sort of stracing or profiling of the process help to narrow
>>> > down where the time is currently spent and why the balancing is only
>>> > running single-threaded?
>> This can't be straced. Someone a lot more knowledgeable than I am
>> might figure out where all the waits are with just a sysrq + t, if it
>> is a hold up in say parity computations. Otherwise perf which is a
>> rabbit hole but perf top is kinda cool to watch. That might give you
>> an idea where most of the cpu cycles are going if you can isolate the
>> workload to just the balance. Otherwise you may end up with noisy
>> data.
>
> My balance run is now working away since 19th of January:
>  "885 out of about 3492 chunks balanced (996 considered),  75% left"
>
> So this will take several more WEEKS to finish. Is there really nothing
> anyone here wants me to do or analyze to help finding the root cause of
> this?

Can you run 'perf top' and let it run for a few minutes, then
copy/paste or screenshot it somewhere? I'll definitely say in advance
this is just a matter of curiosity where the kernel is spending all of
its time, that this is going so slowly. In no way can I imagine being
able to help fix it. I'm a bit surprised there's no dev response,
maybe try the IRC channel? Weeks is just too long. My concern is if
there's a drive failure, a.) what state is the fs going to be in and
b.) will device replace be this slow too? I'd expect the code path for
balance and replace to be the same, so I suspect yes.


> I mean with this kind of performance there is no way a RAID6 can
> be used in production. Not because the code is not stable or
> functioning, but because regular maintenance like replacing a drive or
> growing an array takes WEEKS in which another maintenance procedure
> could be necessary or, much worse, another drive might have failed.

That's right.

In my dummy test, which should have run slower than your setup, the
other differences on my end:

elevator=noop## because I'm running an SSD
kernel 4.5rc0

I could redo my test, using 'perf top' also and see if there's any
glaring difference in where the kernel is spending its time on a
system pushing the block device to its max write ability, vs ones that
aren't. I don't have any other ideas. I'd rather a developer say, "try
this" to gather more useful information, rather than just poking
things with a random stick.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Kai Krakow
Am Tue, 09 Feb 2016 08:10:15 -0800
schrieb Nikolaus Rath :

> On Feb 09 2016, Kai Krakow  wrote:
> > You could even format a bcache superblock "just in case",
> > and add an SSD later. Without SSD, bcache will just work in passthru
> > mode.
> 
> Do the LVM concerns still apply in passthrough mode, or only when
> there's an actual cache?

I don't think anyone ever tried... But I think there's actually not
much logic involved in passthru mode, still it would pass through the
bcache layer - where the bugs may be. It may be worth stress testing
such a setup first, then do your backups (which you should do anyways
when using btrfs, so this is more or less a no-op).

There may even be differences if backing is on lvm, or if caching is on
lvm, and the order of layering (bcache+lvm+btrfs, or lvm+bcache+btrfs).
I think you may find some more details with the search machine of your
preference. I remember there were actually some posts detailing exactly
about this - including some mid-term experience with such a setup.

What ever you find, passthru-mode is probably the easiest path
regarding to code complexity, so it may not reproduce bugs others
found. You may want to try to reproduce exactly their situations but
just using passthru mode and see if it works.

I suspect the hardware storage stack may also play its role (SSD
firmware, SATA/RAID chipset, trim support on/off, NCQ support, ...)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Kai Krakow
Am Tue, 09 Feb 2016 08:09:20 -0800
schrieb Nikolaus Rath :

> On Feb 09 2016, Kai Krakow  wrote:
> > I'm myself using bcache+btrfs and it ran bullet proof so far, even
> > after unintentional resets or power outage. It's important tho to
> > NOT put any storage layer between bcache and your devices or
> > between btrfs and your device as there are reports it becomes
> > unstable with md or lvm involved.
> 
> Do you mean I should not use anything in the stack other than btrfs
> and bcache, or do you mean I should not put anything under bcache?

I never tried, I just use rawdevice+bcache+btrfs. Nothing stacked
below or inbetween. This works for me.

> In other words, I assume bcache on LVM is a bad idea. But what about
> LVM on bcache?

I think it makes a difference.

> Also, btrfs on LVM on disk is working fine for me, but you seem to be
> saying that it should not? Or are you talking specifically about btrfs
> on LVM on bcache?

Btrfs alone should be no problem. Any combination of all three could
get you in trouble. I suggest doing your tests and keep it as simple as
it can be.

> If there's no way to put LVM anywhere into the stack that'd be a
> bummer, I very much want to use dm-crypt (and I guess that counts as
> lvm?).

Wasn't there plans for integrating per-file encryption into btrfs (like
there's already for ext4)? I think this could pretty well obsolete your
plans - except you prefer full-device encryption.

If you don't put encryption below the bcache caching device, everything
going to the cache won't be encrypted - so that's probably what you are
having to do anyways.

But I don't know how such a setup recovers from power outage, I'm not
familiar with dm-crypt at all, how it integrates with maybe initrd etc.

But get a bigger picture let me explain how bcache works:

The caching device is treated dirty always. That means, it replays all
dirty data automatically during device discovery. Backing and caching
create a unified pair - that's why the superblock is needed. It saves
you from accidently using the backing without the cache. So even after
unclean shutdown, from the user-space view, the pair is always
consistent. Bcache will only remove persisted data from its log if it
ensured it was written correctly to the backing. The backing on its
own, however, is not guaranteed to be consistent at any time - except
you cleanly stop bcache and disconnect the pair (detach the cache).

When dm-crypt comes in, I'm not sure how this is handled - given that
the encryption key must be loaded from somewhere... Someone else may
have a better clue here.

So actually there's two questions:

1. Which order of stacking makes more sense and is more resilient to
errors?

2. Which order of stacking is exhibiting bugs?


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Chris Murphy
On Tue, Feb 9, 2016 at 2:43 PM, Kai Krakow  wrote:

> Wasn't there plans for integrating per-file encryption into btrfs (like
> there's already for ext4)? I think this could pretty well obsolete your
> plans - except you prefer full-device encryption.

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Encryption

I don't know whether the ZFS strategy (it would be per subvolume on
Btrfs) or the per directory strategy of ext4 is simpler. The simpler
it is, the more viable it is, I feel.

Maybe it's too much of a tonka toy to only encrypt file data, not
metadata (?) a question for someone more security conscious, but I'd
rather have some level of integrated encryption rather than none. So I
wonder if encryption could be a compression option - that is, it'd fit
into the compression code path and instead of compressing, it'd
encrypt. I guess the bigger problem then is user space tools to manage
keys. For booting, there'd need to be a libbtrfs api or ioctl for
systemd+plymouth to get the passphrase from the user. And for home, it
actually can't be in the startup process at all, it has to be
integrated into the desktop, using the user login passphrase to unlock
a KEK, and from there the DEK. The whole point of per directory
encryption is, a bunch of stuff remains encrypted.

If it were treated as a variation on compression, specifically a
variant of forced compression,  it means no key is needed to do
balance, scrub, device replace, etc, and even inline data gets
encrypted also. Open question if the metadata slot for compression is
big enough to include something like a key uuid, because each dir item
(at least) needs to point to the key needed to decrypt the data. Hmm,
or maybe a new tree to contain and track the encryption keys meant for
each dir item.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Henk Slager
On Tue, Feb 9, 2016 at 8:29 AM, Kai Krakow  wrote:
> Am Mon, 08 Feb 2016 13:44:17 -0800
> schrieb Nikolaus Rath :
>
>> On Feb 07 2016, Martin Steigerwald  wrote:
>> > Am Sonntag, 7. Februar 2016, 21:07:13 CET schrieb Kai Krakow:
>> >> Am Sun, 07 Feb 2016 11:06:58 -0800
>> >>
>> >> schrieb Nikolaus Rath :
>> >> > Hello,
>> >> >
>> >> > I have a large home directory on a spinning disk that I regularly
>> >> > synchronize between different computers using unison. That takes
>> >> > ages, even though the amount of changed files is typically
>> >> > small. I suspect most if the time is spend walking through the
>> >> > file system and checking mtimes.
>> >> >
>> >> > So I was wondering if I could possibly speed-up this operation by
>> >> > storing all btrfs metadata on a fast, SSD drive. It seems that
>> >> > mkfs.btrfs allows me to put the metadata in raid1 or dup mode,
>> >> > and the file contents in single mode. However, I could not find
>> >> > a way to tell btrfs to use a device *only* for metadata. Is
>> >> > there a way to do that?
>> >> >
>> >> > Also, what is the difference between using "dup" and "raid1" for
>> >> > the metadata?
>> >>
>> >> You may want to try bcache. It will speedup random access which is
>> >> probably the main cause for your slow sync. Unfortunately it
>> >> requires you to reformat your btrfs partitions to add a bcache
>> >> superblock. But it's worth the efforts.
>> >>
>> >> I use a nightly rsync to USB3 disk, and bcache reduced it from 5+
>> >> hours to typically 1.5-3 depending on how much data changed.
>> >
>> > An alternative is using dm-cache, I think it doesn´t need to
>> > recreate the filesystem.
>>
>> Yes, I tried that already but it didn't improve things at all. I
>> wrote a message to the lvm list though, so maybe someone will be able
>> to help.
>>
>> Otherwise I'll give bcache a shot. I've avoided it so far because of
>> the need to reformat and because of rumours that it doesn't work well
>> with LVM or BTRFS. But it sounds as if that's not the case..
>
> I'm myself using bcache+btrfs and it ran bullet proof so far, even
> after unintentional resets or power outage. It's important tho to NOT
> put any storage layer between bcache and your devices or between btrfs
> and your device as there are reports it becomes unstable with md or lvm
> involved. In my setup I can even use discard/trim without problems. I'd
> recommend a current kernel, tho.
>
> Since it requires reformatting, it's a big pita but it's worth the
> efforts. It appeared, from its design, much more effective and stable
> than dmcache. You could even format a bcache superblock "just in case",
> and add an SSD later. Without SSD, bcache will just work in passthru
> mode. Actually, I started to format all my storage with bcache
> superblock "just in case". It is similar to having another partition
> table folded inside - so it doesn't hurt (except you need bcache-probe
> in initrd to detect the contained filesystems).

Same positive bcache+BTRFS experience for me, I am using it since
kernel 4.1.6 and now just latest 4.4. Especially now it is possible to
use VM images in normal CoW mode with speed/performance comparable to
the image on SSD. This is with 50G images consisting of about 50k
extents, raid10 btrfs with mount options noatime,nossd,autodefrag and
writeback on. Initial amount of extents was in order of 100 or so, but
later small writes inside the VM just almost all end up in the bcache.
Nightly incremental send|receive is just a few minutes. Kernel compile
from local git repo clone almost works like from SSD.

When both RAM cache is invalidated and bcache detached / stopped / not
there, filesystem finds or operations that have to deal with
fragmentation or a lot of seeks clearly take way more time. From
there, after starting and using an OS in a VM for lets say 10 minutes
for common tasks, speed is 'SSD like' and not 'HDD like' anymore and
stays that way (until eviction of blocks of course).

The 'reformatting' might be avoided by using this:
https://github.com/g2p/blocks

I haven't used it myself as one fs was just full harddisk and my
python installations had some issues. I wanted to keep same UUID ( due
to longterm incremental send|receive cloning setup) so I did shrink
the filesystem to its almost smallest possible and then used an extra
device (4TB) to dd_rescue the fs image onto and then 2nd step
dd_rescue it back to the original disk (to a partition that is
bcache'd). A btrfs replace would have also been an option. Or some
2-step add-remove action or tricks with raid1.

For another disk I did not have a spare disk, so I made a script to do
an 'in-place' filesystem image replace. I have browsed the superblocks
(don't remember size, but its a few kB AFAIK), so 1G copyblocksize is
huge enough and keeping at least 2 copyblocks readahead stored on
intermediate storage worked fine. Same can be used for LUKS header

Re: [PATCH 18/23] xfs: test the automatic cowextsize extent garbage collector

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 07:15:47PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:13:42PM -0800, Darrick J. Wong wrote:
> > Signed-off-by: Darrick J. Wong 
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +echo $old_cow_lifetime > 
> > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> > +#rm -rf "$tmp".* "$testdir"
> 
> uncomment.
> 
> > +echo "CoW and leave leftovers"
> > +echo $old_cow_lifetime > /proc/sys/fs/xfs/speculative_cow_prealloc_lifetime
> > +seq 2 2 $((nr - 1)) | while read f; do
> > +   "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" "$testdir/file2" 
> > >> "$seqres.full"
> > +   "$XFS_IO_PROG" -f -c "pwrite -S 0x63 $((blksz * f)) 1" 
> > "$testdir/file2.chk" >> "$seqres.full"
> > +done
> 
> Ok, I just realised what was bugging me about these loops: "f" is
> not a typical loop iterator for a count. Normally we'd use "i" for
> these
> 
> > +echo "old extents: $old_extents" >> "$seqres.full"
> > +echo "new extents: $new_extents" >> "$seqres.full"
> > +echo "maximum extents: $internal_blks" >> "$seqres.full"
> > +test $new_extents -lt $((internal_blks / 7)) || _fail "file2 badly 
> > fragmented"
> 
> I wouldn't use _fail like this, echo is sufficient to cause the test
> to fail.

Ok, fixed.

--D

> > +echo "Check for damage"
> > +umount "$SCRATCH_MNT"
> > +
> > +# success, all done
> > +status=0
> > +exit
> 
> As would getting rid of the unmount and just setting status
> appropriately...
> 
> /repeat
> 
> -Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 19/23] xfs: test rmapbt functionality

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 07:26:40PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:13:48PM -0800, Darrick J. Wong wrote:
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  common/xfs|   44 ++
> >  tests/xfs/233 |   78 ++
> >  tests/xfs/233.out |6 +++
> >  tests/xfs/234 |   89 
> >  tests/xfs/234.out |6 +++
> >  tests/xfs/235 |  108 
> > +
> >  tests/xfs/235.out |   14 +++
> >  tests/xfs/236 |   93 ++
> >  tests/xfs/236.out |8 
> >  tests/xfs/group   |4 ++
> >  10 files changed, 450 insertions(+)
> >  create mode 100644 common/xfs
> >  create mode 100755 tests/xfs/233
> >  create mode 100644 tests/xfs/233.out
> >  create mode 100755 tests/xfs/234
> >  create mode 100644 tests/xfs/234.out
> >  create mode 100755 tests/xfs/235
> >  create mode 100644 tests/xfs/235.out
> >  create mode 100755 tests/xfs/236
> >  create mode 100644 tests/xfs/236.out
> > 
> > 
> > diff --git a/common/xfs b/common/xfs
> > new file mode 100644
> > index 000..2d1a76f
> > --- /dev/null
> > +++ b/common/xfs
> > @@ -0,0 +1,44 @@
> > +##/bin/bash
> > +# Routines for handling XFS
> > +#---
> > +#  Copyright (c) 2015 Oracle.  All Rights Reserved.
> > +#  This program is free software; you can redistribute it and/or modify
> > +#  it under the terms of the GNU General Public License as published by
> > +#  the Free Software Foundation; either version 2 of the License, or
> > +#  (at your option) any later version.
> > +#
> > +#  This program is distributed in the hope that it will be useful,
> > +#  but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +#  GNU General Public License for more details.
> > +#
> > +#  You should have received a copy of the GNU General Public License
> > +#  along with this program; if not, write to the Free Software
> > +#  Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307
> > +#  USA
> > +#
> > +#  Contact information: Oracle Corporation, 500 Oracle Parkway,
> > +#  Redwood Shores, CA 94065, USA, or: http://www.oracle.com/
> > +#---
> > +
> > +_require_xfs_test_rmapbt()
> > +{
> > +   _require_test
> > +
> > +   if [ "$(xfs_info "$TEST_DIR" | grep -c "rmapbt=1")" -ne 1 ]; then
> > +   _notrun "rmapbt not supported by test filesystem type: $FSTYP"
> > +   fi
> > +}
> > +
> > +_require_xfs_scratch_rmapbt()
> > +{
> > +   _require_scratch
> > +
> > +   _scratch_mkfs > /dev/null
> > +   _scratch_mount
> > +   if [ "$(xfs_info "$SCRATCH_MNT" | grep -c "rmapbt=1")" -ne 1 ]; then
> > +   _scratch_unmount
> > +   _notrun "rmapbt not supported by scratch filesystem type: 
> > $FSTYP"
> > +   fi
> > +   _scratch_unmount
> > +}
> 
> No, not yet. :)
> 
> Wait until I get my "split common/rc" patchset out there, because it
> does not require:

Ok, I moved all the common/xfs stuff back to common/rc.

> 
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/xfs
> 
> This.
> 
> And i don't want to have to undo a bunch of stuff in tests yet. Just
> lump it all in common/rc for the moment.
> 
> > +
> > +# real QA test starts here
> > +_supported_os Linux
> > +_supported_fs xfs
> > +_require_xfs_scratch_rmapbt
> > +
> > +echo "Format and mount"
> > +_scratch_mkfs -d size=$((2 * 4096 * 4096)) -l size=4194304 > 
> > "$seqres.full" 2>&1
> > +_scratch_mount >> "$seqres.full" 2>&1
> 
> _scratch_mkfs_sized ?

Done.
> 
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1   # failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +#rm -f $tmp.*
> 
> More random uncommenting needed.
> 
> > +
> > +echo "Check for damage"
> > +umount "$SCRATCH_MNT"
> > +_check_scratch_fs
> > +
> > +# success, all done
> > +status=0
> > +exit
> 
> Cull.

Done

--D

> 
> -Dave.
> -- 
> Dave Chinner
> da...@fromorbit.com
> 
> ___
> xfs mailing list
> x...@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 17/23] reflink: test CoW across a mixed range of block types with cowextsize set

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 07:09:23PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:13:35PM -0800, Darrick J. Wong wrote:
> > Signed-off-by: Darrick J. Wong 
> > ---
> >  tests/xfs/215 |  108 ++
> >  tests/xfs/215.out |   14 +
> >  tests/xfs/218 |  108 ++
> >  tests/xfs/218.out |   14 +
> >  tests/xfs/219 |  108 ++
> >  tests/xfs/219.out |   14 +
> >  tests/xfs/221 |  108 ++
> >  tests/xfs/221.out |   14 +
> >  tests/xfs/223 |  113 
> >  tests/xfs/223.out |   14 +
> >  tests/xfs/224 |  113 
> >  tests/xfs/224.out |   14 +
> >  tests/xfs/225 |  108 ++
> >  tests/xfs/225.out |   14 +
> >  tests/xfs/226 |  108 ++
> >  tests/xfs/226.out |   14 +
> >  tests/xfs/228 |  137 
> > +
> >  tests/xfs/228.out |   14 +
> >  tests/xfs/230 |  137 
> > +
> >  tests/xfs/230.out |   14 +
> >  tests/xfs/group   |   10 
> >  21 files changed, 1298 insertions(+)
> >  create mode 100755 tests/xfs/215
> >  create mode 100644 tests/xfs/215.out
> >  create mode 100755 tests/xfs/218
> >  create mode 100644 tests/xfs/218.out
> >  create mode 100755 tests/xfs/219
> >  create mode 100644 tests/xfs/219.out
> >  create mode 100755 tests/xfs/221
> >  create mode 100644 tests/xfs/221.out
> >  create mode 100755 tests/xfs/223
> >  create mode 100644 tests/xfs/223.out
> >  create mode 100755 tests/xfs/224
> >  create mode 100644 tests/xfs/224.out
> >  create mode 100755 tests/xfs/225
> >  create mode 100644 tests/xfs/225.out
> >  create mode 100755 tests/xfs/226
> >  create mode 100644 tests/xfs/226.out
> >  create mode 100755 tests/xfs/228
> >  create mode 100644 tests/xfs/228.out
> >  create mode 100755 tests/xfs/230
> >  create mode 100644 tests/xfs/230.out
> > 
> > 
> > diff --git a/tests/xfs/215 b/tests/xfs/215
> > new file mode 100755
> > index 000..8dd5cb5
> > --- /dev/null
> > +++ b/tests/xfs/215
> > @@ -0,0 +1,108 @@
> > +#! /bin/bash
> > +# FS QA Test No. 215
> > +#
> > +# Ensuring that copy on write in direct-io mode works when the CoW
> > +# range originally covers multiple extents, some unwritten, some not.
> > +#   - Set cowextsize hint.
> > +#   - Create a file and fallocate a second file.
> > +#   - Reflink the odd blocks of the first file into the second file.
> > +#   - directio CoW across the halfway mark, starting with the unwritten 
> > extent.
> > +#   - Check that the files are now different where we say they're 
> > different.
> > +#
> > +#---
> > +# Copyright (c) 2016, Oracle and/or its affiliates.  All Rights Reserved.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#---
> > +
> > +seq=`basename "$0"`
> > +seqres="$RESULT_DIR/$seq"
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +rm -rf "$tmp".*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +. ./common/reflink
> > +
> > +# real QA test starts here
> > +_supported_os Linux
> > +_require_scratch_reflink
> > +_require_xfs_io_command "falloc"
> > +
> > +rm -f "$seqres.full"
> > +
> > +echo "Format and mount"
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount >> "$seqres.full" 2>&1
> > +
> > +testdir="$SCRATCH_MNT/test-$seq"
> > +rm -rf $testdir
> > +mkdir $testdir
> > +
> > +echo "Create the original files"
> > +blksz=65536
> > +nr=64
> > +real_blksz=$(stat -f -c '%S' "$testdir")
> > +internal_blks=$((blksz * nr / real_blksz))
> > +"$XFS_IO_PROG" -c "cowextsize $((blksz * 16))" "$testdir" >> "$seqres.full"
> > +_pwrite_byte 0x61 0 $((blksz * nr)) "$testdir/file1" >> "$seqres.full"
> > +$XFS_IO_PROG -f -c "falloc 0 

Re: [PATCH 13/23] xfs: test fragmentation characteristics of copy-on-write

2016-02-09 Thread Darrick J. Wong
On Tue, Feb 09, 2016 at 07:01:44PM +1100, Dave Chinner wrote:
> On Mon, Feb 08, 2016 at 05:13:09PM -0800, Darrick J. Wong wrote:
> > Perform copy-on-writes at random offsets to stress the CoW allocation
> > system.  Assess the effectiveness of the extent size hint at
> > combatting fragmentation via unshare, a rewrite, and no-op after the
> > random writes.
> > 
> > Signed-off-by: Darrick J. Wong 
> 
> > +seq=`basename "$0"`
> > +seqres="$RESULT_DIR/$seq"
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +cd /
> > +#rm -rf "$tmp".* "$testdir"
> 
> Now that I've noticed it, a few tests have this line commented out.
> Probably should remove the tmp files, at least.

Done.

> > +rm -f "$seqres.full"
> > +
> > +echo "Format and mount"
> > +_scratch_mkfs > "$seqres.full" 2>&1
> > +_scratch_mount >> "$seqres.full" 2>&1
> > +
> > +testdir="$SCRATCH_MNT/test-$seq"
> > +rm -rf $testdir
> > +mkdir $testdir
> 
> Again, somthing that is repeated - we just mkfs'd the scratch
> device, so the $testdir is guaranteed not to exist...

I've done that to the new tests, will do to the existing ones.

> > +echo "Check for damage"
> > +umount "$SCRATCH_MNT"
> 
> I've also noticed this in a lot of tests - the scratch device will
> be unmounted by the harness, so I don't think this is necessary

Done.

> > +free_blocks=$(stat -f -c '%a' "$testdir")
> > +real_blksz=$(stat -f -c '%S' "$testdir")
> > +space_needed=$(((blksz * nr * 3) * 5 / 4))
> > +space_avail=$((free_blocks * real_blksz))
> > +internal_blks=$((blksz * nr / real_blksz))
> > +test $space_needed -gt $space_avail && _notrun "Not enough space. 
> > $space_avail < $space_needed"
> 
> Why not:
> 
> _require_fs_space $space_needed
> 
> At minimum, it seems to be a repeated hunk of code, so it shoul dbe
> factored.

Ok, done.

> > +testdir="$SCRATCH_MNT/test-$seq"
> > +rm -rf $testdir
> > +mkdir $testdir
> > +
> > +echo "Create the original files"
> > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 0" "$testdir/file1" >> 
> > "$seqres.full"
> > +"$XFS_IO_PROG" -f -c "pwrite -S 0x61 0 1048576" "$testdir/file2" >> 
> > "$seqres.full"
> > +_scratch_remount
> > +
> > +echo "Set extsz and cowextsz on zero byte file"
> > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file1" | _filter_scratch
> > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file1" | 
> > _filter_scratch
> > +
> > +echo "Set extsz and cowextsz on 1Mbyte file"
> > +"$XFS_IO_PROG" -f -c "extsize 1048576" "$testdir/file2" | _filter_scratch
> > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file2" | 
> > _filter_scratch
> > +_scratch_remount
> > +
> > +fn() {
> > +   "$XFS_IO_PROG" -c "$1" "$2" | sed -e 's/.\([0-9]*\).*$/\1/g'
> > +}
> > +echo "Check extsz and cowextsz settings on zero byte file"
> > +test $(fn extsize "$testdir/file1") -eq 1048576 || echo "file1 extsize not 
> > set"
> > +test $(fn cowextsize "$testdir/file1") -eq 1048576 || echo "file1 
> > cowextsize not set" 
> 
> For this sort of thing, just dump the extent size value to the
> golden output. i.e.
> 
> echo "Check extsz and cowextsz settings on zero byte file"
> $XFS_IO_PROG -c extsize $testdir/file1
> $XFS_IO_PROG -c cowextsize $testdir/file1
> 
> is all that is needed. that way if it fails, we see what value it
> had instead of the expected 1MB. This also makes the test much less
> verbose and easier to read

Done.

> > +
> > +echo "Check extsz and cowextsz settings on 1Mbyte file"
> > +test $(fn extsize "$testdir/file2") -eq 0 || echo "file2 extsize not set"
> > +test $(fn cowextsize "$testdir/file2") -eq 1048576 || echo "file2 
> > cowextsize not set" 
> > +
> > +echo "Set cowextsize and check flag"
> > +"$XFS_IO_PROG" -f -c "cowextsize 1048576" "$testdir/file3" | 
> > _filter_scratch
> > +_scratch_remount
> > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> > awk '{print $4}' | grep -c 'C') -eq 1 || echo "file3 cowextsz flag not set"
> > +test $(fn cowextsize "$testdir/file3") -eq 1048576 || echo "file3 
> > cowextsize not set"
> > +"$XFS_IO_PROG" -f -c "cowextsize 0" "$testdir/file3" | _filter_scratch
> > +_scratch_remount
> > +test $(fn cowextsize "$testdir/file3") -eq 0 || echo "file3 cowextsize not 
> > set"
> > +test $("$XFS_IO_PROG" -c "stat" "$testdir/file3" | grep 'fsxattr.xflags' | 
> > awk '{print $4}' | grep -c 'C') -eq 0 || echo "file3 cowextsz flag not set"
> 
> Same with all these - just grep the output for the line you want,
> and the golden output matching does everything else. e.g. the flag
> check simply becomes:
> 
> $XFS_IO_PROG -c "stat" $testdir/file3 | grep 'fsxattr.xflags'
> 
> Again, this tells us what the wrong flags are if it fails...

Done.  It'll probably break whenever we add new flags, but that can be fixed.

--D

> 
> There are quite a few bits of these tests where the same thing
> 

Re: Use fast device only for metadata?

2016-02-09 Thread Henk Slager
On Tue, Feb 9, 2016 at 11:38 PM, Nikolaus Rath  wrote:
> On Feb 09 2016, Kai Krakow  wrote:
>>> If there's no way to put LVM anywhere into the stack that'd be a
>>> bummer, I very much want to use dm-crypt (and I guess that counts as
>>> lvm?).
>>
>> Wasn't there plans for integrating per-file encryption into btrfs (like
>> there's already for ext4)? I think this could pretty well obsolete your
>> plans - except you prefer full-device encryption.
>
> Well, it could obsolete it once the plan turns into an implementation,
> but not today :-).
>
>> If you don't put encryption below the bcache caching device, everything
>> going to the cache won't be encrypted - so that's probably what you are
>> having to do anyways.
>
> No, I could use put separate encryption layers between bcache and the
> disk - for both the backing and the caching device.
>
>> But I don't know how such a setup recovers from power outage, I'm not
>> familiar with dm-crypt at all, how it integrates with maybe initrd
>> etc.
>
> Initrd is not a concern. You can put on it whatever is needed to set up
> the stack.
>
> As far as power outages is concerned, I think dm-crypt doesn't change
> anything - it's an intermediate layer with no caching. Any write gets
> passed through synchronously.
>
>> The caching device is treated dirty always. That means, it replays all
>> dirty data automatically during device discovery. Backing and caching
>> create a unified pair - that's why the superblock is needed. It saves
>> you from accidently using the backing without the cache. So even after
>> unclean shutdown, from the user-space view, the pair is always
>> consistent. Bcache will only remove persisted data from its log if it
>> ensured it was written correctly to the backing. The backing on its
>> own, however, is not guaranteed to be consistent at any time - except
>> you cleanly stop bcache and disconnect the pair (detach the cache).
>>
>> When dm-crypt comes in, I'm not sure how this is handled - given that
>> the encryption key must be loaded from somewhere... Someone else may
>> have a better clue here.
>
> The encryption keys are supplied by userspace when setting up the
> device.
>
>
>> So actually there's two questions:
>>
>> 1. Which order of stacking makes more sense and is more resilient to
>> errors?
>
> I think in an ideal world (i.e, no software bugs), inserting dm-crypt
> anywhere in the stack will not make a difference at all even when there
> is a crash. Thus...

Most sense to me made dm-crypt between bcache and btrfs. And that
works fine I can say. Actually, I have been using the following since
kernel 4.4.0-rc4 was there:
rawdevice + bcache + iscsi + dm-crypt + btrfs

This way, the IP link transports encrypted data (although it is only a
local short ethernetcable+switch). It works fine, scrubs are still
with 0 errors and the last btrfs check did not report any errors.
(It also works well with AoE, top perfomance after I put MTU's to 9000 )

>> 2. Which order of stacking is exhibiting bugs?
>
> ..indeed becomes the important question. Now if only someone had an
> answer :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Use fast device only for metadata?

2016-02-09 Thread Nikolaus Rath
On Feb 08 2016, Nikolaus Rath  wrote:
> Otherwise I'll give bcache a shot. I've avoided it so far because of the
> need to reformat and because of rumours that it doesn't work well with
> LVM or BTRFS. But it sounds as if that's not the case..

I now have the following stack:

btrfs on LUKS on LVM on bcache

The VG contains two bcache PVs with backing devices on different
spinning disks, and a shared cache device on SSD. I'm using Kernel 4.3.

I'm super happy with the performance, boot times increased from 1:30
minutes to X11 and 2:00 to Firefox roughly 0:10 to X11 and 0:30 to
Firefox.


Time will tell if it also keeps my data intact, but I hope btrfs would
at least detect any corruption. 


Best,
-Nikolaus

-- 
GPG encrypted emails preferred. Key id: 0xD113FCAC3C4E599F
Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

 »Time flies like an arrow, fruit flies like a Banana.«
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-09 Thread Chris Murphy
# perf stat -e 'btrfs:*' -a sleep 10


## This is single device HDD, balance of a root fs was started before
these 10 seconds of sampling. There are some differences in the
statistics depending on whether there are predominately reads or
writes for the balance, so clearly balance does predominately reads,
then predominately writes. Unsurprising but the three tries I did were
largely in agreement (orders of magnitude wise).
http://fpaste.org/320551/06921614/


# perf record -e block:block_rq_issue -ag
^C   ## after ~30 seconds
# perf report

## Single device HDD, balance of root fs start before perf record.
There's a lot of data, collapsed by default. I expanded a few items at
random just as an example. I suspect the write of the perf.data file
is a non-factor because it was just under 2MiB.
http://fpaste.org/320555/14550698/raw/


# perf top

## Single device HDD, balance of root fs start before issuing this
command, and let it run for about 20 seconds. This is actually not as
interesting as I thought it might be, but I don't really know what I'm
looking for. I'd need something else to compare it to.
http://fpaste.org/320559/55070873/


Anyway, all of these are single device, so it's not apples/apples
comparison, but it is a working (full speed for the block device)
balance.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-progs 4.4 re-balance of RAID6 is very slow / limited to one cpu core?

2016-02-09 Thread Chris Murphy
This could also be interesting. It means canceling the balance in
progress; waiting some time; and then cancelling it again to get
results to return.

# perf stat -B btrfs balance start /

## Again, single device example, balancing at expected performance.
http://fpaste.org/320562/55071438/

I didn't try this but, it looks like it'd be a variation on the above,
attaching to a running balance:

# perf stat -B -p  sleep 60

Anyway...

Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to show current profile?

2016-02-09 Thread Hugo Mills
On Tue, Feb 09, 2016 at 11:36:49PM -0800, Ian Kelling wrote:
> I searched the man pages, can't seem to find it. 
> btrfs-balance can change profiles, but not show
> the current profile... seems odd.

   btrfs fi df /mountpoint

   Hugo.

-- 
Hugo Mills | Gentlemen! You can't fight here! This is the War
hugo@... carfax.org.uk | Room!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Dr Strangelove


signature.asc
Description: Digital signature


RAID5 Unable to remove Failing HD

2016-02-09 Thread Rene Castberg
Hi,

This morning i woke up to a failing disk:

[230743.953079] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45648, flush
503, corrupt 0, gen 0
[230743.953970] BTRFS: bdev /dev/sdc errs: wr 1573, rd 45649, flush
503, corrupt 0, gen 0
[230744.106443] BTRFS: lost page write due to I/O error on /dev/sdc
[230744.180412] BTRFS: lost page write due to I/O error on /dev/sdc
[230760.116173] btrfs_dev_stat_print_on_error: 5 callbacks suppressed
[230760.116176] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45651, flush
503, corrupt 0, gen 0
[230760.726244] BTRFS: bdev /dev/sdc errs: wr 1577, rd 45652, flush
503, corrupt 0, gen 0
[230761.392939] btrfs_end_buffer_write_sync: 2 callbacks suppressed
[230761.392947] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.392953] BTRFS: bdev /dev/sdc errs: wr 1578, rd 45652, flush
503, corrupt 0, gen 0
[230761.393813] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.393818] BTRFS: bdev /dev/sdc errs: wr 1579, rd 45652, flush
503, corrupt 0, gen 0
[230761.394843] BTRFS: lost page write due to I/O error on /dev/sdc
[230761.394849] BTRFS: bdev /dev/sdc errs: wr 1580, rd 45652, flush
503, corrupt 0, gen 0
[230802.000425] nfsd: last server has exited, flushing export cache
[230898.791862] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.791873] BTRFS: bdev /dev/sdc errs: wr 1581, rd 45652, flush
503, corrupt 0, gen 0
[230898.792746] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.792752] BTRFS: bdev /dev/sdc errs: wr 1582, rd 45652, flush
503, corrupt 0, gen 0
[230898.793723] BTRFS: lost page write due to I/O error on /dev/sdc
[230898.793728] BTRFS: bdev /dev/sdc errs: wr 1583, rd 45652, flush
503, corrupt 0, gen 0
[230898.830893] BTRFS info (device sdd): allowing degraded mounts
[230898.830902] BTRFS info (device sdd): disk space caching is enabled

Eventually i remounted it as degraded, hopefully to prevent any loss of data.

It seems taht the btrfs filesystem still hasn't noticed that the disk
has failed:
$btrfs fi show
Label: 'RenesData'  uuid: ee80dae2-7c86-43ea-a253-c8f04589b496
Total devices 5 FS bytes used 5.38TiB
devid1 size 2.73TiB used 1.84TiB path /dev/sdb
devid2 size 2.73TiB used 1.84TiB path /dev/sde
devid3 size 3.64TiB used 1.84TiB path /dev/sdf
devid4 size 2.73TiB used 1.84TiB path /dev/sdd
devid5 size 3.64TiB used 1.84TiB path /dev/sdc

I tried deleting the device:
# btrfs device delete /dev/sdc /mnt2/RenesData/
ERROR: error removing device '/dev/sdc': Invalid argument

I have been unlucky and already had a failure last friday, where a
RAID5 array failed after a disk failure.  I rebooted, and the data was
unrecoverable. Fortunately this was only temp data so the failure
wasn't a real issue.

Can somebody give me some advice how to delete the failing disk? I
plan on replacing the disk but unfortunately the system doesn't have
hotplug, so i will need to shutdown to replace the disk without
loosing any of the data stored on these devices.

Regards

Rene Castberg

# uname -a
Linux midgard 4.3.3-1.el7.elrepo.x86_64 #1 SMP Tue Dec 15 11:18:19 EST
2015 x86_64 x86_64 x86_64 GNU/Linux
[root@midgard ~]# btrfs --version
btrfs-progs v4.3.1
[root@midgard ~]# btrfs fi df  /mnt2/RenesData/
Data, RAID6: total=5.52TiB, used=5.37TiB
System, RAID6: total=96.00MiB, used=480.00KiB
Metadata, RAID6: total=17.53GiB, used=11.86GiB
GlobalReserve, single: total=512.00MiB, used=0.00B


# btrfs device stats /mnt2/RenesData/
[/dev/sdb].write_io_errs   0
[/dev/sdb].read_io_errs0
[/dev/sdb].flush_io_errs   0
[/dev/sdb].corruption_errs 0
[/dev/sdb].generation_errs 0
[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdf].write_io_errs   0
[/dev/sdf].read_io_errs0
[/dev/sdf].flush_io_errs   0
[/dev/sdf].corruption_errs 0
[/dev/sdf].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[/dev/sdc].write_io_errs   1583
[/dev/sdc].read_io_errs45652
[/dev/sdc].flush_io_errs   503
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to show current profile?

2016-02-09 Thread Ian Kelling
I searched the man pages, can't seem to find it. 
btrfs-balance can change profiles, but not show
the current profile... seems odd.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html