Re: [PATCH v3] xfstests: btrfs: add test case for qgroup account on shared extents
On Thu, Dec 18, 2014 at 11:05:30AM +1100, Dave Chinner wrote: On Wed, Dec 17, 2014 at 04:30:47PM +0800, Liu Bo wrote: This is a regression test of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)' It can produce qgroup related warnings. The fix is https://patchwork.kernel.org/patch/5499981/ Btrfs: fix a warning of qgroup account on shared extents +#! /bin/bash +# FS QA Test No. 017 +# +# Regression of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', +# this will throw a warning into dmesg. +# +# For more details, the fix is https://patchwork.kernel.org/patch/5499981/ +# Btrfs: fix a warning of qgroup account on shared extents Please describe the test directly. + +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cloner + +run_check _scratch_mkfs --nodesize 4096 +run_check _scratch_mount No, please don't use run_check like this. Errors will end up in the output file, and that will cause the test to fail. +run_check $XFS_IO_PROG -f -d -c pwrite 0 8K $SCRATCH_MNT/foo Same - filter the output, and errors will be verbose and cause a failure. + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap + +run_check $CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/foo-reflink +run_check $CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink +run_check $CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink2 Filter the output, not run_check. If CLONER_PROG is silent when it fails, then it is broken and needs fixing because users need to know that something failed and they don't check exit codes. +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +rm -fr $SCRATCH_MNT/* /dev/null 21 Don't redirect the output. If an unlink fails, we want to know about it. +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT What's wrong with sync? +$BTRFS_UTIL_PROG qgroup show $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' | $AWK_PROG '{print $2 $3}' You can do regex matches with awk. Thanks for reviewing this. Thanks, -liubo Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] xfstests: btrfs: add test case for qgroup account on shared extents
This is a regression test of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', it's used to verify that removing shared extents can end up incorrect qgroup accounting. It can produce qgroup related warnings. The fix is https://patchwork.kernel.org/patch/5499981/ Btrfs: fix a warning of qgroup account on shared extents Signed-off-by: Liu Bo bo.li@oracle.com Tested-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Reviewed-by: Eryu Guan eg...@redhat.com --- v4: - remove inproper run_check macro and add filter macro for xfs_io - use awk's regexp directly - add test case description v3: - remove trailing whilespace. - add the fix link for more details of the problem. v2: - use new seq number 017 instead 080 - use 'cloner' to get shared extents - use XFS_IO_PROG instead tests/btrfs/017 | 82 + tests/btrfs/017.out | 5 tests/btrfs/group | 1 + 3 files changed, 88 insertions(+) create mode 100755 tests/btrfs/017 create mode 100644 tests/btrfs/017.out diff --git a/tests/btrfs/017 b/tests/btrfs/017 new file mode 100755 index 000..7937607 --- /dev/null +++ b/tests/btrfs/017 @@ -0,0 +1,82 @@ +#! /bin/bash +# FS QA Test No. 017 +# +# Verify that removing shared extents can end up incorrect qgroup accounting. +# +# Regression of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', +# this will throw a warning into dmesg. +# +# The issue is fixed by https://patchwork.kernel.org/patch/5499981/ +# Btrfs: fix a warning of qgroup account on shared extents +# +#--- +# Copyright (c) 2014 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cloner + +rm -f $seqres.full + +_scratch_mkfs --nodesize 4096 +_scratch_mount + +$XFS_IO_PROG -f -d -c pwrite 0 8K $SCRATCH_MNT/foo | _filter_xfs_io + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap + +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink2 + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +rm -fr $SCRATCH_MNT/foo* +rm -fr $SCRATCH_MNT/snap/foo* + +sync + +$BTRFS_UTIL_PROG qgroup show $SCRATCH_MNT | $AWK_PROG '/[0-9]/ {print $2 $3}' + +# success, all done +status=0 +exit diff --git a/tests/btrfs/017.out b/tests/btrfs/017.out new file mode 100644 index 000..7658e2e --- /dev/null +++ b/tests/btrfs/017.out @@ -0,0 +1,5 @@ +QA output created by 017 +wrote 8192/8192 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +4096 4096 +4096 4096 diff --git a/tests/btrfs/group b/tests/btrfs/group index abb2fe4..e2b 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -19,6 +19,7 @@ 014 auto balance 015 auto quick snapshot 016 auto quick send +017 auto quick qgroup 018 auto quick subvol 019 auto quick send 020 auto quick replace -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oddly slow read performance with near-full largish FS
Hi, Sorry for late reply. Let me ask some questions. On 2014/12/17 11:42, Charles Cazabon wrote: Hi, I've been running btrfs for various filesystems for a few years now, and have recently run into problems with a large filesystem becoming *really* slow for basic reading. None of the debugging/testing suggestions I've come across in the wiki or in the mailing list archives seems to have helped. Background: this particular filesystem holds backups for various other machines on the network, a mix of rdiff-backup data (so lots of small files) and rsync copies of larger files (everything from ~5MB data files to ~60GB VM HD images). There's roughly 16TB of data in this filesystem (the filesystem is ~17TB). The btrfs filesystem is a simple single volume, no snapshots, multiple devices, or anything like that. It's an LVM logical volume on top of dmcrypt on top of an mdadm RAID set (8 disks in RAID 6). Q1. You mean your Btrfs file system exists on the top of the following deep layers? +---+ |Btrfs(single) | +---+ |LVM(non RAID?) | +---+ |dmcrypt| +---+ |mdadm RAID set | +---+ # Unfortunately, I don't know how Btrfs works in conjunction #with such a deep layers. Q2. If Q1 is true, is it possible to reduce that layers as follows? +---+ |Btrfs(*1) | +---+ |dmcrypt| +---+ It's because there are too many layers and these have the same/similar features and heavy layered file system tends to cause more trouble than thinner layered ones regardless of file system type. *1) Currently I don't recommend you to use RAID56 of Btrfs. So, if RAID6 is mandatory, mdadm RAID6 is also necessary. The performance: trying to copy the data off this filesystem to another (non-btrfs) filesystem with rsync or just cp was taking ges - I found one suggestion that it could be because updating the atimes required a COW of the metadata in btrfs, so I mounted the filesystem noatime, but this doesn't appear to have made any difference. The speeds I'm seeing (with iotop) fluctuate a lot. They spend most of the time in the range of 1-3 MB/s, with large periods of time where no IO seems to happen at all, and occasional short spikes to ~25-30 MB/s. System load seems to sit around 10-12 (with only 2 processes reported as running, everything else sleeping) while this happens. The server is doing nothing other than this copy at the time. The only processes using any noticable CPU are rsync (source and destination processes, around 3% CPU each, plus an md0:raid6 process around 2-3%), and a handful of kworker processes, perhaps one per CPU (there are 8 physical cores in the server, plus hyperthreading). Other filesystems on the same physical disks have no trouble exceeding 100MB/s reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used). Q3. They are also consist of the following layers? +---+ |XFS/ext4 | +---+ |LVM(non RAID?) | +---+ |dmcrypt| +---+ |mdadm RAID set | +---+ Q4. Are other filesystems also near-full? Q5. Is there any error/warning message about Btrfs/LVM/dmcrypt/mdadm/hardwares? Thanks, Satoru Is there something obvious I'm missing here? Is there a reason I can only average ~3MB/s reads from a btrfs filesystem? kernel is x86_64 linux-stable 3.17.6. btrfs-progs is v3.17.3-3-g8cb0438. Output of the various info commands is: $ sudo btrfs fi df /media/backup/ Data, single: total=16.24TiB, used=15.73TiB System, DUP: total=8.00MiB, used=1.75MiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=35.50GiB, used=34.05GiB Metadata, single: total=8.00MiB, used=0.00 unknown, single: total=512.00MiB, used=0.00 $ btrfs --version Btrfs v3.17.3-3-g8cb0438 $ sudo btrfs fi show Label: 'backup' uuid: c18dfd04-d931-4269-b999-e94df3b1918c Total devices 1 FS bytes used 15.76TiB devid1 size 16.37TiB used 16.31TiB path /dev/mapper/vg-backup Thanks in advance for any suggestions. Charles -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] xfstests: btrfs: add test case for qgroup account on shared extents
Hi Liu, On 2014/12/19 17:31, Liu Bo wrote: This is a regression test of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', it's used to verify that removing shared extents can end up incorrect qgroup accounting. It can produce qgroup related warnings. The fix is https://patchwork.kernel.org/patch/5499981/ Btrfs: fix a warning of qgroup account on shared extents Signed-off-by: Liu Bo bo.li@oracle.com Tested-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com V4 also passed my test, Thanks, Satoru Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Reviewed-by: Eryu Guan eg...@redhat.com --- v4: - remove inproper run_check macro and add filter macro for xfs_io - use awk's regexp directly - add test case description v3: - remove trailing whilespace. - add the fix link for more details of the problem. v2: - use new seq number 017 instead 080 - use 'cloner' to get shared extents - use XFS_IO_PROG instead tests/btrfs/017 | 82 + tests/btrfs/017.out | 5 tests/btrfs/group | 1 + 3 files changed, 88 insertions(+) create mode 100755 tests/btrfs/017 create mode 100644 tests/btrfs/017.out diff --git a/tests/btrfs/017 b/tests/btrfs/017 new file mode 100755 index 000..7937607 --- /dev/null +++ b/tests/btrfs/017 @@ -0,0 +1,82 @@ +#! /bin/bash +# FS QA Test No. 017 +# +# Verify that removing shared extents can end up incorrect qgroup accounting. +# +# Regression of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', +# this will throw a warning into dmesg. +# +# The issue is fixed by https://patchwork.kernel.org/patch/5499981/ +# Btrfs: fix a warning of qgroup account on shared extents +# +#--- +# Copyright (c) 2014 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cloner + +rm -f $seqres.full + +_scratch_mkfs --nodesize 4096 +_scratch_mount + +$XFS_IO_PROG -f -d -c pwrite 0 8K $SCRATCH_MNT/foo | _filter_xfs_io + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap + +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink2 + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +rm -fr $SCRATCH_MNT/foo* +rm -fr $SCRATCH_MNT/snap/foo* + +sync + +$BTRFS_UTIL_PROG qgroup show $SCRATCH_MNT | $AWK_PROG '/[0-9]/ {print $2 $3}' + +# success, all done +status=0 +exit diff --git a/tests/btrfs/017.out b/tests/btrfs/017.out new file mode 100644 index 000..7658e2e --- /dev/null +++ b/tests/btrfs/017.out @@ -0,0 +1,5 @@ +QA output created by 017 +wrote 8192/8192 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +4096 4096 +4096 4096 diff --git a/tests/btrfs/group b/tests/btrfs/group index abb2fe4..e2b 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -19,6 +19,7 @@ 014 auto balance 015 auto quick snapshot 016 auto quick send +017 auto quick qgroup 018 auto quick subvol 019 auto quick send 020 auto quick replace -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Cleanup warnings from clang
Hi, On 2014/12/19 15:13, Qu Wenruo wrote: Cleanup warning when compile btrfs-progs with clang. Clang analyser also reports about 60+ errors, but it will be another patchset fixing it later. Qu Wenruo (5): btrfs-progs: Makefile: Move linker only option to LDFLAGS btrfs-progs: Fix a clang dead-judgement warning in disk-io.c. btrfs-progs: Remove a unused function root_gtp_mask(). btrfs-progs: Remove a unused function offset_to_bitmap() btrfs-progs: Remove deprecated _BSD_SOURCE macro. All these patches looks good to me. Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Thanks, Satoru Makefile | 3 ++- cmds-receive.c | 2 +- disk-io.c | 9 ++--- free-space-cache.c | 16 radix-tree.c | 5 - 5 files changed, 9 insertions(+), 26 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] xfstests: btrfs: add test case for qgroup account on shared extents
Hi Satoru san, On Fri, Dec 19, 2014 at 06:21:30PM +0900, Satoru Takeuchi wrote: Hi Liu, On 2014/12/19 17:31, Liu Bo wrote: This is a regression test of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', it's used to verify that removing shared extents can end up incorrect qgroup accounting. It can produce qgroup related warnings. The fix is https://patchwork.kernel.org/patch/5499981/ Btrfs: fix a warning of qgroup account on shared extents Signed-off-by: Liu Bo bo.li@oracle.com Tested-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com V4 also passed my test, Thanks for your active testing! Thanks, -liubo Thanks, Satoru Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Reviewed-by: Eryu Guan eg...@redhat.com --- v4: - remove inproper run_check macro and add filter macro for xfs_io - use awk's regexp directly - add test case description v3: - remove trailing whilespace. - add the fix link for more details of the problem. v2: - use new seq number 017 instead 080 - use 'cloner' to get shared extents - use XFS_IO_PROG instead tests/btrfs/017 | 82 + tests/btrfs/017.out | 5 tests/btrfs/group | 1 + 3 files changed, 88 insertions(+) create mode 100755 tests/btrfs/017 create mode 100644 tests/btrfs/017.out diff --git a/tests/btrfs/017 b/tests/btrfs/017 new file mode 100755 index 000..7937607 --- /dev/null +++ b/tests/btrfs/017 @@ -0,0 +1,82 @@ +#! /bin/bash +# FS QA Test No. 017 +# +# Verify that removing shared extents can end up incorrect qgroup accounting. +# +# Regression of 'commit fcebe4562dec (Btrfs: rework qgroup accounting)', +# this will throw a warning into dmesg. +# +# The issue is fixed by https://patchwork.kernel.org/patch/5499981/ +# Btrfs: fix a warning of qgroup account on shared extents +# +#--- +# Copyright (c) 2014 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here + +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_cloner + +rm -f $seqres.full + +_scratch_mkfs --nodesize 4096 +_scratch_mount + +$XFS_IO_PROG -f -d -c pwrite 0 8K $SCRATCH_MNT/foo | _filter_xfs_io + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT $SCRATCH_MNT/snap + +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink +$CLONER_PROG -s 0 -d 0 -l 8192 $SCRATCH_MNT/foo $SCRATCH_MNT/snap/foo-reflink2 + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +rm -fr $SCRATCH_MNT/foo* +rm -fr $SCRATCH_MNT/snap/foo* + +sync + +$BTRFS_UTIL_PROG qgroup show $SCRATCH_MNT | $AWK_PROG '/[0-9]/ {print $2 $3}' + +# success, all done +status=0 +exit diff --git a/tests/btrfs/017.out b/tests/btrfs/017.out new file mode 100644 index 000..7658e2e --- /dev/null +++ b/tests/btrfs/017.out @@ -0,0 +1,5 @@ +QA output created by 017 +wrote 8192/8192 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +4096 4096 +4096 4096 diff --git a/tests/btrfs/group b/tests/btrfs/group index abb2fe4..e2b 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -19,6 +19,7 @@ 014 auto balance 015 auto quick snapshot 016 auto quick send +017 auto quick qgroup 018 auto quick subvol 019 auto quick send 020 auto quick replace -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH 3/6] btrfs-progs: fi usage, update manpage
Hi David, On 2014/12/18 23:27, David Sterba wrote: Signed-off-by: David Sterba dste...@suse.cz --- Documentation/btrfs-filesystem.txt | 28 1 file changed, 28 insertions(+) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index a8f2972a0e1a..85a94eb52569 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -123,6 +123,34 @@ Show or update the label of a filesystem. If a newlabel optional argument is passed, the label is changed. NOTE: the maximum allowable length shall be less than 256 chars +*usage* [options] path [path...]:: +Show detailed information about internal filesystem usage. Options from -b to -t are the completely same as btrfs fi df's ones. So how about pointing the df's options as follows? === ... + `Options` + -T Show data in tabular format + There are some option to set unit. See description of *df*'s options from '-b' to '-t'. + If conflicting options are passed, the last one takes precedence. ... === I consider it can prevent mistakes caused by further changes. This patch seems to already be in devel/integration-20141218. Here is the patch example. --- From: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com Subject: [PATCH] btrfs-progs: Cleanup: Fix the redundancy of btrfs-filesystem. Signed-off-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com --- Documentation/btrfs-filesystem.txt | 21 +++-- 1 file changed, 3 insertions(+), 18 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index 85a94eb..a8e7431 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -128,27 +128,12 @@ Show detailed information about internal filesystem usage. + `Options` + --b|--raw -raw numbers in bytes, without the 'B' suffix --h -print human friendly numbers, base 1024, this is the default --H -print human friendly numbers, base 1000 ---iec -select the 1024 base for the following options, according to the IEC standard ---si -select the 1000 base for the following options, according to the SI standard --k|--kbytes -show sizes in KiB, or kB with --si --m|--mbytes -show sizes in MiB, or MB with --si --g|--gbytes -show sizes in GiB, or GB with --si --t|--tbytes -show sizes in TiB, or TB with --si -T show data in tabular format +There are some options to set unit. See the description of *df* subcommand +from '-b' option to '-t' option. + If conflicting options are passed, the last one takes precedence. EXIT STATUS -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: generic checksum framework
This changes the original crc32c specific checksum functions into more generic ones, so that converting to a new checksum algorithm can be transparent to btrfs internal code. Note that file names' lookup and extent_data_ref's hashing use crc32c with their own seed instead of the default ~0, so they remain unchanged. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/check-integrity.c | 10 +++-- fs/btrfs/compression.c | 30 ++--- fs/btrfs/ctree.h| 5 ++- fs/btrfs/disk-io.c | 100 +++- fs/btrfs/disk-io.h | 2 - fs/btrfs/file-item.c| 25 +-- fs/btrfs/free-space-cache.c | 8 ++-- fs/btrfs/hash.c | 49 ++ fs/btrfs/hash.h | 9 +++- fs/btrfs/inode.c| 21 ++ fs/btrfs/ioctl.c| 1 + fs/btrfs/ordered-data.c | 10 +++-- fs/btrfs/ordered-data.h | 9 ++-- fs/btrfs/scrub.c| 55 +--- 14 files changed, 216 insertions(+), 118 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index d897ef8..4c26cfa 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -1790,8 +1790,8 @@ static int btrfsic_test_for_metadata(struct btrfsic_state *state, { struct btrfs_header *h; u8 csum[BTRFS_CSUM_SIZE]; - u32 crc = ~(u32)0; unsigned int i; + SHASH_DESC_ON_STACK(shash, state-root-fs_info-csum_tfm); if (num_pages * PAGE_CACHE_SIZE state-metablock_size) return 1; /* not metadata */ @@ -1801,14 +1801,18 @@ static int btrfsic_test_for_metadata(struct btrfsic_state *state, if (memcmp(h-fsid, state-root-fs_info-fsid, BTRFS_UUID_SIZE)) return 1; + shash-tfm = state-root-fs_info-csum_tfm; + shash-flags = 0; + crypto_shash_init(shash); + for (i = 0; i num_pages; i++) { u8 *data = i ? datav[i] : (datav[i] + BTRFS_CSUM_SIZE); size_t sublen = i ? PAGE_CACHE_SIZE : (PAGE_CACHE_SIZE - BTRFS_CSUM_SIZE); - crc = btrfs_crc32c(crc, data, sublen); + crypto_shash_update(shash, data, sublen); } - btrfs_csum_final(crc, csum); + crypto_shash_final(shash, csum); if (memcmp(csum, h-csum, state-csum_size)) return 1; diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index e9df886..42c2435 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -78,7 +78,7 @@ struct compressed_bio { * the start of a variable length array of checksums only * used by reads */ - u32 sums; + u8 sums[]; }; static int btrfs_decompress_biovec(int type, struct page **pages_in, @@ -111,31 +111,29 @@ static int check_compressed_csum(struct inode *inode, struct page *page; unsigned long i; char *kaddr; - u32 csum; - u32 *cb_sum = cb-sums; + u8 csum[BTRFS_CSUM_SIZE]; + u8 *cb_sum = cb-sums; + struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info; + u16 csum_size = btrfs_super_csum_size(fs_info-super_copy); if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) return 0; for (i = 0; i cb-nr_pages; i++) { page = cb-compressed_pages[i]; - csum = ~(u32)0; kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr, csum, PAGE_CACHE_SIZE); - btrfs_csum_final(csum, (char *)csum); + btrfs_csum(fs_info, kaddr, PAGE_CACHE_SIZE, csum); kunmap_atomic(kaddr); - if (csum != *cb_sum) { + if (memcmp(csum, cb_sum, csum_size)) { btrfs_info(BTRFS_I(inode)-root-fs_info, - csum failed ino %llu extent %llu csum %u wanted %u mirror %d, - btrfs_ino(inode), disk_start, csum, *cb_sum, - cb-mirror_num); + csum failed ino %llu extent %llu mirror %d, + btrfs_ino(inode), disk_start, cb-mirror_num); ret = -EIO; goto fail; } - cb_sum++; - + cb_sum += csum_size; } ret = 0; fail: @@ -584,7 +582,8 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, struct extent_map *em; int ret = -ENOMEM; int faili = 0; - u32 *sums; + u8 *sums; + u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy); tree = BTRFS_I(inode)-io_tree; em_tree = BTRFS_I(inode)-extent_tree; @@ -607,7 +606,7 @@ int btrfs_submit_compressed_read(struct inode *inode, struct bio *bio, cb-errors = 0; cb-inode = inode; cb-mirror_num = mirror_num; - sums =
Re: [PATCH 3/6] btrfs-progs: fi usage, update manpage
On Fri, Dec 19, 2014 at 06:56:43PM +0900, Satoru Takeuchi wrote: +There are some options to set unit. See the description of *df* subcommand +from '-b' option to '-t' option. The unit options exist only for very few subcommands so I found it more convenient to list all of them near to the command itself rather than pointing somewhere else. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs: integration-20141218 possible corruption test regression
On Fri, Dec 19, 2014 at 02:23:12PM +0800, Qu Wenruo wrote: In fact, it's not a regression. The 013 testcase is a special case that uses a script to corrupt the image and then do the btrfsck test. There is a patch before the commit, to allow btrfs-progs test script call corruption script. But since there is still some discussion about the corruption script and maybe later verify script, the previous patch is not picked. Yes, I'm waiting for final version of the patch, but wanted to add all test images that were available. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/5] Cleanup warnings from clang
On Fri, Dec 19, 2014 at 06:27:36PM +0900, Satoru Takeuchi wrote: Qu Wenruo (5): btrfs-progs: Makefile: Move linker only option to LDFLAGS btrfs-progs: Fix a clang dead-judgement warning in disk-io.c. btrfs-progs: Remove a unused function root_gtp_mask(). btrfs-progs: Remove a unused function offset_to_bitmap() btrfs-progs: Remove deprecated _BSD_SOURCE macro. All these patches looks good to me. Reviewed-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com All added. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs pull part two
Hi Linus, Please pull my for-linus branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus It has part two of our merge window patches. These are all from Filipe, and fix some really hard to find races that can cause corruptions. Most of them involved block group removal (balance) or discard. Filipe Manana (4) commits (+20/-25): Btrfs: fix fs corruption on transaction abort if device supports discard (+6/-10) Btrfs: always clear a block group node when removing it from the tree (+3/-0) Btrfs: remove non-sense btrfs_error_discard_extent() function (+6/-12) Btrfs: ensure deletion from pinned_chunks list is protected (+5/-3) Total: (4) commits (+20/-25) fs/btrfs/ctree.h| 4 ++-- fs/btrfs/disk-io.c | 6 -- fs/btrfs/extent-tree.c | 23 +++ fs/btrfs/free-space-cache.c | 12 +++- 4 files changed, 20 insertions(+), 25 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/6] Btrfs progs, coverity fixes for 3.18
A few straightforward fixes. David Sterba (6): btrfs-progs: corrupt block, add missing break to option I btrfs-progs: corrupt block, add break after option U btrfs-progs: fragments, close output file on error btrfs-progs: check result of first_cache_extent btrfs-progs: check allocation result in add_clone_source btrfs-progs: let btrfs_free_path accept NULL btrfs-corrupt-block.c | 2 ++ btrfs-fragments.c | 7 +-- cmds-check.c | 2 ++ cmds-send.c | 25 + ctree.c | 2 ++ 5 files changed, 32 insertions(+), 6 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] btrfs-progs: corrupt block, add missing break to option I
Using -I would imply -d. Resolves-Coverity-CID: 1258792 Signed-off-by: David Sterba dste...@suse.cz --- btrfs-corrupt-block.c | 1 + 1 file changed, 1 insertion(+) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index af9ae4d4047c..aeeb1b298f66 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1096,6 +1096,7 @@ int main(int ac, char **av) break; case 'I': corrupt_item = 1; + break; case 'd': delete = 1; break; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] btrfs-progs: corrupt block, add break after option U
Resolves-Coverity-CID: 1258793 Signed-off-by: David Sterba dste...@suse.cz --- btrfs-corrupt-block.c | 1 + 1 file changed, 1 insertion(+) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index aeeb1b298f66..b477e878376b 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1068,6 +1068,7 @@ int main(int ac, char **av) break; case 'U': chunk_tree = 1; + break; case 'i': inode = arg_strtou64(optarg); break; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] btrfs-progs: check result of first_cache_extent
If the tree's empty, we'll get NULL and dereference it. Resolves-Coverity-CID: 1248828 Signed-off-by: David Sterba dste...@suse.cz --- cmds-check.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 6eea36c2f52c..3e7a4ebdce44 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -8075,6 +8075,8 @@ static void free_roots_info_cache(void) struct root_item_info *rii; entry = first_cache_extent(roots_info_cache); + if (!entry) + break; remove_cache_extent(roots_info_cache, entry); rii = container_of(entry, struct root_item_info, cache_extent); free(rii); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] btrfs-progs: let btrfs_free_path accept NULL
Same in kernel and matches semantics of free(). Resolves-Coverity-CID: 1054881 Signed-off-by: David Sterba dste...@suse.cz --- ctree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ctree.c b/ctree.c index bd6cb125b2a2..589efa3db17e 100644 --- a/ctree.c +++ b/ctree.c @@ -48,6 +48,8 @@ struct btrfs_path *btrfs_alloc_path(void) void btrfs_free_path(struct btrfs_path *p) { + if (!p) + return; btrfs_release_path(p); kfree(p); } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] btrfs-progs: check allocation result in add_clone_source
Resolves-Coverity-CID: 1054894 Signed-off-by: David Sterba dste...@suse.cz --- cmds-send.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/cmds-send.c b/cmds-send.c index b17b5e2ca666..9b32c1f0e624 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -172,11 +172,16 @@ out: return ret; } -static void add_clone_source(struct btrfs_send *s, u64 root_id) +static int add_clone_source(struct btrfs_send *s, u64 root_id) { s-clone_sources = realloc(s-clone_sources, sizeof(*s-clone_sources) * (s-clone_sources_count + 1)); + + if (!s-clone_sources) + return -ENOMEM; s-clone_sources[s-clone_sources_count++] = root_id; + + return 0; } static int write_buf(int fd, const void *buf, int size) @@ -475,7 +480,11 @@ int cmd_send(int argc, char **argv) goto out; } - add_clone_source(send, root_id); + ret = add_clone_source(send, root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } subvol_uuid_search_finit(send.sus); free(subvol); subvol = NULL; @@ -575,7 +584,11 @@ int cmd_send(int argc, char **argv) goto out; } - add_clone_source(send, parent_root_id); + ret = add_clone_source(send, parent_root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } } for (i = optind; i argc; i++) { @@ -671,7 +684,11 @@ int cmd_send(int argc, char **argv) goto out; /* done with this subvol, so add it to the clone sources */ - add_clone_source(send, root_id); + ret = add_clone_source(send, root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } parent_root_id = 0; full_send = 0; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] btrfs-progs: fragments, close output file on error
Resolves-Coverity-CID: 1258794 Signed-off-by: David Sterba dste...@suse.cz --- btrfs-fragments.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/btrfs-fragments.c b/btrfs-fragments.c index d03c2c3e7319..360f10f87bfa 100644 --- a/btrfs-fragments.c +++ b/btrfs-fragments.c @@ -233,7 +233,7 @@ list_fragments(int fd, u64 flags, char *dir) ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args); if (ret 0) { fprintf(stderr, ERROR: can't perform the search\n); - return ret; + goto out_close; } /* the ioctl returns the number of item it found in nr_items */ if (sk-nr_items == 0) @@ -373,7 +373,10 @@ skip:; fprintf(html, /p); } fprintf(html, /body/html\n); - + +out_close: + fclose(html); + return ret; } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs receive being very slow
Hello. So I split the job in 2 tasks as per your suggestion. I create the differential snapshot with btrfs send and save it on SSD - so far this is very efficient and the sending happens almost at full SSD speed. When I try to receive the snapshot on the HDD - the speed is just as low as before (as when I do ionice'd pipe). No ionice is used. The hdd raw speed is, according to hdparm: Timing cached reads: 15848 MB in 2.00 seconds = 7928.82 MB/sec Timing buffered disk reads: 310 MB in 3.01 seconds = 103.02 MB/sec And I have more than 100Gb free space on it, but the speed is still low. So, as you mentioned it - I might be dealing with a very fragmented system. Now there are some conclusions and questions: 1. The btrfs send is out of question - it works great with or without ionice. 2. The receive is slow no matter what I do, even if run alone. (as for the what kind of data is being sent, i sent the snapshot of / and /home and both are slow for btrfs receive) 3. How to check how fragmented the filesystem is? (i.e. i want to know if this is the real cause) 4. How to defragment all those read-only snapshots without breaking the compatibility with differential btrfs send. (if i understand it correctly the parent snapshot must be the same on source and destination, is this correct?) 5. Will making those snapshots writable, defragmenting them and re-snapshoting them as read-only break compatibility with btrfs differential send? E.g. will I still be able to btrfs receive a differential snapshot after defragmentation? Also for your suggestion to do it in a break - I would have done it but it sometimes takes hours to sync, thats why i tried to ionice it so I can work while it runs. Thank you a lot for your explanations and effort! On 15.12.2014 10:49, Robert White wrote: On 12/14/2014 11:41 PM, Nick Dimov wrote: Hi, thanks for the answer, I will answer between the lines. On 15.12.2014 08:45, Robert White wrote: On 12/14/2014 08:50 PM, Nick Dimov wrote: Hello everyone! First, thanks for amazing work on btrfs filesystem! Now the problem: I use a ssd as my system drive (/dev/sda2) and use daily snapshots on it. Then, from time to time, i sync those on HDD (/dev/sdb4) by using btrfs send / receive like this: ionice -c3 btrfs send -p /ssd/previously_synced_snapshot /ssd/snapshot-X | pv | btrfs receive /hdd/snapshots I use pv to measure speed and i get ridiculos speeds like 5-200kiB/s! (rarely it goes over 1miB). However if i replace the btrfs receive with cat /dev/null - the speed is 400-500MiB/s (almost full SSD speed) so I understand that the problem is the fs on the HDD... Do you have any idea of how to trace this problem down? You have _lots_ of problems with that above... (1) your ionice is causing the SSD to stall the send every time the receiver does _anything_. I will try to remove completely ionice - but them my system becomes irresponsive :( Yep, see below. Then again if it only goes bad for a minute or two, then just launch the backup right as you go for a break. (1a) The ionice doesn't apply to the pipeline, it only applies to the command it proceeds. So it's ionice -c btrfs send... then pipeline then btrfs receive at the default io scheduling class. You need to specify it twice, or wrap it all in a script. ionice -c 3 btrfs send -p /ssd/parent /ssd/snapshot-X | ionice -c 3 btrfs receive /hdd/snapshots This is usually what i do but I wanted to show that there is no throtle on the receiver. (i tested it with and without - the result is the same) (1b) Your comparison case is flawed because cat /dev/null results in no actual IO (e.g. writing to dev-null doesn't transfer any data anywhere, it just gets rubber-stamped okay at the kernel method level). This was only an intention to show that the sender itself is OK. I understood why you did it, I was just trying to point out that since there was no other IO competing with the btrfs send, it would give you are really outrageously false positive. Particularly if you always used ionice. (2) You probably get negative-to-no value from using ionice on the sending side, particularly since SSDs don't have physical heads to seek around. yeah in theory it should be like this, but in practice on my system - when i use no ionice my system becomes very unresponsive (ubuntu 14.10). What all is in the snapshot? Is it your whole system or just /home or what? e.g. what are your subvolume boundaries if any? btrfs send is very efficent, but that efficency means that it can rifle through a heck of a lot of the parent snapshot and decide it doesn't need sending, and it can do so very fast, and that can be a huge hit on other activities. If most of your system doesn't change between snapshots the send will plow through your disk yelling nope and skip this like a shopper in a black firday riot. (2a) The value of nicing your IO is trivial on the
Re: [PATCH 2/6] btrfs-progs: corrupt block, add break after option U
On 12/19/14 10:06 AM, David Sterba wrote: Resolves-Coverity-CID: 1258793 Signed-off-by: David Sterba dste...@suse.cz Reviewed-by: Eric Sandeen sand...@redhat.com --- btrfs-corrupt-block.c | 1 + 1 file changed, 1 insertion(+) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index aeeb1b298f66..b477e878376b 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1068,6 +1068,7 @@ int main(int ac, char **av) break; case 'U': chunk_tree = 1; + break; case 'i': inode = arg_strtou64(optarg); break; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/6] btrfs-progs: corrupt block, add missing break to option I
On 12/19/14 10:06 AM, David Sterba wrote: Using -I would imply -d. Resolves-Coverity-CID: 1258792 Signed-off-by: David Sterba dste...@suse.cz Reviewed-by: Eric Sandeen sand...@redhat.com --- btrfs-corrupt-block.c | 1 + 1 file changed, 1 insertion(+) diff --git a/btrfs-corrupt-block.c b/btrfs-corrupt-block.c index af9ae4d4047c..aeeb1b298f66 100644 --- a/btrfs-corrupt-block.c +++ b/btrfs-corrupt-block.c @@ -1096,6 +1096,7 @@ int main(int ac, char **av) break; case 'I': corrupt_item = 1; + break; case 'd': delete = 1; break; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] btrfs-progs: fragments, close output file on error
On 12/19/14 10:06 AM, David Sterba wrote: Resolves-Coverity-CID: 1258794 Signed-off-by: David Sterba dste...@suse.cz Reviewed-by: Eric Sandeen sand...@redhat.com --- btrfs-fragments.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/btrfs-fragments.c b/btrfs-fragments.c index d03c2c3e7319..360f10f87bfa 100644 --- a/btrfs-fragments.c +++ b/btrfs-fragments.c @@ -233,7 +233,7 @@ list_fragments(int fd, u64 flags, char *dir) ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, args); if (ret 0) { fprintf(stderr, ERROR: can't perform the search\n); - return ret; + goto out_close; } /* the ioctl returns the number of item it found in nr_items */ if (sk-nr_items == 0) @@ -373,7 +373,10 @@ skip:; fprintf(html, /p); } fprintf(html, /body/html\n); - + +out_close: + fclose(html); + return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oddly slow read performance with near-full largish FS
Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com wrote: Let me ask some questions. Sure - thanks for taking an interest. On 2014/12/17 11:42, Charles Cazabon wrote: There's roughly 16TB of data in this filesystem (the filesystem is ~17TB). The btrfs filesystem is a simple single volume, no snapshots, multiple devices, or anything like that. It's an LVM logical volume on top of dmcrypt on top of an mdadm RAID set (8 disks in RAID 6). Q1. You mean your Btrfs file system exists on the top of the following deep layers? +---+ |Btrfs(single) | +---+ |LVM(non RAID?) | +---+ |dmcrypt| +---+ |mdadm RAID set | +---+ Yes, precisely. mdadm is used to make a large RAID6 device, which is encrypted with LUKS, on top of which is layered LVM (for ease of management), and the btrfs filesystem sits on that. Q2. If Q1 is true, is it possible to reduce that layers as follows? +---+ |Btrfs(*1) | +---+ |dmcrypt| +---+ I don't see how I could do that - I simply have far too much data for a single disk (not to mention I don't want to risk loss of data from a single disk failing). This filesystem has 16.x TB of data in it at present. It's because there are too many layers and these have the same/similar features and heavy layered file system tends to cause more trouble than thinner layered ones regardless of file system type. This configuration is one I've been using for many years. It's only recently that I've noticed it being particularly slow with btrfs -- I don't know if that's because the filesystem has filled up past some critical point, or due to something else entirely. That's why I'm trying to figure this out. *1) Currently I don't recommend you to use RAID56 of Btrfs. So, if RAID6 is mandatory, mdadm RAID6 is also necessary. Yes, exactly. That's why I use mdadm. The speeds I'm seeing (with iotop) fluctuate a lot. They spend most of the time in the range of 1-3 MB/s, with large periods of time where no IO seems to happen at all, and occasional short spikes to ~25-30 MB/s. System load seems to sit around 10-12 (with only 2 processes reported as running, everything else sleeping) while this happens. [...] Other filesystems on the same physical disks have no trouble exceeding 100MB/s reads. The machine is not swapping (16GB RAM, ~8GB swap with 0 swap used). Q3. They are also consist of the following layers? Yes, exactly the same configuration. The fact that I don't see any speed problems with other filesystems (even in the same LVM volume group) leads me in the direction of suspecting something to do with btrfs. Q4. Are other filesystems also near-full? No, not particularly. Now, the btrfs volume in question isn't exactly close to full - there's more than 500 GB free. It's just *relatively* full. Q5. Is there any error/warning message about Btrfs/LVM/dmcrypt/mdadm/hardwares? No, no errors or warnings in logs related to the disks, LVM, or btrfs. I have historically, with previous kernels, gotten the task blocked for more than 120 seconds warnings fairly often, but I haven't seen those lately. Is there any other info I can collect on this that would help? Thanks, Charles -- --- Charles Cazabon GPL'ed software available at: http://pyropus.ca/software/ --- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] btrfs-progs: check result of first_cache_extent
On 12/19/14 10:06 AM, David Sterba wrote: If the tree's empty, we'll get NULL and dereference it. Hm, but this is under an explicit check for not empty: while (!cache_tree_empty(roots_info_cache)) { sooo? Maybe it's just defensive? Nothing really wrong with being defensive, I suppose, so: Reviewed-by: Eric Sandeen sand...@redhat.com Resolves-Coverity-CID: 1248828 Signed-off-by: David Sterba dste...@suse.cz --- cmds-check.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 6eea36c2f52c..3e7a4ebdce44 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -8075,6 +8075,8 @@ static void free_roots_info_cache(void) struct root_item_info *rii; entry = first_cache_extent(roots_info_cache); + if (!entry) + break; remove_cache_extent(roots_info_cache, entry); rii = container_of(entry, struct root_item_info, cache_extent); free(rii); -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] btrfs-progs: check allocation result in add_clone_source
On 12/19/14 10:06 AM, David Sterba wrote: Resolves-Coverity-CID: 1054894 Signed-off-by: David Sterba dste...@suse.cz Reviewed-by: Eric Sandeen sand...@redhat.com --- cmds-send.c | 25 + 1 file changed, 21 insertions(+), 4 deletions(-) diff --git a/cmds-send.c b/cmds-send.c index b17b5e2ca666..9b32c1f0e624 100644 --- a/cmds-send.c +++ b/cmds-send.c @@ -172,11 +172,16 @@ out: return ret; } -static void add_clone_source(struct btrfs_send *s, u64 root_id) +static int add_clone_source(struct btrfs_send *s, u64 root_id) { s-clone_sources = realloc(s-clone_sources, sizeof(*s-clone_sources) * (s-clone_sources_count + 1)); + + if (!s-clone_sources) + return -ENOMEM; s-clone_sources[s-clone_sources_count++] = root_id; + + return 0; } static int write_buf(int fd, const void *buf, int size) @@ -475,7 +480,11 @@ int cmd_send(int argc, char **argv) goto out; } - add_clone_source(send, root_id); + ret = add_clone_source(send, root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } subvol_uuid_search_finit(send.sus); free(subvol); subvol = NULL; @@ -575,7 +584,11 @@ int cmd_send(int argc, char **argv) goto out; } - add_clone_source(send, parent_root_id); + ret = add_clone_source(send, parent_root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } } for (i = optind; i argc; i++) { @@ -671,7 +684,11 @@ int cmd_send(int argc, char **argv) goto out; /* done with this subvol, so add it to the clone sources */ - add_clone_source(send, root_id); + ret = add_clone_source(send, root_id); + if (ret 0) { + fprintf(stderr, ERROR: not enough memory\n); + goto out; + } parent_root_id = 0; full_send = 0; -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] btrfs-progs: let btrfs_free_path accept NULL
On 12/19/14 10:06 AM, David Sterba wrote: Same in kernel and matches semantics of free(). Resolves-Coverity-CID: 1054881 Signed-off-by: David Sterba dste...@suse.cz Reviewed-by: Eric Sandeen sand...@redhat.com --- ctree.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/ctree.c b/ctree.c index bd6cb125b2a2..589efa3db17e 100644 --- a/ctree.c +++ b/ctree.c @@ -48,6 +48,8 @@ struct btrfs_path *btrfs_alloc_path(void) void btrfs_free_path(struct btrfs_path *p) { + if (!p) + return; btrfs_release_path(p); kfree(p); } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs: integration-20141218 possible corruption test regression
On 19 December 2014 at 13:48, David Sterba dste...@suse.cz wrote: On Fri, Dec 19, 2014 at 02:23:12PM +0800, Qu Wenruo wrote: In fact, it's not a regression. The 013 testcase is a special case that uses a script to corrupt the image and then do the btrfsck test. There is a patch before the commit, to allow btrfs-progs test script call corruption script. But since there is still some discussion about the corruption script and maybe later verify script, the previous patch is not picked. Yes, I'm waiting for final version of the patch, but wanted to add all test images that were available. Thanks for the clarification, Qu, David. I've disabled test 013 in my local builds for the mean time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] btrfs-progs: check result of first_cache_extent
On Fri, Dec 19, 2014 at 10:56:41AM -0600, Eric Sandeen wrote: On 12/19/14 10:06 AM, David Sterba wrote: If the tree's empty, we'll get NULL and dereference it. Hm, but this is under an explicit check for not empty: while (!cache_tree_empty(roots_info_cache)) { sooo? Maybe it's just defensive? Nothing really wrong with being defensive, I suppose, so: Well, mostly to shut up the warning with a minimal change. It could be rewritten to while ((entry = ...)) { ... } as in other places. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Oddly slow read performance with near-full largish FS
Charles Cazabon posted on Fri, 19 Dec 2014 10:58:49 -0600 as excerpted: This configuration is one I've been using for many years. It's only recently that I've noticed it being particularly slow with btrfs -- I don't know if that's because the filesystem has filled up past some critical point, or due to something else entirely. That's why I'm trying to figure this out. Not recommending at this point, just saying these are options... Btrfs raid56 mode should, I believe, be pretty close to done with the latest patches. That would be 3.19, however, which isn't out yet of course. There's also raid10, if you have enough devices or little enough data to do it. That's much more mature than raid56 mode and should be about as mature and stable as btrfs in single-device mode, which is what you are using now. But it'll require more devices than a raid56 would... -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PULL] [PATCH 0/4] Updates in message levels
This is motivated by the ERR level when skinny metadata are used, this has been reported several times. Patch tagged for stable. The rest is taken from a SLES patch that I forgot to forward upstream. You can pull the branch from git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git fix/message-levels based on current master (d790be3863b28fd22e0) David Sterba (4): btrfs: update message levels for errors btrfs: update message levels during failed mount btrfs: update message levels after checksum errors btrfs: set proper message level for skinny metadata fs/btrfs/disk-io.c | 29 +++-- fs/btrfs/inode.c | 4 ++-- 2 files changed, 17 insertions(+), 16 deletions(-) -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] btrfs: set proper message level for skinny metadata
This has been confusing people for too long, the message is really just informative. CC: sta...@vger.kernel.org # 3.10+ Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/disk-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 6e986a34f9a1..d5e95ec60e12 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2499,7 +2499,7 @@ int open_ctree(struct super_block *sb, features |= BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO; if (features BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA) - printk(KERN_ERR BTRFS: has skinny extents\n); + printk(KERN_INFO BTRFS: has skinny extents\n); /* * flag our filesystem as having big metadata blocks if -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/4] btrfs: update message levels after checksum errors
The errors are worth noting and might get missed with INFO level. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/inode.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 3b694192a4a9..6e986a34f9a1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -318,7 +318,7 @@ static int csum_tree_block(struct btrfs_root *root, struct extent_buffer *buf, memcpy(found, result, csum_size); read_extent_buffer(buf, val, 0, csum_size); - printk_ratelimited(KERN_INFO + printk_ratelimited(KERN_WARNING BTRFS: %s checksum verify failed on %llu wanted %X found %X level %d\n, root-fs_info-sb-s_id, buf-start, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index ff6d98d8dc20..a91d9ff3293b 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2945,7 +2945,7 @@ static int __readpage_endio_check(struct inode *inode, return 0; zeroit: if (__ratelimit(_rs)) - btrfs_info(BTRFS_I(inode)-root-fs_info, + btrfs_warn(BTRFS_I(inode)-root-fs_info, csum failed ino %llu off %llu csum %u expected csum %u, btrfs_ino(inode), start, csum, csum_expected); memset(kaddr + pgoff, 1, len); -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/4] btrfs: update message levels for errors
Several messages that point to some internal problem, level INFO is wrong here. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/disk-io.c | 9 + fs/btrfs/inode.c | 2 +- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 30965120772b..8beb74ffb075 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -367,7 +367,8 @@ static int verify_parent_transid(struct extent_io_tree *io_tree, ret = 0; goto out; } - printk_ratelimited(KERN_INFO BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n, + printk_ratelimited(KERN_ERR + BTRFS (device %s): parent transid verify failed on %llu wanted %llu found %llu\n, eb-fs_info-sb-s_id, eb-start, parent_transid, btrfs_header_generation(eb)); ret = 1; @@ -633,21 +634,21 @@ static int btree_readpage_end_io_hook(struct btrfs_io_bio *io_bio, found_start = btrfs_header_bytenr(eb); if (found_start != eb-start) { - printk_ratelimited(KERN_INFO BTRFS (device %s): bad tree block start + printk_ratelimited(KERN_ERR BTRFS (device %s): bad tree block start %llu %llu\n, eb-fs_info-sb-s_id, found_start, eb-start); ret = -EIO; goto err; } if (check_tree_block_fsid(root, eb)) { - printk_ratelimited(KERN_INFO BTRFS (device %s): bad fsid on block %llu\n, + printk_ratelimited(KERN_ERR BTRFS (device %s): bad fsid on block %llu\n, eb-fs_info-sb-s_id, eb-start); ret = -EIO; goto err; } found_level = btrfs_header_level(eb); if (found_level = BTRFS_MAX_LEVEL) { - btrfs_info(root-fs_info, bad tree block level %d, + btrfs_err(root-fs_info, bad tree block level %d, (int)btrfs_header_level(eb)); ret = -EIO; goto err; diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e687bb0dc73a..ff6d98d8dc20 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -3407,7 +3407,7 @@ int btrfs_orphan_cleanup(struct btrfs_root *root) out: if (ret) - btrfs_crit(root-fs_info, + btrfs_err(root-fs_info, could not do orphan cleanup %d, ret); btrfs_free_path(path); return ret; -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/4] btrfs: update message levels during failed mount
All error conditions from open_ctree shall be ERR. Warning would suggest that something's wrong and we can continue. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/disk-io.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 8beb74ffb075..3b694192a4a9 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2523,7 +2523,7 @@ int open_ctree(struct super_block *sb, */ if ((features BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS) (sectorsize != nodesize)) { - printk(KERN_WARNING BTRFS: unequal leaf/node/sector sizes + printk(KERN_ERR BTRFS: unequal leaf/node/sector sizes are not allowed for mixed block groups on %s\n, sb-s_id); goto fail_alloc; @@ -2631,12 +2631,12 @@ int open_ctree(struct super_block *sb, sb-s_blocksize_bits = blksize_bits(sectorsize); if (btrfs_super_magic(disk_super) != BTRFS_MAGIC) { - printk(KERN_INFO BTRFS: valid FS not found on %s\n, sb-s_id); + printk(KERN_ERR BTRFS: valid FS not found on %s\n, sb-s_id); goto fail_sb_buffer; } if (sectorsize != PAGE_SIZE) { - printk(KERN_WARNING BTRFS: Incompatible sector size(%lu) + printk(KERN_ERR BTRFS: incompatible sector size (%lu) found on %s\n, (unsigned long)sectorsize, sb-s_id); goto fail_sb_buffer; } @@ -2645,7 +2645,7 @@ int open_ctree(struct super_block *sb, ret = btrfs_read_sys_array(tree_root); mutex_unlock(fs_info-chunk_mutex); if (ret) { - printk(KERN_WARNING BTRFS: failed to read the system + printk(KERN_ERR BTRFS: failed to read the system array on %s\n, sb-s_id); goto fail_sb_buffer; } @@ -2660,7 +2660,7 @@ int open_ctree(struct super_block *sb, generation); if (!chunk_root-node || !test_bit(EXTENT_BUFFER_UPTODATE, chunk_root-node-bflags)) { - printk(KERN_WARNING BTRFS: failed to read chunk root on %s\n, + printk(KERN_ERR BTRFS: failed to read chunk root on %s\n, sb-s_id); goto fail_tree_roots; } @@ -2672,7 +2672,7 @@ int open_ctree(struct super_block *sb, ret = btrfs_read_chunk_tree(chunk_root); if (ret) { - printk(KERN_WARNING BTRFS: failed to read chunk tree on %s\n, + printk(KERN_ERR BTRFS: failed to read chunk tree on %s\n, sb-s_id); goto fail_tree_roots; } @@ -2684,7 +2684,7 @@ int open_ctree(struct super_block *sb, btrfs_close_extra_devices(fs_info, fs_devices, 0); if (!fs_devices-latest_bdev) { - printk(KERN_CRIT BTRFS: failed to read devices on %s\n, + printk(KERN_ERR BTRFS: failed to read devices on %s\n, sb-s_id); goto fail_tree_roots; } @@ -2768,7 +2768,7 @@ retry_root_backup: ret = btrfs_recover_balance(fs_info); if (ret) { - printk(KERN_WARNING BTRFS: failed to recover balance\n); + printk(KERN_ERR BTRFS: failed to recover balance\n); goto fail_block_groups; } -- 2.1.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs progs pre-release 3.18-rc2
Hi another step towards 3.18, the changes are limited in scope and mostly cleanups or docs. I'd like to see more test images for new fsck code, there are 2 new that I missed earlier, so they don't count. The timing with end of the year is not good so if I'm not confident that the release is in a good shape I'm rather going to postpone it. Alternatively, I can do a release on Monday and then use the following weeks to prepare a .1 bugfix release. Known problems: the image 013 fails -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: track dirty block groups on their own list V2
Currently any time we try to update the block groups on disk we will walk _all_ block groups and check for the -dirty flag to see if it is set. This function can get called several times during a commit. So if you have several terabytes of data you will be a very sad panda as we will loop through _all_ of the block groups several times, which makes the commit take a while which slows down the rest of the file system operations. This patch introduces a dirty list for the block groups that we get added to when we dirty the block group for the first time. Then we simply update any block groups that have been dirtied since the last time we called btrfs_write_dirty_block_groups. This allows us to clean up how we write the free space cache out so it is much cleaner. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: Don't unconditionally take the dirty bg list lock in update_block_group, only do it if our dirty_bg list is empty. fs/btrfs/ctree.h| 5 +- fs/btrfs/extent-tree.c | 169 ++-- fs/btrfs/free-space-cache.c | 8 ++- fs/btrfs/transaction.c | 14 ++-- fs/btrfs/transaction.h | 2 + 5 files changed, 74 insertions(+), 124 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index b62315b..e5bc509 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1237,7 +1237,6 @@ enum btrfs_disk_cache_state { BTRFS_DC_ERROR = 1, BTRFS_DC_CLEAR = 2, BTRFS_DC_SETUP = 3, - BTRFS_DC_NEED_WRITE = 4, }; struct btrfs_caching_control { @@ -1275,7 +1274,6 @@ struct btrfs_block_group_cache { unsigned long full_stripe_len; unsigned int ro:1; - unsigned int dirty:1; unsigned int iref:1; int disk_cache_state; @@ -1309,6 +1307,9 @@ struct btrfs_block_group_cache { /* For read-only block groups */ struct list_head ro_list; + + /* For dirty block groups */ + struct list_head dirty_list; }; /* delayed seq elem */ diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 74eb29d..71a9752 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -74,8 +74,9 @@ enum { RESERVE_ALLOC_NO_ACCOUNT = 2, }; -static int update_block_group(struct btrfs_root *root, - u64 bytenr, u64 num_bytes, int alloc); +static int update_block_group(struct btrfs_trans_handle *trans, + struct btrfs_root *root, u64 bytenr, + u64 num_bytes, int alloc); static int __btrfs_free_extent(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 num_bytes, u64 parent, @@ -3315,120 +3316,42 @@ int btrfs_write_dirty_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_block_group_cache *cache; - int err = 0; + struct btrfs_transaction *cur_trans = trans-transaction; + int ret = 0; struct btrfs_path *path; - u64 last = 0; + + if (list_empty(cur_trans-dirty_bgs)) + return 0; path = btrfs_alloc_path(); if (!path) return -ENOMEM; -again: - while (1) { - cache = btrfs_lookup_first_block_group(root-fs_info, last); - while (cache) { - if (cache-disk_cache_state == BTRFS_DC_CLEAR) - break; - cache = next_block_group(root, cache); - } - if (!cache) { - if (last == 0) - break; - last = 0; - continue; - } - err = cache_save_setup(cache, trans, path); - last = cache-key.objectid + cache-key.offset; - btrfs_put_block_group(cache); - } - - while (1) { - if (last == 0) { - err = btrfs_run_delayed_refs(trans, root, -(unsigned long)-1); - if (err) /* File system offline */ - goto out; - } - - cache = btrfs_lookup_first_block_group(root-fs_info, last); - while (cache) { - if (cache-disk_cache_state == BTRFS_DC_CLEAR) { - btrfs_put_block_group(cache); - goto again; - } - - if (cache-dirty) - break; - cache = next_block_group(root, cache); - } - if (!cache) { - if (last == 0) - break; - last = 0; - continue; - } - -
Re: btrfs is using 25% more disk than it should
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/18/2014 9:59 AM, Daniele Testa wrote: As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On that partition, I have one single starse file, taking 302GB of space (max 315GB). The snapshots directory is completely empty. So you don't have any snapshots or other subvolumes? However, for some weird reason, btrfs seems to think it takes 404GB. The big file is a disk that I use in a virtual server and when I write stuff inside that virtual server, the disk-usage of the btrfs partition on the host keeps increasing even if the sparse-file is constant at 302GB. I even have 100GB of free disk-space inside that virtual disk-file. Writing 1GB inside the virtual disk-file seems to increase the usage about 4-5GB on the outside. Did you flag the file as nodatacow? Does anyone have a clue on what is going on? How can the difference and behaviour be like this when I just have one single file? Is it also normal to have 672MB of metadata for a single file? You probably have the data checksums enabled and that isn't unreasonable for checksums on 302g of data. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2 WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8= =Z9/S -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
No, I don't have any snapshots or subvolumes. Only that single file. The file has both checksums and datacow on it. I will do chattr +C on the parent dir and re-create the file to make sure all files are marked as nodatacow. Should I also turn off checksums with the mount-flags if this filesystem only contain big VM-files? Or is it not needed if I put +C on the parent dir? 2014-12-20 2:53 GMT+08:00 Phillip Susi ps...@ubuntu.com: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/18/2014 9:59 AM, Daniele Testa wrote: As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On that partition, I have one single starse file, taking 302GB of space (max 315GB). The snapshots directory is completely empty. So you don't have any snapshots or other subvolumes? However, for some weird reason, btrfs seems to think it takes 404GB. The big file is a disk that I use in a virtual server and when I write stuff inside that virtual server, the disk-usage of the btrfs partition on the host keeps increasing even if the sparse-file is constant at 302GB. I even have 100GB of free disk-space inside that virtual disk-file. Writing 1GB inside the virtual disk-file seems to increase the usage about 4-5GB on the outside. Did you flag the file as nodatacow? Does anyone have a clue on what is going on? How can the difference and behaviour be like this when I just have one single file? Is it also normal to have 672MB of metadata for a single file? You probably have the data checksums enabled and that isn't unreasonable for checksums on 302g of data. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUlHQyAAoJENRVrw2cjl5RZWEIAKfdDzlNVrD/IYDZ5wzIeg5P DR5H8anGGc2QPTAD76vEX/XA7/j1Kg+PbQRHGdz6Iq2+Vq4CGno/yIi46oVVVYaL H4XvuH7GvPJyzHJ+XCMHjPGLrSCBxgIm1XSluNXmFNCwqi/FONk8TUhWsw7JchaZ yCVe/82YI+MLZhmJdudt48MeNFzW6LYi58dQo/JfYnTGnpZAFutdgBM7vLmnqLY2 WVLQUNHZsHBa7solttCuRtc4h8ku9FBObfKKYNPAEn1YWfx7bihWgPeBMH/blsza yhpMq96OMhIfn2SmIZMSwGh2ys+AxQQfymYR69fyGYTIajHmJEhJUzltuQD9Yg8= =Z9/S -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/19/2014 2:59 PM, Daniele Testa wrote: No, I don't have any snapshots or subvolumes. Only that single file. The file has both checksums and datacow on it. I will do chattr +C on the parent dir and re-create the file to make sure all files are marked as nodatacow. Should I also turn off checksums with the mount-flags if this filesystem only contain big VM-files? Or is it not needed if I put +C on the parent dir? If you don't want the overhead of those checksums, then yea. Also I would question why you are using btrfs to hold only big vm files in the first place. You would be better off using lvm thinp volumes instead of files, though personally I prefer to just use regular lvm volumes and manually allocate enough space. It avoids the fragmentation you get from thin provisioning ( or qcow2 ) at the cost of a bit of overallocated space and the need to do some manual resizing to add more if and when it is needed. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUlIwGAAoJENRVrw2cjl5RlGEH/1OYz07C/OjGBASA9IHTCVMV NkYHnO3/s2+SOsafQj4ej/RifgX9aG43b8Y6z9XAdosG/X+8z7xRjW9Nic0H5beK JZRpwP+02Dw02A3/RSPjGqJBeAmS8yi9yTlunnPaCau+m1kPYL4M/vFM8/hqrGeU Jy+jbffX+XtOedBWptxnDVIyXpYskgVyH8AmQ9d3TGrv52jw/QY1BxkuoVG60hBU Fk4Q8ed43C9zjCVihmkDOeER6Ygr1roDb1/gFLoeCk4FwVLO9Kusft2Qi2oXyHy1 iTkoVJan8NRzXBhrPtZexxQdewHSw9Z4wyHxlal3b/xIbRf6/DRwPRHfgG5djvM= =AqC/ -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
On 12/18/2014 09:59 AM, Daniele Testa wrote: Hey, I am hoping you guys can shed some light on my issue. I know that it's a common question that people see differences in the disk used when running different calculations, but I still think that my issue is weird. root@s4 / # mount /dev/md3 on /opt/drives/ssd type btrfs (rw,noatime,compress=zlib,discard,nospace_cache) root@s4 / # btrfs filesystem df /opt/drives/ssd Data: total=407.97GB, used=404.08GB System, DUP: total=8.00MB, used=52.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.25GB, used=672.21MB Metadata: total=8.00MB, used=0.00 root@s4 /opt/drives/ssd # ls -alhs total 302G 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 . 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 .. 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots root@s4 /opt/drives/ssd # du -h 0 ./snapshots 302G. As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On that partition, I have one single starse file, taking 302GB of space (max 315GB). The snapshots directory is completely empty. However, for some weird reason, btrfs seems to think it takes 404GB. The big file is a disk that I use in a virtual server and when I write stuff inside that virtual server, the disk-usage of the btrfs partition on the host keeps increasing even if the sparse-file is constant at 302GB. I even have 100GB of free disk-space inside that virtual disk-file. Writing 1GB inside the virtual disk-file seems to increase the usage about 4-5GB on the outside. Does anyone have a clue on what is going on? How can the difference and behaviour be like this when I just have one single file? Is it also normal to have 672MB of metadata for a single file? Hello and welcome to the wonderful world of btrfs, where COW can really suck hard without being super clear why! It's 4pm on a Friday right before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to use pretty pictures. You have this case to start with file offset 0 offset 302g [-prealloced 302g extent--] (man it's impressive I got all that lined up right) On disk you have 2 things. First your file which has file extents which says inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g and then in the extent tree, who keeps track of actual allocated space has this extent bytenr 123, len 302g, refs 1 Now say you boot up your virt image and it writes 1 4k block to offset 0. Now you have this [4k][302g-4k--] And for your inode you now have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123, disklen 302g and in your extent tree you have extent bytenr 123, len 302g, refs 1 extent bytenr whatever, len 4k, refs 1 See that? Your file is still the same size, it is still 302g. If you cp'ed it right now it would copy 302g of information. But what you have actually allocated on disk? Well that's now 302g + 4k. Now lets say your virt thing decides to write to the middle, lets say at offset 12k, now you have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever, disklen 4k inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123, disklen 302g and in the extent tree you have this extent bytenr 123, len 302g, refs 2 extent bytenr whatever, len 4k, refs 1 extent bytenr notimportant, len 4k, refs 1 See that refs 2 change? We split the original extent, so we have 2 file extents pointing to the same physical extents, so we bumped the ref count. This will happen over and over again until we have completely overwritten the original extent, at which point your space usage will go back down to ~302g. We split big extents with cow, so unless you've got lots of space to spare or are going to use nodatacow you should probably not pre-allocate virt images. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
On 12/19/2014 02:59 PM, Daniele Testa wrote: No, I don't have any snapshots or subvolumes. Only that single file. The file has both checksums and datacow on it. I will do chattr +C on the parent dir and re-create the file to make sure all files are marked as nodatacow. Should I also turn off checksums with the mount-flags if this filesystem only contain big VM-files? Or is it not needed if I put +C on the parent dir? Please God don't turn off of checksums. Checksums are tracked in metadata anyway, they won't show up in the data accounting. Our csums are 8 bytes per block, so basic math says you are going to max out at 604 megabytes for that big of a file. Please people try to only take advice from people who know what they are talking about. So unless it's from somebody who has commits in btrfs/btrfs-progs take their feedback with a grain of salt. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
On 12/19/2014 04:10 PM, Josef Bacik wrote: On 12/18/2014 09:59 AM, Daniele Testa wrote: Hey, I am hoping you guys can shed some light on my issue. I know that it's a common question that people see differences in the disk used when running different calculations, but I still think that my issue is weird. root@s4 / # mount /dev/md3 on /opt/drives/ssd type btrfs (rw,noatime,compress=zlib,discard,nospace_cache) root@s4 / # btrfs filesystem df /opt/drives/ssd Data: total=407.97GB, used=404.08GB System, DUP: total=8.00MB, used=52.00KB System: total=4.00MB, used=0.00 Metadata, DUP: total=1.25GB, used=672.21MB Metadata: total=8.00MB, used=0.00 root@s4 /opt/drives/ssd # ls -alhs total 302G 4.0K drwxr-xr-x 1 root root 42 Dec 18 14:34 . 4.0K drwxr-xr-x 4 libvirt-qemu libvirt-qemu 4.0K Dec 18 14:31 .. 302G -rw-r--r-- 1 libvirt-qemu libvirt-qemu 315G Dec 18 14:49 disk_208.img 0 drwxr-xr-x 1 libvirt-qemu libvirt-qemu0 Dec 18 10:08 snapshots root@s4 /opt/drives/ssd # du -h 0 ./snapshots 302G. As seen above, I have a 410GB SSD mounted at /opt/drives/ssd. On that partition, I have one single starse file, taking 302GB of space (max 315GB). The snapshots directory is completely empty. However, for some weird reason, btrfs seems to think it takes 404GB. The big file is a disk that I use in a virtual server and when I write stuff inside that virtual server, the disk-usage of the btrfs partition on the host keeps increasing even if the sparse-file is constant at 302GB. I even have 100GB of free disk-space inside that virtual disk-file. Writing 1GB inside the virtual disk-file seems to increase the usage about 4-5GB on the outside. Does anyone have a clue on what is going on? How can the difference and behaviour be like this when I just have one single file? Is it also normal to have 672MB of metadata for a single file? Hello and welcome to the wonderful world of btrfs, where COW can really suck hard without being super clear why! It's 4pm on a Friday right before I'm gone for 2 weeks so I'm a bit happy and drunk so I'm going to use pretty pictures. You have this case to start with file offset 0 offset 302g [-prealloced 302g extent--] (man it's impressive I got all that lined up right) On disk you have 2 things. First your file which has file extents which says inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g and then in the extent tree, who keeps track of actual allocated space has this extent bytenr 123, len 302g, refs 1 Now say you boot up your virt image and it writes 1 4k block to offset 0. Now you have this [4k][302g-4k--] And for your inode you now have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123, disklen 302g and in your extent tree you have extent bytenr 123, len 302g, refs 1 extent bytenr whatever, len 4k, refs 1 See that? Your file is still the same size, it is still 302g. If you cp'ed it right now it would copy 302g of information. But what you have actually allocated on disk? Well that's now 302g + 4k. Now lets say your virt thing decides to write to the middle, lets say at offset 12k, now you have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever, disklen 4k inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123, disklen 302g and in the extent tree you have this extent bytenr 123, len 302g, refs 2 extent bytenr whatever, len 4k, refs 1 extent bytenr notimportant, len 4k, refs 1 See that refs 2 change? We split the original extent, so we have 2 file extents pointing to the same physical extents, so we bumped the ref count. This will happen over and over again until we have completely overwritten the original extent, at which point your space usage will go back down to ~302g. We split big extents with cow, so unless you've got lots of space to spare or are going to use nodatacow you should probably not pre-allocate virt images. Thanks, Sorry should have added a tl;dr: Cow means you can in the worst case end up using 2 * filesize - blocksize of data on disk and the file will appear to be filesize. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!
On 12/12/2014 09:37 AM, Tomasz Chmielewski wrote: FYI, still seeing this with 3.18 (scrub passes fine on this filesystem). # time btrfs balance start /mnt/lxc2 Segmentation fault Ok now I remember why I haven't fix this yet, the images you gave me restore but then they don't mount because the extent tree is corrupted for some reason. Could you re-image this fs and send it to me and I promise to spend all of my time on the problem until its fixed. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/19/2014 4:15 PM, Josef Bacik wrote: Please God don't turn off of checksums. Checksums are tracked in metadata anyway, they won't show up in the data accounting. Our csums are 8 bytes per block, so basic math says you are going to max out at 604 megabytes for that big of a file. Yes, and it is exactly that metadata space he is complaining about. So if you don't want to use up all of that space ( and have no use for the checksums ), then you turn them off. Please people try to only take advice from people who know what they are talking about. So unless it's from somebody who has commits in btrfs/btrfs-progs take their feedback with a grain of salt. Thanks, Well that is rather arrogant and rude. For that matter, I *do* have commits in btrfs-progs. -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.17 (MingW32) iQEcBAEBAgAGBQJUlJ5gAAoJENRVrw2cjl5RZ5MIAI0Ok0q0hFTMcYYXu1U48R4Z AsuRg6zQDMOa9C1SqZucH2cuiiaGU8XixKcscaquoJDzzaND2kuy+sxp0k2YQnGz +/269OmZUtwjYil1NcSFTJiE2bYUAx1R+xWUGax/03NsXRr672f0EtAQ2sIitTaG WsNUhiU0GREpQL6pK403fO79eD2vRmgCx2w50gB2OYPQYciJ+YN0YAJ7z8VEmUro M9xqce2oc7haAHliDvazl+7IDRkkiZ7FcpSs2nBSqiHiUhgVaxuTzHZEXvUasE5l LamJCwiSwuevWWPCDE4N/r7qVcamKM2K/DMvZCiOuPkSm3YkcVyrUd8x4i8OEJs= =8R13 -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
On 12/19/2014 04:53 PM, Phillip Susi wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 12/19/2014 4:15 PM, Josef Bacik wrote: Please God don't turn off of checksums. Checksums are tracked in metadata anyway, they won't show up in the data accounting. Our csums are 8 bytes per block, so basic math says you are going to max out at 604 megabytes for that big of a file. Yes, and it is exactly that metadata space he is complaining about. So if you don't want to use up all of that space ( and have no use for the checksums ), then you turn them off. Please people try to only take advice from people who know what they are talking about. So unless it's from somebody who has commits in btrfs/btrfs-progs take their feedback with a grain of salt. Thanks, Well that is rather arrogant and rude. For that matter, I *do* have commits in btrfs-progs. root@destiny ~/btrfs-progs# git log --oneline --author=Phillip Susi c65345d btrfs-progs: document --rootdir mkfs switch f6b6e93 btrfs-progs: removed extraneous whitespace from mkfs man page Sorry I should have qualified that statement better. So unless it's from somebody who has had commits to meaningful portions of btrfs/btrfs-progs take their feedback with a grain of salt. There are too many people on this list who give random horribly wrong advice to users that can result in data loss or corruption. Now I'll admit I read her question wrong so what you said wasn't incorrect, I'm sorry for that. I've seen a lot of people responding to questions recently that I don't recognize that have been completely full of crap, I just assumed you were in that camp as well. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.18.0: kernel BUG at fs/btrfs/relocation.c:242!
On 2014-12-19 22:47, Josef Bacik wrote: On 12/12/2014 09:37 AM, Tomasz Chmielewski wrote: FYI, still seeing this with 3.18 (scrub passes fine on this filesystem). # time btrfs balance start /mnt/lxc2 Segmentation fault Ok now I remember why I haven't fix this yet, the images you gave me restore but then they don't mount because the extent tree is corrupted for some reason. Could you re-image this fs and send it to me and I promise to spend all of my time on the problem until its fixed. (un)fortunately one filesystem stopped crashing on balance with some kernel update, and the other one I had crashing on balance was fixed with btrfs - so I'm not able to reproduce anymore / produce an image which is crashing. -- Tomasz Chmielewski http://www.sslrack.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123!
Get this BUG with 3.18.1 (pasted at the bottom of the email). Below all actions from creating the fs to BUG. I did not attempt to reproduce. # mkfs.btrfs /dev/vdb Btrfs v3.17.3 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vdb nodesize 16384 leafsize 16384 sectorsize 4096 size 256.00GiB # mount -o noatime /dev/vdb /mnt/test/ # cd /mnt/test # btrfs sub cre subvolume Create subvolume './subvolume' # dd if=/dev/urandom of=bigfile.img bs=64k ^C91758+0 records in 91757+0 records out 6013386752 bytes (6.0 GB) copied, 374.777 s, 16.0 MB/s # btrfs sub list /mnt/test/ ID 257 gen 16 top level 5 path subvolume # btrfs quota enable /mnt/test # btrfs qgroup show /mnt/test qgroupid rfer excl 0/5 16384 16384 0/2576013403136 6013403136 # dd if=/dev/urandom of=bigfile2.img bs=64k ^C47721+0 records in 47720+0 records out 3127377920 bytes (3.1 GB) copied, 194.641 s, 16.1 MB/s # btrfs qgroup show /mnt/test qgroupid rfer excl 0/5 16384 16384 0/2578704049152 8704049152 root@srv2:/mnt/test/subvolume# sync root@srv2:/mnt/test/subvolume# btrfs qgroup show /mnt/test qgroupid rfer excl 0/5 16384 16384 0/2579140781056 9140781056 # dd if=/dev/urandom of=bigfile3.img bs=64k ^C3617580+0 records in 3617579+0 records out 237081657344 bytes (237 GB) copied, 14796 s, 16.0 MB/s # df -h Filesystem Size Used Avail Use% Mounted on (...) /dev/vdb256G 230G 25G 91% /mnt/test # btrfs qgroup show /mnt/test qgroupid rfer excl 0/5 1638416384 0/257245960245248 245960245248 # ls -l total 240451584 -rw-r--r-- 1 root root 3127377920 Dec 19 20:06 bigfile2.img -rw-r--r-- 1 root root 237081657344 Dec 20 00:15 bigfile3.img -rw-r--r-- 1 root root 6013386752 Dec 19 20:02 bigfile.img # rm bigfile3.img # sync # dmesg (...) [ 95.055420] BTRFS: device fsid 97f98279-21e7-4822-89be-3aed9dc05f2c devid 1 transid 3 /dev/vdb [ 118.446509] BTRFS info (device vdb): disk space caching is enabled [ 118.446518] BTRFS: flagging fs with big metadata feature [ 118.452176] BTRFS: creating UUID tree [ 575.189412] BTRFS info (device vdb): qgroup scan completed [15948.234826] [ cut here ] [15948.234883] kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123! [15948.234906] invalid opcode: [#1] SMP [15948.234925] Modules linked in: nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_crypt btrfs xor crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ppdev aes_x86_64 lrw raid6_pq gf128mul glue_helper ablk_helper cryptd serio_raw mac_hid pvpanic 8250_fintek parport_pc i2c_piix4 lp parport psmouse qxl ttm floppy drm_kms_helper drm [15948.235172] CPU: 0 PID: 3274 Comm: btrfs-cleaner Not tainted 3.18.1-031801-generic #201412170637 [15948.235193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [15948.235222] task: 880036708a00 ti: 88007b97c000 task.ti: 88007b97c000 [15948.235240] RIP: 0010:[c0458ec9] [c0458ec9] btrfs_orphan_add+0x1a9/0x1c0 [btrfs] [15948.235305] RSP: 0018:88007b97fc98 EFLAGS: 00010286 [15948.235318] RAX: ffe4 RBX: 88007b80a800 RCX: [15948.235333] RDX: 219e RSI: 0004 RDI: 880079418138 [15948.235349] RBP: 88007b97fcd8 R08: 88007fc1cae0 R09: 88007ad272d0 [15948.235366] R10: R11: 0010 R12: 88007a2d9500 [15948.235381] R13: 8800027d60e0 R14: 88007b80ac58 R15: 0001 [15948.235401] FS: () GS:88007fc0() knlGS: [15948.235418] CS: 0010 DS: ES: CR0: 80050033 [15948.235432] CR2: 7f0489ff CR3: 7a5e CR4: 001407f0 [15948.235464] Stack: [15948.235473] 88007b97fcd8 c0497acf 88007b809800 88003c207400 [15948.235498] 88007b809800 88007ad272d0 88007a2d9500 0001 [15948.235521] 88007b97fd58 c04412e0 880079418000 0004c0427fea [15948.235551] Call Trace: [15948.235601] [c0497acf] ? lookup_free_space_inode+0x4f/0x100 [btrfs] [15948.235642] [c04412e0] btrfs_remove_block_group+0x140/0x490 [btrfs] [15948.235693] [c047bde5] btrfs_remove_chunk+0x245/0x380 [btrfs] [15948.235731] [c0441866] btrfs_delete_unused_bgs+0x236/0x270 [btrfs] [15948.235771] [c044ad6c] cleaner_kthread+0x12c/0x190 [btrfs] [15948.235806]
Re: btrfs is using 25% more disk than it should
Daniele Testa posted on Sat, 20 Dec 2014 03:59:42 +0800 as excerpted: The file has both checksums and datacow on it. I will do chattr +C on the parent dir and re-create the file to make sure all files are marked as nodatacow. Should I also turn off checksums with the mount-flags if this filesystem only contain big VM-files? Or is it not needed if I put +C on the parent dir? FWIW... Turning off datacow, whether by chattr +C on the parent dir before creating the file, or via mount option, turns off checksumming as well. (For completeness, it also turns off compression, but I don't think that applies in your case.) In general, active VM images (and database files) with default flags tend to get very highly fragmented very fast, due to btrfs' default COW on a file with a heavy internal rewrite pattern (as opposed to append-only or full rename/replace on rewrite). For relatively small files with this rewrite pattern, think typical desktop firefox sqlite database files of a quarter GiB or less, the btrfs autodefrag mount option can be helpful, but because it triggers a rewrite of the entire file, as filesize goes up, the viability of autodefrag goes down, and at somewhere around half a gig, autodefrag doesn't work so well any more, particularly on very active files where the incoming rewrite stream may be faster than btrfs can rewrite the entire file. Making heavy-internal-rewrite pattern files of over say half a GiB in size nocow is one suggested solution. However, snapshots lock in place the existing version, causing a one-time COW after a snapshot. If people are doing frequent automated snapshots (say once an hour), this can be a big problem, as the file ends up fragmenting pretty badly with these 1- cow writes as well. That's where snapshots come into the picture. There are ways to work around the problem (put the files in question on a subvolume and don't snapshot it as often as the parent, setup a cron job to do say weekly defrag on the files in question, etc), but since you don't have snapshots going anyway, that's not a concern for you except as a preventative -- consider it if you /do/ start doing snapshots. So anyway, as I said, creating the file nocow (whether by mount option or chattr) will turn off checksumming too. But on something that frequently internally rewritten, where corruption will very likely corrupt the VM anyway and there's already mechanisms in place to deal with that (either VM integrity mechanisms, or backups, or simply disposable VMs, fire up a new one when necessary), at least with btrfs single-mode-data where there's no second copy to restore from if the checksum /does/ fail, turning off checksumming isn't necessarily as bad as it may seem anyway. And it /should/ save you some on the metadata... tho I'd not consider that savings worth turning off checksumming if that were the /only/ reason, on its own. The metadata difference is more a nice side-effect of an already commonly recommended practice for large VM image files, than something you'd turn off checksumming for in the first place. Certainly, on most files I'd prefer the checksums, and in fact am running btrfs raid1 mode here specifically to get the benefit of having a second copy to retrieve from if the first attempted copy fails checksum. But VM images and database files are a bit of an exception. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
Josef Bacik posted on Fri, 19 Dec 2014 16:17:08 -0500 as excerpted: tl;dr: Cow means you can in the worst case end up using 2 * filesize - blocksize of data on disk and the file will appear to be filesize. Thanks for the tl;dr /and/ the very sensible longer explanation. That's a very nice thing to know and to file away for further reference. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Can BTRFS handle XATTRs larger than 4K?
Hi folks, I need a Linux file system that supports XATTRs up to 64K. Can BTRFS support that or is XFS the only Linux file system with such support? -- Regards, Richard Sharpe (何以解憂?唯有杜康。--曹操) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote: And for your inode you now have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123, disklen 302g and in your extent tree you have extent bytenr 123, len 302g, refs 1 extent bytenr whatever, len 4k, refs 1 See that? Your file is still the same size, it is still 302g. If you cp'ed it right now it would copy 302g of information. But what you have actually allocated on disk? Well that's now 302g + 4k. Now lets say your virt thing decides to write to the middle, lets say at offset 12k, now you have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever, disklen 4k inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123, disklen 302g and in the extent tree you have this extent bytenr 123, len 302g, refs 2 extent bytenr whatever, len 4k, refs 1 extent bytenr notimportant, len 4k, refs 1 See that refs 2 change? We split the original extent, so we have 2 file extents pointing to the same physical extents, so we bumped the ref count. This will happen over and over again until we have completely overwritten the original extent, at which point your space usage will go back down to ~302g. Wait, *what*? OK, I did a small experiment, and found that btrfs actually does do something like this. Can't argue with fact, though it would be nice if btrfs could be smarter and drop unused portions of the original extent sooner. :-P The above quoted scenario is a little oversimplified. Chances are that 302G file is made of much smaller extents (128M..256M). If the VM is writing 4K randomly everywhere then those 128M+ extents are not going away any time soon. Even the extents that are dropped stick around for a few btrfs transaction commits before they go away. I couldn't reproduce this behavior until I realized the extents I was overwriting in my tests were exactly the same size and position of the extents on disk. I changed the offset slightly and found that partially-overwritten extents do in fact stick around in their entirety. There seems to be an unexpected benefit for compression here: compression keeps the extents small, so many small updates will be less likely to leave big mostly-unused extents lying around the filesystem. signature.asc Description: Digital signature
Re: btrfs is using 25% more disk than it should
But I read somewhere that compression should be turned off on mounts that just store large VM-images. Is that wrong? Btw, I am not pre-allocation space for the images. I use sparse files with: dd if=/dev/zero of=drive.img bs=1 count=1 seek=300G It creates the file in a few ms. Is it better to use fallocate with btrfs? If I use sparse files, it adds a benefit when I want to copy/move the image-file to another server. Like if the 300GB sparse file just has 10GB of data in it, I only need to copy 10GB when moving it to another server. Would the same be true with fallocate? Anyways, would disabling CoW (by putting +C on the parent dir) prevent the performance issues and 2*filesize issue? 2014-12-20 13:52 GMT+08:00 Zygo Blaxell ce3g8...@umail.furryterror.org: On Fri, Dec 19, 2014 at 04:17:08PM -0500, Josef Bacik wrote: And for your inode you now have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 302g-4k, offset 4k, diskbytenr 123, disklen 302g and in your extent tree you have extent bytenr 123, len 302g, refs 1 extent bytenr whatever, len 4k, refs 1 See that? Your file is still the same size, it is still 302g. If you cp'ed it right now it would copy 302g of information. But what you have actually allocated on disk? Well that's now 302g + 4k. Now lets say your virt thing decides to write to the middle, lets say at offset 12k, now you have this inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g), disklen 4k inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever, disklen 4k inode 256, file offset 16k, size 302g - 16k, offset 16k, diskbytenr 123, disklen 302g and in the extent tree you have this extent bytenr 123, len 302g, refs 2 extent bytenr whatever, len 4k, refs 1 extent bytenr notimportant, len 4k, refs 1 See that refs 2 change? We split the original extent, so we have 2 file extents pointing to the same physical extents, so we bumped the ref count. This will happen over and over again until we have completely overwritten the original extent, at which point your space usage will go back down to ~302g. Wait, *what*? OK, I did a small experiment, and found that btrfs actually does do something like this. Can't argue with fact, though it would be nice if btrfs could be smarter and drop unused portions of the original extent sooner. :-P The above quoted scenario is a little oversimplified. Chances are that 302G file is made of much smaller extents (128M..256M). If the VM is writing 4K randomly everywhere then those 128M+ extents are not going away any time soon. Even the extents that are dropped stick around for a few btrfs transaction commits before they go away. I couldn't reproduce this behavior until I realized the extents I was overwriting in my tests were exactly the same size and position of the extents on disk. I changed the offset slightly and found that partially-overwritten extents do in fact stick around in their entirety. There seems to be an unexpected benefit for compression here: compression keeps the extents small, so many small updates will be less likely to leave big mostly-unused extents lying around the filesystem. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs is using 25% more disk than it should
Daniele Testa posted on Sat, 20 Dec 2014 14:18:31 +0800 as excerpted: Anyways, would disabling CoW (by putting +C on the parent dir) prevent the performance issues and 2*filesize issue? It should, provided you don't then start snapshotting the file (which I don't believe you intend to do but just in case...). -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html