Re: How btrfs-find-root knows that the block is actually a root?
Hi Qu, On Tue, Dec 23, 2014 at 7:27 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: How btrfs-find-root knows that the block is actually a root? From: Alex Lyakas alex.bt...@zadarastorage.com To: linux-btrfs linux-btrfs@vger.kernel.org Date: 2014年12月22日 22:57 Greetings, I am looking at the code of search_iobuf() in btrfs-find-root.c.(3.17.3) I see that we probe nodesize blocks one by one, and for each block we check: - its owner is what we are looking for - its header-bytenr is what we are looking at currently - its level is not too small - it has valid checksum - it has the desired generation If all those conditions are true, we declare this block as a root and end the program. How do we actually know that it's a root and not a leaf or an intermediate node? What if we are searching for a root of the root tree, which has one node and two leafs (all have the same highest transid), and one of the leafs has logical lower than the actual root, i.e., it comes first in our scan. Then we will declare this leaf as a root, won't we? Or somehow the root always has the lowest logical? You can refer to this patch: https://patchwork.kernel.org/patch/5285521/ I see that this has not been applied to any of David's branches. Do you have a repo to look at this code in its entirety? Your questions are mostly right. The best method should be search through all the metadata, and only the highest level header for a given generation may be the root for that generation. But that method still has some problems. 1) Overwritten old node/leaf As btrfs metadata cow happens, old node/leaf may be overwritten and become incompletely, so above method won't always work as expected. 2) Corrupted fs That will makes everything not work as expected. But sadly, when someone needs to use btrfs-find-root, there is a high possibility the fs is already corrupted. 3) Slow speed It needs to scan over all the sectors of metadata chunks, it may var from megabytese to tegabytes, which makes the complete scan impractical. So current find-root uses a trade-off, if find a header at the position superblock points to, and generation matches, then just consider it as the desired root and exit. I think this is a bit optimistic. What if the root tree has several leaves having the same generation as the root? Then we might declare a leaf as a root and exit. But further recovery based on that output will get us into trouble. Also, I am confused by this line: level = h_level; This means that if we encounter a block that seems good, we will skip all other blocks that have lower level. Is this intended? This is intended, for case user already know the root's level, so it will skip any header whose level is below it. But this line is performed before the generation check. Let's say that user did not specify any level (so search_level==0). Then assume we encounter a block, which has lower generation than what we need, but higher level. At this point, we do level = h_level; and we will skip any blocks lower than this level from now on. What if the root tree got shirnked (due to subvolume deletion, for example), and the good root has lower level? We will skip it then, and will not find the root. Thanks for your comments, Alex. Thanks, Qu Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 06/17] btrfs: make use of immutable biovecs
This seems like it could be applied without the rest of the series, right? Might be worth to get it into the btrfs tree ASAP? -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 06/17] btrfs: make use of immutable biovecs
On 23.12.2014 02:35, Christoph Hellwig wrote: This seems like it could be applied without the rest of the series, right? Might be worth to get it into the btrfs tree ASAP? Ah, you're right. While patch #5 must be in this series, patch #6 does not necessarily have to be included. This conversion should belong to the multipage bvecs series. So I'll skip #6 in the next round. Thanks, Dongsu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Standards Problems [Was: [PATCH v2 1/3] Btrfs: get more accurate output in df command.]
On 12/18/2014 12:07 PM, Robert White wrote: I don't disagree with the _ideal_ of your patch. I just think that it's impossible to implement it without lying to the user or making things just as bad in a different way. I would _like_ you to be right. But my thing is finding and quantifying failure cases and the entire question is full of fail. ... See you keep giving me these examples where the history of the filesystem is uniform. It was made a certain way and stayed that way. But in real life this sort of thing is going to happen and your patch simply report's a _different_ _wrong_ number. A _friendlier_ wrong number, I'll grant you that, but still wrong. Hi Robert, sorry for the late. (It's busy to deal with some other work.) Yes, you are providing some examples here to show that in some cases the numbers from df is not helpful to understand the real space state in disk. But I'm afraid we can not blame df reporting a wrong number. You could say Hey df, you are stupid, we can store the small file in metadata to exceed your @size.. But he just reports the information from what he is seeing from FS level honestly. He is reporting what he can see, and do not care about the other things out of his business. The all special cases you provided in this thread, actually lead to one result that Btrfs is special, df command will report an improper in some case.. It means we need some other methods to tell user about the space info. And yes, we had. It's btrfs fi df. You are trying to make df showing information as more as possible, even change the behaviour of df in btrfs to show the numbers of different meaning with what it is in other filesystems. And my answer is keeping df command doing what he want, and solve the problems in our private df , btrfs fi df. Thanx Yang . xaE -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can BTRFS handle XATTRs larger than 4K?
On 2014-12-22 19:08, Robert White wrote: On 12/22/2014 02:55 PM, Richard Sharpe wrote: On Mon, Dec 22, 2014 at 2:52 PM, Robert White rwh...@pobox.com wrote: So skipping the full ADS, what's the current demand/payoff for large XATTR space? Windows Security Descriptors (sometimes incorrectly called ACLs) stored by Samba. Ah. I know that Linux ACLs are fairly small per entry, I take it Windows' can be much bigger? Having just gone off an read a lot about the many ADS possible in windows, they've sort of treated ever file as if it were the name of a phantom directory limited depth... That is you seem to be able to create any name as a stream name but you can't create any pathname as same. The system-level API -- that is the complete retooling of SYS_open et al -- and the requsite departure from POSIX -- seems unlikely. On the antipode, it seems like being able to put an inode reference key type (e.g. a name,inode pair as one of the metadata entries) could relieve the space constraint for a limited number of entires. The contents of that inode's data region would become the value of the single attribute. Does that relieve Security Descriptor burden? Is each descriptor a separate attribute or are all the descriptors held in one attribute as a list-of? Going full phantom directory to match Windows just seems like we get into the business of replacing whole kernel tidbits a la the inner-system effect. Personally, I'm thinking something more like the OS X Forks API than the Windows ADS API. Forks are actually the original reason that windows added ADS to NTFS, and their API is much cleaner than the windows one. The standard there is that forks are accessible as 'filename/..namedfork/forkname' ..namedfork doesn't get listed by ls -a, but you can explicitly list 'filename/..namedfork/' to get a list of forks other than the data fork. Such an API needs almost zero modification to existing programs (just enough to get tar and friends to check if there are any named forks for each file). The significant advantages I see to having forks are: 1. Windows Security Descriptors 2. Macintosh Resource Forks 3. BeOS (and others) style associated metadata (ie, associated file icons/thumbnails, per-file program associations, in other words, the same sort of stuff that OSX uses resource forks for) 4. Better flexibility for stuff like IMA and EVM smime.p7s Description: S/MIME Cryptographic Signature
Re: [RFC PATCH 05/17] btrfs: remove bio splitting and merge_bvec_fn() calls
On Mon, Dec 22, 2014 at 6:48 AM, Dongsu Park dongsu.p...@profitbricks.com wrote: From: Kent Overstreet k...@daterainc.com Btrfs has been doing bio splitting from btrfs_map_bio(), by checking device limits as well as calling -merge_bvec_fn() etc. That is not necessary any more, because generic_make_request() is now able to handle arbitrarily sized bios. So clean up unnecessary code paths. Signed-off-by: Kent Overstreet k...@daterainc.com [dpark: add more description in commit message] Signed-off-by: Dongsu Park dongsu.p...@profitbricks.com Cc: Chris Mason c...@fb.com Cc: Josef Bacik jba...@fb.com Cc: linux-btrfs@vger.kernel.org --- Looks good, I'll test it here too. Thanks! Signed-off-by: Chris Mason c...@fb.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible
Hi Qu, On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Before the patch, chunk will be considered bad if the corresponding block group is missing, even the only uncertain data is the 'used' member of the block group. This patch will try to recalculate the 'used' value of the block group and rebuild it. So even only chunk item and dev extent item is found, the chunk can be recovered. Although if extent tree is damanged and needed extent item can't be read, the block group's 'used' value will be the block group length, to prevent any later write/block reserve damaging the block group. In that case, we will prompt user and recommend them to use '--init-extent-tree' to rebuild extent tree if possible. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- btrfsck.h | 3 +- chunk-recover.c | 242 +--- cmds-check.c| 29 --- 3 files changed, 234 insertions(+), 40 deletions(-) diff --git a/btrfsck.h b/btrfsck.h index 356c767..7a50648 100644 --- a/btrfsck.h +++ b/btrfsck.h @@ -179,5 +179,6 @@ btrfs_new_device_extent_record(struct extent_buffer *leaf, int check_chunks(struct cache_tree *chunk_cache, struct block_group_tree *block_group_cache, struct device_extent_tree *dev_extent_cache, -struct list_head *good, struct list_head *bad, int silent); +struct list_head *good, struct list_head *bad, +struct list_head *rebuild, int silent); #endif diff --git a/chunk-recover.c b/chunk-recover.c index 6f43066..dbf98b5 100644 --- a/chunk-recover.c +++ b/chunk-recover.c @@ -61,6 +61,7 @@ struct recover_control { struct list_head good_chunks; struct list_head bad_chunks; + struct list_head rebuild_chunks; struct list_head unrepaired_chunks; pthread_mutex_t rc_lock; }; @@ -203,6 +204,7 @@ static void init_recover_control(struct recover_control *rc, int verbose, INIT_LIST_HEAD(rc-good_chunks); INIT_LIST_HEAD(rc-bad_chunks); + INIT_LIST_HEAD(rc-rebuild_chunks); INIT_LIST_HEAD(rc-unrepaired_chunks); rc-verbose = verbose; @@ -529,22 +531,32 @@ static void print_check_result(struct recover_control *rc) return; printf(CHECK RESULT:\n); - printf(Healthy Chunks:\n); + printf(Recoverable Chunks:\n); list_for_each_entry(chunk, rc-good_chunks, list) { print_chunk_info(chunk, ); good++; total++; } - printf(Bad Chunks:\n); + list_for_each_entry(chunk, rc-rebuild_chunks, list) { + print_chunk_info(chunk, ); + good++; + total++; + } + list_for_each_entry(chunk, rc-unrepaired_chunks, list) { + print_chunk_info(chunk, ); + good++; + total++; + } + printf(Unrecoverable Chunks:\n); list_for_each_entry(chunk, rc-bad_chunks, list) { print_chunk_info(chunk, ); bad++; total++; } printf(\n); - printf(Total Chunks:\t%d\n, total); - printf( Heathy:\t%d\n, good); - printf( Bad:\t%d\n, bad); + printf(Total Chunks:\t\t%d\n, total); + printf( Recoverable:\t\t%d\n, good); + printf( Unrecoverable:\t%d\n, bad); printf(\n); printf(Orphan Block Groups:\n); @@ -555,6 +567,7 @@ static void print_check_result(struct recover_control *rc) printf(Orphan Device Extents:\n); list_for_each_entry(devext, rc-devext.no_chunk_orphans, chunk_list) print_device_extent_info(devext, ); + printf(\n); } static int check_chunk_by_metadata(struct recover_control *rc, @@ -938,6 +951,11 @@ static int build_device_maps_by_chunk_records(struct recover_control *rc, if (ret) return ret; } + list_for_each_entry(chunk, rc-rebuild_chunks, list) { + ret = build_device_map_by_chunk_record(root, chunk); + if (ret) + return ret; + } return ret; } @@ -1168,12 +1186,31 @@ static int __rebuild_device_items(struct btrfs_trans_handle *trans, return ret; } +static int __insert_chunk_item(struct btrfs_trans_handle *trans, + struct chunk_record *chunk_rec, + struct btrfs_root *chunk_root) +{ + struct btrfs_key key; + struct btrfs_chunk *chunk = NULL; + int ret = 0; + + chunk = create_chunk_item(chunk_rec); + if (!chunk) + return -ENOMEM; + key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID; + key.type = BTRFS_CHUNK_ITEM_KEY; + key.offset = chunk_rec-offset; + + ret =
[PATCH] btrfs-progs: Fix btrfs fi show by uuid and label
Commit 8be2fff (btrfs-progs: apply realpath for btrfs fi show when mount point is given) changed the behavior of btrfs fi show to return an error if the call to realpath() failed. This broke the ability to specify a filesystem by uuid or label. So let's not consider a failed call to realpath() as an error. If the user really specified a bad device, just return nothing like we did before. --- cmds-filesystem.c |9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 8f037dd..a654e6f 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -901,13 +901,8 @@ static int cmd_show(int argc, char **argv) * realpath do /mnt/btrfs/ = /mnt/btrfs * which shall be recognized by btrfs_scan_kernel() */ - if (!realpath(search, path)) { - fprintf(stderr, ERROR: Could not show %s: %s\n, - search, strerror(errno)); - return 1; - } - - search = path; + if (realpath(search, path)) + search = path; /* * Needs special handling if input arg is block dev And if -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Quota limit question
On Tue, Dec 16, 2014 at 11:15:37PM -0200, Christian Robottom Reis wrote: # btrfs qgroup limit 2000m 0/261 . touch x touch: cannot touch ‘x’: Disk quota exceeded The strange thing is that it doesn't seem to be actually out of space: # btrfs qgroup show -p -r -e /var | grep 261 0/261810048 391114752 2097152000 0 --- Replying to myself as I had not yet been subscribed in time to receive a reply; I just upgraded to 3.18.1 and am seeing the same issue on the same subvolume (and on no others). root@riff:/etc# uname -a Linux riff 3.18.1-031801-generic #201412170637 SMP Wed Dec 17 11:38:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux It's quite odd that this specific subvolume acts up, given that there are quite a few others that are closer to the quota: subvol grouptotal unshared - (unknown) 0/5 1.37G / none 1.16G / none lxc-template1/rootfs0/2590.68G / none 0.10G / 2.00G machine-2/rootfs0/2611.07G / none 0.40G / 2.00G machine-3/rootfs0/2651.17G / none 0.41G / 2.00G lxc-template2/rootfs0/2710.77G / none 0.31G / 2.00G lxc-template3/rootfs0/2740.46G / none 0.02G / 2.00G machine-4/rootfs0/2837.12G / none 6.21G / 10.00G machine-5/rootfs0/2881.05G / none 0.34G / 2.00G machine-6/rootfs0/289 11.33G / none 10.74G / 15.00G machine-7/rootfs0/2901.30G / none 0.68G / 2.00G machine-8/rootfs0/2921.00G / none 0.33G / 2.00G machine-9/rootfs0/2931.17G / none 0.38G / 2.00G machine-10/rootfs 0/3061.34G / none 0.62G / 2.00G machine-11/rootfs 0/3189.49G / none 8.75G / 15.00G lxc-template4/rootfs0/3200.79G / none 0.78G / 2.00G machine-14/rootfs 0/3231.10G / none 0.45G / 2.00G The LWN article suggests that btrfs is quite conservative with quotas, but shouldn't 265, 290, 306, 320 and 323 all be out of quota as well? Or is there a lot else that goes into the calculation beyond the numbers reported by btrfs qgroup show? What could I do to help investigate further? -- Christian Robottom Reis | [+55 16] 3376 0125 | http://async.com.br/~kiko CEO, Async Open Source | [+55 16] 9 9112 6430 | http://launchpad.net/~kiko -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: add regression test for remount with thread_pool resized
On Tue, Dec 23, 2014 at 03:07:32PM +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH] btrfs: add regression test for remount with thread_pool resized From: gux.fnst gux.f...@cn.fujitsu.com To: fste...@vger.kernel.org Date: 2014年09月17日 18:02 Hi, Could anyone help to review this patch? Thanks! Regards, Xing Gu On 05/30/2014 04:52 PM, Xing Gu wrote: Regression test for resizing 'thread_pool' when remount the fs. Signed-off-by: Xing Gu gux.f...@cn.fujitsu.com --- tests/btrfs/055 | 55 + tests/btrfs/055.out | 1 + tests/btrfs/group | 1 + 3 files changed, 57 insertions(+) create mode 100755 tests/btrfs/055 create mode 100644 tests/btrfs/055.out diff --git a/tests/btrfs/055 b/tests/btrfs/055 new file mode 100755 index 000..0a0dd34 --- /dev/null +++ b/tests/btrfs/055 @@ -0,0 +1,55 @@ +#!/bin/bash +# FS QA Test No. btrfs/055 +# +# Regression test for resizing 'thread_pool' when remount the fs. +# IMHO, it would be better to add the commit number for the regression and the fix commit number (Filipe Manana always does the great job). For reference: Regresssion commit: 08a9ff3264181986d1d692a4e6fce3669700c9f8 (All my fault) Fix commit: 800ee2247f483b6d05ed47ef3bbc90b56451746c That belongs in the commit message, not the test description. -Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: add regression test for remount with thread_pool resized
On Tue, Dec 23, 2014 at 03:23:10PM +0800, Eryu Guan wrote: On Tue, Dec 23, 2014 at 01:41:02PM +0800, Xing Gu wrote: Regression test for a btrfs issue of resizing 'thread_pool' when remount the fs. Signed-off-by: Xing Gu gux.f...@cn.fujitsu.com --- tests/btrfs/017 | 64 + tests/btrfs/017.out | 1 + tests/btrfs/group | 1 + 3 files changed, 66 insertions(+) create mode 100755 tests/btrfs/017 create mode 100644 tests/btrfs/017.out diff --git a/tests/btrfs/017 b/tests/btrfs/017 new file mode 100755 index 000..81f5af6 --- /dev/null +++ b/tests/btrfs/017 @@ -0,0 +1,64 @@ +#!/bin/bash +# FS QA Test No. btrfs/017 +# +# Regression test for a btrfs issue of resizing 'thread_pool' when +# remount the fs. +# This was fixed in the following linux kernel patch: +# 800ee22 btrfs: fix crash in remount(thread_pool=) case +# +#--- +# Copyright (c) 2014 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ +rm -f $tmp.* +} + +trap _cleanup ; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch + +_scratch_mkfs /dev/null + +dmesg -c /dev/null +nr_null=`dmesg | grep -c kernel NULL pointer dereference` dmesg -c cleared dmesg buffer and nr_null is zero, I think this is duplicated work, either dmesg -c or check nr_null before remount. Please don't clear the dmesg buffer - it means that people can't review the kernel error messages across the entire test run for errors unless they are capturing that information elsewhere. And yes, I don't use syslog for capturing dmesg output on my test boxes because that leads to root filesystem ENOSPC whenever I'm dumping a lot of debug to the console Actually I'm thinking about checking dmesg log in check after each test, which could detect kernel BUG, WARNING and NULL pointer dereference etc. so test doesn't need to check dmesg itself. I'll send it out for review soon. Some tests actually generate errors/warnings on purpose. IOWs, the presence of messages in dmesg is not always indicative of an error +_scratch_mount -o remount,thread_pool=10 +new_null=`dmesg | grep -c kernel NULL pointer dereference` +if [ $new_null -ne $nr_null ]; then + _fatal kernel bug detected, check dmesg for more infomation. _fail is better than _fatal, and don't forget to remove $seqres.full before test if you use _fail, cause it dumps errors to $seqres.full But simply echo the error message could fail the test too, I'm fine with either way.. Echo the error message, set status=1 and exit. Oh, wait, that's what _fail does. ;) +exit diff --git a/tests/btrfs/017.out b/tests/btrfs/017.out new file mode 100644 index 000..90c4b69 --- /dev/null +++ b/tests/btrfs/017.out @@ -0,0 +1 @@ +QA output created by 017 Empty output with no indication that it should be empty is bad form as well. The test should echo something to indicate that an empty output file is expected. Silence is golden is the usual string used to indicate that the golden output expects no output from the test. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Uncorrectable errors on RAID-1?
On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote: For the kernel to automatically fix bad sectors by overwriting them, the drive needs to explicitly report read errors. If the SCSI command timer value is shorter than the drive's error recovery, the SATA link might get reset before the drive reports the read error and then uncorrected errors will persist instead of being automatically fixed. Is there a way to tell the kernel to go ahead and assume that all timeouts are effectively read errors? For a simple non-removable hard disk (i.e. not removable and not optical), that seems like a reasonable workaround for an assortment of firmware brokenness. I just did a quick survey of random drives here and found less than 10% support smartctl -l scterc. A lot of server drives (or at least the drives that shipped in servers) don't have it, but laptop drives do. Drives with firmware that has horrifying known bugs do also have this feature. :-P signature.asc Description: Digital signature
Re: btrfs is using 25% more disk than it should
On Sat, Dec 20, 2014 at 06:28:22AM -0500, Josef Bacik wrote: We now have two extents with the same bytenr but with different lengths. [...] Then there is the problem of actually returning the free space. Now if we drop all of the refs for an extent we know the space is free and we return it to the allocator. With the above example we can't do that anymore, we have to check the extent tree for any area that is left overlapping the area we just freed. This add's another search to every btrfs_free_extent operation, which slows the whole system down and again leaves us with weird corner cases and pain for the users. Plus this would be an incompatible format change so would require setting a feature flag in the fs and rolled to voluntarily. Ouchie. Now I have another solution, but I'm not convinced it's awesome either. Take the same example above, but instead we split the original extent in the extent tree so we avoid all the mess of having overlapping ranges Would this work for a read-only snapshot? For a read-write snapshot it would be as if we had modified both (or all, if there are multiple snapshots) versions of the tree with split extents. This wouldn't require a format change so everybody would get this behaviour as soon as we turned it on It could be a mount option, like autodefrag, off by default until the bugs were worked out. Arguably there could be a 'garbage-collection tool' similar to 'btrfs fi defrag', that could be used to clean out any large partially-obscured extents from specific files. This might be important for deduplication as well (although the extent-same code looks like it does split extents?). Definitely something to think about. Thanks for the detailed explanations. signature.asc Description: Digital signature
Re: Uncorrectable errors on RAID-1?
On Tue, Dec 23, 2014 at 2:16 PM, Zygo Blaxell zblax...@furryterror.org wrote: On Sun, Dec 21, 2014 at 05:25:47PM -0700, Chris Murphy wrote: For the kernel to automatically fix bad sectors by overwriting them, the drive needs to explicitly report read errors. If the SCSI command timer value is shorter than the drive's error recovery, the SATA link might get reset before the drive reports the read error and then uncorrected errors will persist instead of being automatically fixed. Is there a way to tell the kernel to go ahead and assume that all timeouts are effectively read errors? The timer in /sys is a kernel command timer, it's not a device timer even though it's pointed at a block device. You need to change that from 30 to something higher to get the behavior you want. It doesn't really make sense to say, timeout in 30 seconds, but instead of reporting a timeout, report it as a read error. They're completely different things. There are all sorts of errors listed in libata so for all of them to get dumped into a read error doesn't make sense. A lot of those errors don't report back a sector, and the key part of the read error is what sector(s) have the problem so that they can be fixed. Without that information, the ability to fix it is lost. And it's the drive that needs to report this. For a simple non-removable hard disk (i.e. not removable and not optical), that seems like a reasonable workaround for an assortment of firmware brokenness. Oven doesn't work, so lets spray gasoline on it and light it and the kitchen on fire so that we can cook this damn pizza! That's what I just read. Sorry. It doesn't seem like a good idea to me to map all errors as read errors. I just did a quick survey of random drives here and found less than 10% support smartctl -l scterc. A lot of server drives (or at least the drives that shipped in servers) don't have it, but laptop drives do. Drives with firmware that has horrifying known bugs do also have this feature. :-P Any decent server SATA drive should support SCT ERC. The inexpensive WDC Red drives for NAS's all have it and by default are a reasonable 70 deciseconds last time I checked. It might be that you're using SAS drives? In that case they may have something different than SCT ERC that serves the same purpose, but I don't have any SAS drives here to check this. I'd expect any SAS drive already has short error recoveries by default, but that expectation might be flawed. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Uncorrectable errors on RAID-1?
The other thing to note, is that the scsi command timer timeout is a maximum. So at 30 seconds if a command to the drive hasn't completed, then consider the drive hung up and do a link reset. And whatever error recovery is in the drive, is also a maximum. If the sector is really immediately bad, the drive will produce a read error immediately. The case where you get these long recoveries where the drive keeps retrying beyond the 30 second scsi command timer value, is when the drive firmware ECC thinks it can recover (or reconstruct) the data instead of producing a read error. A gotcha with changing the scsi command timer to a much larger value is that it possibly gives the drive enough time to recover the data, report it back to the kernel, and then everything goes on normally. The slow sector doesn't get fixed. Even a scrub wouldn't fix that unless the drive reported wrongly recovered data and Btrfs checksums catch it. So what you want to do with a drive that has, or is suspected of having such slow sectors, is to balance it. Rewrite everything. That should cause the drive firmware to map out those sectors if they result in persistent write errors. What ought to happen is the data from slow sectors, once recovered, should get written to a reserve sector and the old sector removed from use (remapping, i.e. the LBA is the same but the physical sector is different) but every drive firmware handles this differently. I definitely have had drives where this doesn't happen automatically. Also, I've had drives that when ATA Secure Erased, did not test for persistent write errors and therefore bad sectors weren't removed from use, they'd remain persistently bad when doing smartctl -t long tests. In those cases, using badblocks -w fixed the problem but of course that's destructive. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case.
[ Sorry to take some time to get to this, it got caught by a spam filter and I only just noticed. ] On Tue, Dec 16, 2014 at 02:08:53PM +, Filipe David Manana wrote: On Tue, Dec 16, 2014 at 1:35 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case. From: David Sterba dste...@suse.cz To: Filipe David Manana fdman...@gmail.com Date: 2014年12月16日 02:19 On Mon, Dec 15, 2014 at 10:13:45AM +, Filipe David Manana wrote: So another thing I would like to see is doing a more comprehensive verification that the repair code worked as expected. Currently we only check that a readonly fsck, after running fsck --repair, returns 0. For the improvements you've been doing, it's equally important to verify that --repair recovered the inodes, links, etc to the lost+found directory (or whatever is the directory's name). So perhaps adding a verify.sh script to the tarball for example? Or, forgot before, it might be better to do such verification/test in xfstests since we can create the fs and use the new btrfs-progs programs to corrupt leafs/nodes. xfstests has a lot of infrastructure already and probably run by a lot more people (compared to the fsck tests of btrfs-progs). I'm thinking about the best way how to integrate that, but it seems that there will be always some level of code or infrastructure duplication (or other hassle). btrfs-corrupt-block is not installed by default (make install) and it's not a type of utility I'd consider for default installations. The tests would be skipped in absence of the utility, so there will be test environments where install xfstests, install btrfspprogs will not add the desired test coverage. Solvable by packaging the extra progs. Adding corrupt-block into xfsprogs is infeasible (IMO too much code from btrfs-progs to be added). I don't know how much infrastructure code we'd have to either write or copy from fstests, but I think it would not be that much. Ideally we could write the tests within btrfs-progs and then submit them to fstests once they're considered reliable. If we keep the same syntax of the tests, provide stubs where applicable, the code duplication in test itself would be zero. We'd only have to write the stubs in btrfs-progs and probably extend fstests to provide helpers for preparing/unpacking the images. In my wildest idea, if we have a good enough btrfs debugger(maybe even stronger than debugfs), which can do almost everything from read key/item to corrupt given structure, then we can resolve them all. No binary image since corruption can be done by it and verify can also done by it. (OK, it's just a daydream) But IMHO, isn't xfstests designed to mainly detect kernel defeats? I don't see any fsck tool test case in it. I don't think xfstests is specific to test the kernel implementation of filesystems. I believe it includes user space code too, but I might be wrong so I'm CCing fstests and Dave to get an authoritative answer. We use fstests to test everything we ship for XFS - kernel and userspace. i.e. we have tests that corrupt filesystems with xfs_db and then test that xfs_repair can fix them, and once fixed the filesystem can be mounted and used by the kernel... i.e. fstests is for testing both the kernel code and the utilities used to manipulate filesystems. And I don't see a big problem with btrfs-corrupt not being built by default when running make on btrfs-progs. We can make the xfstest not run and print an informative message if the btrfs-corrupt program isn't available - several xfstests do this, they require some executable which isn't either in the xfstests nor xfsprogs repositories - for example generic/241 which requires 'dbench' and generic/299 which requires 'fio'. _require_btrfs_corrupt_prog() Just like we do with lots of other optional userspace tools that are required for various tests to run. I also have a slight preference to get all tests in the same place (xfstests) rather than in multiple repositories (btrfs-progs, xfstests). Definitely my preference as well. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How btrfs-find-root knows that the block is actually a root?
Original Message Subject: Re: How btrfs-find-root knows that the block is actually a root? From: Alex Lyakas alex.bt...@zadarastorage.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年12月23日 18:08 Hi Qu, On Tue, Dec 23, 2014 at 7:27 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: How btrfs-find-root knows that the block is actually a root? From: Alex Lyakas alex.bt...@zadarastorage.com To: linux-btrfs linux-btrfs@vger.kernel.org Date: 2014年12月22日 22:57 Greetings, I am looking at the code of search_iobuf() in btrfs-find-root.c.(3.17.3) I see that we probe nodesize blocks one by one, and for each block we check: - its owner is what we are looking for - its header-bytenr is what we are looking at currently - its level is not too small - it has valid checksum - it has the desired generation If all those conditions are true, we declare this block as a root and end the program. How do we actually know that it's a root and not a leaf or an intermediate node? What if we are searching for a root of the root tree, which has one node and two leafs (all have the same highest transid), and one of the leafs has logical lower than the actual root, i.e., it comes first in our scan. Then we will declare this leaf as a root, won't we? Or somehow the root always has the lowest logical? You can refer to this patch: https://patchwork.kernel.org/patch/5285521/ I see that this has not been applied to any of David's branches. Do you have a repo to look at this code in its entirety? Saddly, no personal repo... Your questions are mostly right. The best method should be search through all the metadata, and only the highest level header for a given generation may be the root for that generation. But that method still has some problems. 1) Overwritten old node/leaf As btrfs metadata cow happens, old node/leaf may be overwritten and become incompletely, so above method won't always work as expected. 2) Corrupted fs That will makes everything not work as expected. But sadly, when someone needs to use btrfs-find-root, there is a high possibility the fs is already corrupted. 3) Slow speed It needs to scan over all the sectors of metadata chunks, it may var from megabytese to tegabytes, which makes the complete scan impractical. So current find-root uses a trade-off, if find a header at the position superblock points to, and generation matches, then just consider it as the desired root and exit. I think this is a bit optimistic. What if the root tree has several leaves having the same generation as the root? Then we might declare a leaf as a root and exit. But further recovery based on that output will get us into trouble. That's right, but if btrfs-find-root takes several hours each time to do full scan, most user will be crazy... Especially when it is executed on a clean btrfs. Also, I am confused by this line: level = h_level; This means that if we encounter a block that seems good, we will skip all other blocks that have lower level. Is this intended? This is intended, for case user already know the root's level, so it will skip any header whose level is below it. But this line is performed before the generation check. Let's say that user did not specify any level (so search_level==0). Then assume we encounter a block, which has lower generation than what we need, but higher level. At this point, we do level = h_level; and we will skip any blocks lower than this level from now on. What if the root tree got shirnked (due to subvolume deletion, for example), and the good root has lower level? We will skip it then, and will not find the root. Thanks for your comments, Alex. Oh, I misunderstand your meaning, you're completely right, the original codes will skip any level lower than current found. That's deadly for root with higher generation but with lower level, as you mentioned, item delete. So the best method would be only skip leaf/node whose level is lower than current found highest level *among the same generation*. Oh, that's what I do in my patch. :-) Thanks, Qu Thanks, Qu Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Debian/Jessie 3.16.7-ckt2-1 kernel error
I've attached the kernel message log that I get after booting kernel 3.16.7 from Debian/Unstable. This is the kernel branch that will go into Debian/Jessie so it's important to get it fixed. Below has the start of the errors, the attached file has everything from boot. I've got similar issues on another system, would it help if I collect the logs from multiple systems? [6.618809] Btrfs loaded [6.670878] BTRFS: device fsid a2d7cbbc-23be-4d97-b2bb-de99b0c58c7d devid 1 t ransid 548614 /dev/mapper/root-crypt [6.706907] BTRFS info (device dm-0): disk space caching is enabled [6.798762] BTRFS: detected SSD devices, enabling SSD mode [6.881272] [ cut here ] [6.881358] WARNING: CPU: 3 PID: 198 at /build/linux-CMiYW9/linux-3.16.7- ckt2 /fs/btrfs/delayed-inode.c:1410 btrfs_commit_transaction+0x38a/0x9c0 [btrfs]() -- My Main Blog http://etbe.coker.com.au/ My Documents Bloghttp://doc.coker.com.au/ kern-err.gz Description: GNU Zip compressed data
Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible
Original Message Subject: Re: [PATCH] btrfs-progs: rebuild missing block group during chunk recovery if possible From: Alex Lyakas alex.bt...@zadarastorage.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年12月24日 00:49 Hi Qu, On Thu, Oct 30, 2014 at 4:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: [snipped] + +static int __insert_block_group(struct btrfs_trans_handle *trans, + struct chunk_record *chunk_rec, + struct btrfs_root *extent_root, + u64 used) +{ + struct btrfs_block_group_item bg_item; + struct btrfs_key key; + int ret = 0; + + btrfs_set_block_group_used(bg_item, used); + btrfs_set_block_group_chunk_objectid(bg_item, used); This looks like a bug. Instead of used, I think it should be BTRFS_FIRST_CHUNK_TREE_OBJECTID. Oh, my mistake, BTRFS_FIRST_CHUNK_TREE_OBJECTID is right. Thanks for pointing out this. [snipped] -- 2.1.2 Couple of questions: # In remove_chunk_extent_item, should we also consider rebuild chunks now? It can happen that a rebuild chunks is a SYSTEM chunk. Should we try to handle it as well? Not quite sure about the meaning of rebuild here. The chunk-recovery has the rebuild_chunk_tree() function to rebuild the whole chunk tree with the good/repaired chunks we found. # Same question for rebuild_sys_array. Should we also consider rebuild chunks? The chunk-recovery has rebuild_sys_array() to handle SYSTEM chunk too. Thanks, Qu Thanks, Alex. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: Fix btrfs fi show by uuid and label
On Tue, 2014-12-23 at 11:34 -0800, Justin Maggard wrote: Commit 8be2fff (btrfs-progs: apply realpath for btrfs fi show when mount point is given) changed the behavior of btrfs fi show to return an error if the call to realpath() failed. This broke the ability to specify a filesystem by uuid or label. Oh, that's my fault, thanks for correcting this. So let's not consider a failed call to realpath() as an error. If the user really specified a bad device, just return nothing like we did before. --- cmds-filesystem.c |9 ++--- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 8f037dd..a654e6f 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -901,13 +901,8 @@ static int cmd_show(int argc, char **argv) * realpath do /mnt/btrfs/ = /mnt/btrfs * which shall be recognized by btrfs_scan_kernel() */ - if (!realpath(search, path)) { - fprintf(stderr, ERROR: Could not show %s: %s\n, - search, strerror(errno)); - return 1; - } - - search = path; + if (realpath(search, path)) + search = path; This looks nice and stay consistent with the behavior before my faulty commit. Reviewed-by: Gui Hecheng guihc.f...@cn.fujitsu.com Thanks, Gui /* * Needs special handling if input arg is block dev And if -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
When btrfs allocate a chunk, it will try to alloc up to 1G for data and 256M for metadata, or 10% of all the writeable space if there is enough space for the stripe on device. However, when we run out of space, this allocation may cause unbalanced chunk allocation. For example, there are only 1G unallocated space, and request for allocate DATA chunk is sent, and all the space will be allocated as data chunk, making later metadata chunk alloc request unable to handle, which will cause ENOSPC. This is the one of the common complains from end users about why ENOSPC happens but there is still available space. This patch will try not to alloc chunk which is more than half of the unallocated space, making the last space more balanced at a small cost of more fragmented chunk at the last 1G. Some easy example: Preallocate 17.5G on a 20G empty btrfs fs: [Before] # btrfs fi show /mnt/test Label: none uuid: da8741b1-5d47-4245-9e94-bfccea34e91e Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 20.00GiB path /dev/sdb All space is allocated. No space for later metadata allocation. [After] # btrfs fi show /mnt/test Label: none uuid: e6935aeb-a232-4140-84f9-80aab1f23d56 Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 19.77GiB path /dev/sdb About 230M is still available for later metadata allocation. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- Changelog: v2: Remove false dead zone judgement since it won't happen --- fs/btrfs/volumes.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 8e74b34..20b3eea 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4237,6 +4237,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, u64 max_stripe_size; u64 max_logical_size; /* Up limit on chunk's logical size */ u64 max_physical_size; /* Up limit on a chunk's on-disk size */ + u64 total_physical_avail = 0; u64 stripe_size; u64 num_bytes; u64 raid_stripe_len = BTRFS_STRIPE_LEN; @@ -4349,6 +4350,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, devices_info[ndevs].max_avail = max_avail; devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; + total_physical_avail += total_avail; ++ndevs; } @@ -4398,6 +4400,23 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, do_div(stripe_size, num_stripes); need_bump = 1; } + + /* +* Don't alloc chunk whose physical size is larger than half +* of the rest physical space. +* This will reduce the possibility of ENOSPC when comes to +* last unallocated space +* +* For the last 16~32M (e.g. 20M), it will first alloc 16M +* (bumped to 16M) and the next time will be the rest size +* (bumped to 16M and reduced to 4M). +* So no dead zone. +*/ + if (stripe_size * num_stripes total_physical_avail / 2) { + stripe_size = total_physical_avail / 2; + need_bump = 1; + + } /* restrict logical chunk size */ if (stripe_size * data_stripes max_logical_size) { stripe_size = max_logical_size; -- 2.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: Make the chunk size limit on on-disk/logical more clean.
Original __btrfs_alloc_chunk() use max_chunk_size to limit chunk size, however it mixed the on-disk space with logical space. When comes to 10% of writable space, max_chunk_size is used with on-disk size, but it is also used as logical space size limit, so it is very confusing and causing inconsistence in different profile. For example: on M single, D single 5G btrfs single device, data chunk is limited to 512M due to 10% limit. on M RAID1, D RAID1 10Gx2 btrfs 2 devices, data chunk is limited to 2G due to 10% limit is mixed with on-disk space, causing the logical chunk space to 1G, twice than single device. This patch will make the logical and on-disk space limit independent and clear and solve the above inconsistence. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- changelog: v2: Newly introduced. --- fs/btrfs/volumes.c | 40 ++-- 1 file changed, 26 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 0144790..8e74b34 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4235,10 +4235,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int ncopies;/* how many copies to data has */ int ret; u64 max_stripe_size; - u64 max_chunk_size; + u64 max_logical_size; /* Up limit on chunk's logical size */ + u64 max_physical_size; /* Up limit on a chunk's on-disk size */ u64 stripe_size; u64 num_bytes; u64 raid_stripe_len = BTRFS_STRIPE_LEN; + int need_bump = 0; int ndevs; int i; int j; @@ -4260,7 +4262,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, if (type BTRFS_BLOCK_GROUP_DATA) { max_stripe_size = 1024 * 1024 * 1024; - max_chunk_size = 10 * max_stripe_size; + max_logical_size = 10 * max_stripe_size; if (!devs_max) devs_max = BTRFS_MAX_DEVS(info-chunk_root); } else if (type BTRFS_BLOCK_GROUP_METADATA) { @@ -4269,12 +4271,12 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, max_stripe_size = 1024 * 1024 * 1024; else max_stripe_size = 256 * 1024 * 1024; - max_chunk_size = max_stripe_size; + max_logical_size = max_stripe_size; if (!devs_max) devs_max = BTRFS_MAX_DEVS(info-chunk_root); } else if (type BTRFS_BLOCK_GROUP_SYSTEM) { max_stripe_size = 32 * 1024 * 1024; - max_chunk_size = 2 * max_stripe_size; + max_logical_size = 2 * max_stripe_size; if (!devs_max) devs_max = BTRFS_MAX_DEVS_SYS_CHUNK; } else { @@ -4284,8 +4286,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, } /* we don't want a chunk larger than 10% of writeable space */ - max_chunk_size = min(div_factor(fs_devices-total_rw_bytes, 1), -max_chunk_size); + max_physical_size = div_factor(fs_devices-total_rw_bytes, 1); devices_info = kzalloc(sizeof(*devices_info) * fs_devices-rw_devices, GFP_NOFS); @@ -4391,15 +4392,21 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, data_stripes = num_stripes - 2; } - /* -* Use the number of data stripes to figure out how big this chunk -* is really going to be in terms of logical address space, -* and compare that answer with the max chunk size -*/ - if (stripe_size * data_stripes max_chunk_size) { - u64 mask = (1ULL 24) - 1; - stripe_size = max_chunk_size; + /* Restrict on-disk chunk size */ + if (stripe_size * num_stripes max_physical_size) { + stripe_size = max_physical_size; + do_div(stripe_size, num_stripes); + need_bump = 1; + } + /* restrict logical chunk size */ + if (stripe_size * data_stripes max_logical_size) { + stripe_size = max_logical_size; do_div(stripe_size, data_stripes); + need_bump = 1; + } + + if (need_bump) { + u64 mask = (1ULL 24) - 1; /* bump the answer up to a 16MB boundary */ stripe_size = (stripe_size + mask) ~mask; @@ -4411,6 +4418,11 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, stripe_size = devices_info[ndevs-1].max_avail; } + /* +* Special handle for DUP, since stripe_size is the largest free extent +* we found, DUP can only use half of it. Other profile's dev_stripes +* is always 1. +*/ do_div(stripe_size, dev_stripes); /* align to BTRFS_STRIPE_LEN */ -- 2.2.1 -- To
Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case.
Original Message Subject: Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case. From: Dave Chinner da...@fromorbit.com To: Filipe David Manana fdman...@gmail.com Date: 2014年12月24日 08:03 [ Sorry to take some time to get to this, it got caught by a spam filter and I only just noticed. ] On Tue, Dec 16, 2014 at 02:08:53PM +, Filipe David Manana wrote: On Tue, Dec 16, 2014 at 1:35 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case. From: David Sterba dste...@suse.cz To: Filipe David Manana fdman...@gmail.com Date: 2014年12月16日 02:19 On Mon, Dec 15, 2014 at 10:13:45AM +, Filipe David Manana wrote: So another thing I would like to see is doing a more comprehensive verification that the repair code worked as expected. Currently we only check that a readonly fsck, after running fsck --repair, returns 0. For the improvements you've been doing, it's equally important to verify that --repair recovered the inodes, links, etc to the lost+found directory (or whatever is the directory's name). So perhaps adding a verify.sh script to the tarball for example? Or, forgot before, it might be better to do such verification/test in xfstests since we can create the fs and use the new btrfs-progs programs to corrupt leafs/nodes. xfstests has a lot of infrastructure already and probably run by a lot more people (compared to the fsck tests of btrfs-progs). I'm thinking about the best way how to integrate that, but it seems that there will be always some level of code or infrastructure duplication (or other hassle). btrfs-corrupt-block is not installed by default (make install) and it's not a type of utility I'd consider for default installations. The tests would be skipped in absence of the utility, so there will be test environments where install xfstests, install btrfspprogs will not add the desired test coverage. Solvable by packaging the extra progs. Adding corrupt-block into xfsprogs is infeasible (IMO too much code from btrfs-progs to be added). I don't know how much infrastructure code we'd have to either write or copy from fstests, but I think it would not be that much. Ideally we could write the tests within btrfs-progs and then submit them to fstests once they're considered reliable. If we keep the same syntax of the tests, provide stubs where applicable, the code duplication in test itself would be zero. We'd only have to write the stubs in btrfs-progs and probably extend fstests to provide helpers for preparing/unpacking the images. In my wildest idea, if we have a good enough btrfs debugger(maybe even stronger than debugfs), which can do almost everything from read key/item to corrupt given structure, then we can resolve them all. No binary image since corruption can be done by it and verify can also done by it. (OK, it's just a daydream) But IMHO, isn't xfstests designed to mainly detect kernel defeats? I don't see any fsck tool test case in it. I don't think xfstests is specific to test the kernel implementation of filesystems. I believe it includes user space code too, but I might be wrong so I'm CCing fstests and Dave to get an authoritative answer. We use fstests to test everything we ship for XFS - kernel and userspace. i.e. we have tests that corrupt filesystems with xfs_db and then test that xfs_repair can fix them, and once fixed the filesystem can be mounted and used by the kernel... i.e. fstests is for testing both the kernel code and the utilities used to manipulate filesystems. That's great. But what will happen if some btrfs cases need binary(still huge even compressed) or btrfs-image dump(some existing dumps are already several MB)? Will it be OK for fstests? Or should we wait until btrfs-progs has a good debug tools like xfs_db or debugfs and use them to generate the corrupted image like xfs testcases do? Thanks, Qu And I don't see a big problem with btrfs-corrupt not being built by default when running make on btrfs-progs. We can make the xfstest not run and print an informative message if the btrfs-corrupt program isn't available - several xfstests do this, they require some executable which isn't either in the xfstests nor xfsprogs repositories - for example generic/241 which requires 'dbench' and generic/299 which requires 'fio'. _require_btrfs_corrupt_prog() Just like we do with lots of other optional userspace tools that are required for various tests to run. I also have a slight preference to get all tests in the same place (xfstests) rather than in multiple repositories (btrfs-progs, xfstests). Definitely my preference as well. Cheers, Dave. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][BTRFS-PROGS][CLEANUP] Remove gotos
Hi Goffredo, On 2014/12/23 4:07, Goffredo Baroncelli wrote: On 12/22/2014 12:34 PM, Satoru Takeuchi wrote: On 2014/12/22 3:07, Goffredo Baroncelli wrote: Change a spagetti-style code (there are some interlaced gotos) to a more modern style... This patch removes also some #define from utils.h, which define constants used only in cmds-filesystems.c . Instead an enum is used locally in cmds-filesystems.c . I'm happy if you have a regression test for this patch since such kind of change often cause regression. I am impressed, this is the first time that I received this kind of request on btrfs ml It's because I'd like to contribute to Btrfs not only by submitting patches, but also by reviewing/testing other guy's patches to improve its quality better. In which form you want this regression test ? I see a directory tests/ under btrfs-progs; but also I saw reference to xfstest... Any format is OK, for example a simple shell script. Thanks, Satoru Thanks, Satoru --- cmds-filesystem.c | 106 -- utils.h | 3 -- 2 files changed, 55 insertions(+), 54 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 7eaccb9..08ddb5d 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -40,6 +40,11 @@ #include list_sort.h #include disk-io.h +enum filesystem_show_scan_method { + BTRFS_SCAN_ANY, + BTRFS_SCAN_MOUNTED, + BTRFS_SCAN_LBLKID +}; /* * for btrfs fi show, we maintain a hash of fsids we've already printed. @@ -853,7 +858,7 @@ static int cmd_show(int argc, char **argv) char *search = NULL; int ret; /* default, search both kernel and udev */ - int where = -1; + int where = BTRFS_SCAN_ANY; int type = 0; char mp[BTRFS_PATH_NAME_MAX + 1]; char path[PATH_MAX]; @@ -930,61 +935,60 @@ static int cmd_show(int argc, char **argv) uuid_unparse(fsid, uuid_buf); search = uuid_buf; type = BTRFS_ARG_UUID; - goto devs_only; + where = BTRFS_SCAN_LBLKID; } } } - if (where == BTRFS_SCAN_LBLKID) - goto devs_only; - - /* show mounted btrfs */ - ret = btrfs_scan_kernel(search); - if (search !ret) { - /* since search is found we are done */ - goto out; - } - - /* shows mounted only */ - if (where == BTRFS_SCAN_MOUNTED) - goto out; - -devs_only: - ret = btrfs_scan_lblkid(); - - if (ret) { - fprintf(stderr, ERROR: %d while scanning\n, ret); - return 1; - } - - found = search_umounted_fs_uuids(all_uuids, search); - if (found 0) { - fprintf(stderr, - ERROR: %d while searching target device\n, ret); - return 1; - } - - /* -* The seed/sprout mapping are not detected yet, -* do mapping build for all umounted fs -*/ - ret = map_seed_devices(all_uuids); - if (ret) { - fprintf(stderr, - ERROR: %d while mapping seed devices\n, ret); - return 1; + if (where == BTRFS_SCAN_MOUNTED || where == BTRFS_SCAN_ANY) { + + /* show mounted btrfs */ + ret = btrfs_scan_kernel(search); + if (search !ret) { + /* since search is found we are done */ + goto out; + } + } - - list_for_each_entry(fs_devices, all_uuids, list) - print_one_uuid(fs_devices); - - if (search !found) - ret = 1; - - while (!list_empty(all_uuids)) { - fs_devices = list_entry(all_uuids.next, - struct btrfs_fs_devices, list); - free_fs_devices(fs_devices); + + if (where == BTRFS_SCAN_LBLKID || where == BTRFS_SCAN_ANY) { + + ret = btrfs_scan_lblkid(); + + if (ret) { + fprintf(stderr, ERROR: %d while scanning\n, ret); + return 1; + } + + found = search_umounted_fs_uuids(all_uuids, search); + if (found 0) { + fprintf(stderr, + ERROR: %d while searching target device\n, ret); + return 1; + } + + /* +* The seed/sprout mapping are not detected yet, +* do mapping build for all umounted fs +*/ + ret = map_seed_devices(all_uuids); + if (ret) { + fprintf(stderr, + ERROR: %d while mapping seed devices\n, ret); + return 1; + } + + list_for_each_entry(fs_devices, all_uuids, list) + print_one_uuid(fs_devices); + + if (search
Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case.
On Wed, Dec 24, 2014 at 10:56:04AM +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case. From: Dave Chinner da...@fromorbit.com To: Filipe David Manana fdman...@gmail.com Date: 2014年12月24日 08:03 [ Sorry to take some time to get to this, it got caught by a spam filter and I only just noticed. ] On Tue, Dec 16, 2014 at 02:08:53PM +, Filipe David Manana wrote: On Tue, Dec 16, 2014 at 1:35 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Original Message Subject: Re: [PATCH 1/2] btrfs-progs: Add support for btrfs-image + corrupt script fsck test case. From: David Sterba dste...@suse.cz To: Filipe David Manana fdman...@gmail.com Date: 2014年12月16日 02:19 On Mon, Dec 15, 2014 at 10:13:45AM +, Filipe David Manana wrote: So another thing I would like to see is doing a more comprehensive verification that the repair code worked as expected. Currently we only check that a readonly fsck, after running fsck --repair, returns 0. For the improvements you've been doing, it's equally important to verify that --repair recovered the inodes, links, etc to the lost+found directory (or whatever is the directory's name). So perhaps adding a verify.sh script to the tarball for example? Or, forgot before, it might be better to do such verification/test in xfstests since we can create the fs and use the new btrfs-progs programs to corrupt leafs/nodes. xfstests has a lot of infrastructure already and probably run by a lot more people (compared to the fsck tests of btrfs-progs). I'm thinking about the best way how to integrate that, but it seems that there will be always some level of code or infrastructure duplication (or other hassle). btrfs-corrupt-block is not installed by default (make install) and it's not a type of utility I'd consider for default installations. The tests would be skipped in absence of the utility, so there will be test environments where install xfstests, install btrfspprogs will not add the desired test coverage. Solvable by packaging the extra progs. Adding corrupt-block into xfsprogs is infeasible (IMO too much code from btrfs-progs to be added). I don't know how much infrastructure code we'd have to either write or copy from fstests, but I think it would not be that much. Ideally we could write the tests within btrfs-progs and then submit them to fstests once they're considered reliable. If we keep the same syntax of the tests, provide stubs where applicable, the code duplication in test itself would be zero. We'd only have to write the stubs in btrfs-progs and probably extend fstests to provide helpers for preparing/unpacking the images. In my wildest idea, if we have a good enough btrfs debugger(maybe even stronger than debugfs), which can do almost everything from read key/item to corrupt given structure, then we can resolve them all. No binary image since corruption can be done by it and verify can also done by it. (OK, it's just a daydream) But IMHO, isn't xfstests designed to mainly detect kernel defeats? I don't see any fsck tool test case in it. I don't think xfstests is specific to test the kernel implementation of filesystems. I believe it includes user space code too, but I might be wrong so I'm CCing fstests and Dave to get an authoritative answer. We use fstests to test everything we ship for XFS - kernel and userspace. i.e. we have tests that corrupt filesystems with xfs_db and then test that xfs_repair can fix them, and once fixed the filesystem can be mounted and used by the kernel... i.e. fstests is for testing both the kernel code and the utilities used to manipulate filesystems. That's great. But what will happen if some btrfs cases need binary(still huge even compressed) or btrfs-image dump(some existing dumps are already several MB)? Will it be OK for fstests? No. Random filesystem images don't make a regression test. They are just different encoding of the game of whack-a-mole. Corruption recovery tests work best with simple, obvious corruptions. This makes it easy to make sure correct detection and recovery behaviour works without any other complications. e.g. overwriting the primary superblock with zeros is a simple test that every filesystem recovery tool should be able to handle. Or should we wait until btrfs-progs has a good debug tools like xfs_db or debugfs and use them to generate the corrupted image like xfs testcases do? You don't need a tool like xfs_db to do repeatable structure corruption - you can do this with dd by writing to known offsets. xfs_db just makes that really easy by having a filesystem format parser that means finding the offset we want to write to is trivial Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to
Re: Quota limit question
Christian Robottom Reis posted on Tue, 23 Dec 2014 18:36:02 -0200 as excerpted: On Tue, Dec 16, 2014 at 11:15:37PM -0200, Christian Robottom Reis wrote: # btrfs qgroup limit 2000m 0/261 . touch x touch: cannot touch ‘x’: Disk quota exceeded The strange thing is that it doesn't seem to be actually out of space: # btrfs qgroup show -p -r -e /var | grep 261 0/261810048 391114752 2097152000 0 --- Replying to myself as I had not yet been subscribed in time to receive a reply; I just upgraded to 3.18.1 and am seeing the same issue on the same subvolume (and on no others). Looking at the thread here on gmane.org (list2news and list2web gateway), it appears my reply was the only reply in any case, and it was general as I don't run quotas myself. Basically I suggested upgrading, as the quota code as some rather huge bugs in it (quotas could go seriously negative!) with the old versions you were running. But you've upgraded at least the kernel now (userspace you didn't say). Here's a link to the thread on the gmane web interface for completeness, but the above about covers my reply, as I said the only one until your thread bump and my reply here, so there's not much new there unless someone posts further followups to this thread... http://comments.gmane.org/gmane.comp.file-systems.btrfs/41491 -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] Remove unnecessary placeholder in btrfs_err_code
Hi Gui, On 2014/12/23 10:59, Gui Hecheng wrote: On Mon, 2014-12-22 at 19:39 +0900, Satoru Takeuchi wrote: Sorry, I forgot to add Signed-off-by line. --- From: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com notused is not necessary. Set 1 to the first entry is enough. Hi Satoru, Actually, the struct is copied from the kernel header include/uapi/linux/btrfs.h. I think they should stay the same, so either change the kernel header in another path, or just keep them in old style. OK, I'll send also kernel patches soon. Thanks, Satoru Thanks, Gui Signed-off-by: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com --- ioctl.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/ioctl.h b/ioctl.h index 2c2c7c1..c377b96 100644 --- a/ioctl.h +++ b/ioctl.h @@ -474,8 +474,7 @@ struct btrfs_ioctl_qgroup_create_args { /* Error codes as returned by the kernel */ enum btrfs_err_code { - notused, - BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET, + BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1, BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET, BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: Remove unnecessary placeholder in btrfs_err_code
From: Satoru Takeuchi takeuchi_sat...@jp.fujitsu.com notused is not necessary. Set 1 to the first entry is enough. Signed-off-by: Takeuchi Satoru takeuchi_sat...@jp.fujitsu.com Cc: Gui Hecheng guihc.f...@cn.fujitsu.com --- I once submit the similar patch to btrfs-progs. Then Gui Hecheng tell me fixing original code in kernel is better. http://permalink.gmane.org/gmane.comp.file-systems.btrfs/41679 --- include/uapi/linux/btrfs.h | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 611e1c5..b6dec05 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -495,8 +495,7 @@ struct btrfs_ioctl_send_args { /* Error codes as returned by the kernel */ enum btrfs_err_code { - notused, - BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET, + BTRFS_ERROR_DEV_RAID1_MIN_NOT_MET = 1, BTRFS_ERROR_DEV_RAID10_MIN_NOT_MET, BTRFS_ERROR_DEV_RAID5_MIN_NOT_MET, BTRFS_ERROR_DEV_RAID6_MIN_NOT_MET, -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: call inode_dec_link_count() on mkdir error path
From: Wang Shilong wangshilong1...@gmail.com In btrfs_mkdir(), if it fails to create dir, we should clean up existed items, setting inode's link properly to make sure it could be cleaned up properly. Signed-off-by: Wang Shilong wangshilong1...@gmail.com --- fs/btrfs/inode.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8de2335..8a036ed 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6255,8 +6255,10 @@ static int btrfs_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) out_fail: btrfs_end_transaction(trans, root); - if (drop_on_err) + if (drop_on_err) { + inode_dec_link_count(inode); iput(inode); + } btrfs_balance_delayed_items(root); btrfs_btree_balance_dirty(root); return err; -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: doc: fix incorrect format of 'l' option in mkfs.btrfs
The format of 'l' option in mkfs.btrfs.txt is wrong. And, when the head of the character string is 65536, the following warning is displayed. $ make Making all in Documentation [ASCII] mkfs.btrfs.xml asciidoc: WARNING: mkfs.btrfs.xml.tmp1: line 67: list item index: expected 1 got 65536 [XMLTO] mkfs.btrfs.8 [GZ] mkfs.btrfs.8.gz rm mkfs.btrfs.8 mkfs.btrfs.xml So, fix it. Signed-off-by: Tsutomu Itoh t-i...@jp.fujitsu.com --- Documentation/mkfs.btrfs.txt | 7 +++ 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/Documentation/mkfs.btrfs.txt b/Documentation/mkfs.btrfs.txt index ba7e42b..01b615d 100644 --- a/Documentation/mkfs.btrfs.txt +++ b/Documentation/mkfs.btrfs.txt @@ -59,19 +59,18 @@ By default, mkfs.btrfs will not write to the device if it suspects that there is a filesystem or partition table on the device already. -n|--nodesize size -+ + -l|--leafsize size:: Specify the nodesize, the tree block size in which btrfs stores data. The default value is 16KB (16384) or the page size, whichever is -bigger. Must be a multiple of the sectorsize, but not larger than -65536. Leafsize always equals nodesize and the options are aliases. +bigger. Must be a multiple of the sectorsize, but not larger than 65536. +Leafsize always equals nodesize and the options are aliases. -L|--label name:: Specify a label for the filesystem. + NOTE: name should be less than 256 characters. - -m|--metadata profile:: Specify how metadata must be spanned across the devices specified. Valid values are 'raid0', 'raid1', 'raid5', 'raid6', 'raid10', 'single' or 'dup'. -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html