[PATCH] btrfs-progs: tests: add 020-extent-ref-cases
In order to confirm that btrfsck supports to check a variety of refs, add the following cases: * keyed_block_ref * keyed_data_ref * shared_block_ref * shared_data_ref * no_inline_ref (a extent item without inline ref) * no_skinny_ref Signed-off-by: Lu Fengqi--- In order to btrfsck on the restored file system, we should use the patch "btrfs-progs: make btrfs-image restore to support dup". This patch make btrfs-image correctly restore img in dup case. --- .../fsck-tests/020-extent-ref-cases/keyed_block_ref.img | Bin 0 -> 10240 bytes .../fsck-tests/020-extent-ref-cases/keyed_data_ref.img | Bin 0 -> 4096 bytes tests/fsck-tests/020-extent-ref-cases/no_inline_ref.img | Bin 0 -> 4096 bytes tests/fsck-tests/020-extent-ref-cases/no_skinny_ref.img | Bin 0 -> 3072 bytes .../020-extent-ref-cases/shared_block_ref.img | Bin 0 -> 23552 bytes .../fsck-tests/020-extent-ref-cases/shared_data_ref.img | Bin 0 -> 5120 bytes tests/fsck-tests/020-extent-ref-cases/test.sh | 14 ++ 7 files changed, 14 insertions(+) create mode 100644 tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img create mode 100644 tests/fsck-tests/020-extent-ref-cases/keyed_data_ref.img create mode 100644 tests/fsck-tests/020-extent-ref-cases/no_inline_ref.img create mode 100644 tests/fsck-tests/020-extent-ref-cases/no_skinny_ref.img create mode 100644 tests/fsck-tests/020-extent-ref-cases/shared_block_ref.img create mode 100644 tests/fsck-tests/020-extent-ref-cases/shared_data_ref.img create mode 100755 tests/fsck-tests/020-extent-ref-cases/test.sh diff --git a/tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img b/tests/fsck-tests/020-extent-ref-cases/keyed_block_ref.img new file mode 100644 index ..289d37bc309fb8c33bff13cb71de8b9c5d83f1bb GIT binary patch literal 10240 zcmeHMcTiK`mrv**#1MK39Rx%S1d-lRq-bag5-@ZGM0$}H2qgkiq!)pRA_yX2qbMyD zY0^ZRNK>jHML=pc= H!l5!e?7D1x4iog?E3b%9NhXF*ZqN)e_?MD;(pL= zzwhB`|BdVZz^IPja>1w_y!$Oj|H3y8o)7N`=*!$o2sC@W|{SWSO{e@i) zaOp4Hae%vjVIvA62Gku~0MMVae!~C1{J#bMrxx%X*3tV|+<~Snrg5gaT l?f5>Sjxj58pSyaN?QCrWii3?0Q0!gvyd z;he}|O`t`a){WyMEqN?TyOulVFWm^5U)!jzoeM}Qz|O5F(0O*x(o^PM_)1MmDt1HN zfSn }8 z5OLEv!FTxC&6^qRhgpY=lT~#x27JBwnbQ?Tqu9+c1^;{V^34G?_+mt02O@*~f&`w9 z6@ND;gqQllG%8e*CRGPKCTpzG^BfMAIXLK6Q{62 _xu1;Mdt5)hppG|u{0f~83_m$+Q`P+ z8)H9Ea;t#zTdm)QT)$7uRzRd_?sg3!5+Foj(+@KYUXy@{sX!(|BtET$$mIP;1^&?n zOrlGB>sCjF06C(gXV*v}(#yG@E=B6_;EI;ORo~%0^8~p};BgbCL2$9bf`V;z zWYCA~3qG|Hx-xZC5c@DKc(!@ y4Qhv(w+`ubejBkDHg6?7ZI9|! zNo#P82$=w+P0%L6RFYi*&%z}Hc|d+0k%buQ#=#hovKZ1vhY0OIjr`9Zg;^ueY|Usd zO5EsqiU=uJ7N$4cI+eF5g^NN$r00*r#tknyR{#L)2721Q!wkrI@y5ffCe3L>b{nqt zfUT>tYCDR}Lx*W+)jYg-@>)pdymMl_X>@CiygM-F*Xj4>sG6y=8F zB>y`UoB;s!FANjD!}2RL2nJHFQqfXglZPMt+PshN6)Pq#1zDqH8jmo}bm+WwnCQ5z z%4ItE`dMyLJc$537weQFHx?65HT0;40K>Y0gmG9g{Wt21AZ!_hFfL3Wp^ofD9*5p} z61Wi`!($BnsPPFu-V!SfpMEY2?0FOf@=hr*uKpZQ6O$m|jye}!Dhb27!(cGnNLe_! z6(vIzv>goz(1N(ihnn(u|Fp6ziAjB48&6%GepUwZYLclz;gzh6#Plmy^)phla#B;$ zQ!^mf9FudBk~5HyYc|FisflSx8KjJ2Wunq3bn JBNX+c@MUO_TrH3TSr|S=6o|7BTtuQ8AxF~_Ss>w+@D6Ab?9-UL$Q`giy z>rxM;aqgE0%ze}nYe%HfNAYj$4OPl!*wn3SE}map=4 zMMi88!uteirVQZpo%oBs!#iM(=68qrpAjT+MAlT|w$pNH(&`*7AJb2(=}xE_-Xl#N za|LVuWVgfeMi>AX5n@974rlsI4{%W^po~%KrP@!{k`}qPP5p_nJ#5!2DWS^n9JZ z-iwbT+gA27?VlP3wxo6Wq)|SOV|$FIE@sYWw-j$=6c>)87ES!`7gVHV!TJsxj!rck z8s3~B_z~*);kHfwT|F+6XO+1MLIt`YWDs7J@!w X;5)P0N*(b={())c>8-|7mSBv45hNXEr8M`sK{} z=EGbFi9L$=ws6~nf2Cu!5)-%Yum^$OKj$zV17sSG(2*HEZyotJItGDQP7!J9ZIa@l zWs5yl@(4eY
Re: [regression] make sure seed is writeable sprout after device add
On Sun, May 29, 2016 at 8:03 PM, Chris Murphywrote: > > # lsblk -o +UUID > NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT UUID > loop07:0010G 0 loop /mnt/0 > 63288b0c-9216-4f11-aed4-cc054ae90e07 > loop17:1010G 0 loop > 63288b0c-9216-4f11-aed4-cc054ae90e07 > > This is worse. > > # btrfs fi show > Label: none uuid: 63288b0c-9216-4f11-aed4-cc054ae90e07 > Total devices 2 FS bytes used 384.00KiB > devid1 size 10.00GiB used 2.02GiB path /dev/loop0 > devid2 size 10.00GiB used 0.00B path /dev/loop1 > > So is this. Where is the new UUID for the newly created sprout volume? > > /dev/loop0: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07" > UUID_SUB="e379aedb-6d14-4d56-be7d-1772c9984bc5" TYPE="btrfs" > /dev/loop1: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07" > UUID_SUB="b282f566-8382-468e-b9ec-f748244b703b" TYPE="btrfs" > > Uhh? Identical UUIDs for seed and sproud? That's not right. Just to confirm that UUID for the sprout is the same as the seed at mkfs time, and I didn't include it in the previous email: # mkfs.btrfs /dev/loop0 btrfs-progs v4.5.2 See http://btrfs.wiki.kernel.org for more information. Performing full device TRIM (10.00GiB) ... Label: (null) UUID: 63288b0c-9216-4f11-aed4-cc054ae90e07 -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [regression] make sure seed is writeable sprout after device add
With 4.5.5 the 'mount -o remount,rw' works like the wiki describes, and is in my opinion contrary to the mount man page. After the -o remount,rw following btrfs dev add VG/sprout, I get this partial: # lsblk -o +UUID NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT UUID │ └─VG-thintastic_tdata 253:1090G 0 lvm │ └─VG-thintastic-tpool 253:2090G 0 lvm │ ├─VG-thintastic 253:3090G 0 lvm │ ├─VG-seed 253:5050G 0 lvm /mnt/0 59828c01-8354-43ac-a92d-f22d1b5d0e22 │ └─VG-sprout 253:6050G 0 lvm e8de3a52-34d1-46af-98c1-8620642be884 And # mount [...trimmed...] /dev/mapper/VG-seed on /mnt/0 type btrfs (rw,relatime,seclabel,space_cache,subvolid=5,subvol=/) This is just wrong. The wrong volume UUID is associated with /mnt/0 by lsblk. And mount shows VG/seed as rw, which is not possible because by definition the seed is read-only. I've always thought this was misleading and confusing. Next I tried it with kernel-4.6.0... [root@f24m ~]# losetup NAME SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILEDIO /dev/loop0 0 0 0 0 /root/seed 0 /dev/loop1 0 0 0 0 /root/sprout 0 # mount [...trimmed...] /dev/loop0 on /mnt/0 type btrfs (rw,relatime,seclabel,space_cache,subvolid=5,subvol=/) So it's the same incorrect information as 4.5.5. # lsblk -o +UUID NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT UUID loop07:0010G 0 loop /mnt/0 63288b0c-9216-4f11-aed4-cc054ae90e07 loop17:1010G 0 loop63288b0c-9216-4f11-aed4-cc054ae90e07 This is worse. # btrfs fi show Label: none uuid: 63288b0c-9216-4f11-aed4-cc054ae90e07 Total devices 2 FS bytes used 384.00KiB devid1 size 10.00GiB used 2.02GiB path /dev/loop0 devid2 size 10.00GiB used 0.00B path /dev/loop1 So is this. Where is the new UUID for the newly created sprout volume? /dev/loop0: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07" UUID_SUB="e379aedb-6d14-4d56-be7d-1772c9984bc5" TYPE="btrfs" /dev/loop1: UUID="63288b0c-9216-4f11-aed4-cc054ae90e07" UUID_SUB="b282f566-8382-468e-b9ec-f748244b703b" TYPE="btrfs" Uhh? Identical UUIDs for seed and sproud? That's not right. Same version of btrfs-progs in both cases, just different kernels, BUT on different machines, and hence why LVM in the first case, and fallocated files on loop for the second. I'm not sure what's causing this. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array
Chris Johnson posted on Sun, 29 May 2016 09:33:49 -0700 as excerpted: > Situation: A six disk RAID5/6 array with a completely failed disk. The > failed disk is removed and an identical replacement drive is plugged in. First of all, be aware (as you already will be if you're following the list) that there are currently two, possibly related, (semi-?)critical known bugs still affecting raid56 mode, with the result being that despite raid56 nominal completion in 3.19 and fix of a couple even more critical bugs early on, by 4.1 release, raid56 mode remains negatively recommended for anything but testing. One of the two bugs is that restriping (as done by balance either with the restripe filters after adding devices or triggered automatically by device delete) can, in SOME cases only, with the trigger variable unknown at this point, can take an order of magnitude (or even more) longer than it should -- we're talking over a week for a rebalance that would be expected to be done in under a day, upto possibly months for the multi-TB filesystems that are a common use-case for raid5/6, that might be expected to take a day or two under normal circumstances. This rises to critical because other than the impractical time involved, once you're talking weeks to months restripe time, the danger of another device going out, thereby killing the entire array, increases unacceptably, to the point that raid56 cannot be considered usable for the normal things people use it for, thus the critical bug rating even if in theory the restripe is completing correctly and the data isn't in immediate danger. Obviously you're not hitting it if your results show balance as significantly faster, but because we don't know what triggers the problem yet, that's no guarantee that you won't hit it later, after somehow triggering the problem. The second bug is equally alarming, but in a different way. A number of people have reported that replacing (by one method or the other) a first device appears to work, but if a second replace is attempted, it kills the array(!!), so obviously something's going wrong with the first replace as it's not returning the array to full undegraded functionality, even tho all the current tools as well as operations before the second replace is attempted suggest that it has done just that. This one too remains untraced to an ultimate cause, and while the two bugs appear quite different, because they are both critical and remain untraced, it remains possible that they are actually simply two different symptoms of the same root bug. So, if you're using raid56 only for testing as is recommended, great, but if you're using it for live data, for sure have your backups ready as there remains an uncomfortably high chance that you may need to use them if something goes wrong with that raid56 and these bug(s) prevent you from recovering the array. Or alternatively, switch to the more mature raid1 or raid10 modes if realistic in your use-case, or to more traditional solutions such as md/dm-raid underneath btrfs or some other filesystem. (One very interesting solution is btrfs raid1 mode over top of a pair of md/dm-raid0 virtual devices, each of which can then be composed of multiple physical devices. This allows btrfs raid1 mode data and metadata integrity checking and repair that underlying raid modes don't have, and includes the repair of detected checksum errors that btrfs single mode won't be able to do because it can detect problems but not correct them. Meanwhile, the underlying raid0 helps make up somewhat for the btrfs' poor raid1 optimization and performance as it tends to serialize access to multiple devices that other raid solutions parallelize.) Of course, the more mature zfs on linux can be another alternative, if you're prepared to overlook the licensing issues and have hardware upto the task. With that warning explained and alternatives provided, to your actual question... > Here I have two options for replacing the disk, assuming the old drive > is device 6 in the superblock and the replacement disk is /dev/sda. > > 'btrfs replace start 6 /dev/sda /mnt' > This will start a rebuild of the array using the new drive, copying data > that would have been on device 6 to the new drive from the parity data. > > btrfs add /dev/sda /mnt && btrfs device delete missing /mnt This adds a > new device (the replacement disk) to the array and dev delete missing > appears to trigger a rebalance before deleting the missing disk from the > array. The end result appears to be identical to option 1. > > A few weeks back I recovered an array with a failed drive using 'delete > missing' because 'replace' caused a kernel panic. I later discovered > that this was not (just) a failed drive but some other failed hardware > that I've yet to start diagnosing. Either motherboard or HBA. The drives > are in a new server now and I am currently rebuilding the array with >
Re: [regression] make sure seed is writeable sprout after device add
On Sun, May 29, 2016 at 1:48 PM, Anand Jainwrote: > Originally a seed FS becomes a writeable sprout FS after a > device is added to it, however as at 4.6 I don't see this > behavior anymore. I think the old behavior where it's possible to use -o remount,rw is actually confusing. Strictly speaking what is mounted is the seed volume with that particular volume UUID. Doing a remount to effectively cause a behind the scene umount then a mount of a different volume and volume UUID is not obvious. What if there's more than one sprout? It seems really ambiguous what the user wants with an -o remount so I'm wondering if this operation should fail instead? >From mount man page: remount Attempt to remount an already-mounted filesystem. This is commonly used to change the mount flags for a filesystem, especially to make a readonly filesystem writable. It does not change device or mount point. Two problems: 1. the seed and sprout are two filesystems, and one of them is not already mounted. So remount should not mean switch from a mounted filesystem to an unmounted one. 2. The former remount behavior did change device, from that of the seed to that of the sprout. So both of those seem to depart from the remount definition rather significantly. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
Op Sun, 29 May 2016 12:33:06 -0600, schreef Chris Murphy: > On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstätte >wrote: >> On 05/29/16 19:53, Chris Murphy wrote: >>> But I'm skeptical of bcache using a hidden area historically for the >>> bootloader, to put its device metadata. I didn't realize that was the >>> case. Imagine if LVM were to stuff metadata into the MBR gap, or >>> mdadm. Egads. >> >> On the matter of bcache in general this seems noteworthy: >> >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/ commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667 >> >> bummer.. > > Well it doesn't mean no one will take it, just that no one has taken it > yet. But the future of SSD caching may only be with LVM. > > -- > Chris Murphy I think all the above posts underline exacly my point: Instead of using a ssd cache (be it bcache or dm-cache) it would be much better to have the btrfs allocator be aware of ssd's in the pool and prioritize allocations to the ssd to maximize performance. This will allow to easily add more ssd's or replace worn out ones, without the mentioned headaches. After all adding/replacing drives to a pool is one of btrfs's biggest advantages. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/5] Btrfs: test_check_exists: Fix infinite loop when searching for free space entries
On 2016/5/27 23:43, Josef Bacik wrote: fs/btrfs/free-space-cache.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index 5e6062c..05c9ef8 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -3662,6 +3662,7 @@ have_info: if (tmp->offset + tmp->bytes < offset) break; if (offset + bytes < tmp->offset) { +info = tmp; n = rb_prev(>offset_index); continue; } @@ -3676,6 +3677,7 @@ have_info: if (offset + bytes < tmp->offset) break; if (tmp->offset + tmp->bytes < offset) { +info = tmp; n = rb_next(>offset_index); continue; } Just make it rb_next(>offset_index)/rb_prev(>offset_index) instead of doing the info = tmp thing. Thanks, Josef I will change it in v2, thanks Feifei -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[regression] make sure seed is writeable sprout after device add
Originally a seed FS becomes a writeable sprout FS after a device is added to it, however as at 4.6 I don't see this behavior anymore. This, the above feature is quite unique to btrfs, and there are some good future solutions on top it. So please preserve this feature and here is a test case [1] which is to make sure we would. On the point of fixing regression, I am trying, traced until 3.8 looks like it failed beyond that. Will continue after my vacation. Further, while digging this out, found another bug in btrfs_init_new_device() which [2] will fix it. (again [2] is not a fix for the regression). [1] [PATCH] btrfs: failed to create sprout should set back to rdonly [2] [PATCH] fstests: btrfs: test case to make sure seed FS is writable after device add -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: failed to create sprout should set back to rdonly
btrfs_init_new_device() should put the FS back to RDONLY if init fails in the seed_device context. Further it adds the following clean up: - fixes a bug_on to goto label error_trans: - move btrfs_abort_transaction() label error_trans: and - as there is no code to undo the btrfs_prepare_sprout() so temporarily calls a bug_on Signed-off-by: Anand Jain--- fs/btrfs/volumes.c | 31 +-- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 2b88127bba5b..a637e99e4c6b 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2351,7 +2351,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) if (seeding_dev) { sb->s_flags &= ~MS_RDONLY; ret = btrfs_prepare_sprout(root); - BUG_ON(ret); /* -ENOMEM */ + if (ret) + goto error_trans; } device->fs_devices = root->fs_info->fs_devices; @@ -2398,26 +2399,20 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) lock_chunks(root); ret = init_first_rw_device(trans, root, device); unlock_chunks(root); - if (ret) { - btrfs_abort_transaction(trans, root, ret); - goto error_trans; - } + if (ret) + goto error_sysfs; } ret = btrfs_add_device(trans, root, device); - if (ret) { - btrfs_abort_transaction(trans, root, ret); - goto error_trans; - } + if (ret) + goto error_sysfs; if (seeding_dev) { char fsid_buf[BTRFS_UUID_UNPARSED_SIZE]; ret = btrfs_finish_sprout(trans, root); - if (ret) { - btrfs_abort_transaction(trans, root, ret); - goto error_trans; - } + if (ret) + goto error_sysfs; /* Sprouting would change fsid of the mounted root, * so rename the fsid on the sysfs @@ -2460,10 +2455,18 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path) update_dev_time(device_path); return ret; +error_sysfs: + if (seeding_dev) { + /* undo of btrfs_prepare_sprout is missing*/ + BUG_ON(1); + } + btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device); error_trans: + if (seeding_dev) + sb->s_flags |= MS_RDONLY; + btrfs_abort_transaction(trans, root, ret); btrfs_end_transaction(trans, root); rcu_string_free(device->name); - btrfs_sysfs_rm_device_link(root->fs_info->fs_devices, device); kfree(device); error: blkdev_put(bdev, FMODE_EXCL); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstests: btrfs: test case to make sure seed FS is writable after device add
Originally when the device is added to a seed FS, the mount point converts to writeable. However there appears to be a regression that in 4.6 the sprouted FS still remains read-only. Traced back untill 3.8 and still there is regression. Seed sprout btrfs feature is one of the unique feature of btrfs and interesting solutions can be developed using this feature. So this test case makes sure that original expected output is preserved. --- tests/btrfs/125 | 81 + tests/btrfs/125.out | 1 + tests/btrfs/group | 1 + 3 files changed, 83 insertions(+) create mode 100755 tests/btrfs/125 create mode 100644 tests/btrfs/125.out diff --git a/tests/btrfs/125 b/tests/btrfs/125 new file mode 100755 index ..189d30614ad0 --- /dev/null +++ b/tests/btrfs/125 @@ -0,0 +1,81 @@ +#! /bin/bash +# FS QA Test No. btrfs/123 +# +# Test BTRFS seed device add +# +# Steps: +# Create seed FS and mount +# Device add +# Check if the FS is now RW-able +# +#--- +# Copyright (c) 2016 Oracle. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_scratch_nocheck +_require_scratch_dev_pool 2 + +rm -f $seqres.full + +_scratch_dev_pool_get 1 +_spare_dev_get + +_scratch_pool_mkfs >> $seqres.full 2>&1 + +btrfstune -S 1 $SCRATCH_DEV_POOL || \ + _fail "btrfstune failed to mark '$SCRATCH_DEV_POOL' as seed" + +_scratch_mount >> $seqres.full 2>&1 + +_run_btrfs_util_prog filesystem show -m + +_run_btrfs_util_prog device add $SPARE_DEV "$SCRATCH_MNT" + +_run_btrfs_util_prog filesystem show -m + +touch "$SCRATCH_MNT"/tf1 || _fail "FS not Writeable" + +_scratch_unmount +_spare_dev_put +_scratch_dev_pool_put + +echo "Silence is golden" +status=0 +exit diff --git a/tests/btrfs/125.out b/tests/btrfs/125.out new file mode 100644 index ..4f22ab0cb5e9 --- /dev/null +++ b/tests/btrfs/125.out @@ -0,0 +1 @@ +QA output created by 125 diff --git a/tests/btrfs/group b/tests/btrfs/group index 1866b17aa6df..0afc82940f61 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -126,3 +126,4 @@ 122 auto quick snapshot qgroup 123 auto replace 124 auto replace +125 auto replace -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On Sun, May 29, 2016 at 12:03 PM, Holger Hoffstättewrote: > On 05/29/16 19:53, Chris Murphy wrote: >> But I'm skeptical of bcache using a hidden area historically for the >> bootloader, to put its device metadata. I didn't realize that was the >> case. Imagine if LVM were to stuff metadata into the MBR gap, or >> mdadm. Egads. > > On the matter of bcache in general this seems noteworthy: > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667 > > bummer.. Well it doesn't mean no one will take it, just that no one has taken it yet. But the future of SSD caching may only be with LVM. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize doesnt work as expected
On Sun, May 29, 2016 at 12:16 PM, Peter Beckerwrote: > 2016-05-29 19:11 GMT+02:00 Chris Murphy : >> On Sat, May 28, 2016 at 3:42 PM, Peter Becker wrote: >>> Thanks for the clarification. I've probably overlooked this. >>> >>> But should "resize max" does not do what you expect instead of falling >>> back on an "invisible" 1? >> >> How does it know what the user expects? > > Then simply remove the default deviceid und let the user choise what the want. They can already choose what they want, but they have to specify it, it's not an interactive UI. Plus the shrink case has to be considered. What it could do is state what happened rather than completing without any message, i.e. if devid not specifed it would say something like: devid 1 resized from X to X At least there's feedback. It doesn't exactly make sense to require the most common case, single device, to have to specify the single device, hence why devid 1 is assumed. > IMHO its a bad think to choise automaticly a option if its not clear > what the user wants. And in spezial there is no hint in the output of > the deviceid who is used. > The output is: > > "Resize '/mnt' of 'max'" .. no hint the this only affect the deviceid > 1. The suggestion for me is that the whole pool is resized to max. That would mean the command affects all devid's at the same time. This would require a lot more logic and safeguards for the shrink case. So someone would need to do that work. I think such Ux improvements are happening in a separate github thread on revising the btrfs-progs UI/Ux. > > Possible solutions: > 1. remove the default deviceid > 2. resize without deviceid affects the whole pool > 3. improve the output of the resize command by adding the deviceid > 4. remove the inconsitent between add+remove and replace by triggering > resize max after replace is finished. 1 negatively impacts single device setups. 2 doesn't account for shrink, where now every device is reduced by some unknown amount and could end up a mess in some cases. 3 & 4 are reasonable Right now the command is rather explicit with an exception for the single device case. That's really what you're seeing here. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize doesnt work as expected
2016-05-29 19:11 GMT+02:00 Chris Murphy: > On Sat, May 28, 2016 at 3:42 PM, Peter Becker wrote: >> Thanks for the clarification. I've probably overlooked this. >> >> But should "resize max" does not do what you expect instead of falling >> back on an "invisible" 1? > > How does it know what the user expects? Then simply remove the default deviceid und let the user choise what the want. IMHO its a bad think to choise automaticly a option if its not clear what the user wants. And in spezial there is no hint in the output of the deviceid who is used. The output is: "Resize '/mnt' of 'max'" .. no hint the this only affect the deviceid 1. The suggestion for me is that the whole pool is resized to max. Possible solutions: 1. remove the default deviceid 2. resize without deviceid affects the whole pool 3. improve the output of the resize command by adding the deviceid 4. remove the inconsitent between add+remove and replace by triggering resize max after replace is finished. > I think the issue is not with the resize command, but rather the > replace command does not include the resize max operation. Presumably > the user intends the entire block device provided as the target for > replacement to be used. > > So I think the mistake is replace assumes the user wants to use the > same amount of space as the former block device. I think if the user > wanted to use the former block device size on the new block device, > they'd partition it and use the partition as the target. > > > -- > Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 05/29/16 19:53, Chris Murphy wrote: > But I'm skeptical of bcache using a hidden area historically for the > bootloader, to put its device metadata. I didn't realize that was the > case. Imagine if LVM were to stuff metadata into the MBR gap, or > mdadm. Egads. On the matter of bcache in general this seems noteworthy: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=4d1034eb7c2f5e32d48ddc4dfce0f1a723d28667 bummer.. Holger -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On Sun, May 29, 2016 at 12:23 AM, Andrei Borzenkovwrote: > 20.05.2016 20:59, Austin S. Hemmelgarn пишет: >> On 2016-05-20 13:02, Ferry Toth wrote: >>> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, >>> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs >>> partitions are in the same pool, which is in btrfs RAID10 format. /boot >>> is in subvolume @boot. >> If you have GRUB installed on all 4, then you don't actually have the >> full 2047 sectors between the MBR and the partition free, as GRUB is >> embedded in that space. I forget exactly how much space it takes up, >> but I know it's not the whole 1023.5K I would not suggest risking usage >> of the final 8k there though. > > If you mean grub2, required space is variable and depends on where > /boot/grub is located (i.e. which drivers it needs to access it). > Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB. > > Note that grub2 detects some post-MBR gap software signatures and skips > over them (space need not be contiguous). It is entirely possible to add > bcache detection if enough demand exists. Might not be a bad idea, just to avoid it getting stepped on and causing later confusion. If it is stepped on I don't think there's data loss except possibly in the case where there's an unclean shutdown where the SSD has bcache data that hasn't been committed to the HDD? But I'm skeptical of bcache using a hidden area historically for the bootloader, to put its device metadata. I didn't realize that was the case. Imagine if LVM were to stuff metadata into the MBR gap, or mdadm. Egads. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Resize doesnt work as expected
On Sat, May 28, 2016 at 3:42 PM, Peter Beckerwrote: > Thanks for the clarification. I've probably overlooked this. > > But should "resize max" does not do what you expect instead of falling > back on an "invisible" 1? How does it know what the user expects? I think the issue is not with the resize command, but rather the replace command does not include the resize max operation. Presumably the user intends the entire block device provided as the target for replacement to be used. So I think the mistake is replace assumes the user wants to use the same amount of space as the former block device. I think if the user wanted to use the former block device size on the new block device, they'd partition it and use the partition as the target. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Functional difference between "replace" vs "add" then "delete missing" with a missing disk in a RAID56 array
Situation: A six disk RAID5/6 array with a completely failed disk. The failed disk is removed and an identical replacement drive is plugged in. Here I have two options for replacing the disk, assuming the old drive is device 6 in the superblock and the replacement disk is /dev/sda. 'btrfs replace start 6 /dev/sda /mnt' This will start a rebuild of the array using the new drive, copying data that would have been on device 6 to the new drive from the parity data. btrfs add /dev/sda /mnt && btrfs device delete missing /mnt This adds a new device (the replacement disk) to the array and dev delete missing appears to trigger a rebalance before deleting the missing disk from the array. The end result appears to be identical to option 1. A few weeks back I recovered an array with a failed drive using 'delete missing' because 'replace' caused a kernel panic. I later discovered that this was not (just) a failed drive but some other failed hardware that I've yet to start diagnosing. Either motherboard or HBA. The drives are in a new server now and I am currently rebuilding the array with 'replace', which is believe is the "more correct" way to replace a bad drive in an array. Both work, but 'replace' seems to be slower so I'm curious what the functional differences are between the two. I thought the replace would be faster as I assumed it would need to read fewer blocks since instead of a complete rebalance it's just rebuilding a drive from parity data. What are the differences between the two under the hood? The only obvious difference I could see is that when I ran `replace` the space on the replacement drive was instantly allocated under 'filesystem show' while when I used 'device delete' the drive usage slowly crept up through the course of the rebalance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Some ideas for improvements
On 2016-05-25 21:03, Duncan wrote: > Dmitry Katsubo posted on Wed, 25 May 2016 16:45:41 +0200 as excerpted: >> * Would be nice if 'btrfs scrub status' shows estimated finishing time >> (ETA) and throughput (in Mb/s). > > That might not be so easy to implement. (Caveat, I'm not a dev, just a > btrfs user and list regular, so if a dev says different...) > > Currently, a running scrub simply outputs progress to a file (/var/lib/ > btrfs/scrub.status.), and scrub status is simply a UI to pretty- > print that file. Note that there's nothing in there which lists the > total number of extents or bytes to go -- that's not calculated ahead of > time. > > So implementing some form of percentage done or eta is likely to increase > the processing time dramatically, as it could involve doing a dry-run > first, in ordered to get the total figures against which to calculate > percentage done. Indeed that this cannot (should not) be done on user-space level: kernel module should provide information about that. I am not a dev :) but I think module should now number of extents, at least something is shown in "btrfs fi usage ..." output. The information shouldn't be 100% exact, but at least some indication would be great. In worst scenario module can remember the last scrub time and make estimation based on that (similar how some CD burning utilities do). >> * Not possible to start scrub for all devices in the volume without >> mounting it. > > Interesting. It's news to me that you can scrub individual devices > without mounting. But given that, this would indeed be a useful feature, > and given that btrfs filesystem show can get the information, scrub > should be able to get and make use of it as well. =:^) More over I got into a trap when tried to use "btrfs scrub start /dev/..." syntax, as I only scrubs the given device. When I scrubbed the whole volume after mounting it, de result was different. I understood it only after reading man btrfs-scrub more attentively: start ... | Start a scrub on all devices of the filesystem identified by or on a single . Other (shorter) forms of help misled me, giving the impression that it does not matter whether I specify a path or device. On 2016-05-26 00:05, Duncan wrote: > Nicholas D Steeves posted on Wed, 25 May 2016 16:36:13 -0400 as excerpted: >> On 25 May 2016 at 15:03, Duncan <1i5t5.dun...@cox.net> wrote: >>> Dmitry Katsubo posted on Wed, 25 May 2016 16:45:41 +0200 as excerpted: btrfs-restore [needs an o]ption that applies (y) to all questions (completely unattended recovery) >>> >>> That['s] a known sore spot that a lot of people have complained >>> about. > >> I'm surprised no one has mentioned, in any of these discussions, what I >> believe is the standard method of providing this functionality: >> yes | btrfs-restore -options /dev/disk > > Good point. > > I didn't bring it up because while I've used btrfs restore a few times, > my btrfs are all on relatively small SSD partitions, so I both needed > less y's, and the total time per restore is a few minutes, not hours, so > it wasn't a big deal. As a result, while I know of yes, I didn't need to > think about automation, and as I never used it, it didn't occur to me to > suggest it for others. Thanks for advise, Nicholas. Last time I tried it I used the following command: while true; do echo y; done | btrfs restore -voxmSi /dev/sda /mnt/tmp &> btrfs_restore & which presumably is equivalent to what you suggest. The command was in "running" state in "jobs" output for a while, but then turned into "waiting" state and did not progress. I suspect that btrfs-restore somehow reads directly from terminal, not from stdin. I will try the solution with "yes | btrfs-restore..." once I get a chance. -- With best regards, Dmitry -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PULL] Btrfs for 4.7, part 2
On Sat, May 28, 2016 at 01:14:13PM +0800, Anand Jain wrote: On 05/27/2016 11:42 PM, Chris Mason wrote: I'm getting errors from btrfs fi show -d, after the very last round of device replaces. A little extra debugging: bytenr mismatch, want=4332716032, have=0 ERROR: cannot read chunk root ERROR reading /dev/vdh failed /dev/vdh Which is cute because the very next command we run fscks /dev/vdh and succeeds. Checked the code paths both btrfs fi show -d and btrfs check, both are calling flush during relative open_ctree in progs. However the flush is called after we have read superblock. That means the read_superblock during 'show' cli (only) will read superblock without flush, and 'check' won't, because 011 calls 'check' after 'show'. But it still does not explain the above error, which is during open_ctree not at read superblock. Remains strange case as of now. It's because we're just not done writing it out yet when btrfs fi show is run. I think replace is special here. Also. I can't reproduce. I'm in a relatively new test rig using kvm, which probably explains why I haven't seen it before. You can probably make it easier by adding a sleep inside the actual __free_device() func. So the page cache is stale and this isn't related to any of our patches. close_ctree() calls into btrfs_close_devices(), which calls btrfs_close_one_device(), which uses: call_rcu(>rcu, free_device); close_ctree() also does an rcu_barrier() to make sure and wait for free_device() to finish. But, free_device() just puts the work into schedule_work(), so we don't know for sure the blkdev_put is done when we exit. Right, saw that before. Any idea why its like that ? Or if it should be fixed? It's just trying to limit the work that is done from call_rcu, and it should definitely be fixed. It might cause EBUSY or other problems. Probably easiest to add a counter or completion object that gets changed by the __free_device function. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 16/22] btrfs-progs: convert: Introduce function to migrate reserved ranges
On 05/28/2016 11:16 AM, Liu Bo wrote: On Fri, Jan 29, 2016 at 01:03:26PM +0800, Qu Wenruo wrote: Introduce new function, migrate_reserved_ranges() to migrate used fs data in btrfs reserved space. Unlike old implement, which will need to relocate all the complicated csum and reference relocation, previous patches already ensure such reserved ranges won't be allocated. So here we only need copy these data out and create new extent/csum/reference. Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- btrfs-convert.c | 124 +++- 1 file changed, 122 insertions(+), 2 deletions(-) diff --git a/btrfs-convert.c b/btrfs-convert.c index 16e2309..f6126db 100644 --- a/btrfs-convert.c +++ b/btrfs-convert.c @@ -1679,6 +1679,123 @@ static int create_image_file_range_v2(struct btrfs_trans_handle *trans, return ret; } + +/* + * Relocate old fs data in one reserved ranges + * + * Since all old fs data in reserved range is not covered by any chunk nor + * data extent, we don't need to handle any reference but add new + * extent/reference, which makes codes more clear + */ +static int migrate_one_reserved_range(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct cache_tree *used, + struct btrfs_inode_item *inode, int fd, + u64 ino, u64 start, u64 len, int datacsum) +{ + u64 cur_off = start; + u64 cur_len = len; + struct cache_extent *cache; + struct btrfs_key key; + struct extent_buffer *eb; + int ret = 0; + + while (cur_off < start + len) { + cache = lookup_cache_extent(used, cur_off, cur_len); + if (!cache) + break; + cur_off = max(cache->start, cur_off); + cur_len = min(cache->start + cache->size, start + len) - + cur_off; + BUG_ON(cur_len < root->sectorsize); + + /* reserve extent for the data */ + ret = btrfs_reserve_extent(trans, root, cur_len, 0, 0, (u64)-1, + , 1); + if (ret < 0) + break; + + eb = malloc(sizeof(*eb) + cur_len); + if (!eb) { + ret = -ENOMEM; + break; + } + + ret = pread(fd, eb->data, cur_len, cur_off); + if (ret < cur_len) { + ret = (ret < 0 ? ret : -EIO); + free(eb); + break; + } + eb->start = key.objectid; + eb->len = key.offset; + + /* Write the data */ + ret = write_and_map_eb(trans, root, eb); + free(eb); + if (ret < 0) + break; With write_data_to_disk(), we don't have to create eb for write. Thanks, -liubo Nice advice. I didn't remember whether write_data_to_disk() was there when the patchset was written, but always a good idea to get rid of the temporary eb. Thanks, Qu + + /* Now handle extent item and file extent things */ + ret = btrfs_record_file_extent(trans, root, ino, inode, cur_off, + key.objectid, key.offset); + if (ret < 0) + break; + /* Finally, insert csum items */ + if (datacsum) + ret = csum_disk_extent(trans, root, key.objectid, + key.offset); + + cur_off += key.offset; + cur_len = start + len - cur_off; + } + return ret; +} + +/* + * Relocate the used ext2 data in reserved ranges + * [0,1M) + * [btrfs_sb_offset(1), +BTRFS_STRIPE_LEN) + * [btrfs_sb_offset(2), +BTRFS_STRIPE_LEN) + */ +static int migrate_reserved_ranges(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct cache_tree *used, + struct btrfs_inode_item *inode, int fd, + u64 ino, u64 total_bytes, int datacsum) +{ + u64 cur_off; + u64 cur_len; + int ret = 0; + + /* 0 ~ 1M */ + cur_off = 0; + cur_len = 1024 * 1024; + ret = migrate_one_reserved_range(trans, root, used, inode, fd, ino, +cur_off, cur_len, datacsum); + if (ret < 0) + return ret; + + /* second sb(fisrt sb is included in 0~1M) */ + cur_off = btrfs_sb_offset(1); + cur_len = min(total_bytes, cur_off + BTRFS_STRIPE_LEN) - cur_off; + if (cur_off < total_bytes) + return ret; + ret =
Re: [PATCH v3 21/22] btrfs-progs: convert: Strictly avoid meta or system chunk allocation
On 05/28/2016 11:30 AM, Liu Bo wrote: On Fri, Jan 29, 2016 at 01:03:31PM +0800, Qu Wenruo wrote: Before this patch, btrfs-convert only rely on large enough initial system/metadata chunk size to ensure no newer system/meta chunk will be created. But that's not safe enough. So add two new members in fs_info, avoid_sys/meta_chunk_alloc flags to prevent any newer system or meta chunks to be created before init_btrfs_v2(). Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- btrfs-convert.c | 9 + ctree.h | 3 +++ extent-tree.c | 10 ++ 3 files changed, 22 insertions(+) diff --git a/btrfs-convert.c b/btrfs-convert.c index efa3b02..333f413 100644 --- a/btrfs-convert.c +++ b/btrfs-convert.c @@ -2322,6 +2322,13 @@ static int init_btrfs_v2(struct btrfs_mkfs_config *cfg, struct btrfs_root *root, struct btrfs_fs_info *fs_info = root->fs_info; int ret; + /* +* Don't alloc any metadata/system chunk, as we don't want +* any meta/sys chunk allcated before all data chunks are inserted. +* Or we screw up the chunk layout just like the old implement. +*/ I don't get this, with this patch set, we can allocate data from DATA chunk, allocate metadata from METADATA chunk, but then we're not allowed to allocate new chunks? Thanks, For new convert, we are going through the following steps: 1) Create initial meta/sys chunks into unused space, manually 2) Open fs 3) Insert data chunks to covert all ext* used data 4) Do per inode copying The whole patchset rely on a key assumption: All data chunks are already allocated to cover all ext* used data, so new chunk/extent allocation can follow the normal routine. Before that, only chunks created in step 1) is completely safe, and new chunk allocation before step 3) are all unsafe, as the key assumption is not met yet. So, unitl step 3), we must not allocate any new data/metadata chunks. Thanks, Qu -liubo + fs_info->avoid_sys_chunk_alloc = 1; + fs_info->avoid_meta_chunk_alloc = 1; trans = btrfs_start_transaction(root, 1); BUG_ON(!trans); ret = btrfs_fix_block_accounting(trans, root); @@ -2359,6 +2366,8 @@ static int init_btrfs_v2(struct btrfs_mkfs_config *cfg, struct btrfs_root *root, goto err; ret = btrfs_commit_transaction(trans, root); + fs_info->avoid_sys_chunk_alloc = 0; + fs_info->avoid_meta_chunk_alloc = 0; err: return ret; } diff --git a/ctree.h b/ctree.h index 1443746..187bd27 100644 --- a/ctree.h +++ b/ctree.h @@ -1030,6 +1030,9 @@ struct btrfs_fs_info { unsigned int quota_enabled:1; unsigned int suppress_check_block_errors:1; unsigned int ignore_fsid_mismatch:1; + unsigned int avoid_meta_chunk_alloc:1; + unsigned int avoid_sys_chunk_alloc:1; + int (*free_extent_hook)(struct btrfs_trans_handle *trans, struct btrfs_root *root, diff --git a/extent-tree.c b/extent-tree.c index 93b1945..e7c61b1 100644 --- a/extent-tree.c +++ b/extent-tree.c @@ -1904,6 +1904,16 @@ static int do_chunk_alloc(struct btrfs_trans_handle *trans, thresh) return 0; + /* +* Avoid allocating given chunk type +*/ + if (extent_root->fs_info->avoid_meta_chunk_alloc && + (flags & BTRFS_BLOCK_GROUP_METADATA)) + return 0; + if (extent_root->fs_info->avoid_sys_chunk_alloc && + (flags & BTRFS_BLOCK_GROUP_SYSTEM)) + return 0; + ret = btrfs_alloc_chunk(trans, extent_root, , _bytes, space_info->flags); if (ret == -ENOSPC) { -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 05/22] btrfs-progs: Introduce function to setup temporary superblock
On 05/28/2016 11:04 AM, Liu Bo wrote: On Fri, Jan 29, 2016 at 01:03:15PM +0800, Qu Wenruo wrote: Introduce a new function, setup_temp_super(), to setup temporary super for make_btrfs_v2(). Signed-off-by: Qu WenruoSigned-off-by: David Sterba --- utils.c | 117 1 file changed, 117 insertions(+) diff --git a/utils.c b/utils.c index bc10293..ed5476d 100644 --- a/utils.c +++ b/utils.c @@ -212,6 +212,98 @@ static int reserve_free_space(struct cache_tree *free_tree, u64 len, return 0; } +static inline int write_temp_super(int fd, struct btrfs_super_block *sb, + u64 sb_bytenr) +{ + u32 crc = ~(u32)0; + int ret; + + crc = btrfs_csum_data(NULL, (char *)sb + BTRFS_CSUM_SIZE, crc, + BTRFS_SUPER_INFO_SIZE - BTRFS_CSUM_SIZE); + btrfs_csum_final(crc, (char *)>csum[0]); + ret = pwrite(fd, sb, BTRFS_SUPER_INFO_SIZE, sb_bytenr); + if (ret < BTRFS_SUPER_INFO_SIZE) + ret = (ret < 0 ? -errno : -EIO); + else + ret = 0; + return ret; +} + +/* + * Setup temporary superblock at cfg->super_bynter + * Needed info are extracted from cfg, and root_bytenr, chunk_bytenr + * + * For now sys chunk array will be empty and dev_item is empty + * too. + * They will be re-initialized at temp chunk tree setup. + */ +static int setup_temp_super(int fd, struct btrfs_mkfs_config *cfg, + u64 root_bytenr, u64 chunk_bytenr) +{ + unsigned char chunk_uuid[BTRFS_UUID_SIZE]; + char super_buf[BTRFS_SUPER_INFO_SIZE]; + struct btrfs_super_block *super = (struct btrfs_super_block *)super_buf; + int ret; + + /* +* We rely on cfg->chunk_uuid and cfg->fs_uuid to pass uuid +* for other functions. +* Caller must allocation space for them +*/ + BUG_ON(!cfg->chunk_uuid || !cfg->fs_uuid); + memset(super_buf, 0, BTRFS_SUPER_INFO_SIZE); + cfg->num_bytes = round_down(cfg->num_bytes, cfg->sectorsize); + + if (cfg->fs_uuid && *cfg->fs_uuid) { + if (uuid_parse(cfg->fs_uuid, super->fsid) != 0) { + error("cound not parse UUID: %s", cfg->fs_uuid); + ret = -EINVAL; + goto out; + } + if (!test_uuid_unique(cfg->fs_uuid)) { + error("non-unique UUID: %s", cfg->fs_uuid); + ret = -EINVAL; + goto out; + } + } else { + uuid_generate(super->fsid); + uuid_unparse(super->fsid, cfg->fs_uuid); + } + uuid_generate(chunk_uuid); + uuid_unparse(chunk_uuid, cfg->chunk_uuid); + + btrfs_set_super_bytenr(super, cfg->super_bytenr); + btrfs_set_super_num_devices(super, 1); + btrfs_set_super_magic(super, BTRFS_MAGIC); + btrfs_set_super_generation(super, 1); + btrfs_set_super_root(super, root_bytenr); + btrfs_set_super_chunk_root(super, chunk_bytenr); + btrfs_set_super_total_bytes(super, cfg->num_bytes); + /* +* Temporary btrfs will only has 6 tree roots: +* chunk tree, root tree, extent_tree, device tree, fs tree +* and csum tree. +*/ + btrfs_set_super_bytes_used(super, 6 * cfg->nodesize); + btrfs_set_super_sectorsize(super, cfg->sectorsize); + btrfs_set_super_leafsize(super, cfg->nodesize); + btrfs_set_super_nodesize(super, cfg->nodesize); + btrfs_set_super_stripesize(super, cfg->stripesize); + btrfs_set_super_csum_type(super, BTRFS_CSUM_TYPE_CRC32); + btrfs_set_super_chunk_root(super, chunk_bytenr); + btrfs_set_super_cache_generation(super, -1); + btrfs_set_super_incompat_flags(super, cfg->features); + if (cfg->label) + strncpy(super->label, cfg->label, BTRFS_LABEL_SIZE - 1); Why not use __strncpy_null? Thanks, -liubo Good idea, I'll add new patch to use it. Thanks, Qu + + /* Sys chunk array will be re-initialized at chunk tree init time */ + super->sys_chunk_array_size = 0; + + ret = write_temp_super(fd, super, cfg->super_bytenr); +out: + return ret; +} + /* * Improved version of make_btrfs(). * @@ -230,6 +322,10 @@ static int make_convert_btrfs(int fd, struct btrfs_mkfs_config *cfg, struct cache_tree *used = >used; u64 sys_chunk_start; u64 meta_chunk_start; + /* chunk tree bytenr, in system chunk */ + u64 chunk_bytenr; + /* metadata trees bytenr, in metadata chunk */ + u64 root_bytenr; int ret; /* Shouldn't happen */ @@ -260,6 +356,27 @@ static int make_convert_btrfs(int fd, struct btrfs_mkfs_config *cfg, if (ret < 0) goto out; + /* +* Inside the allocate metadata chunk, its layout will be: +
Re: [PATCH] btrfs,vfs: allow FILE_EXTENT_SAME on a file opened ro
29.05.2016 03:56, Zygo Blaxell пишет: >> >> I don't think this can happen on btrfs: the superblock is updated only after >> a barrier when both the data and extent refs are already on the disk. > > If and only if the filesystem is mounted with the flushoncommit option, > that's true. This is not the default, though, and I lost a fair amount > of time and data before I discovered this. > According to wiki, this is default on "reasonably recent kernels"; unfortunately it does say what kernel is recent enough. I am surprised it can be disabled at all. signature.asc Description: OpenPGP digital signature
Re: Hot data tracking / hybrid storage
20.05.2016 20:59, Austin S. Hemmelgarn пишет: > On 2016-05-20 13:02, Ferry Toth wrote: >> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, >> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs >> partitions are in the same pool, which is in btrfs RAID10 format. /boot >> is in subvolume @boot. > If you have GRUB installed on all 4, then you don't actually have the > full 2047 sectors between the MBR and the partition free, as GRUB is > embedded in that space. I forget exactly how much space it takes up, > but I know it's not the whole 1023.5K I would not suggest risking usage > of the final 8k there though. If you mean grub2, required space is variable and depends on where /boot/grub is located (i.e. which drivers it needs to access it). Assuming plain btrfs on legacy BIOS MBR, required space is around 40-50KB. Note that grub2 detects some post-MBR gap software signatures and skips over them (space need not be contiguous). It is entirely possible to add bcache detection if enough demand exists. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html