[PATCH RFC RESEND] btrfs: harden agaist duplicate fsid
(Thanks for the comments on requiring to warn_on if we fail the device change.) (This fixes an ugly bug, I appreciate if you have any further comments). Its not that impossible to imagine that a device OR a btrfs image is been copied just by using the dd or the cp command. Which in case both the copies of the btrfs will have the same fsid. If on the system with automount enabled, the copied FS gets scanned. We have a known bug in btrfs, that we let the device path be changed after the device has been mounted. So using this loop hole the new copied device would appears as if its mounted immediately after its been copied. For example: Initially.. /dev/mmcblk0p4 is mounted as / lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT mmcblk0 179:00 29.2G 0 disk |-mmcblk0p4 179:404G 0 part / |-mmcblk0p2 179:20 500M 0 part /boot |-mmcblk0p3 179:30 256M 0 part [SWAP] `-mmcblk0p1 179:10 256M 0 part /boot/efi btrfs fi show Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4 Copy mmcblk0 to sda dd if=/dev/mmcblk0 of=/dev/sda And immediately after the copy completes the change in the device superblock is notified which the automount scans using btrfs device scan and the new device sda becomes the mounted root device. lsblk NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:01 14.9G 0 disk |-sda48:414G 0 part / |-sda28:21 500M 0 part |-sda38:31 256M 0 part `-sda18:11 256M 0 part mmcblk0 179:00 29.2G 0 disk |-mmcblk0p4 179:404G 0 part |-mmcblk0p2 179:20 500M 0 part /boot |-mmcblk0p3 179:30 256M 0 part [SWAP] `-mmcblk0p1 179:10 256M 0 part /boot/efi btrfs fi show / Label: none uuid: 07892354-ddaa-4443-90ea-f76a06accaba Total devices 1 FS bytes used 1.40GiB devid1 size 4.00GiB used 3.00GiB path /dev/sda4 The bug is quite nasty that you can't either unmount /dev/sda4 or /dev/mmcblk0p4. And the problem does not get solved until you take sda out of the system on to another system to change its fsid using the 'btrfstune -u' command. Signed-off-by: Anand Jain --- Hi, There was previous attempt to fix this bug ref: www.spinics.net/lists/linux-btrfs/msg37466.html which broke the Ubuntu subvol mount at boot. The reason for that is, Ubuntu changes the device path in the boot process, and the earlier fix checked for the device-path instead of block_device and and so we failed the subvol mount request and thus the bootup process. This patch instead checks for the block_device instead of the device-path. I have tested this with Oracle Linux with btrfs as boot device with a subvol to be mounted at boot. And also have verified with new test case btrfs/173. It will be good if someone run this through Ubuntu boot test case. fs/btrfs/volumes.c | 23 +++ 1 file changed, 23 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index f4405e430da6..62173a3abcc4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -850,6 +850,29 @@ static noinline struct btrfs_device *device_list_add(const char *path, return ERR_PTR(-EEXIST); } + /* +* we are going to replace the device path, make sure its the +* same device if the device mounted +*/ + if (device->bdev) { + struct block_device *path_bdev; + + path_bdev = lookup_bdev(path); + if (IS_ERR(path_bdev)) { + mutex_unlock(_devices->device_list_mutex); + return ERR_CAST(path_bdev); + } + + if (device->bdev != path_bdev) { + bdput(path_bdev); + mutex_unlock(_devices->device_list_mutex); + return ERR_PTR(-EEXIST); + } + bdput(path_bdev); + pr_info("BTRFS: device fsid:devid %pU:%llu old path:%s new path:%s\n", + disk_super->fsid, devid, rcu_str_deref(device->name), path); + } + name = rcu_string_strdup(path, GFP_NOFS); if (!name) { mutex_unlock(_devices->device_list_mutex); -- 1.8.3.1
Re: [PATCH 0/6] Some trivail cleanup about dealyed-refs
On Thu, Oct 11, 2018 at 01:51:37PM +0200, David Sterba wrote: >On Thu, Oct 11, 2018 at 01:40:32PM +0800, Lu Fengqi wrote: >> There is no functional change. Just improve readablity. >> >> PATCH 1-4 parameter cleanup patches >> PATCH 5 cleanup about btrfs_select_ref_head >> PATCH 6 switch int to bool; add some comment >> >> Lu Fengqi (6): >> btrfs: delayed-ref: pass delayed_refs directly to >> btrfs_select_ref_head() >> btrfs: delayed-ref: pass delayed_refs directly to >> btrfs_delayed_ref_lock() >> btrfs: remove fs_info from btrfs_check_space_for_delayed_refs >> btrfs: remove fs_info from btrfs_should_throttle_delayed_refs >> btrfs: simplify btrfs_select_ref_head and cleanup some local variables >> btrfs: switch return_bigger to bool in find_ref_head > >1-4 and 6 added to misc-next, thanks. There is not patch 2 at the misc-next branch. So it was forgotten? -- Thanks, Lu
Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables
On Thu, Oct 11, 2018 at 02:45:04PM +0200, David Sterba wrote: >On Thu, Oct 11, 2018 at 03:28:15PM +0300, Nikolay Borisov wrote: >> > I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the >> > extent-tree.c. I am a bit curious whether it has been forgotten by >> > everyone, I have not found any test results about its performance impact. >> >> I guess it was used during testing but nothing currently sets it. I.e it >> might make sense to enable it if BTRFS_DEBUG is set. > >Agreed, the way the scrambling is supposed to be used does not align >very well with the typical testing workflow so adding to ti the >BTRFS_DEBUG set is ok, unless there are severe performance problems. I will add it to the BTRFS_DEBUG set, and test if it has severe performance problems. > >The part in btrfs_run_delayed_refs would be better hidden in a function >similar to btrfs_debug_check_extent_io_range or btrfs_leak_debug_check. Got it. -- Thanks, Lu
Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables
On Thu, Oct 11, 2018 at 03:28:15PM +0300, Nikolay Borisov wrote: > > >On 11.10.2018 15:15, Lu Fengqi wrote: >> On Thu, Oct 11, 2018 at 09:40:52AM +0300, Nikolay Borisov wrote: >>> >>> >>> On 11.10.2018 08:40, Lu Fengqi wrote: If the return value of find_ref_head() is NULL, the only possibility is that delayed_refs' head ref rbtree is empty. Hence, the second find_ref_head() is pointless. > Besides, the local variables loop and start are unnecessary, just remove them. >>> >>> So the objective of that function is to get a reference to the first >>> delayed head which is not processed. This is done by essentially keeping >>> track of the last range that was processed in >>> delayed_refs->run_delayed_start Signed-off-by: Lu Fengqi --- fs/btrfs/delayed-ref.c | 17 +++-- 1 file changed, 3 insertions(+), 14 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 885581852bea..2726d2fb4bbe 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -354,20 +354,11 @@ struct btrfs_delayed_ref_head * btrfs_select_ref_head(struct btrfs_delayed_ref_root *delayed_refs) { struct btrfs_delayed_ref_head *head; - u64 start; - bool loop = false; again: - start = delayed_refs->run_delayed_start; - head = find_ref_head(delayed_refs, start, 1); - if (!head && !loop) { + head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start, 1); + if (!head) { delayed_refs->run_delayed_start = 0; - start = 0; - loop = true; - head = find_ref_head(delayed_refs, start, 1); - if (!head) - return NULL; - } else if (!head && loop) { >>> >>> I believe this will have a negative impact since it actually will >>> prevent finding a head which was added BEFORE the last processed head. >>> So when a ref head is selected in btrfs_obtain_ref_head then the >>> delayed_refs->lock is dropped and the given head is locked and >>> delayed_refs->run_delayed_start points to the end of the selected range >>> that the head represents. At this point it's possible that another >>> thread modifies a different range which is before the one we have >>> selected so graphically it will be something like: >>> >>> >>> ---[HEAD2]->[HEAD1]-- >>> 0N >>> >>> Where HEAD1 is the head returned from first invocation of >>> btrfs_obtain_ref_head. Once btrfs_obtain_ref_head is called the 2nd >>> time it will not find HEAD2 so will just reset run_delayed_start to 0 >>> and return. So it will be up to another run of the delayed refs to >>> actually find head2. Essentially you made btrfs_obtain_ref_head less >> >> Not exactly. In fact, find_ref_head hides such a logic. When >> return_bigger is set, if there is no larger entry to return, the first >> entry will be returned. Please see the comment I add in the PATCH 6. >> >> Hence, the 2nd invocation of btrfs_obtain_ref_head still will return >> HEAD2. There is no functional change here. >> >> However, your question makes me consider whether such hidden logic >> should be extracted from find_ref_head to btrfs_select_ref_head. > >Right I agree with your. As it stands I will expect that if >return_bigger is true to specifically return a bigger entry or if >nothing is found to return null. IMO this behavior is higher level and This is also exactly what I want. The patch is on the way. >belongs to btrfs_delayed_ref_head. > >> >>> greedy. Have you characterized what kind of performance impact this have? >> >> I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the >> extent-tree.c. I am a bit curious whether it has been forgotten by >> everyone, I have not found any test results about its performance impact. > >I guess it was used during testing but nothing currently sets it. I.e it >might make sense to enable it if BTRFS_DEBUG is set. > Make sense. -- Thanks, Lu
Re: reproducible builds with btrfs seed feature
On Sun, Oct 14, 2018 at 1:09 PM, Cerem Cem ASLAN wrote: > Thanks for the explanation, I got it now. I still think this is > related with my needs, so I'll keep an eye on this. > > What is the possible use case? I can think of only one scenario: You > have a rootfs that contains a distro installer and you want to > generate distro.img files which uses Btrfs under the hood in different > locations and still have the same hash, so you can publish your > verified image hash by a single source (https://your-distro.org). The first step is accepting reproducible builds as a worthy goal in and of itself independent of Btrfs. Specifically "Why does it matter?" found here https://reproducible-builds.org/ Btrfs does bring valuable features for installation images: always on checksumming; seed feature permits a straightforward way to setup a volatile overlay on zram device; ability to convert it to a non-volatile overlay, and boot either the seed or overlay; and even installation by adding the install target and removing both overlay and seed. And yet it remains compatible with a conventional copy to another file system if it's not desirable to use Btrfs as root. Win win. By subsetting Btrfs features we don't care about in the installation seed context, can we achieve reproducibility as a consequence, while retaining some of the more interesting features? Of course once sprouted, those limitations wouldn't apply. Basically it's a "btrfs seed device 2.0" idea. But Btrfs is so complicated it's maybe too much work, hence the question. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
Thanks for the explanation, I got it now. I still think this is related with my needs, so I'll keep an eye on this. What is the possible use case? I can think of only one scenario: You have a rootfs that contains a distro installer and you want to generate distro.img files which uses Btrfs under the hood in different locations and still have the same hash, so you can publish your verified image hash by a single source (https://your-distro.org). You'll sync next release files with the remote servers by using diffs (btrfs send/receive) and they will generate distro.img independently, still having the same hash that you'll later verify by https://your-distro.org. Chris Murphy , 14 Eki 2018 Paz, 21:10 tarihinde şunu yazdı: > > On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN > wrote: > > I'm not sure I could fully understand the desired achievement but it > > sounds like (or this would be an example of selective perception) it's > > somehow related with "creating reproducible snapshots" > > (https://unix.stackexchange.com/q/462451/65781), no? > > No the idea is to be able to consistently reproduce a distro installer > image (like an ISO file) with the same hash. Inside the ISO image, is > typically a root.img or squash.img which itself contains a file system > like ext4 or squashfs, to act as the system root. And that root.img is > the main thing I'm talking about here. There is work to make squashfs > deterministic, as well as ext4. And I'm wondering if there are sane > ways to constrain Btrfs features to make it likewise deterministic. > > For example: > > fallocate -l 5G btrfsroot.img > losetup /dev/loop0 btrfsroot.img > mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T > "20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ... > shasum btrfsroot.img > > And then do it again, and the shasum's should be the same. I realize > today it's not that way. And that inode assignment, extent allocation > (number, size, locality) are all factors in making Btrfs quickly > non-determinstic, and also why I'm assuming this needs to be done in > user space. That would be the point of the -rseed flag: set the seed > flag, possibly set a compat_ro flag, fix generation/transid to 1, > require the use of -T (similar to make_ext4) to set all timestamps to > this value, and configurable uuid's for everything that uses uuids, > and whatever other constraints are necessary. > > > -- > Chris Murphy
Re: reproducible builds with btrfs seed feature
On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN wrote: > I'm not sure I could fully understand the desired achievement but it > sounds like (or this would be an example of selective perception) it's > somehow related with "creating reproducible snapshots" > (https://unix.stackexchange.com/q/462451/65781), no? No the idea is to be able to consistently reproduce a distro installer image (like an ISO file) with the same hash. Inside the ISO image, is typically a root.img or squash.img which itself contains a file system like ext4 or squashfs, to act as the system root. And that root.img is the main thing I'm talking about here. There is work to make squashfs deterministic, as well as ext4. And I'm wondering if there are sane ways to constrain Btrfs features to make it likewise deterministic. For example: fallocate -l 5G btrfsroot.img losetup /dev/loop0 btrfsroot.img mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T "20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ... shasum btrfsroot.img And then do it again, and the shasum's should be the same. I realize today it's not that way. And that inode assignment, extent allocation (number, size, locality) are all factors in making Btrfs quickly non-determinstic, and also why I'm assuming this needs to be done in user space. That would be the point of the -rseed flag: set the seed flag, possibly set a compat_ro flag, fix generation/transid to 1, require the use of -T (similar to make_ext4) to set all timestamps to this value, and configurable uuid's for everything that uses uuids, and whatever other constraints are necessary. -- Chris Murphy
Re: reproducible builds with btrfs seed feature
I'm not sure I could fully understand the desired achievement but it sounds like (or this would be an example of selective perception) it's somehow related with "creating reproducible snapshots" (https://unix.stackexchange.com/q/462451/65781), no? Chris Murphy , 14 Eki 2018 Paz, 02:05 tarihinde şunu yazdı: > > On Sat, Oct 13, 2018 at 4:28 PM, Chris Murphy wrote: > > Is it practical and desirable to make Btrfs based OS installation > > images reproducible? Or is Btrfs simply too complex and > > non-deterministic? [1] > > > > The main three problems with Btrfs right now for reproducibility are: > > a. many objects have uuids other than the volume uuid; and mkfs only > > lets us set the volume uuid > > b. atime, ctime, mtime, otime; and no way to make them all the same > > c. non-deterministic allocation of file extents, compression, inode > > assignment, logical and physical address allocation > > d. generation, just pick a consistent default because the entire image > is made with mkfs and then never rw mounted so it's not a problem > > > - Possibly disallow subvolumes and snapshots > > There's no actual mechanism to do either of these with mkfs, so it's > not a problem. And if a sprout is created, it's fine for newly created > subvolumes to follow the usual behavior of having unique UUID and > incrementing generation. Thing is, the sprout will inherit the seeds > preset chunk uuid, which while it shouldn't cause a problem is a kind > of violation of uuid uniqueness; but ultimately I'm not sure how big > of a problem it is for such uuids to spread. > > > > -- > Chris Murphy
Re: BTRFS bad block management. Does it exist?
On 2018/10/14 下午7:08, waxhead wrote: > In case BTRFS fails to WRITE to a disk. What happens? Normally it should return error when we flush disk. And in that case, error will leads to transaction abort and the fs goes RO to prevent further corruption. > Does the bad area get mapped out somehow? No. > Does it try again until it > succeed or until it "times out" or reach a threshold counter? Unless it's done by block layer, btrfs doesn't try that. > Does it eventually try to write to a different disk (in case of using > the raid1/10 profile?) No. That's not what RAID is designed to do. It's only allowed to have any flush error if using "degraded" mount option and the error is under tolerance. Thanks, Qu signature.asc Description: OpenPGP digital signature
BTRFS bad block management. Does it exist?
In case BTRFS fails to WRITE to a disk. What happens? Does the bad area get mapped out somehow? Does it try again until it succeed or until it "times out" or reach a threshold counter? Does it eventually try to write to a different disk (in case of using the raid1/10 profile?)