[PATCH RFC RESEND] btrfs: harden agaist duplicate fsid

2018-10-14 Thread Anand Jain
(Thanks for the comments on requiring to warn_on if we fail the device change.)
(This fixes an ugly bug, I appreciate if you have any further comments).

Its not that impossible to imagine that a device OR a btrfs image is
been copied just by using the dd or the cp command. Which in case both
the copies of the btrfs will have the same fsid. If on the system with
automount enabled, the copied FS gets scanned.

We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after its
been copied.

For example:

Initially.. /dev/mmcblk0p4 is mounted as /

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part /
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show
   Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
   Total devices 1 FS bytes used 1.40GiB
   devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

Copy mmcblk0 to sda
   dd if=/dev/mmcblk0 of=/dev/sda

And immediately after the copy completes the change in the device
superblock is notified which the automount scans using
btrfs device scan and the new device sda becomes the mounted root
device.

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda   8:01 14.9G  0 disk
|-sda48:414G  0 part /
|-sda28:21  500M  0 part
|-sda38:31  256M  0 part
`-sda18:11  256M  0 part
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show /
 Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
 Total devices 1 FS bytes used 1.40GiB
 devid1 size 4.00GiB used 3.00GiB path /dev/sda4

The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take
sda out of the system on to another system to change its fsid
using the 'btrfstune -u' command.

Signed-off-by: Anand Jain 
---

Hi,

There was previous attempt to fix this bug ref: 
   www.spinics.net/lists/linux-btrfs/msg37466.html

which broke the Ubuntu subvol mount at boot. The reason
for that is, Ubuntu changes the device path in the boot
process, and the earlier fix checked for the device-path
instead of block_device and and so we failed the
subvol mount request and thus the bootup process.
This patch instead checks for the block_device instead of
the device-path.

I have tested this with Oracle Linux with btrfs as boot device
with a subvol to be mounted at boot. And also have verified
with new test case btrfs/173.

It will be good if someone run this through Ubuntu boot test case.

 fs/btrfs/volumes.c | 23 +++
 1 file changed, 23 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..62173a3abcc4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -850,6 +850,29 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
return ERR_PTR(-EEXIST);
}
 
+   /*
+* we are going to replace the device path, make sure its the
+* same device if the device mounted
+*/
+   if (device->bdev) {
+   struct block_device *path_bdev;
+
+   path_bdev = lookup_bdev(path);
+   if (IS_ERR(path_bdev)) {
+   mutex_unlock(_devices->device_list_mutex);
+   return ERR_CAST(path_bdev);
+   }
+
+   if (device->bdev != path_bdev) {
+   bdput(path_bdev);
+   mutex_unlock(_devices->device_list_mutex);
+   return ERR_PTR(-EEXIST);
+   }
+   bdput(path_bdev);
+   pr_info("BTRFS: device fsid:devid %pU:%llu old path:%s 
new path:%s\n",
+   disk_super->fsid, devid, 
rcu_str_deref(device->name), path);
+   }
+
name = rcu_string_strdup(path, GFP_NOFS);
if (!name) {
mutex_unlock(_devices->device_list_mutex);
-- 
1.8.3.1



Re: [PATCH 0/6] Some trivail cleanup about dealyed-refs

2018-10-14 Thread Lu Fengqi
On Thu, Oct 11, 2018 at 01:51:37PM +0200, David Sterba wrote:
>On Thu, Oct 11, 2018 at 01:40:32PM +0800, Lu Fengqi wrote:
>> There is no functional change. Just improve readablity.
>> 
>> PATCH 1-4 parameter cleanup patches
>> PATCH 5 cleanup about btrfs_select_ref_head
>> PATCH 6 switch int to bool; add some comment
>> 
>> Lu Fengqi (6):
>>   btrfs: delayed-ref: pass delayed_refs directly to
>> btrfs_select_ref_head()
>>   btrfs: delayed-ref: pass delayed_refs directly to
>> btrfs_delayed_ref_lock()
>>   btrfs: remove fs_info from btrfs_check_space_for_delayed_refs
>>   btrfs: remove fs_info from btrfs_should_throttle_delayed_refs
>>   btrfs: simplify btrfs_select_ref_head and cleanup some local variables
>>   btrfs: switch return_bigger to bool in find_ref_head
>
>1-4 and 6 added to misc-next, thanks.

There is not patch 2 at the misc-next branch. So it was forgotten?

-- 
Thanks,
Lu




Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables

2018-10-14 Thread Lu Fengqi
On Thu, Oct 11, 2018 at 02:45:04PM +0200, David Sterba wrote:
>On Thu, Oct 11, 2018 at 03:28:15PM +0300, Nikolay Borisov wrote:
>> > I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the
>> > extent-tree.c. I am a bit curious whether it has been forgotten by
>> > everyone, I have not found any test results about its performance impact.
>> 
>> I guess it was used during testing but nothing currently sets it. I.e it
>> might make sense to enable it if BTRFS_DEBUG is set.
>
>Agreed, the way the scrambling is supposed to be used does not align
>very well with the typical testing workflow so adding to ti the
>BTRFS_DEBUG set is ok, unless there are severe performance problems.

I will add it to the BTRFS_DEBUG set, and test if it has severe
performance problems.

>
>The part in btrfs_run_delayed_refs would be better hidden in a function
>similar to btrfs_debug_check_extent_io_range or btrfs_leak_debug_check.

Got it.

-- 
Thanks,
Lu




Re: [PATCH 5/6] btrfs: simplify btrfs_select_ref_head and cleanup some local variables

2018-10-14 Thread Lu Fengqi
On Thu, Oct 11, 2018 at 03:28:15PM +0300, Nikolay Borisov wrote:
>
>
>On 11.10.2018 15:15, Lu Fengqi wrote:
>> On Thu, Oct 11, 2018 at 09:40:52AM +0300, Nikolay Borisov wrote:
>>>
>>>
>>> On 11.10.2018 08:40, Lu Fengqi wrote:
 If the return value of find_ref_head() is NULL, the only possibility is
 that delayed_refs' head ref rbtree is empty. Hence, the second
 find_ref_head() is pointless.
> Besides, the local variables loop and start are unnecessary, just remove
 them.
>>>
>>> So the objective of that function is to get a reference to the first
>>> delayed head which is not processed. This is done by essentially keeping
>>> track of the last range that was processed in
>>> delayed_refs->run_delayed_start

 Signed-off-by: Lu Fengqi 
 ---
  fs/btrfs/delayed-ref.c | 17 +++--
  1 file changed, 3 insertions(+), 14 deletions(-)

 diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
 index 885581852bea..2726d2fb4bbe 100644
 --- a/fs/btrfs/delayed-ref.c
 +++ b/fs/btrfs/delayed-ref.c
 @@ -354,20 +354,11 @@ struct btrfs_delayed_ref_head *
  btrfs_select_ref_head(struct btrfs_delayed_ref_root *delayed_refs)
  {
struct btrfs_delayed_ref_head *head;
 -  u64 start;
 -  bool loop = false;
  
  again:
 -  start = delayed_refs->run_delayed_start;
 -  head = find_ref_head(delayed_refs, start, 1);
 -  if (!head && !loop) {
 +  head = find_ref_head(delayed_refs, delayed_refs->run_delayed_start, 1);
 +  if (!head) {
delayed_refs->run_delayed_start = 0;
 -  start = 0;
 -  loop = true;
 -  head = find_ref_head(delayed_refs, start, 1);
 -  if (!head)
 -  return NULL;
 -  } else if (!head && loop) {
>>>
>>> I believe this will have a negative impact since it actually will
>>> prevent finding a head which was added BEFORE the last processed head.
>>> So when a ref head is selected in btrfs_obtain_ref_head then the
>>> delayed_refs->lock is dropped and the given head is locked and
>>> delayed_refs->run_delayed_start points to the end of the selected range
>>> that the head represents. At this point it's possible that another
>>> thread modifies a different range which is before the one we have
>>> selected so graphically it will be something like:
>>>
>>>
>>> ---[HEAD2]->[HEAD1]--
>>> 0N
>>>
>>> Where HEAD1 is the head returned from first invocation of
>>> btrfs_obtain_ref_head. Once  btrfs_obtain_ref_head is called the 2nd
>>> time it will not find HEAD2 so will just reset run_delayed_start to 0
>>> and return. So it will be up to another run of the delayed refs to
>>> actually find head2. Essentially you made btrfs_obtain_ref_head less
>> 
>> Not exactly. In fact, find_ref_head hides such a logic. When
>> return_bigger is set, if there is no larger entry to return, the first
>> entry will be returned. Please see the comment I add in the PATCH 6.
>> 
>> Hence, the 2nd invocation of btrfs_obtain_ref_head still will return
>> HEAD2. There is no functional change here.
>> 
>> However, your question makes me consider whether such hidden logic
>> should be extracted from find_ref_head to btrfs_select_ref_head.
>
>Right I agree with your. As it stands I will expect that if
>return_bigger is true to specifically return a bigger entry or if
>nothing is found to return null. IMO this behavior is higher level and

This is also exactly what I want. The patch is on the way.

>belongs to btrfs_delayed_ref_head.
>
>> 
>>> greedy. Have you characterized what kind of performance impact this have?
>> 
>> I noticed that there is a macro called SCRAMBLE_DELAYED_REFS in the
>> extent-tree.c. I am a bit curious whether it has been forgotten by
>> everyone, I have not found any test results about its performance impact.
>
>I guess it was used during testing but nothing currently sets it. I.e it
>might make sense to enable it if BTRFS_DEBUG is set.
>

Make sense.

-- 
Thanks,
Lu




Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Chris Murphy
On Sun, Oct 14, 2018 at 1:09 PM, Cerem Cem ASLAN  wrote:
> Thanks for the explanation, I got it now. I still think this is
> related with my needs, so I'll keep an eye on this.
>
> What is the possible use case? I can think of only one scenario: You
> have a rootfs that contains a distro installer and you want to
> generate distro.img files which uses Btrfs under the hood in different
> locations and still have the same hash, so you can publish your
> verified image hash by a single source (https://your-distro.org).

The first step is accepting reproducible builds as a worthy goal in
and of itself independent of Btrfs. Specifically "Why does it matter?"
found here https://reproducible-builds.org/

Btrfs does bring valuable features for installation images: always on
checksumming; seed feature permits a straightforward way to setup a
volatile overlay on zram device; ability to convert it to a
non-volatile overlay, and boot either the seed or overlay; and even
installation by adding the install target and removing both overlay
and seed. And yet it remains compatible with a conventional copy to
another file system if it's not desirable to use Btrfs as root. Win
win.

By subsetting Btrfs features we don't care about in the installation
seed context, can we achieve reproducibility as a consequence, while
retaining some of the more interesting features? Of course once
sprouted, those limitations wouldn't apply.

Basically it's a "btrfs seed device 2.0" idea. But Btrfs is so
complicated it's maybe too much work, hence the question.



-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Cerem Cem ASLAN
Thanks for the explanation, I got it now. I still think this is
related with my needs, so I'll keep an eye on this.

What is the possible use case? I can think of only one scenario: You
have a rootfs that contains a distro installer and you want to
generate distro.img files which uses Btrfs under the hood in different
locations and still have the same hash, so you can publish your
verified image hash by a single source (https://your-distro.org).
You'll sync next release files with the remote servers by using diffs
(btrfs send/receive) and they will generate distro.img independently,
still having the same hash that you'll later verify by
https://your-distro.org.
Chris Murphy , 14 Eki 2018 Paz, 21:10
tarihinde şunu yazdı:
>
> On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN  
> wrote:
> > I'm not sure I could fully understand the desired achievement but it
> > sounds like (or this would be an example of selective perception) it's
> > somehow related with "creating reproducible snapshots"
> > (https://unix.stackexchange.com/q/462451/65781), no?
>
> No the idea is to be able to consistently reproduce a distro installer
> image (like an ISO file) with the same hash. Inside the ISO image, is
> typically a root.img or squash.img which itself contains a file system
> like ext4 or squashfs, to act as the system root. And that root.img is
> the main thing I'm talking about here. There is work to make squashfs
> deterministic, as well as ext4. And I'm wondering if there are sane
> ways to constrain Btrfs features to make it likewise deterministic.
>
> For example:
>
> fallocate -l 5G btrfsroot.img
> losetup /dev/loop0 btrfsroot.img
> mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T
> "20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ...
> shasum btrfsroot.img
>
> And then do it again, and the shasum's should be the same. I realize
> today it's not that way. And that inode assignment, extent allocation
> (number, size, locality) are all factors in making Btrfs quickly
> non-determinstic, and also why I'm assuming this needs to be done in
> user space. That would be the point of the -rseed flag: set the seed
> flag, possibly set a compat_ro flag, fix generation/transid to 1,
> require the use of -T (similar to make_ext4) to set all timestamps to
> this value, and configurable uuid's for everything that uses uuids,
> and whatever other constraints are necessary.
>
>
> --
> Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Chris Murphy
On Sun, Oct 14, 2018 at 6:20 AM, Cerem Cem ASLAN  wrote:
> I'm not sure I could fully understand the desired achievement but it
> sounds like (or this would be an example of selective perception) it's
> somehow related with "creating reproducible snapshots"
> (https://unix.stackexchange.com/q/462451/65781), no?

No the idea is to be able to consistently reproduce a distro installer
image (like an ISO file) with the same hash. Inside the ISO image, is
typically a root.img or squash.img which itself contains a file system
like ext4 or squashfs, to act as the system root. And that root.img is
the main thing I'm talking about here. There is work to make squashfs
deterministic, as well as ext4. And I'm wondering if there are sane
ways to constrain Btrfs features to make it likewise deterministic.

For example:

fallocate -l 5G btrfsroot.img
losetup /dev/loop0 btrfsroot.img
mkfs.btrfs -m single -d single -rseed --rootdir /tmp/ -T
"20181010T1200" --uuidv $X --uuidc $Y --uuidd $Z ...
shasum btrfsroot.img

And then do it again, and the shasum's should be the same. I realize
today it's not that way. And that inode assignment, extent allocation
(number, size, locality) are all factors in making Btrfs quickly
non-determinstic, and also why I'm assuming this needs to be done in
user space. That would be the point of the -rseed flag: set the seed
flag, possibly set a compat_ro flag, fix generation/transid to 1,
require the use of -T (similar to make_ext4) to set all timestamps to
this value, and configurable uuid's for everything that uses uuids,
and whatever other constraints are necessary.


-- 
Chris Murphy


Re: reproducible builds with btrfs seed feature

2018-10-14 Thread Cerem Cem ASLAN
I'm not sure I could fully understand the desired achievement but it
sounds like (or this would be an example of selective perception) it's
somehow related with "creating reproducible snapshots"
(https://unix.stackexchange.com/q/462451/65781), no?
Chris Murphy , 14 Eki 2018 Paz, 02:05
tarihinde şunu yazdı:
>
> On Sat, Oct 13, 2018 at 4:28 PM, Chris Murphy  wrote:
> > Is it practical and desirable to make Btrfs based OS installation
> > images reproducible? Or is Btrfs simply too complex and
> > non-deterministic? [1]
> >
> > The main three problems with Btrfs right now for reproducibility are:
> > a. many objects have uuids other than the volume uuid; and mkfs only
> > lets us set the volume uuid
> > b. atime, ctime, mtime, otime; and no way to make them all the same
> > c. non-deterministic allocation of file extents, compression, inode
> > assignment, logical and physical address allocation
>
> d. generation, just pick a consistent default because the entire image
> is made with mkfs and then never rw mounted so it's not a problem
>
> > - Possibly disallow subvolumes and snapshots
>
> There's no actual mechanism to do either of these with mkfs, so it's
> not a problem. And if a sprout is created, it's fine for newly created
> subvolumes to follow the usual behavior of having unique UUID and
> incrementing generation. Thing is, the sprout will inherit the seeds
> preset chunk uuid, which while it shouldn't cause a problem is a kind
> of violation of uuid uniqueness; but ultimately I'm not sure how big
> of a problem it is for such uuids to spread.
>
>
>
> --
> Chris Murphy


Re: BTRFS bad block management. Does it exist?

2018-10-14 Thread Qu Wenruo


On 2018/10/14 下午7:08, waxhead wrote:
> In case BTRFS fails to WRITE to a disk. What happens?

Normally it should return error when we flush disk.
And in that case, error will leads to transaction abort and the fs goes
RO to prevent further corruption.

> Does the bad area get mapped out somehow?

No.

> Does it try again until it
> succeed or until it "times out" or reach a threshold counter?

Unless it's done by block layer, btrfs doesn't try that.

> Does it eventually try to write to a different disk (in case of using
> the raid1/10 profile?)

No. That's not what RAID is designed to do.

It's only allowed to have any flush error if using "degraded" mount
option and the error is under tolerance.

Thanks,
Qu



signature.asc
Description: OpenPGP digital signature


BTRFS bad block management. Does it exist?

2018-10-14 Thread waxhead

In case BTRFS fails to WRITE to a disk. What happens?
Does the bad area get mapped out somehow? Does it try again until it 
succeed or until it "times out" or reach a threshold counter?
Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)