date:20160226

On Fri, Feb 26, 2016 at 10:36:50PM +0100, Stanislav Brabec wrote:

> It should definitely report error whenever trying -oloop on top of
> anything else than a file. Or at least a warning.
> 
> Well, even losetup should report a warning.

Keep in mind that with crypto in the game it just might be useful to have
loop-over-loop - it might be _not_ a no-op (hell, you might have two
layers of encryption - not the smartest thing to do, but if that's what
got dumped on your lap, you deal with what you've got).  So such warnings
shouldn't be hard errors.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes


On Feb 26, 2016 at 22:03 Al Viro wrote:
And I'm not sure how

to deal with -o loop in a sane way, TBH - automagical losetup is bloody
hard to get right.


See another reply in this thread for the idea:
Fri, 26 Feb 2016 22:00:44 +0100


Keep in mind that loop-over-loop is also possible...


Indeed! Let's remember that mount(8) should never do it.

# losetup /dev/loop0 /dev/sda2
# losetup /dev/loop1 /dev/loop0
# losetup -l
NAME   SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
/dev/loop0 0  0 0  0 /dev/sda2
/dev/loop1 0  0 0  0 /dev/loop0


But it actually does, if the command line is "overlooped":

oct:~ # mount -oloop /dev/loop1 /mnt

as it does exactly that:
oct:~ # losetup -l
NAME   SIZELIMIT OFFSET AUTOCLEAR RO BACK-FILE
/dev/loop0 0  0 0  0 /dev/sda2
/dev/loop1 0  0 0  0 /dev/loop0
/dev/loop2 0  0 1  0 /dev/loop1

It should definitely report error whenever trying -oloop on top of
anything else than a file. Or at least a warning.

Well, even losetup should report a warning.

--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: upgrading kernel 3.13 to 3.16

2016-02-26 Thread Duncan

Vytautas D posted on Fri, 26 Feb 2016 10:50:12 + as excerpted:

> Hi all,
> 
> Are there any known issues upgrading btrfs running ubuntu kernel 3.13 to
> 3.16 ? System was once converted from ext4 using btrfs-convert (
> btrfs-progs 3.17 ).
> 
> The commit that worries me is following:
> *  Btrfs: incompatible format change to remove hole extents (+373/-56)
> (
> http://linux-btrfs.vger.kernel.narkive.com/syNRZbHS/patch-btrfs-
incompatible-format-change-to-remove-hole-extents-v3#post1
> )
> 
> would this block me from reverting system with a snapshot back to kernel
> 3.13 ?
> After upgrade would the system continue writing more metadata ?

As Austin H says about that commit, but in the broader sense as well, 
btrfs policy since inclusion in the mainline kernel (with one exception 
in the first kernel series after that, which Linus made *very* clear he 
didn't appreciate as he was actually running btrfs on something and it 
made switching kernels back and forth across that exception, for testing, 
nearly impossible), has been that on existing btrfs, new features that 
affect the on-device format must be specifically enabled.

IOW, no worries about incompatible upgrades on existing filesystems -- if 
such bugs happen at all they're treated exactly as that, regression bugs, 
and are fixed at high priority, with enough people running btrfs now that 
such bugs are likely to be found rather fast and a *BIG* stink raised 
about them not being found and fixed before they even reached mainline.

/New/ btrfs, or fresh conversions from ext* using the converter when the 
creation/conversion is done from a newer kernel with intent to use an 
older kernel, are different.  In that case, the creation/conversion 
options will often enable new features and old kernels won't be able to 
mount the filesystem.  However, there are options available to create 
older on-device formats as well, when the filesystems are intended to be 
mounted on older kernels.


All that said, you are **WAY** behind list-recommended kernels, even with 
kernel 3.16.  Btrfs is considered "stabilizing, but not yet entirely 
stable and mature."  As such, the strong on-list recommendation is to 
choose either the current or mainline LTS kernel series, and to run no 
further back than next to last kernel series in either.  With the current 
4.4 kernel also being LTS, that would be 4.4 or 4.3 if you choose the 
current kernels, and the LTS 4.4 or 4.1 series if you're doing LTS.  With 
4.4 reasonably new, it's understood if you're still on the previous LTS 
before that, 3.18, but if you're on 3.18, you'd be strongly encouraged to 
upgrade to at least 4.1 and preferably 4.4 ASAP.

Older kernels, at least back to 3.12 where the "experimental" label was 
officially pealed off and btrfs (semi-)officially reached its current 
status of "stabilizing, but not entirely stable or mature", are "best 
effort" support.  We do still try to help as best we can, but the first 
recommendation you'll get upon posting to the list is "please upgrade to 
a kernel more in line with btrfs' 'stabilizing but not fully stable' 
status."

Yes there are reasons people may wish to run really old kernels.  
However, such reasons really aren't compatible with running a still 
stablizing filesystem like btrfs in the first place, and so many bugs 
have been fixed and development focus has simply moved on since then, 
that supporting btrfs on such old kernels really isn't practical, as for 
us it really is ancient, and buggy, history.  So the recommendation is, 
if you /do/ have a reason to run such old kernels, generally, a wish for 
stability and lack of change, then you really should consider running 
something other than btrfs, because the fact of the matter is, it's still 
changing fast, and simply doesn't yet reach the level of stability that 
running such old kernels indicates you want/need.  So choose one or the 
other, btrfs on reasonably current kernels if you want it, or stability 
on older kernels, without btrfs, if you want/need that.

All /that/ said, yes, some distros do claim support on older kernels, and 
indeed, they may well be backporting bug patches as appropriate to 
properly support that claim.  But that's their claim and their support.  
On the list we're focused on newer kernels and features, and while we try 
not to break older and doing so is a bug we'll patch if we find it, as a 
rule we don't track those distros and what patches they may or may not 
have backported, and thus have no way of properly supporting them.  So if 
you're relying on distro support for btrfs on such old kernels, you 
really should be looking to them for that support, not to this list, as 
we'll still do our best effort, but the fact is, it's not going to be to 
the level of support we'd be able to give if you were running kernels 
within our recommended kernel support time frame, the last two of either 
current or LTS kernel series, and often, the best we'll be able to do 
with

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On Fri, Feb 26, 2016 at 09:37:22PM +0100, Stanislav Brabec wrote:

> Do I understand, that you are saying:
> 
> Yes, mounting multiple loop devices associated with one file is a
> legitimate use, but mount(8) should never do it, because it has other
> ugly side effects?

It's on the same level as "hey, let's have an nbd daemon run in qemu
guest, exporting a host file over nbd, import it to host as /dev/nbd69,
set a loopback device over the underlying file as /dev/loop42 and
ask e.g. xfs to recognize that it's dealing with the same underlying array
of bytes in both cases - wouldn't it be neat if it could do that?"

There's no magic.  Really.  Unexpected sharing of backing store between
apparently unrelated devices can cause trouble.  And I'm not sure how
to deal with -o loop in a sane way, TBH - automagical losetup is bloody
hard to get right.  Keep in mind that loop-over-loop is also possible...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes


On Feb 26, 2016 at 21:30 Al Viro wrote:

IMO on-demand losetup a-la -o loop is simply a bad idea...


So the correct behavior of -o loop should:

Check, whether another mount command already did losetup.

If not, allocate new loop device.

If yes, reuse existing loop device.

Well, it seems to be safe, even if the loop device was not allocated by 
mount(8) itself, as

ioctl(fd, LOOP_CLR_FD)
never returns EBUSY:

# losetup /dev/loop2 /ext4.img
# mount /dev/loop2 /mnt
# strace losetup -d /dev/loop2 2>&1 | tail -n7 | head -n3
open("/dev/loop2", O_RDONLY|O_CLOEXEC)  = 3
ioctl(3, LOOP_CLR_FD)   = 0
close(3)= 0

If the recycling "alien" loop devices will not be considered as a good 
idea, then (if possible):


If the loop device was allocated by mount(8) itself, recycle it.

If the loop device was not allocated by mount(8) itself, return error.

--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On Feb 26, 2016 at 21:05 Austin S. Hemmelgarn wrote:

> It's kind of interesting, but I can't reproduce _any_ of this behavior
> with either ext4 or BTRFS when I manually set up the loop devices and
> point mount(8) at those instead of using -o loop on a file. That really
> seems to indicate that this is caused by something mount(8) is doing
> when it's calling losetup.

Behavior of "-oloop" is more similar to "losetup -f /fs.img"? than to
"losetup /dev/loop0 /fs.img".

Anyway, I can reproduce without -oloop:
# losetup /dev/loop0 /btrfs.img
# mount /dev/loop0 /mnt/1
# grep /mnt /proc/self/mountinfo
107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop0 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
# losetup /dev/loop1 /btrfs.img
# mount -osubvol=/ /dev/loop1 /mnt/2
# grep /mnt /proc/self/mountinfo
107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop1 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
108 59 0:59 / /mnt/2 rw,relatime shared:48 - btrfs /dev/loop1 
rw,space_cache,subvolid=5,subvol=/
# uname -a
Linux oct 4.4.1-1-default #1 SMP PREEMPT Mon Feb 15 11:03:27 UTC 2016 (6398c2d) 
x86_64 x86_64 x86_64 GNU/Linux

(Note that the system was freshly rebooted. After other experiments,
the second line of mountinfo can be missing completely.)

>> 2) mount(2) called after the reproducer returns OK but does nothing.
>>
> OK, we've determined that mount(2) is misbehaving.  That doesn't change
> the fact that mount(8) is triggering this, and therefore should itself
> be corrected.

> Assume that mount(2) gets fixed so it doesn't lose it's
> mind and /proc/self/mountinfo doesn't change.  There will still be
> issues resulting from mount(8)'s behavior:
> 1. BTRFS will lose it's mind and corrupt data when using a multi-device
> filesystem (due to the problems with duplicate FS UUID's).
> 2. XFS might have similar issues to 1 when using metadata checksumming,
> although it's more likely that it won't allow the second mount to succeed.
> 3. Most other filesystems will likely end up corrupting data.

Do I understand, that you are saying:

Yes, mounting multiple loop devices associated with one file is a
legitimate use, but mount(8) should never do it, because it has other
ugly side effects?

OK, it looks like a next task for mount(8) to fix.

-- 
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes


On 2016-02-26 15:30, Al Viro wrote:

On Fri, Feb 26, 2016 at 03:05:27PM -0500, Austin S. Hemmelgarn wrote:

Where is /mnt/2?

It's kind of interesting, but I can't reproduce _any_ of this
behavior with either ext4 or BTRFS when I manually set up the loop
devices and point mount(8) at those instead of using -o loop on a
file. That really seems to indicate that this is caused by something
mount(8) is doing when it's calling losetup. I'm running a mostly
unmodified version of 4.4.2 (the only modification that would come
even remotely close to this is that I changed the default mount
options for everything from relatime to noatime), and util-linux
2.27.1 from Gentoo.


Sigh...  sys_mount() (mount_bdev(), actually) has no way to tell if two
loop devices refer to the same underlying object.  As far as it's
concerned, you are asking to mount a completely unrelated block device.
Which just happens to see the data (living in separate pagecache, even)
modified behind its back (with some delay) after it gets written to another
device.  Filesystem drivers generally don't like when something is screwing
the underlying data, to put it mildly...

When you ask to mount the _same_ device, mount_bdev(), as well as btrfs
counterpart, makes sure that you get a reference to the same struct
super_block, which avoids all coherency problems - all mounted instances
refer to the same in-core objects (dentries, inodes, page cache, etc.).
They get separate struct vfsmount instances, but that only matters for
mountpoint crossing.

As soon as you've set the second /dev/loop alias for the same underlying
file, you are asking for all kinds of trouble.  If you use the same one
consistently, you are OK.  BTW, even
losetup /dev/loop0 /dev/sda1
mount -t ext2 /dev/sda1 /mnt/1
mount -t ext2 /dev/loop0 /mnt/2
is enough for trouble - you get (as far as ext2 knows) unrelated devices
screwing each other, with no good way to predict that.  And you need to
check propagation through more than one layer - loop over loop over block
is also possible.

IMO on-demand losetup a-la -o loop is simply a bad idea...

I agree wholeheartedly and wasn't disputing any of this, I meant I'm not 
seeing any of the odd mount(2) or /proc/self/mountinfo behavior that 
Stanislav started the thread about.  It was entirely trivial to get the 
filesystem images I used into a state where they couldn't be mounted 
again afterwards.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On Fri, Feb 26, 2016 at 03:05:27PM -0500, Austin S. Hemmelgarn wrote:
> >Where is /mnt/2?
> It's kind of interesting, but I can't reproduce _any_ of this
> behavior with either ext4 or BTRFS when I manually set up the loop
> devices and point mount(8) at those instead of using -o loop on a
> file. That really seems to indicate that this is caused by something
> mount(8) is doing when it's calling losetup. I'm running a mostly
> unmodified version of 4.4.2 (the only modification that would come
> even remotely close to this is that I changed the default mount
> options for everything from relatime to noatime), and util-linux
> 2.27.1 from Gentoo.

Sigh...  sys_mount() (mount_bdev(), actually) has no way to tell if two
loop devices refer to the same underlying object.  As far as it's
concerned, you are asking to mount a completely unrelated block device.
Which just happens to see the data (living in separate pagecache, even)
modified behind its back (with some delay) after it gets written to another
device.  Filesystem drivers generally don't like when something is screwing
the underlying data, to put it mildly...

When you ask to mount the _same_ device, mount_bdev(), as well as btrfs
counterpart, makes sure that you get a reference to the same struct
super_block, which avoids all coherency problems - all mounted instances
refer to the same in-core objects (dentries, inodes, page cache, etc.).
They get separate struct vfsmount instances, but that only matters for
mountpoint crossing.

As soon as you've set the second /dev/loop alias for the same underlying
file, you are asking for all kinds of trouble.  If you use the same one
consistently, you are OK.  BTW, even
losetup /dev/loop0 /dev/sda1
mount -t ext2 /dev/sda1 /mnt/1
mount -t ext2 /dev/loop0 /mnt/2
is enough for trouble - you get (as far as ext2 knows) unrelated devices
screwing each other, with no good way to predict that.  And you need to
check propagation through more than one layer - loop over loop over block
is also possible.

IMO on-demand losetup a-la -o loop is simply a bad idea...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes


On 2016-02-26 14:12, Stanislav Brabec wrote:

Al Viro wrote:

On Fri, Feb 26, 2016 at 11:39:11AM -0500, Austin S. Hemmelgarn wrote:


That's just it though, from what I can tell based on what I've seen
and what you said above, mount(8) isn't doing things correctly in
this case.  If we were to do this with something like XFS or ext4,
the filesystem would probably end up completely messed up just
because of the log replay code (assuming they actually mount the
second time, I'm not sure what XFS would do in this case, but I
believe that ext4 would allow the mount as long as the mmp feature
is off).  It would make sense that this behavior wouldn't have been
noticed before (and probably wouldn't have mattered even if it had
been), because most filesystems don't allow multiple mounts even if
they're all RO, and most people don't try to mount other filesystems
multiple times as a result of this.


Well, in such case kernel should return an error when mount(8) is
trying to use multiple mount devices for a single file for mount(2).
As I said in my other e-mail, there are perfectly legitimate reasons to 
be doing this.  And I should also point out that anybody who has one of 
those reasons for doing this should be setting up the loop devices 
themselves, so mount(8) behaving this way is still wrong.


But kernel does not return error, it starts to do strange things.


They most certainly do.  The problem is mount(8) treatment of -o loop -
you can mount e.g. ext4 many times, it'll just get you extra references
to the same struct super_block from those new vfsmounts.  IOW, that'll
behave the same way as if you were doing mount --bind on subsequent ones.


I just tested the same with ext4. The rewriting of mountinfo happens
only with btrfs.

But after that mount(2) stops to work. See the last mount(2). It
returns 0, but nothing is mounted! Kernel mount(2) refuses to work!

# mount -oloop /ext4.img /mnt/1
# cat /proc/self/mountinfo | grep /mnt
238 59 7:0 / /mnt/1 rw,relatime shared:153 - ext4 /dev/loop0 rw,data=ordered
# mount -oloop /ext4.img /mnt/2
# cat /proc/self/mountinfo | grep /mnt
238 59 7:0 / /mnt/1 rw,relatime shared:153 - ext4 /dev/loop0 rw,data=ordered
243 59 7:1 / /mnt/2 rw,relatime shared:156 - ext4 /dev/loop1 rw,data=ordered
# umount /mnt/*
# mount -oloop /btrfs.img /mnt/1
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop0 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
# mount -oloop,subvol=/ /btrfs.img /mnt/2
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop1 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

I is really strange! Mount was called, but nothing appeared in the
mountinfo. Just a rewritten /dev/loop0 -> /dev/loop1 in the existing
mount.

To be sure, that it is mount(2) issue and not mount(8), let's try it
again with strace.

# strace mount -oloop,subvol=/ /btrfs.img /mnt/2 2>&1 | tail -n 7
mount("/dev/loop1", "/mnt/2", "btrfs", MS_MGC_VAL, "subvol=/") = 0
access("/mnt/2", W_OK)  = 0
close(4)= 0
close(1)= 0
close(2)= 0
exit_group(0)   = ?
+++ exited with 0 +++
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop1 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

Where is /mnt/2?
It's kind of interesting, but I can't reproduce _any_ of this behavior 
with either ext4 or BTRFS when I manually set up the loop devices and 
point mount(8) at those instead of using -o loop on a file. That really 
seems to indicate that this is caused by something mount(8) is doing 
when it's calling losetup. I'm running a mostly unmodified version of 
4.4.2 (the only modification that would come even remotely close to this 
is that I changed the default mount options for everything from relatime 
to noatime), and util-linux 2.27.1 from Gentoo.



And as far as kernel is concerned, /dev/loop* isn't special in any respects;
if you do explicit losetup and mount the resulting /dev/loop as many
times as you wish, it'll work just fine.


mount(8) just calls losetup internally for every -o loop. Once per
"loop" option. Nobody probably tried to loop mount the same ext4 volume
more times, so no problems appeared.

But for btrfs, one would. And mounting two btrfs subvolumes with two
"-oloop" calls losetup twice for the same file.


And from the kernel POV it's not
different from what it sees with -o loop; setting the loop device up is
done first by separate syscall, then mount(2) for that device is issued.


Yes, it is different.
- You have one file.
- You have two loop devices pointing to the same file.
- btrfs subvolumes are internally handled similarly like bind mounts.
   It means, that all

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes


On Feb 26, 2016 at 19:22 Austin S. Hemmelgarn wrote:

The first commit is just test cases, and the others are specific issues
that only affected BTRFS which have nothing to do with this thread at
all other than involving mount(8) and BTRFS.


Yes, it is a bit off topic. It just demonstrates, how complex mount(8) 
for btrfs is.


The test case is all-fail in util-linux 2.27.1.


mount(8) behavior has the potential to cause either data
corruption or similar behavior in the future (I would expect that XFS
with metadata checksumming enabled would cause a similar interaction,
although they probably would handle it better).


Especially "mount -a" has a hard time to recognize what was already 
mounted and what still needs to be mounted.


The only information mount(8) has is the one from mountinfo. 
Interpreting of the mountinfo contents to reconstruct possible mount 
options used is a task far from being trivial.


Some of them are even impossible to discriminate:

Suppose you have:
mount -osubvol=/ /dev/sda2 /mnt/1
mount -osubvol=/sbv /dev/sda2 /mnt/2

Case 1:
mount -obind /mnt/1/sbv/bnd /mnt/3

Case 2:
mount -obind /mnt/2/bnd /mnt/3

Case 1 and case 2 have exactly the same mountinfo, but different 
reference counts for /mnt/1 and /mnt/2.


--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

Al Viro wrote:
> On Fri, Feb 26, 2016 at 11:39:11AM -0500, Austin S. Hemmelgarn wrote:
> 
>> That's just it though, from what I can tell based on what I've seen
>> and what you said above, mount(8) isn't doing things correctly in
>> this case.  If we were to do this with something like XFS or ext4,
>> the filesystem would probably end up completely messed up just
>> because of the log replay code (assuming they actually mount the
>> second time, I'm not sure what XFS would do in this case, but I
>> believe that ext4 would allow the mount as long as the mmp feature
>> is off).  It would make sense that this behavior wouldn't have been
>> noticed before (and probably wouldn't have mattered even if it had
>> been), because most filesystems don't allow multiple mounts even if
>> they're all RO, and most people don't try to mount other filesystems
>> multiple times as a result of this.

Well, in such case kernel should return an error when mount(8) is
trying to use multiple mount devices for a single file for mount(2).

But kernel does not return error, it starts to do strange things.

> They most certainly do.  The problem is mount(8) treatment of -o loop -
> you can mount e.g. ext4 many times, it'll just get you extra references
> to the same struct super_block from those new vfsmounts.  IOW, that'll
> behave the same way as if you were doing mount --bind on subsequent ones.

I just tested the same with ext4. The rewriting of mountinfo happens
only with btrfs.

But after that mount(2) stops to work. See the last mount(2). It
returns 0, but nothing is mounted! Kernel mount(2) refuses to work!

# mount -oloop /ext4.img /mnt/1
# cat /proc/self/mountinfo | grep /mnt
238 59 7:0 / /mnt/1 rw,relatime shared:153 - ext4 /dev/loop0 rw,data=ordered
# mount -oloop /ext4.img /mnt/2
# cat /proc/self/mountinfo | grep /mnt
238 59 7:0 / /mnt/1 rw,relatime shared:153 - ext4 /dev/loop0 rw,data=ordered
243 59 7:1 / /mnt/2 rw,relatime shared:156 - ext4 /dev/loop1 rw,data=ordered
# umount /mnt/*
# mount -oloop /btrfs.img /mnt/1
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop0 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
# mount -oloop,subvol=/ /btrfs.img /mnt/2
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop1 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

I is really strange! Mount was called, but nothing appeared in the
mountinfo. Just a rewritten /dev/loop0 -> /dev/loop1 in the existing
mount.

To be sure, that it is mount(2) issue and not mount(8), let's try it
again with strace.

# strace mount -oloop,subvol=/ /btrfs.img /mnt/2 2>&1 | tail -n 7
mount("/dev/loop1", "/mnt/2", "btrfs", MS_MGC_VAL, "subvol=/") = 0
access("/mnt/2", W_OK)  = 0
close(4)= 0
close(1)= 0
close(2)= 0
exit_group(0)   = ?
+++ exited with 0 +++
# cat /proc/self/mountinfo | grep /mnt
238 59 0:94 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:153 - 
btrfs /dev/loop1 
rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

Where is /mnt/2?

> And as far as kernel is concerned, /dev/loop* isn't special in any respects;
> if you do explicit losetup and mount the resulting /dev/loop as many
> times as you wish, it'll work just fine.

mount(8) just calls losetup internally for every -o loop. Once per
"loop" option. Nobody probably tried to loop mount the same ext4 volume
more times, so no problems appeared.

But for btrfs, one would. And mounting two btrfs subvolumes with two
"-oloop" calls losetup twice for the same file.

> And from the kernel POV it's not
> different from what it sees with -o loop; setting the loop device up is
> done first by separate syscall, then mount(2) for that device is issued.

Yes, it is different.
- You have one file.
- You have two loop devices pointing to the same file.
- btrfs subvolumes are internally handled similarly like bind mounts.
  It means, that all subvolumes should have the same mount source. But
  these two mounts don't have.

> It's mount(8) that screws up here.

Yes mount(8) screws mount(2). And it corrupts kernel:

1) /proc/self/mountinfo changes its contents.

2) mount(2) called after the reproducer returns OK but does nothing.

-- 
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On 2016-02-26 12:07, Stanislav Brabec wrote:

Austin S. Hemmelgarn wrote:
> On 2016-02-26 10:50, Stanislav Brabec wrote:

That's just it though, from what I can tell based on what I've seen and
what you said above, mount(8) isn't doing things correctly in this case.
If we were to do this with something like XFS or ext4, the filesystem
would probably end up completely messed up just because of the log
replay code (assuming they actually mount the second time, I'm not sure
what XFS would do in this case, but I believe that ext4 would allow the
mount as long as the mmp feature is off). It would make sense that this
behavior wouldn't have been noticed before (and probably wouldn't have
mattered even if it had been), because most filesystems don't allow
multiple mounts even if they're all RO, and most people don't try to
mount other filesystems multiple times as a result of this. If this
behavior of allocating a new loop device for each call on a given file
is in fact not BTRFS specific (as implied by your statement about a
possible workaround in mount(8)), then mount(8) really should be fixed
to not do that before we even consider looking at the issues in BTRFS,
as that is behavior that has serious potential to result in data
corruption for any filesystem, not just BTRFS.

Well, kernel could "fix" it in a simple way:

- don't allow two loop devices pointing to the same file
or
- don't allow two loop devices pointing to the same file being used by
mount(2).
This has legitimate usage in testing multipath configuration and
operation, and in testing that filesystems handle this correctly. On
top of that, it becomes decidedly non-trivial to handle when you
consider that loop devices can map a fixed range of a file independent
of the rest of the file (this used to be the way to pull partitions out
of raw disk images before the device mapper became as commonplace as it
is now).

Then util-linux would need a behavior change for sure.

I already found another inconsistency caused by this implementation:

/proc/self/mountinfo reports subvolid of the nearest upper sub-volume
root for the bind mount, not the sub-volume that was used for creating
this bind mount, and subvolid that potentially does not correspond to
any subvolume root.

This could causes problem for evaluation of order of umount(2) that
should prevent EBUSY.

I was talking about it with David Sterba, and he told, that in the
current implementation is not optimal. btrfs driver does not have
sufficient information to evaluate true root of the bind mount.

I've noticed this before myself, but I've never seen any issues
resulting from it; however, I've also not tried calling BTRFS related
ioctls on or from such a mount, so I may just have been lucky.

I can imagine two side effects deeply inside mount(8):

- "mount -a" uses subvol internally for a path lookup of the default
volume or volume corresponding to subvolid. (Only the GIT version,
not yet in 2.27.1.) I could imagine that the lookup is confused by a
bind mount reporting the searched subvolid and a "random" subvol
subvol. But I don't have a reproducer yet, and I am not sure,
whether it is really possible.

- "umount -a" could have a problem to find a proper order to umount(2)
without EBUSY. I did not check the algorithm, so I am not sure,
whether it is a real issue.
If BTRFS can't get the correct ref on the FS root internally, then there
are all kinds of things that could go wrong when you try to do any of
the typical maintenance stuff on it (like balancing, scrub, defrag,
snapshot/subvolume creation/deletion, etc). In essence, if you try to
do almost anything using the btrfs command line tools on that mount
point, it might fail in new and interesting ways.

P. S.: There were many problems with btrfs in mount(8):

https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=c4af75a84ef3430003c77be2469869aaf3a63e2a
https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=618a88140e26a134727a39c906c9cdf6d0c04513
https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=d2f8267847ecbe763a3b63af1289bf1179cd8c45
https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=2cd28fc82d0c947472a4700d5e764265916fba1e
https://git.kernel.org/cgit/utils/util-linux/util-linux.git/commit/?id=352740e88e2c9cb180fe845ce210b1c7b5ad88c7

The first commit is just test cases, and the others are specific issues
that only affected BTRFS which have nothing to do with this thread at
all other than involving mount(8) and BTRFS. The originally stated
issue that this thread is about is specific to loop mounting a BTRFS
filesystem stored in a file multiple times. The issue can be
empirically demonstrated to be a result of an interaction between BTRFS
behavior regarding duplicate filesystems and an implementation detail of
mount(8). The BTRFS behavior WRT duplicate FS UUID's is not going away
any time soon (believe me, it's

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On Fri, Feb 26, 2016 at 11:39:11AM -0500, Austin S. Hemmelgarn wrote:

> That's just it though, from what I can tell based on what I've seen
> and what you said above, mount(8) isn't doing things correctly in
> this case.  If we were to do this with something like XFS or ext4,
> the filesystem would probably end up completely messed up just
> because of the log replay code (assuming they actually mount the
> second time, I'm not sure what XFS would do in this case, but I
> believe that ext4 would allow the mount as long as the mmp feature
> is off).  It would make sense that this behavior wouldn't have been
> noticed before (and probably wouldn't have mattered even if it had
> been), because most filesystems don't allow multiple mounts even if
> they're all RO, and most people don't try to mount other filesystems
> multiple times as a result of this.

They most certainly do.  The problem is mount(8) treatment of -o loop -
you can mount e.g. ext4 many times, it'll just get you extra references
to the same struct super_block from those new vfsmounts.  IOW, that'll
behave the same way as if you were doing mount --bind on subsequent ones.

And as far as kernel is concerned, /dev/loop* isn't special in any respects;
if you do explicit losetup and mount the resulting /dev/loop as many
times as you wish, it'll work just fine.  And from the kernel POV it's not
different from what it sees with -o loop; setting the loop device up is
done first by separate syscall, then mount(2) for that device is issued.

It's mount(8) that screws up here.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs progs release 4.4.1

2016-02-26 Thread David Sterba

Hi,

btrfs-progs 4.4.1 have been released, minor bugfixes.

Changes:

* find-root: don't skip the first chunk
* free-space-tree compat bits fix
* build: target symlinks
* documentation updates
* test updates

Not much updates for the pending patchset. I've tried to merge the mkfs feature
guessing patches but had to write them from scratch. The other pending
patchsets are in the integration branch. As my kernel works for next
merge window are almost done I hopefully will spend more time on progs
during next weeks.

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (4):
  btrfs-progs: fix compat_ro mask for free space tree
  btrfs-progs: tests: store checksums in /tmp
  btrfs-progs: tests: use common variables and helpers
  Btrfs progs v4.4.1

Hongxu Jia (1):
  btrfs-progs: fix symlink creation multiple times

Lakshmipathi.G (1):
  btrfs-progs: tests: do checksum verification with convert-tests

Mark Fasheh (1):
  btrfs-progs: Import interval tree implemenation from Linux v4.0-rc7.

Mike Gilbert (1):
  btrfs-progs: Makefile.in: Simplify/correct install-static

Qu Wenruo (5):
  btrfs-progs: volume: Fix a bug causing btrfs-find-root to skip first chunk
  btrfs-progs: Allow open_ctree to return fs_info even chunk tree is 
corrupted
  btrfs-progs: Add support for tree block operations on fs_info without 
roots
  btrfs-progs: find-root: Allow btrfs-find-root to search chunk root even 
chunk root is corrupted
  btrfs-progs: misc-test: Add regression test for find-root gives empty 
result

Satoru Takeuchi (3):
  btrfs-progs: Fix self-reference of man btrfs-subvolume
  btrfs-progs: describe btrfs-send requires read-only subvolume
  btrfs-progs: write down the meaning of BTRFS_ARG_BLKDEV

Tsutomu Itoh (2):
  btrfs-progs: doc: fix typo of some documents
  btrfs-progs: doc: fix size suffix in mkfs.btrfs

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

Austin S. Hemmelgarn wrote:
> On 2016-02-26 10:50, Stanislav Brabec wrote:

Well, kernel could "fix" it in a simple way:

- don't allow two loop devices pointing to the same file
or
- don't allow two loop devices pointing to the same file being used by
mount(2).

Then util-linux would need a behavior change for sure.

I already found another inconsistency caused by this implementation:

This could causes problem for evaluation of order of umount(2) that
should prevent EBUSY.

I was talking about it with David Sterba, and he told, that in the
current implementation is not optimal. btrfs driver does not have
sufficient information to evaluate true root of the bind mount.

I've noticed this before myself, but I've never seen any issues
resulting from it; however, I've also not tried calling BTRFS related
ioctls on or from such a mount, so I may just have been lucky.

I can imagine two side effects deeply inside mount(8):

- "umount -a" could have a problem to find a proper order to umount(2)
without EBUSY. I did not check the algorithm, so I am not sure,
whether it is a real issue.

P. S.: There were many problems with btrfs in mount(8):

--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax: +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

On 2016-02-26 10:50, Stanislav Brabec wrote:

Austin S. Hemmelgarn wrote:
 > Added linux-btrfs as this should be documented there as a known issue
 > until it gets fixed (although I have no idea which side is the issue).

This is a very bad behavior, as it makes impossible to safely use btrfs
loop bind mounts in fstab. (Well, it is possible to write a work-around
in util-linux: Remember the source file, and if -oloop is specified
next time, and source file is already assigned to a loop device, use
existing loop device.)

I'm not 100% certain, but I think this is a interaction between how
BTRFS handles multiple mounts of the same filesystem on a given system
and how mount handles loop mounts.  AFAIUI, all instances of a given
BTRFS filesystem being mounted on a given system are internally
identical to bind mounts of a hidden mount of that filesystem.  This is
what allows both manual mounting of sub-volumes, and multiple mounting
of the FS in general.

Yes, internal implementation is the same.

But here it causes a  real trouble: However both mounts point to the
same file, first and second mount use different loop device. To create
a bind mount, something ugly needs to be done. And it is done in an
incorrect way.
That's just it though, from what I can tell based on what I've seen and 
what you said above, mount(8) isn't doing things correctly in this case. 
 If we were to do this with something like XFS or ext4, the filesystem 
would probably end up completely messed up just because of the log 
replay code (assuming they actually mount the second time, I'm not sure 
what XFS would do in this case, but I believe that ext4 would allow the 
mount as long as the mmp feature is off).  It would make sense that this 
behavior wouldn't have been noticed before (and probably wouldn't have 
mattered even if it had been), because most filesystems don't allow 
multiple mounts even if they're all RO, and most people don't try to 
mount other filesystems multiple times as a result of this.  If this 
behavior of allocating a new loop device for each call on a given file 
is in fact not BTRFS specific (as implied by your statement about a 
possible workaround in mount(8)), then mount(8) really should be fixed 
to not do that before we even consider looking at the issues in BTRFS, 
as that is behavior that has serious potential to result in data 
corruption for any filesystem, not just BTRFS.

Now, if this does get fixed, mount(8) doesn't necessarily need to 
maintain it's own copy of the state of /dev/loop mappings, it could 
simply check the currently allocated loop devices.  You would of course 
need some form of locking relative to other mount -o loop instances and 
losetup, and it would be slow, but if you're using enough loop devices 
that this causes noticeable delays, then you really shouldn't be 
complaining all that much about performance.

I already found another inconsistency caused by this implementation:

/proc/self/mountinfo reports subvolid of the nearest upper sub-volume
root for the bind mount, not the sub-volume that was used for creating
this bind mount, and subvolid that potentially does not correspond to
any subvolume root.

This could causes problem for evaluation of order of umount(2) that
should prevent EBUSY.

I was talking about it with David Sterba, and he told, that in the
current implementation is not optimal. btrfs driver does not have
sufficient information to evaluate true root of the bind mount.
I've noticed this before myself, but I've never seen any issues 
resulting from it; however, I've also not tried calling BTRFS related 
ioctls on or from such a mount, so I may just have been lucky.

Maybe the same is valid for the reported loop issue, and this is just
an ugly side effect.
I'd be more than willing to bet that that isn't the case, loop mounts 
and bind mounts are entirely different inside the kernel, and I think 
the loop mount issue on the BTRFS side is a result of the issues it has 
when dealing with filesystems with the same UUID (if this is in fact the 
case, similar behavior should be seen when trying to either mount 
multiple lower level components of a multi-path device, or by manually 
creating multiple /dev/loop associations for the same file and mounting 
them all at once using the /dev/loop names instead of the file).

P. S.: There are some use differences between bind mounts and btrfs
sub-volumes:

- Bind mounts can be created for any file or directory.
- Sub-volume mounts can be created only for inodes marked as sub-volume
   root.

- Bind mounts can be mounted only if any of upper sub-volume root is
   mounted.
- Sub-volumes can be mounted even if volume root is not mounted.
FWIW, it's actually possible to simulate this behavior with bind mounts 
by mounting the root at the eventual mount point, then bind mounting the 
desired directory from that root over top of it.  Of course, there is 
almost zero practical purpose to anyone doing this on most traditional

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

Austin S. Hemmelgarn wrote:
> Added linux-btrfs as this should be documented there as a known issue
> until it gets fixed (although I have no idea which side is the issue).

This is a very bad behavior, as it makes impossible to safely use btrfs
loop bind mounts in fstab. (Well, it is possible to write a work-around
in util-linux: Remember the source file, and if -oloop is specified
next time, and source file is already assigned to a loop device, use
existing loop device.)

I'm not 100% certain, but I think this is a interaction between how
BTRFS handles multiple mounts of the same filesystem on a given system
and how mount handles loop mounts.  AFAIUI, all instances of a given
BTRFS filesystem being mounted on a given system are internally
identical to bind mounts of a hidden mount of that filesystem.  This is
what allows both manual mounting of sub-volumes, and multiple mounting
of the FS in general.

Yes, internal implementation is the same.

But here it causes a  real trouble: However both mounts point to the
same file, first and second mount use different loop device. To create
a bind mount, something ugly needs to be done. And it is done in an
incorrect way.

I already found another inconsistency caused by this implementation:

/proc/self/mountinfo reports subvolid of the nearest upper sub-volume
root for the bind mount, not the sub-volume that was used for creating
this bind mount, and subvolid that potentially does not correspond to
any subvolume root.

This could causes problem for evaluation of order of umount(2) that
should prevent EBUSY.

I was talking about it with David Sterba, and he told, that in the
current implementation is not optimal. btrfs driver does not have
sufficient information to evaluate true root of the bind mount.

Maybe the same is valid for the reported loop issue, and this is just
an ugly side effect.

P. S.: There are some use differences between bind mounts and btrfs
sub-volumes:

- Bind mounts can be created for any file or directory.
- Sub-volume mounts can be created only for inodes marked as sub-volume
  root.

- Bind mounts can be mounted only if any of upper sub-volume root is
  mounted.
- Sub-volumes can be mounted even if volume root is not mounted.

--
Best Regards / S pozdravem,

Stanislav Brabec
software developer
-
SUSE LINUX, s. r. o. e-mail: sbra...@suse.com
Lihovarská 1060/12tel: +49 911 7405384547
190 00 Praha 9 fax:  +420 284 084 001
Czech Republichttp://www.suse.cz/
PGP: 830B 40D5 9E05 35D8 5E27 6FA3 717C 209F A04F CD76
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PULL] Btrfs for 4.6

2016-02-26 Thread David Sterba

Hi,

this is my final pull request for 4.6 branches that I've been tracking. All of
them were in for-next for some time.

Summary:

* Chandan's preparatory work for subpage-blocksize
* Qu's updates to mount options (usebackuproot, nologreplay, norecovery)
* Zhao's readahead bugfixes
* Josef's updates to space handling
* from me, GFP flag updates, b-tree key space renamings
* collection of misc patches (sent out of series)
* misc cleanups

The patchset to allow device deletion by id is not part of this pull.


The following changes since commit 18558cae0272f8fd9647e69d3fec1565a7949865:

  Linux 4.5-rc4 (2016-02-14 13:05:20 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-chris

for you to fetch changes up to f5bc27c71a1b0741cb93dbec0f216b012b21d93f:

  Merge branch 'dev/control-ioctl' into for-chris-4.6 (2016-02-26 15:38:34 
+0100)


Arnd Bergmann (1):
  btrfs: avoid uninitialized variable warning

Byongho Lee (2):
  btrfs: simplify expression in btrfs_calc_trans_metadata_size()
  btrfs: remove redundant error check

Chandan Rajendra (12):
  Btrfs: __btrfs_buffered_write: Reserve/release extents aligned to block 
size
  Btrfs: Compute and look up csums based on sectorsized blocks
  Btrfs: Direct I/O read: Work on sectorsized blocks
  Btrfs: fallocate: Work with sectorsized blocks
  Btrfs: btrfs_page_mkwrite: Reserve space in sectorsized units
  Btrfs: Search for all ordered extents that could span across a page
  Btrfs: Use (eb->start, seq) as search key for tree modification log
  Btrfs: btrfs_submit_direct_hook: Handle map_length < bio vector length
  Btrfs: Limit inline extents to root->sectorsize
  Btrfs: Fix block size returned to user space
  Btrfs: Clean pte corresponding to page straddling i_size
  Btrfs: btrfs_ioctl_clone: Truncate complete page after performing clone 
operation

Dave Jones (1):
  btrfs: remove open-coded swap() in backref.c:__merge_refs

David Sterba (30):
  btrfs: send: use GFP_KERNEL everywhere
  btrfs: reada: use GFP_KERNEL everywhere
  btrfs: scrub: use GFP_KERNEL on the submission path
  btrfs: let callers of btrfs_alloc_root pass gfp flags
  btrfs: fallocate: use GFP_KERNEL
  btrfs: readdir: use GFP_KERNEL
  btrfs: device add and remove: use GFP_KERNEL
  btrfs: extent same: use GFP_KERNEL for page array allocations
  btrfs: switch to kcalloc in btrfs_cmp_data_prepare
  btrfs: introduce key type for persistent temporary items
  btrfs: switch balance item to the temporary item key
  btrfs: introduce key type for persistent permanent items
  btrfs: switch dev stats item to the permanent item key
  btrfs: teach print_leaf about permanent item subtypes
  btrfs: teach print_leaf about temporary item subtypes
  btrfs: use proper type for failrec in extent_state
  btrfs: remove error message from search ioctl for nonexistent tree
  btrfs: change max_inline default to 2048
  btrfs: add GET_SUPPORTED_FEATURES to the control device ioctls
  btrfs: drop unused argument in btrfs_ioctl_get_supported_features
  Merge branch 'chandan/prep-subpage-blocksize' into for-chris-4.6
  Merge branch 'dev/gfp-flags' into for-chris-4.6
  Merge branch 'dev/rename-keys' into for-chris-4.6
  Merge branch 'foreign/qu/norecovery-v7' into for-chris-4.6
  Merge branch 'foreign/zhaolei/reada' into for-chris-4.6
  Merge branch 'foreign/josef/space-updates' into for-chris-4.6
  Merge branch 'foreign/liubo/replace-lockup' into for-chris-4.6
  Merge branch 'cleanups-4.6' into for-chris-4.6
  Merge branch 'misc-4.6' into for-chris-4.6
  Merge branch 'dev/control-ioctl' into for-chris-4.6

Deepa Dinamani (1):
  btrfs: Replace CURRENT_TIME by current_fs_time()

Josef Bacik (4):
  Btrfs: change how we update the global block rsv
  Btrfs: fix truncate_space_check
  Btrfs: add transaction space reservation tracepoints
  Btrfs: check reserved when deciding to background flush

Kinglong Mee (2):
  btrfs: drop null testing before destroy functions
  btrfs: fix memory leak of fs_info in block group cache

Liu Bo (1):
  Btrfs: fix lockdep deadlock warning due to dev_replace

Qu Wenruo (3):
  btrfs: Introduce new mount option usebackuproot to replace recovery
  btrfs: Introduce new mount option to disable tree log replay
  btrfs: Introduce new mount option alias for nologreplay

Sudip Mukherjee (1):
  btrfs: fix build warning

Zhao Lei (18):
  btrfs: reada: Fix in-segment calculation for reada
  btrfs: reada: reduce additional fs_info->reada_lock in reada_find_zone
  btrfs: reada: Add missed segment checking in reada_find_zone
  btrfs: reada: Avoid many times of empty loop
  btrfs:

Re: [PATCH 1/3] btrfs: Continue write in case of can_not_nocow

2016-02-26 Thread Chris Mason

On Fri, Feb 26, 2016 at 10:41:31AM +0500, Roman Mamedov wrote:
> On Wed, 6 Jan 2016 19:00:17 +0800
> Zhao Lei  wrote:
> 
> > btrfs failed in xfstests btrfs/080 with -o nodatacow.
> > 
> > Can be reproduced by following script:
> >   DEV=/dev/vdg
> >   MNT=/mnt/tmp
> > 
> >   umount $DEV &>/dev/null
> >   mkfs.btrfs -f $DEV
> >   mount -o nodatacow $DEV $MNT
> > 
> >   dd if=/dev/zero of=$MNT/test bs=1 count=2048 &
> >   btrfs subvolume snapshot -r $MNT $MNT/test_snap &
> >   wait
> >   --
> >   We can see dd failed on NO_SPACE.
> > 
> > Reason:
> >   __btrfs_buffered_write should run cow write when no_cow impossible,
> >   and current code is designed with above logic.
> >   But check_can_nocow() have 2 type of return value(0 and <0) on
> >   can_not_no_cow, and current code only continue write on first case,
> >   the second case happened in doing subvolume.
> > 
> > Fix:
> >   Continue write when check_can_nocow() return 0 and <0.
> > 
> > Signed-off-by: Zhao Lei 
> 
> Guys please don't forget about this patch. It solves real problem for people,
> http://www.spinics.net/lists/linux-btrfs/msg51276.html
> http://www.spinics.net/lists/linux-btrfs/msg51819.html
> but it's not in 4.4.1, not in 4.4.2... and now not in 4.4.3

Dave already has it queued for the next merge window.  Thanks!

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[GIT PULL] Btrfs fixes for 4.6

2016-02-26 Thread fdmanana

From: Filipe Manana 

Hi Chris,

Please consider the following changes for the 4.6 kernel merge window.
Nothing particularly outstanding, just the usual sort of bug fixes.
These have all been sent to the mailing list before (I just changed in
my repo the changelog for the deadlock fix patch to fix a typo pointed
by Liu Bo, other than that it's exactly the same as the version sent to
the mailing list). Some xfstests for these were already merged upstream
and one more sent earlier this week (for the listxattrs issue) that is
not yet merged.

Thanks.

The following changes since commit 0fcb760afa6103419800674e22fb7f4de1f9670b:

  Merge branch 'for-next' of 
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux into for-linus-4.6 
(2016-02-24 10:21:44 -0800)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
integration-4.6

for you to fetch changes up to 97c86c11a5cb9839609a9df195e998c3312e68b0:

  Btrfs: do not collect ordered extents when logging that inode exists 
(2016-02-26 04:28:15 +)


Filipe Manana (7):
  Btrfs: fix unreplayable log after snapshot delete + parent dir fsync
  Btrfs: fix file loss on log replay after renaming a file and fsync
  Btrfs: fix extent_same allowing destination offset beyond i_size
  Btrfs: fix deadlock between direct IO reads and buffered writes
  Btrfs: fix listxattrs not listing all xattrs packed in the same item
  Btrfs: fix race when checking if we can skip fsync'ing an inode
  Btrfs: do not collect ordered extents when logging that inode exists

 fs/btrfs/file.c |  9 +
 fs/btrfs/inode.c| 25 +++--
 fs/btrfs/ioctl.c|  6 ++
 fs/btrfs/tree-log.c | 99 
---
 fs/btrfs/tree-log.h |  2 ++
 fs/btrfs/xattr.c| 65 
+
 6 files changed, 165 insertions(+), 41 deletions(-)

-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: do not collect ordered extents when logging that inode exists

2016-02-26 Thread fdmanana

From: Filipe Manana 

When logging that an inode exists, for example as part of a directory
fsync operation, we were collecting any ordered extents for the inode but
we ended up doing nothing with them except tagging them as processed, by
setting the flag BTRFS_ORDERED_LOGGED on them, which prevented a
subsequent fsync of that inode (using the LOG_INODE_ALL mode) from
collecting and processing them. This created a time window where a second
fsync against the inode, using the fast path, ended up not logging the
checksums for the new extents but it logged the extents since they were
part of the list of modified extents. This happened because the ordered
extents were not collected and checksums were not yet added to the csum
tree - the ordered extents have not gone through btrfs_finish_ordered_io()
yet (which is where we add them to the csum tree by calling
inode.c:add_pending_csums()).

So fix this by not collecting an inode's ordered extents if we are logging
it with the LOG_INODE_EXISTS mode.

Signed-off-by: Filipe Manana 
---
 fs/btrfs/tree-log.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index 9f6372d..9d2e8ec 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4500,7 +4500,22 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
 
mutex_lock(_I(inode)->log_mutex);
 
-   btrfs_get_logged_extents(inode, _list, start, end);
+   /*
+* Collect ordered extents only if we are logging data. This is to
+* ensure a subsequent request to log this inode in LOG_INODE_ALL mode
+* will process the ordered extents if they still exists at the time,
+* because when we collect them we test and set for the flag
+* BTRFS_ORDERED_LOGGED to prevent multiple log requests to process the
+* same ordered extents. The consequence for the LOG_INODE_ALL log mode
+* not processing the ordered extents is that we end up logging the
+* corresponding file extent items, based on the extent maps in the
+* inode's extent_map_tree's modified_list, without logging the
+* respective checksums (since the may still be only attached to the
+* ordered extents and have not been inserted in the csum tree by
+* btrfs_finish_ordered_io() yet).
+*/
+   if (inode_only == LOG_INODE_ALL)
+   btrfs_get_logged_extents(inode, _list, start, end);
 
/*
 * a brute force approach to making sure we get the most uptodate
-- 
2.7.0.rc3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: loop subsystem corrupted after mounting multiple btrfs sub-volumes

Added linux-btrfs as this should be documented there as a known issue 
until it gets fixed (although I have no idea which side is the issue).

On 2016-02-25 14:22, Stanislav Brabec wrote:

While writing a test suite for util-linux[1], I experienced a a strange
behavior of loop device:

When two loop devices refer to the same file, and two btrfs mounts are
called on them, the second mount changes loop device of the first,
already mounted sub-volume. (Note that the current implementation of
util-linux mount -oloop works exactly in this way, and it allocates new
loop device for each mount command, so this bug can be easily
reproduced without losetup, just using "mount -oloop" or fstab.)
I'm not 100% certain, but I think this is a interaction between how 
BTRFS handles multiple mounts of the same filesystem on a given system 
and how mount handles loop mounts.  AFAIUI, all instances of a given 
BTRFS filesystem being mounted on a given system are internally 
identical to bind mounts of a hidden mount of that filesystem.  This is 
what allows both manual mounting of sub-volumes, and multiple mounting 
of the FS in general.


/proc/self/mountinfo after first btrfs loop mount:

107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop0 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

This line changes after second first btrfs loop to:

07 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop1 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

See the change of /dev/loop0 to /dev/loop1!

It is apparently not only proc file change, but it also causes a
corruption of loop device subsystem, as I observed severe problems
on the affected system later:

- mount(2) returning 0 but doing nothing.

- mount(8) entering an infinite loop while searching for free loop
device.
This seems odd that it would cause such a degree of inconsistency in the 
kernel itself.  My guess though is that mount(8) sees that you're trying 
to mount a file and unconditionally tries to bind it to a loop device 
without checking any in-use loop devices to see if it's already bound to 
them, and then when it calls mount(2), this ends up somehow confusing 
the BTRFS driver (probably because you've now mounted two filesystems 
with effectively identical super-blocks, BTRFS already has issues if 
multiple filesystems have the same UUID, and I have no idea how it might 
react to filesystems that appear identical but are on separate devices).



Here is a main reproducer:

=
#!/bin/sh
# Prepare the environment:
/btrfs.sh
mkdir -p /mnt/1 /mnt/2
losetup /dev/loop0 /btrfs.img
# Verify that nothing is mounted:
cat /proc/self/mountinfo | grep /mnt
mount /dev/loop0 /mnt/1
echo "One file system should be mounted now."
cat /proc/self/mountinfo | grep /mnt
# Create another loop.
losetup /dev/loop1 /btrfs.img
echo "Going to mount second one."
mount -osubvol=/ /dev/loop1 /mnt/2 2>&1
echo "Two file system should be mounted now."
cat /proc/self/mountinfo | grep /mnt
echo "Strange. First mount changed its loop device!"
umount /mnt/2
echo "And now check, whether it remains changed after umount."
cat /proc/self/mountinfo | grep /mnt
umount /mnt/1
losetup -d /dev/loop1
losetup -d /dev/loop0
rmdir /mnt/1 /mnt/2
=

And here is its output:

One file system should be mounted now.
107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop0 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
Going to mount second one.
Two file system should be mounted now.
107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop1 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2
108 59 0:59 / /mnt/2 rw,relatime shared:47 - btrfs /dev/loop1 
rw,space_cache,subvolid=5,subvol=/
Strange. First mount changed its loop device!
And now check, whether it remains changed after umount.
107 59 0:59 /d0/dd0/ddd0/s1/d1/dd1/ddd1/s2 /mnt/1 rw,relatime shared:45 - btrfs 
/dev/loop1 rw,space_cache,subvolid=257,subvol=/d0/dd0/ddd0/s1/d1/dd1/ddd1/s2

It was actually reproduced on linux-4.4.1 on openSUSE Tumbleweed.


Test image creator:

= /btrfs.sh =
#!/bin/sh
truncate -s 42M /btrfs.img
mkfs.btrfs -f -d single -m single /btrfs.img >/dev/null
mount -o loop /btrfs.img /mnt
pushd . >/dev/null
cd /mnt
mkdir -p d0/dd0/ddd0
cd ./d0/dd0/ddd0
touch file{1..5}
btrfs subvol create s1 >/dev/null
cd ./s1
touch file{1..5}
mkdir bind-point
mkdir -p d1/dd1/ddd1
cd ./d1/dd1/ddd1
btrfs subvol create s2 >/dev/null
DEFAULT_SUBVOLID=$(btrfs inspect rootid s2)
btrfs subvol set-default $DEFAULT_SUBVOLID . >/dev/null
NON_DEFAULT_SUBVOLID=$(btrfs subvol list /mnt |
while read dummy id rest ; do if test $id = $DEFAULT_SUBVOLID ; then
continue ; fi ; echo $id ; done)
cd ../../../..
mkdir -p d2/dd2/ddd2
cd ./d2/dd2/ddd2
btrfs subvol create s3 >/dev/null
mkdir -p s3/bind-mnt
popd >/dev/null

Re: upgrading kernel 3.13 to 3.16