Re: degraded raid scribbling upon wrong device

2017-07-22 Thread Adam Borowski
On Thu, Jul 13, 2017 at 08:40:12AM +0200, Adam Borowski wrote:
> Here's a set of test cases, two of them in some cases seem to scribble upon
> the wrong device:
> 
> * deg-mid-missing
> * deg-last-replaced (not on the innocent "re")
> * but never deg-last-missing
> 
> When all goes ok, there are no errors other than wrong generation on the
> re-added disk (expected).   When it goes bad, there's a lot of corruption.
> In all cases, though, the "Device missing:" field is wrong.

I did not explore this adequately yet, in a good part because of ENOSPC
triggering a lot of time for an unrelated reason that Omar just fixed
(thanks!).  So, here's what I know so far:

* copying in, say, 2.2GB /usr/share is a lot more likely to trigger than
  dd-ing 2.2GB of /dev/null
* no "real" degrading is needed: in the original scripts, the missing device
  is empty so all blocks are doubled anyway.  It's not about degraded chunks
  but because of a bogus device.
* bogus output of "btrfs f u" is a sure predictor that, with enough tries,
  you'll get corruption -- if it shows something when it should say
  "missing", shit is likely to happen


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device ready purpose

2017-07-22 Thread Chris Murphy
On Sat, Jul 22, 2017 at 1:58 PM, Adam Borowski  wrote:
> On Sat, Jul 22, 2017 at 06:15:58PM +, Hugo Mills wrote:
>> On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote:
>> > I just did an additional test that's pretty icky behavior.
>> >
>> > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
>> > exits with 0 as expected. Physically remove both USB devices.
>> > Reconnect one device. `btrfs device ready` still exits 0. That's
>> > definitely not good. (If I leave that one device connected and reboot,
>> > `btrfs device ready` exits 1).
>>
>>In a slightly less-specific way, this has been a problem pretty
>> much since the inception of the FS. It's not possible to do the
>> reverse of the "scan" operation on a device -- that is, invalidate/
>> remove the device's record in the kernel. So, as you've discovered
>> here, if you have a device which is removed (overwritten, unplugged),
>> the kernel still thinks it's a part of the FS.
>
> Alas, this needs to be fixed.  The reproducers I posted last week give data
> corruption in case a device that was once a part of the FS is reconnected.
> It doesn't matter what it contains now -- be it another part of the FS or
> something totally unrelated, as far as the device node (/dev/loop0,
> /dev/sda1, etc) is reused, degraded mounts get confused.
>
> It wasn't urgent before as degraded mounts were broken before Qu's chunk
> check patch (that's not even merged yet) -- but once running degraded is
> not an emergency, there'll be folks doing so for an extended time.
>
>>It's something I recall being talked about a bit, some years ago. I
>> don't recall now why it was going to be useful, though. I think you
>> have a good use-case for such a new ioctl (or extension to the
>> SCAN_DEV ioctl) now, though.
>
> Such an ioctl would be inherently racey.  Even current udev code is --
> mounting right after losetup often fails, sometimes you even need to sleep
> longer than 1 second.  With the above in mind, I see no way other than
> invalidating and re-checking all known devices at mount time.


If we go back even further in time, what I'm trying to avoid is the
problem with DE's where the user connects a two device Btrfs, and then
they want to eject it. The DE is already confused because behind the
scenes it has actually mounted each device to two different mount
points, which Btrfs allows (it's one file system, on two mount
points). That's confusing, but not a big problem. The big problem
happens when the user wants to stop using that file system. So they
eject one of the two appearing devices (which should of course only be
one with Btrfs) and behind the scenes udisksd umounts just one of the
mountpoints and then appears to delete that device node, which in
effect makes the still mounted file system degraded, and results in
corruption.

Btrfs fixes this up on the next mount of both devices. But it's just
asking for trouble.

Output of this behavior here:
https://bugs.freedesktop.org/show_bug.cgi?id=87277#c3

So then I started to look at whether it's possible to easily determine
in advance if a Btrfs file system is single or multiple device, and
let udisksd have a policy where it will just ignore multiple device
Btrfs entirely - just don't support it until the guts of all this
infrastructure gets better.

'strace btrfs filesystem show' curiously shows BTRFS_IOC_FS_INFO is
only called for single device Btrfs. There is seemingly a much more
esoteric, btrfs-progs only method for getting information for multiple
device Btrfs volumes. And therefore I'm not certain if
BTRFS_IOC_FS_INFO supports multiple device Btrfs, and would return
num_devices so that it's possible to know whether to ignore devices
for a multiple device Btrfs volume.

*sigh*


Chris Murphy



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device ready purpose

2017-07-22 Thread Adam Borowski
On Sat, Jul 22, 2017 at 06:15:58PM +, Hugo Mills wrote:
> On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote:
> > I just did an additional test that's pretty icky behavior.
> > 
> > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
> > exits with 0 as expected. Physically remove both USB devices.
> > Reconnect one device. `btrfs device ready` still exits 0. That's
> > definitely not good. (If I leave that one device connected and reboot,
> > `btrfs device ready` exits 1).
> 
>In a slightly less-specific way, this has been a problem pretty
> much since the inception of the FS. It's not possible to do the
> reverse of the "scan" operation on a device -- that is, invalidate/
> remove the device's record in the kernel. So, as you've discovered
> here, if you have a device which is removed (overwritten, unplugged),
> the kernel still thinks it's a part of the FS.

Alas, this needs to be fixed.  The reproducers I posted last week give data
corruption in case a device that was once a part of the FS is reconnected. 
It doesn't matter what it contains now -- be it another part of the FS or
something totally unrelated, as far as the device node (/dev/loop0,
/dev/sda1, etc) is reused, degraded mounts get confused.

It wasn't urgent before as degraded mounts were broken before Qu's chunk
check patch (that's not even merged yet) -- but once running degraded is
not an emergency, there'll be folks doing so for an extended time.

>It's something I recall being talked about a bit, some years ago. I
> don't recall now why it was going to be useful, though. I think you
> have a good use-case for such a new ioctl (or extension to the
> SCAN_DEV ioctl) now, though.

Such an ioctl would be inherently racey.  Even current udev code is --
mounting right after losetup often fails, sometimes you even need to sleep
longer than 1 second.  With the above in mind, I see no way other than
invalidating and re-checking all known devices at mount time.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device ready purpose

2017-07-22 Thread Hugo Mills
On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote:
> I just did an additional test that's pretty icky behavior.
> 
> 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
> exits with 0 as expected. Physically remove both USB devices.
> Reconnect one device. `btrfs device ready` still exits 0. That's
> definitely not good. (If I leave that one device connected and reboot,
> `btrfs device ready` exits 1).

   In a slightly less-specific way, this has been a problem pretty
much since the inception of the FS. It's not possible to do the
reverse of the "scan" operation on a device -- that is, invalidate/
remove the device's record in the kernel. So, as you've discovered
here, if you have a device which is removed (overwritten, unplugged),
the kernel still thinks it's a part of the FS.

   It's something I recall being talked about a bit, some years ago. I
don't recall now why it was going to be useful, though. I think you
have a good use-case for such a new ioctl (or extension to the
SCAN_DEV ioctl) now, though.

   Hugo.

-- 
Hugo Mills | UNIX: Italian pen maker
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs device ready purpose

2017-07-22 Thread Chris Murphy
I just did an additional test that's pretty icky behavior.

2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
exits with 0 as expected. Physically remove both USB devices.
Reconnect one device. `btrfs device ready` still exits 0. That's
definitely not good. (If I leave that one device connected and reboot,
`btrfs device ready` exits 1).


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device ready purpose

2017-07-22 Thread Chris Murphy
On Fri, Jul 21, 2017 at 11:55 PM, Andrei Borzenkov  wrote:
> 21.07.2017 17:36, Chris Murphy пишет:
>>>
>>> The command is just a simple wrapper around the DEVICES_READY ioctl, but
>>> now that systemd has it's own wrapper tool, there are probably no users
>>> of that subcommand in 'btrfs' tool itself. We can enhance the
>>> documentation to state the expected purpose and that normal users can
>>> ignore it.
>>
>> What is the expected purpose? It flat out does not seem to work at
>> all. It doesn't wait when devices are missing, as the man description
>> says.
>
> That's man page that is misleading. The intent was to let caller of
> "btrfs device ready" to know when it has to wait.
>
>> And echo ? returns a 0 instead of 1. I'd expect the exit code is
>> 0 to mean "yes all devices are ready", and exit code 1 "some devices
>> not ready". But right now, I get the same result no matter what.
>>
>
> That's not what I observe.
>
> linux-gtrk:~ # btrfs device ready /dev/sdb
> linux-gtrk:~ # echo $?
> 1
> linux-gtrk:~ # btrfs-debug-tree /dev/sdb
> btrfs-progs v4.5.3+20160729
> warning, device 2 is missing
> ...
>
> But if you call "btrfs device ready" AFTER kernel has already seen (or
> decided about) all devices, then it returns 0. Basically, this is not
> "filesystem ready" but "does kernel know about all devices for this
> filesystem".

OK! Super! This is the critical bit of behavior. My test is flawed.

The multiple device volume was visible to the kernel, and then I
merely deactivated the LV. The kernel had seen it, and isn't "missing"
it at least in terms of 'btrfs device ready' whereas 'btrfs fi show'
does report it as missing but is also using different ioctls. Even if
I use 'btrfs device scan' a subsequent 'btrfs device ready' exits 0.
But if I set skip activation 'lvchange -ky' and reboot, 'btrfs device
ready' on the non-missing device does result in an exit code of 1.


> Please do not confuse independent things. "btrfs device ready" simply
> tells caller whether all devices have been seen by kernel. This is poor
> man's solution for "can I mount it". What caller does with this
> information is outside of scope of btrfs.

Got it. Thanks.


>
>> I think it'd be better to return a code.
>> 0: is complete, degraded not required
>> 1: is incomplete, degraded should mount it
>> 2: is incomplete, degraded won't mount it
>>
>
> There is no way systemd can make use of this information using current
> static unit dependencies. Really, this topic came up more than once
> (including by you as well). systemd does not have adequate ways to
> represent multi-device objects (this goes beyond btrfs, Linux MD is good
> example). Sometimes it is possible to workaround it (Linux MD again).
> But at the end, systemd needs to offer framework where btrfs et al can
> plug in by providing status. Until this happens, discussion on this list
> is pointless.

Understood.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-22 Thread Adam Borowski
On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-20 17:27, Nick Terrell wrote:
> > This patch set adds xxhash, zstd compression, and zstd decompression
> > modules. It also adds zstd support to BtrFS and SquashFS.
> > 
> > Each patch has relevant summaries, benchmarks, and tests.
> 
> For patches 2-3, I've compile tested and had runtime testing running for
> about 18 hours now with no issues, so you can add:
> 
> Tested-by: Austin S. Hemmelgarn 

I assume you haven't tried it on arm64, right?

I had no time to get 'round to it before, and just got the following build
failure:

  CC  fs/btrfs/zstd.o
In file included from fs/btrfs/zstd.c:28:0:
fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’
  refcount_t pending_bios;
  ^~
scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed

It's trivially fixably by:
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "compression.h"

after which it works fine, although half an hour of testing isn't exactly
exhaustive.


Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on
doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that
would be quite painful so I did not try yet.  I guess re-testing your patch
set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't
be of much help.  So far, previous versions have been running for weeks,
with no issue since you fixed workspace flickering.


On amd64 all is fine.


I haven't tested SquashFS at all.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to detect "orphaned" subvolume attachment point in snapshot?

2017-07-22 Thread Hans van Kranenburg
On 07/22/2017 09:58 AM, Andrei Borzenkov wrote:
> Here is structure of snapshots in openSUSE; all snapshots of root volume
> are created under /.snapshots subvolume:
> 
> linux-gtrk:/host/home/src/python-btrfs/examples # sudo mount -o
> ro,subvol=/ /dev/sda3 /mnt
> linux-gtrk:/host/home/src/python-btrfs/examples #
> ./show_directory_contents.py /mnt/
> directory /mnt/@ tree 257 inum 256
> inode generation 6 transid 315 size 72 nbytes 0 block_group 0 mode 40755
> nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
> inode ref index 0 name utf-8 ..
> ...
> dir item list hash 1921786525 size 1
> dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots
> ...
> linux-gtrk:/host/home/src/python-btrfs/examples #
> ./show_directory_contents.py /mnt/@/.snapshots/251/snapshot/
> directory /mnt/@/.snapshots/251/snapshot/ tree 774 inum 256
> inode generation 6 transid 15867 size 164 nbytes 0 block_group 0 mode
> 40755 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
> inode ref index 0 name utf-8 ..
> ...
> dir item list hash 1921786525 size 1
> dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots
> ...
> linux-gtrk:/host/home/src/python-btrfs/examples #
> 
> Note that both directory items in /mnt/@ and
> /mnt/@/.snapshots/251/snapshot store the same tree ID for .snapshots
> item - 258. This causes grub2 btrfs driver loop - when it comes to
> /mnt/@/.snapshots/251/snapshot and looks for .snapshots it jumps back to
> /mnt/@/.snapshots tree.
> 
> I see that in Linux kernel somehow distinguishes between both of them; I
> am not sure how it actually does it though.
> 
> What on-disk information should we check to find out "orphaned" snapshot
> directory?

The information is not stored in the subvolume that contains the
"attachment point". So you cannot get the info at that location.

If it was, that would mean that when creating a snapshot, some process
would need to walk the entire directory structure and change all the
locations in the tree that looked like if there was another nested
subvolume placed there before.

In tree 1, the tree of trees, there's information about root 258:

-# btrfs inspect-internal dump-tree -t 1 /dev/[...]/blaat
[...]
item 19 key (258 ROOT_ITEM 0) itemoff 12635 itemsize 439
 root data bytenr 21397504 level 0 dirid 256 refs 1 gen 11 lastsnap 0
flags 0x0(none)
uuid d7fe436b-35b5-9b4e-805d-20b9294a55d0
ctransid 11 otransid 9 stransid 0 rtransid 0
item 20 key (258 ROOT_BACKREF 257) itemoff 12616 itemsize 19
root backref key dirid 256 sequence 2 name b

I think the ROOT_BACKREF says that the only location where the contents
of the nested subvolume should really be shown is when it's looked at
via the "attachment point" in tree 257, directory 256, index 2 in the
directory with name b.

When looking at it via the VFS, you get a special inode 2 number when
looking at it in a place that does not match the BACKREF again:

-# stat b
  File: b
  Size: 0   Blocks: 0  IO Block: 4096   directory
Device: 55h/85d Inode: 2   Links: 1
Access: (0755/drwxr-xr-x)  Uid: (0/root)   Gid: (0/root)
Access: 2017-07-22 13:08:59.217456707 +0200
Modify: 2017-07-22 13:08:59.217456707 +0200
Change: 2017-07-22 13:08:59.217456707 +0200
 Birth: -

I don't have all structures of the root tree yet in python-btrfs it
seems. Would be nice to create an example script that does a pretty
printed version of tree 1.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to detect "orphaned" subvolume attachment point in snapshot?

2017-07-22 Thread Andrei Borzenkov
Here is structure of snapshots in openSUSE; all snapshots of root volume
are created under /.snapshots subvolume:

linux-gtrk:/host/home/src/python-btrfs/examples # sudo mount -o
ro,subvol=/ /dev/sda3 /mnt
linux-gtrk:/host/home/src/python-btrfs/examples #
./show_directory_contents.py /mnt/
directory /mnt/@ tree 257 inum 256
inode generation 6 transid 315 size 72 nbytes 0 block_group 0 mode 40755
nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
inode ref index 0 name utf-8 ..
...
dir item list hash 1921786525 size 1
dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots
...
linux-gtrk:/host/home/src/python-btrfs/examples #
./show_directory_contents.py /mnt/@/.snapshots/251/snapshot/
directory /mnt/@/.snapshots/251/snapshot/ tree 774 inum 256
inode generation 6 transid 15867 size 164 nbytes 0 block_group 0 mode
40755 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
inode ref index 0 name utf-8 ..
...
dir item list hash 1921786525 size 1
dir item location (258 ROOT_ITEM -1) type DIR name utf-8 .snapshots
...
linux-gtrk:/host/home/src/python-btrfs/examples #

Note that both directory items in /mnt/@ and
/mnt/@/.snapshots/251/snapshot store the same tree ID for .snapshots
item - 258. This causes grub2 btrfs driver loop - when it comes to
/mnt/@/.snapshots/251/snapshot and looks for .snapshots it jumps back to
/mnt/@/.snapshots tree.

I see that in Linux kernel somehow distinguishes between both of them; I
am not sure how it actually does it though.

What on-disk information should we check to find out "orphaned" snapshot
directory?

TIA

-andrei


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel btrfs file system wedged -- is it toast?

2017-07-22 Thread Marat Khalili

> The btrfs developers should have known this, and announced this,
a long time ago, in various prominent ways that it would be difficult
for potential new users to miss. 

I'm also a user like you, and I felt like this too when I came here (BTW there 
are several traps in BTRFS, and other are causing partial or whole filesystem 
loss, so you're lucky). There's truth in your words that some warning is 
needed, but in this open-source business it is not clear who should give it to 
whom. Developers in the list are actually spending their time on adding such 
warnings to kernel and command-line tools, but e.g. people using GUI and not 
reading dmesg over breakfast won't see them anyways. All situation is 
unfortunate because hardware and OS vendors keep hyping BTRFS and making it 
default in their products when it is clearly not ready, but you're now talking 
to and blaming the wrong people.

Personally for me coming to this list was the most helpful thing in 
understanding BTRFS current state and limitations. I'm still using it, although 
in a very careful and controlled manner. But browsing the list every day sadly 
takes time. If you can't afford it or are running something absolutely 
critical, better look to other, more mature filesystems. After all, as adage 
says: "legacy is what we run in production".
-- 

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html