Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 14:50, Tomasz Pala wrote:

On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:


I personally think the degraded mount option is a mistake as this
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount
regardless if it is fully operative or not. If the array is in a bad
state you need to learn about it by issuing a command or something. The
same goes for a MD array (and yes, I am aware of the block layer vs
filesystem thing here).

The problem with this is that right now, it is not safe to run a BTRFS
volume degraded and writable, but for an even remotely usable system


Mounting read-only is still better than not mounting at all.
Agreed, but what most people who are asking about this are asking for is 
to have the system just run missing a drive.


For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.


with pretty much any modern distro, you need your root filesystem to be
writable (or you need to have jumped through the hoops to make sure /var
and /tmp are writable even if / isn't).


Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

It's reasonably easy to handle even in a normal init system.  The issue 
is that most distros don't really support it well.  Arch and Gentoo make 
it trivial, but they let you configure storage however the hell you 
want.  Pretty much everybody else is mostly designed to assume that /var 
is a part of /, they mostly work if it's not, but certain odd things 
cause problems, and you have to go through somewhat unfriendly 
configuration work during install to get a system set up that way (well, 
unfriendly if you're a regular user, it's perfectly fine for a seasoned 
sysadmin).


Also, slightly OT, but has anyone involved in the development described 
in the article you linked every looked beyond the typical Fedora/Debian 
environment for any of the stuff the conclusions section says you're 
trying to achieve?  Just curious, since NixOS can do almost all of it 
with near zero effort except for the vendor data part (NixOS still 
stores it's config in /etc, but it can work with just one or two files), 
and a handful of the other specific items have reasonably easy ways to 
implement them that just aren't widely supported (for example, factory 
resets have at least three options already, OverlayFS (bottom layer is 
your base image, stored in a read-only verified manner, top layer is 
writable for user customization), BTRFS seed devices (similar to an 
overlay, just at the block level), and bootable, self-installing, 
compressed system images).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:

>> I personally think the degraded mount option is a mistake as this 
>> assumes that a lightly degraded system is not able to work which is false.
>> If the system can mount to some working state then it should mount 
>> regardless if it is fully operative or not. If the array is in a bad 
>> state you need to learn about it by issuing a command or something. The 
>> same goes for a MD array (and yes, I am aware of the block layer vs 
>> filesystem thing here).
> The problem with this is that right now, it is not safe to run a BTRFS 
> volume degraded and writable, but for an even remotely usable system 

Mounting read-only is still better than not mounting at all.

For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.

> with pretty much any modern distro, you need your root filesystem to be 
> writable (or you need to have jumped through the hoops to make sure /var 
> and /tmp are writable even if / isn't).

Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
Just one final word, as all was already said:

On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

Hard to agree with someone who refuses to do _anything_.

You can choose to follow whatever, MD, LVM, ZFS, invent something
totally different, write custom daemon or put timeout logic inside the
kernel itself. It doesn't matter. You know the ecosystem - it is the
udev that must be signalled somehow and systemd WILL follow.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> - crypto below software RAID means double-encryption (wasted CPU),
> It also means you leak no information about your storage stack.  If 

JBOD

> you're sufficiently worried about data protection that you're using 
> block-level encryption, you should be thinking _very_ hard about whether 
> or not that's an acceptable risk (and it usually isn't).

Nonsense. Block-level encryption is the last resource protection, your
primary concern is to encrypt at the highest level possible. Anyway,
I don't need to care at all about encryption, one of my customer might.
Just stop extending justification of your tight usage pattern to the
rest of the world.

BTW if YOU are sufficiently worried about data protection you need to
use some hardware solution, like OPAL and completely avoid using
consumer-grade (especially SSD) drives. This also saves CPU cycles,
but let's not discuss here the gory details.

If you can't imagine people have different requirements than you, then
this is your mental problem, go solve it somewhere else.

>> - RAID below LVM means you're stuck with the same RAID-profile for all
>>the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>>system and RAID0 for various system caches (like ccache on software
>>builder machine) or transient LVM-level snapshots.
> Then you skip MD and do the RAID work in LVM with DM-RAID (which 
> technically _is_ MD, just with a different frontend).

1. how is write-mostly handled by LVM-initiated RAID1?
2. how can one split LVM RAID1 to separate volumes in case of bit-rot
situation that requires manual intervention to recover specific copy of
a data (just like btrfs checksumming does automatically in raid1 mode)?

>> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>>like recovering data from different mirror when CRC mismatch happens,
> That depends on your choice of RAID and the exact configuration of the 

There is no data checksumming in MD-RAID, there is no voting in MD-RAID.
There is FEC mode in dm-verity.

> storage stack.  As long as you expose two RAID devices, BTRFS 
> replication works just fine on top of them.

Taking up 4 times the space? Or going crazy with 2*MD-RAID0?

>> - crypto below LVN means encrypting everything, including data that is
>>not sensitive - more CPU wasted,
> Encrypting only sensitive data is never a good idea unless you can prove 

Encrypting the sensitive data _AT_or_ABOVE_ the filesystem level is
crucial for any really sensitive data.

> with certainty that you will keep it properly segregated, and even then 
> it's still a pretty bad idea because it makes it obvious exactly where 
> the information you consider sensitive is stored.

ROTFL

Do you really think this would make breaking the XTS easier, than in
would be if the _entire_ drive would be encrypted using THE SAME secret?
With attacker having access to the plain texts _AND_ ciphers?

Wow... - stop doing the crypto, seriously. You do this wrong.

Do you think that my customer or cooperative would happily share HIS
secret with mine, just because we're running on the same server?

Have you ever heard about zero-knowledge databases?
Can you imagine, that some might want to do the decryption remotely,
when he doesn't trust me as the owner of the machine?

How is that me KNOWING, that their data is encrypted, eases the attack?

>> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>>space using MD write-mostly functionality.
> Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 

Obviously you've never used write-mostly, as you're apparently not aware
about the difference in maintenance burden.

> some patches just posted for BTRFS that indirectly allow for this 
> (specifically, they let you change the read-selection algorithm, with 
> the option of specifying to preferentially read from a specific device).

When they will be available in LTS kernel, some will definitely use it
and create even more complicated stacks.

>> It is the bottom layer, but I might be attached into volumes at virtually
>> any place of the logical topology tree. E.g. bare network drive added as
>> device-mapper mirror target for on-line volume cloning.
> And you seriously think that that's going to be a persistent setup? 

Persistent setups are archeology in IT.

> One-shot stuff like that is almost never an issue unless your init 
> system is absolutely brain-dead _and_ you need it working as it was 
> immediately (and a live-clone of a device doesn't if you're doing it right).

Brain-dead is a state of mind, when you reject usage scenarios that you
completely don't understand, hopefully due to the small experience only.

>> The point is: mainaining all of this logic is NOT the job for init system.
>> With systemd you need exactly N-N=0 lines of code to make this work.
> So, I find it very hard to believe that systemd requires absolutely 

Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 10:09, Tomasz Pala wrote:

On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:


Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.

I think you mean any setup that isn't sensibly layered.


No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.


BCP for over a
decade has been to put multipathing at the bottom, then crypto, then
software RAID, than LVM, and then whatever filesystem you're using.


Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),
It also means you leak no information about your storage stack.  If 
you're sufficiently worried about data protection that you're using 
block-level encryption, you should be thinking _very_ hard about whether 
or not that's an acceptable risk (and it usually isn't).


- RAID below LVM means you're stuck with the same RAID-profile for all
   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
   system and RAID0 for various system caches (like ccache on software
   builder machine) or transient LVM-level snapshots.
Then you skip MD and do the RAID work in LVM with DM-RAID (which 
technically _is_ MD, just with a different frontend).


- RAID below filesystem means loosing btrfs-RAID extra functionality,
   like recovering data from different mirror when CRC mismatch happens,
That depends on your choice of RAID and the exact configuration of the 
storage stack.  As long as you expose two RAID devices, BTRFS 
replication works just fine on top of them.


- crypto below LVN means encrypting everything, including data that is
   not sensitive - more CPU wasted,
Encrypting only sensitive data is never a good idea unless you can prove 
with certainty that you will keep it properly segregated, and even then 
it's still a pretty bad idea because it makes it obvious exactly where 
the information you consider sensitive is stored.


- RAID below LVM means no way to use SSD acceleration of part of the HDD
   space using MD write-mostly functionality.
Again, just use LVM's DM-RAID and throw in DM-cache.  Also, there were 
some patches just posted for BTRFS that indirectly allow for this 
(specifically, they let you change the read-selection algorithm, with 
the option of specifying to preferentially read from a specific device).


What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.


Multipathing has to be the bottom layer for a given node because it
interacts directly with hardware topology which gets obscured by the
other layers.


It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.
And you seriously think that that's going to be a persistent setup? 
One-shot stuff like that is almost never an issue unless your init 
system is absolutely brain-dead _and_ you need it working as it was 
immediately (and a live-clone of a device doesn't if you're doing it right).



Crypto essentially has to be next, otherwise you leak
info about the storage stack.


I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.
Again, you're leaking information by doing so.  At a minimum, you're 
leaking info about where the data you consider sensitive is stored, and 
that's not counting volume names (exposed by LVM), container 
configuration (possibly exposed depending on how your container stack 
handles it), and other storage stack configuration info (exposed by the 
metadata of the various layers and possibly by files in /etc if you 
don't have your root filesystem encrypted).



Swapping LVM and software RAID ends up
giving you a setup which is difficult for most people to understand and
therefore is hard to reliably maintain.


It's more difficult, as you need to maintain manually two (or more) separate 
VGs with
matching LVs inside. Harder, but more flexible.
And could also be trivially simplified by eliminating MD and using LVM's 
native support for DM-RAID, which provides essentially the exact same 
functionality because DM-RAID is largely just a DM fronted for MD.



Other init systems enforce things being this way because it maintains
people's sanity, not because they have significant difficulty doing
things differently (and in fact, it is _trivial_ to change the ordering
in some of them, OpenRC on Gentoo for example quite literally requires
exactly N-1 lines to change in each of N 

Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote:

>> BCP for over a 
>> decade has been to put multipathing at the bottom, then crypto, then 
>> software RAID, than LVM, and then whatever filesystem you're using. 
> 
> Really? Let's enumerate some caveats of this:
> 
> - crypto below software RAID means double-encryption (wasted CPU),
> 
> - RAID below LVM means you're stuck with the same RAID-profile for all
>   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>   system and RAID0 for various system caches (like ccache on software
>   builder machine) or transient LVM-level snapshots.
> 
> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>   like recovering data from different mirror when CRC mismatch happens,
> 
> - crypto below LVN means encrypting everything, including data that is
>   not sensitive - more CPU wasted,

And, what is much worse - encrypting everything using the same secret.
BIG show-stopper.

I would shred such BCP as ineffective and insecure for both, data
integrity and confidentiality.

> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>   space using MD write-mostly functionality.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote:

>> Instead, they should move their legs continuously and if the train is > not 
>> on the station yet, just climb back and retry.
> No, that's really not a good analogy given the fact that that check for 
> the presence of a train takes a normal person milliseconds while the 
> event being raced against (the train departing) takes minutes.  In the 

OMG... preventing races by "this would always take longer"? Seriously?

> You're already looping forever _waiting_ for the volume to appear.  How 

udev is waiting for events, not systemd. Nobody will do some crazy
cross-layered shortcuts to overcome other's lazyness.

> is that any different from lopping forever trying to _mount_ the volume 

Yes, because udev doesn't mount anything, ever. Not this binary dude!

> instead given that failing to mount the volume is not going to damage 
> things. 

Failed premature attempt to mount prevents the system from booting WHEN
the devices are ready - this is fatal. System boots randomly on racy
conditions.

But hey, "the devices will always appear faster, than the init attempt
to do the mount"!

Have you ever had some hardware RAID controller? Never heard about
devices appearing after 5 minutes of warming up?

> The issue here is that systemd refuses to implement any method 
> of actually retrying things that fail during startup.>

1. Such methods are trivial and I've already mentioned them a dozen of times.
2. They should be implemented in btrfs-upstream, not systemd-upstream,
   but I personally would happily help with writing them here.
3. They require full-circle path of 'allow-degraded' to be passed
   through btrfs code.

>> mounting BEFORE volume is complete is FATAL - since no userspace daemon
>> would ever retrigger the mount and the system won't came up. Provide one
>> btrfsd volume manager and systemd could probably switch to using it.
> And here you've lost any respect I might have had for you.

Going personal? So thank you for discussion and good bye.

Please refrain from answering me, I'm not going to discuss this any
further with you.

> **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Sorry dude, but I won't repeat for the 5th times all the alternatives.

You *all* refuse to step in ANY possible solution mentioned.
You *all* except the systemd to do ALL the job, just like other init
systems were forced to do, against the good design principles.

Good luck having btrfs degraded mount under systemd.

> 
> This is one of the two biggest things I hate about systemd(the journal 
> is the other one for those who care).

The journal has currently *many* drawbacks, but this is not 'by design'
but 'by appropriate code missing for now'. The same applies to btrfs,
isn't it?

> You don't need some special daemon to set the time,

Ever heard about NTP?

> or to set the hostname,

FUD - no such daemon

> or to fetch account data,

FUD

> or even to track who's logged in

FUD

> As much as it may surprise the systemd developers, people got on just 
> fine handling setting the system time, setting the hostname, fetching 
> account info, tracking active users, and any number of myriad other 
> tasks before systemd decided they needed to have their own special daemon.
> 

Sure, in myriad of different scattered distro-specific files. The only
reason systemd stepped in for some of there is that nobody else could
introduce and force Linux-wide consensus. And if anyone would succeed,
there would be some Austins blaming them for 'overtaking good old
trashyard into coherent de facto standard.'

> In this particular case, you don't need a daemon because the kernel does 
> the state tracking. 

Sure, MD doesn't require daemon and LVM doesn't require either. But they
do provide some - I know, they are all wrong.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote:

> Btrfs is orthogonal to systemd's willingness to wait forever while
> making no progress. It doesn't matter what it is, it shouldn't wait
> forever.

It times out after 90 seconds (by default) and then it fails the mount
entirely.

> It occurs to me there are such systemd service units specifically for
> waiting for example
> 
> systemd-networkd-wait-online.service, systemd-networkd-wait-online -
> Wait for network to
>come online
> 
>  chrony-wait.service - Wait for chrony to synchronize system clock
> 
> NetworkManager has a version of this. I don't see why there can't be a
> wait for Btrfs to normally mount,

Because mounting degraded btrfs without -o degraded won't WAIT for
anything just immediatelly return failed.

> just simply try to mount, it fails, wait 10, try again, wait 10 try again.

For the last time:

No
Such
Logic
In
Systemd
CORE

Every wait/repeat is done using UNITS - as you already noticed
itself. And these are plain, regular UNITS.

Is there anything that prevents YOU, Chris, from writing these UNITS for
btrfs?

I know what makes ME stop writing these units - it's lack of feedback
from btrfs.ko ioctl handler. Without this I am unable to write UNITS
handling fstab mount entries, because the logic would PROBABLY have to
be hardcoded inside systemd-fstab-generator.

And such logic MUST NOT be hardcoded - this MUST be user-configurable,
i.e. made on UNITS level.

You might argue that some-distros-SysV units or some Gentoo-OpenRC have
support for this and if you want to change anything this is only a few
lines of shell code to be altered. But systemd-fstab-generator is
compiled binary and so WON'T allow the behaviour to be user-configurable.

> And then fail the unit so we end up at a prompt.

This can also be easily done, just like emergency-shell spawns
when configured. If only btrfs could accept and keep information about
volume being allowed for degraded mount.


OK, to be honest I _can_ write such rules now, keeping the
'allow-degraded' state somewhere else (in a file for example).

But since this is some non-standarized side-channel, such code won't
be accepted in systemd upstream, especially because it requires the
current udev rule to be slightly changed.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote:

> We already do so in the accepted standard manner.  If the mount fails 
> because of a missing device, you get a very specific message in the 
> kernel log about it, as is the case for most other common errors (for 
> uncommon ones you usually just get a generic open_ctree error).  This is 
> really the only option too, as the mount() syscall (which the mount 
> command calls) returns only 0 on success or -1 and an appropriate errno 
> value on failure, and we can't exactly go about creating a half dozen 
> new error numbers just for this (well, technically we could, but I very 
> much doubt that they would be accepted upstream, which defeats the purpose).

This is exacly why the separate communication channel being the ioctl is
currently used. And I really don't understand why do you fight against
expanding this ioctl response.

> With what you're proposing for BTRFS however, _everything_ is a 
> complicated decision, namely:
> 1. Do you retry at all?  During boot, the answer should usually be yes, 
> but during normal system operation it should normally be no (because we 
> should be letting the user handle issues at that point).

This is exactly why I propose to introduce ioctl in btrfs.ko that
accepts userspace-configured (as per-volume policy) expectations.

> 2. How long should you wait before you retry?  There is no right answer 
> here that will work in all cases (I've seen systems which take multiple 
> minutes for devices to become available on boot), especially considering 
> those of us who would rather have things fail early.

btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer

> 3. If the retry fails, do you retry again?  How many times before it 
> just outright fails?  This is going to be system specific policy.  On 
> systems where devices may take a while to come online, the answer is 
> probably yes and some reasonably large number, while on systems where 
> devices are known to reliably be online immediately, it makes no sense 
> to retry more than once or twice.

All of this is systemd timer/service job.

> 4. If you are going to retry, should you try a degraded mount?  Again, 
> this is going to be system specific policy (regular users would probably 
> want this to be a yes, while people who care about data integrity over 
> availability would likely want it to be a no).

Just like above - user-configured in systemd timers/services easily.

> 5. Assuming you do retry with the degraded mount, how many times should 
> a normal mount fail before things go degraded?  This ties in with 3 and 
> has the same arguments about variability I gave there.

As above.

> 6. How many times do you try a degraded mount before just giving up? 
> Again, similar variability to 3.
> 7. Should each attempt try first a regular mount and then a degraded 
> one, or do you try just normal a couple times and then switch to 
> degraded, or even start out trying normal and then start alternating? 
> Any of those patterns has valid arguments both for and against it, so 
> this again needs to be user configurable policy.
> 
> Altogether, that's a total of 7 policy decisions that should be user 
> configurable. 

All of them easy to implement if the btrfs.ko could accept
'allow-degraded' per-volume instruction and return 'try-degraded' in the
ioctl.

> Having a config file other than /etc/fstab for the mount 
> command should probably be avoided for sanity reasons (again, BTRFS is a 
> filesystem, not a volume manager), so they would all have to be handled 
> through mount options.  The kernel will additionally have to understand 
> that those options need to be ignored (things do try to mount 
> filesystems without calling a mount helper, most notably the kernel when 
> it mounts the root filesystem on boot if you're not using an initramfs). 
>   All in all, this type of thing gets out of hand _very_ fast.

You need to think about the two separately:
1. tracking STATE - this is remembering 'allow-degraded' option for now,
2. configured POLICY - this is to be handled by init system.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 08:42:32 -0500, Austin S. Hemmelgarn wrote:

>> Yes. They are stupid enough to fail miserably with any more complicated
>> setups, like stacking volume managers, crypto layer, network attached
>> storage etc.
> I think you mean any setup that isn't sensibly layered.

No, I mean any setup that wasn't considered by init system authors.
Your 'sensibly' is not sensible for me.

> BCP for over a 
> decade has been to put multipathing at the bottom, then crypto, then 
> software RAID, than LVM, and then whatever filesystem you're using. 

Really? Let's enumerate some caveats of this:

- crypto below software RAID means double-encryption (wasted CPU),

- RAID below LVM means you're stuck with the same RAID-profile for all
  the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
  system and RAID0 for various system caches (like ccache on software
  builder machine) or transient LVM-level snapshots.

- RAID below filesystem means loosing btrfs-RAID extra functionality,
  like recovering data from different mirror when CRC mismatch happens,

- crypto below LVN means encrypting everything, including data that is
  not sensitive - more CPU wasted,

- RAID below LVM means no way to use SSD acceleration of part of the HDD
  space using MD write-mostly functionality.

What you present is only some sane default, which doesn't mean it covers
all the real-world cases.

My recent server is using:
- raw partitioning for base volumes,
- LVM,
- MD on top of some LVs (varying levels),
- paritioned SSD cache attached to specific VGs,
- crypto on top of selected LV/MD,
- btrfs RAID1 on top of non-MDed LVs.

> Multipathing has to be the bottom layer for a given node because it 
> interacts directly with hardware topology which gets obscured by the 
> other layers.

It is the bottom layer, but I might be attached into volumes at virtually
any place of the logical topology tree. E.g. bare network drive added as
device-mapper mirror target for on-line volume cloning.

> Crypto essentially has to be next, otherwise you leak
> info about the storage stack.

I'm encrypting only the containers that require block-level encryption.
Others might have more effective filesystem-level encryption or even be
some TrueCrypt/whatever images.

> Swapping LVM and software RAID ends up 
> giving you a setup which is difficult for most people to understand and 
> therefore is hard to reliably maintain.

It's more difficult, as you need to maintain manually two (or more) separate 
VGs with
matching LVs inside. Harder, but more flexible.

> Other init systems enforce things being this way because it maintains 
> people's sanity, not because they have significant difficulty doing 
> things differently (and in fact, it is _trivial_ to change the ordering 
> in some of them, OpenRC on Gentoo for example quite literally requires 
> exactly N-1 lines to change in each of N files when re-ordering N 
> layers), provided each layer occurs exactly once for a given device and 
> the relative ordering is the same on all devices.  And you know what? 

The point is: mainaining all of this logic is NOT the job for init system.
With systemd you need exactly N-N=0 lines of code to make this work.

The appropriate unit files are provided by MD and LVM upstream.
And they include fallback mechanism for degrading volumes.

> Given my own experience with systemd, it has exactly the same constraint 
> on relative ordering.  I've tried to run split setups with LVM and 
> dm-crypt where one device had dm-crypt as the bottom layer and the other 
> had it as the top layer, and things locked up during boot on _every_ 
> generalized init system I tried.

Hard to tell without access to the failing system, but this MIGHT have been:

- old/missing/broken-by-distro-maintainers-who-know-better LVM rules,
- old/bugged systemd, possibly with broken/old cryptsetup rules.

>> It's quite obvious who's the culprit: every single remaining filesystem
>> manages to mount under systemd without problems. They just expose
>> informations about their state.
> No, they don't (except ZFS).

They don't expose informations (as there are none), but they DO mount.

> There is no 'state' to expose for anything but BTRFS (and ZFS)

Does ZFS expose it's state or not?

> except possibly if the filesystem needs checked or 
> not.  You're conflating filesystems and volume management.

btrfs is a filesystem, device manager and volume manager.
I might add DEVICE to a btrfs-thingy.
I might mount the same btrfs-thingy selecting different VOLUME 
(subVOL=something_other)

> The alternative way of putting what you just said is:
> Every single remaining filesystem manages to mount under systemd without 
> problems, because it doesn't try to treat them as a block layer.

Or: every other volume manager exposes separate block devices.

Anyway - however we put this into words, it is btrfs that behaves differently.

>> The 'needless complication', as you named it, usually should be the 

Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-30 08:46, Tomasz Pala wrote:

On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:


Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF
THIS_.  It's functionally no different than prefacing an attempt to send
a signal to a process by checking if the process exists, or trying to
see if some other process is using a file that might be locked by


Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is > not on 
the station yet, just climb back and retry.
No, that's really not a good analogy given the fact that that check for 
the presence of a train takes a normal person milliseconds while the 
event being raced against (the train departing) takes minutes.  In the 
case being discussed, the check takes milliseconds and the event being 
raced against also takes milliseconds.  The scale here is drastically 
different.>

See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.
Yes, the consequences of the condition being discussed functionally are 
fatal (you completely fail to mount the volume), because systemd doesn't 
retry mounting the root filesystem, it just breaks, which is absolutely 
at odds with the whole 'just works' mentality I always hear from the 
systemd fanboys and developers.


You're already looping forever _waiting_ for the volume to appear.  How 
is that any different from lopping forever trying to _mount_ the volume 
instead given that failing to mount the volume is not going to damage 
things.  The issue here is that systemd refuses to implement any method 
of actually retrying things that fail during startup.>

mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

And here you've lost any respect I might have had for you.

**YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Period, end of story.


This is one of the two biggest things I hate about systemd (the journal 
is the other one for those who care).  You don't need some special 
daemon to set the time, or to set the hostname, or to fetch account 
data, or even to track who's logged in (though I understand that the 
last one is not systemd's fault originally).


As much as it may surprise the systemd developers, people got on just 
fine handling setting the system time, setting the hostname, fetching 
account info, tracking active users, and any number of myriad other 
tasks before systemd decided they needed to have their own special daemon.



In this particular case, you don't need a daemon because the kernel does 
the state tracking.  It only checks that state completely though _when 
you ask it to mount the filesystem_ because it requires doing 99% of the 
work of mounting the filesystem (quite literally, you're doing pretty 
much everything short of actually hooking things up in the VFS layer). 
We are not a case like MD where there's just a tiny bit of metadata to 
parse to check what the state is supposed to be.  Imagine if LVM 
required you to unconditionally activate all the LV's in a VG when you 
activate the VG and what logic would be required to validate the VG 
then, and you're pretty close to what's needed to check state for a 
BTRFS volume (translating LV's to chunks and the VG to the filesystem as 
a whole).  There is no point in trying to parse that data every time a 
new device shows up, it's a waste of time (at a minimum, you're almost 
doubling the amount of time it takes to mount a volume if you are doing 
this each time a device shows up), energy, and resources in general.


mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".
Short of catastrophic failure, the _volume_ doesn't disappear, a 
component device does, and that is where the problem lies, especially 
given that the ioctl only tracks that each component device has been 
seen, not that all are present at the moment the ioctl is invoked.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:

> Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
> THIS_.  It's functionally no different than prefacing an attempt to send 
> a signal to a process by checking if the process exists, or trying to 
> see if some other process is using a file that might be locked by 

Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is
not on the station yet, just climb back and retry.


See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.


mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Austin S. Hemmelgarn

On 2018-01-29 16:54, waxhead wrote:



Austin S. Hemmelgarn wrote:

On 2018-01-29 12:58, Andrei Borzenkov wrote:

29.01.2018 14:24, Adam Borowski пишет:
...


So any event (the user's request) has already happened.  A rc 
system, of
which systemd is one, knows whether we reached the "want root 
filesystem" or
"want secondary filesystems" stage.  Once you're there, you can 
issue the

mount() call and let the kernel do the work.

It is a btrfs choice to not expose compound device as separate one 
(like

every other device manager does)


Btrfs is not a device manager, it's a filesystem.

it is a btrfs drawback that doesn't provice anything else except 
for this

IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you 
want the
information, call mount().  And as others in this thread have 
mentioned,
what, pray tell, would you want to know "would a mount succeed?" for 
if you

don't want to mount?

it is a btrfs drawback that there is nothing to push assembling 
into "OK,

going degraded" state


The way to do so is to timeout, then retry with -o degraded.



That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded 
mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This 
is really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate 
errno value on failure, and we can't exactly go about creating a half 
dozen new error numbers just for this (well, technically we could, but 
I very much doubt that they would be accepted upstream, which defeats 
the purpose).


Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command 
itself and adjust the timeout handling.


This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local 
directory that the filesystem is being mounted on not being present is 
treated like a mount timeout), and it had near zero control.  It works 
there because they push the complicated policy decisions to userspace 
(namely, there is no support for retrying with different options or 
trying a different server).


I just felt like commenting a bit on this from a regular users point of 
view.


Remember that at some point BTRFS will probably be the default 
filesystem for the average penguin.
BTRFS big selling point is redundance and a guarantee that whatever you 
write is the same that you will read sometime later.


Many users will probably build their BTRFS system on a redundant array 
of storage devices. As long as there are sufficient (not necessarily 
all) storage devices present they expect their system to come up and 
work. If the system is not able to come up in a fully operative state it 
must at least be able to limp until the issue is fixed.


Starting a argument about what init system is the most sane or most 
shiny is not helping. The truth is that systemd is not going away 
sometime soon and one might as well try to become friends if nothing 
else for the sake of having things working which should be a common goal 
regardless of the religion.
FWIW, I don't care that it's systemd in this case, I care that people 
are arguing for the forced use of a coding anti-pattern that ends up 
being covered as bad practice in first year computer science courses 
(no, seriously, every professional programmer I've asked about this had 
time-of-check-time-of-use race conditions covered in one of their 
first-year CS classes) or the enforcement of an event-based model that 
really doesn't make any sense for this (OK, it makes a little sense for 
handling of devices reappearing, but systemd doesn't need to be involved 
in that beyond telling the kernel that the device reappeared, except 
that that's udev's job).


I personally think the degraded mount option is a mistake as this 
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount 
regardless if it is fully operative or not. If the array is in a bad 
state you need to learn about it by issuing a command or something. The 
same goes for a MD array (and yes, I am aware of the block layer vs 
filesystem thing here).
The problem with this is that right now, it is not safe to run a BTRFS 
volume degraded and writable, but for an even remotely usable system 

Re: degraded permanent mount option

2018-01-29 Thread Chris Murphy
On Mon, Jan 29, 2018 at 1:54 AM, Tomasz Pala  wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
>
>> systemd can't possibly need to know more information than a person
>> does in the exact same situation in order to do the right thing. No
>> human would wait 10 minutes, let alone literally the heat death of the
>> planet for "all devices have appeared" but systemd will. And it does
>
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices. Just like it 'waits' for someone to plug USB
> pendrive in.

Btrfs is orthogonal to systemd's willingness to wait forever while
making no progress. It doesn't matter what it is, it shouldn't wait
forever.

It occurs to me there are such systemd service units specifically for
waiting for example

systemd-networkd-wait-online.service, systemd-networkd-wait-online -
Wait for network to
   come online

 chrony-wait.service - Wait for chrony to synchronize system clock

NetworkManager has a version of this. I don't see why there can't be a
wait for Btrfs to normally mount, just simply try to mount, it fails,
wait 10, try again, wait 10 try again. And then fail the unit so we
end up at a prompt. Or some people can optionally ask for a mount -o
degraded instead of a fail, and then if that also doesn't work, the
unit fails. Of course service units can have such conditionals rather
than waiting forever.





-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread waxhead



Austin S. Hemmelgarn wrote:

On 2018-01-29 12:58, Andrei Borzenkov wrote:

29.01.2018 14:24, Adam Borowski пишет:
...


So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root 
filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue 
the

mount() call and let the kernel do the work.

It is a btrfs choice to not expose compound device as separate one 
(like

every other device manager does)


Btrfs is not a device manager, it's a filesystem.

it is a btrfs drawback that doesn't provice anything else except for 
this

IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you 
want the

information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for 
if you

don't want to mount?

it is a btrfs drawback that there is nothing to push assembling into 
"OK,

going degraded" state


The way to do so is to timeout, then retry with -o degraded.



That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded 
mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This is 
really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate errno 
value on failure, and we can't exactly go about creating a half dozen 
new error numbers just for this (well, technically we could, but I very 
much doubt that they would be accepted upstream, which defeats the 
purpose).


Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command itself 
and adjust the timeout handling.


This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local directory 
that the filesystem is being mounted on not being present is treated 
like a mount timeout), and it had near zero control.  It works there 
because they push the complicated policy decisions to userspace (namely, 
there is no support for retrying with different options or trying a 
different server).


I just felt like commenting a bit on this from a regular users point of 
view.


Remember that at some point BTRFS will probably be the default 
filesystem for the average penguin.
BTRFS big selling point is redundance and a guarantee that whatever you 
write is the same that you will read sometime later.


Many users will probably build their BTRFS system on a redundant array 
of storage devices. As long as there are sufficient (not necessarily 
all) storage devices present they expect their system to come up and 
work. If the system is not able to come up in a fully operative state it 
must at least be able to limp until the issue is fixed.


Starting a argument about what init system is the most sane or most 
shiny is not helping. The truth is that systemd is not going away 
sometime soon and one might as well try to become friends if nothing 
else for the sake of having things working which should be a common goal 
regardless of the religion.


I personally think the degraded mount option is a mistake as this 
assumes that a lightly degraded system is not able to work which is false.
If the system can mount to some working state then it should mount 
regardless if it is fully operative or not. If the array is in a bad 
state you need to learn about it by issuing a command or something. The 
same goes for a MD array (and yes, I am aware of the block layer vs 
filesystem thing here).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Austin S. Hemmelgarn

On 2018-01-29 12:58, Andrei Borzenkov wrote:

29.01.2018 14:24, Adam Borowski пишет:
...


So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue the
mount() call and let the kernel do the work.


It is a btrfs choice to not expose compound device as separate one (like
every other device manager does)


Btrfs is not a device manager, it's a filesystem.


it is a btrfs drawback that doesn't provice anything else except for this
IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?


it is a btrfs drawback that there is nothing to push assembling into "OK,
going degraded" state


The way to do so is to timeout, then retry with -o degraded.



That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded mount.
We already do so in the accepted standard manner.  If the mount fails 
because of a missing device, you get a very specific message in the 
kernel log about it, as is the case for most other common errors (for 
uncommon ones you usually just get a generic open_ctree error).  This is 
really the only option too, as the mount() syscall (which the mount 
command calls) returns only 0 on success or -1 and an appropriate errno 
value on failure, and we can't exactly go about creating a half dozen 
new error numbers just for this (well, technically we could, but I very 
much doubt that they would be accepted upstream, which defeats the purpose).


Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
And would also be another layering violation which would require a 
proliferation of extra mount options to control the mount command itself 
and adjust the timeout handling.


This has been done before with mount.nfs, but for slightly different 
reasons (primarily to allow nested NFS mounts, since the local directory 
that the filesystem is being mounted on not being present is treated 
like a mount timeout), and it had near zero control.  It works there 
because they push the complicated policy decisions to userspace (namely, 
there is no support for retrying with different options or trying a 
different server).


With what you're proposing for BTRFS however, _everything_ is a 
complicated decision, namely:
1. Do you retry at all?  During boot, the answer should usually be yes, 
but during normal system operation it should normally be no (because we 
should be letting the user handle issues at that point).
2. How long should you wait before you retry?  There is no right answer 
here that will work in all cases (I've seen systems which take multiple 
minutes for devices to become available on boot), especially considering 
those of us who would rather have things fail early.
3. If the retry fails, do you retry again?  How many times before it 
just outright fails?  This is going to be system specific policy.  On 
systems where devices may take a while to come online, the answer is 
probably yes and some reasonably large number, while on systems where 
devices are known to reliably be online immediately, it makes no sense 
to retry more than once or twice.
4. If you are going to retry, should you try a degraded mount?  Again, 
this is going to be system specific policy (regular users would probably 
want this to be a yes, while people who care about data integrity over 
availability would likely want it to be a no).
5. Assuming you do retry with the degraded mount, how many times should 
a normal mount fail before things go degraded?  This ties in with 3 and 
has the same arguments about variability I gave there.
6. How many times do you try a degraded mount before just giving up? 
Again, similar variability to 3.
7. Should each attempt try first a regular mount and then a degraded 
one, or do you try just normal a couple times and then switch to 
degraded, or even start out trying normal and then start alternating? 
Any of those patterns has valid arguments both for and against it, so 
this again needs to be user configurable policy.


Altogether, that's a total of 7 policy decisions that should be user 
configurable.  Having a config file other than /etc/fstab for the mount 
command should probably be avoided for sanity reasons (again, BTRFS is a 
filesystem, not a volume manager), so they would all have to be handled 
through mount options.  The kernel will additionally have to understand 
that those options need to be ignored (things do try to mount 
filesystems 

Re: degraded permanent mount option

2018-01-29 Thread Andrei Borzenkov
29.01.2018 14:24, Adam Borowski пишет:
...
> 
> So any event (the user's request) has already happened.  A rc system, of
> which systemd is one, knows whether we reached the "want root filesystem" or
> "want secondary filesystems" stage.  Once you're there, you can issue the
> mount() call and let the kernel do the work.
> 
>> It is a btrfs choice to not expose compound device as separate one (like
>> every other device manager does)
> 
> Btrfs is not a device manager, it's a filesystem.
> 
>> it is a btrfs drawback that doesn't provice anything else except for this
>> IOCTL with it's logic
> 
> How can it provide you with something it doesn't yet have?  If you want the
> information, call mount().  And as others in this thread have mentioned,
> what, pray tell, would you want to know "would a mount succeed?" for if you
> don't want to mount?
> 
>> it is a btrfs drawback that there is nothing to push assembling into "OK,
>> going degraded" state
> 
> The way to do so is to timeout, then retry with -o degraded.
> 

That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded mount.

Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Austin S. Hemmelgarn

On 2018-01-27 17:42, Tomasz Pala wrote:

On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:


It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems.  They just don't try to
outsmart the kernel.


Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
I think you mean any setup that isn't sensibly layered.  BCP for over a 
decade has been to put multipathing at the bottom, then crypto, then 
software RAID, than LVM, and then whatever filesystem you're using. 
Multipathing has to be the bottom layer for a given node because it 
interacts directly with hardware topology which gets obscured by the 
other layers.  Crypto essentially has to be next, otherwise you leak 
info about the storage stack.  Swapping LVM and software RAID ends up 
giving you a setup which is difficult for most people to understand and 
therefore is hard to reliably maintain.


Other init systems enforce things being this way because it maintains 
people's sanity, not because they have significant difficulty doing 
things differently (and in fact, it is _trivial_ to change the ordering 
in some of them, OpenRC on Gentoo for example quite literally requires 
exactly N-1 lines to change in each of N files when re-ordering N 
layers), provided each layer occurs exactly once for a given device and 
the relative ordering is the same on all devices.  And you know what? 
Given my own experience with systemd, it has exactly the same constraint 
on relative ordering.  I've tried to run split setups with LVM and 
dm-crypt where one device had dm-crypt as the bottom layer and the other 
had it as the top layer, and things locked up during boot on _every_ 
generalized init system I tried.



Recently I've started mdadm on top of bunch of LVM volumes, with others
using btrfs and others prepared for crypto. And you know what? systemd
assembled everything just fine.

So with argument just like yours:

It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.
No, they don't (except ZFS).  There is no 'state' to expose for anything 
but BTRFS (and ZFS) except possibly if the filesystem needs checked or 
not.  You're conflating filesystems and volume management.


The alternative way of putting what you just said is:
Every single remaining filesystem manages to mount under systemd without 
problems, because it doesn't try to treat them as a block layer.



This is not a systemd issue, but apparently btrfs design choice to allow
using any single component device name also as volume name itself.


And what other user interface would you propose? The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.


The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).
For a bog-standard client system, all of those _ARE_ overkill (and 
actually, so is BTRFS in many cases too, it's just that we're the only 
option for main-line filesystem-level snapshots at the moment).



If btrfs pretends to be device manager it should expose more states,


But it doesn't pretend to.


Why mounting sda2 requires sdb2 in my setup then?
First off, it shouldn't unless you're using a profile that doesn't 
tolerate any missing devices and have provided the `degraded` mount 
option.  It doesn't in your case because you are using systemd.


Second, BTRFS is not a volume manager, it's a filesystem with 
multi-device support.  The difference is that it's not a block layer, 
despite the fact that systemd is treating it as such.   Yes, BTRFS has 
failure modes that result in regular operations being refused based on 
what storage devices are present, but so does every single distributed 
filesystem in existence, and none of those are volume managers either.



especially "ready to be mounted, but not fully populated" (i.e.
"degraded mount possible"). Then systemd could _fallback_ after timing
out to degraded mount automatically according to some systemd-level
option.


You're assuming that btrfs somehow knows this itself.


"It's quite obvious who's the culprit: every single volume manager keeps
track of it's component devices".


  Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.


There is a term 

Re: degraded permanent mount option

2018-01-29 Thread Austin S. Hemmelgarn

On 2018-01-29 06:24, Adam Borowski wrote:

On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:

it is a btrfs drawback that doesn't provice anything else except for this
IOCTL with it's logic


How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?
And more importantly, WHY THE HELL DO YOU _WANT_ A TOCTOU RACE CONDITION 
INVOLVED?


Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
THIS_.  It's functionally no different than prefacing an attempt to send 
a signal to a process by checking if the process exists, or trying to 
see if some other process is using a file that might be locked by 
scanning /proc instead of just trying to lock the file yourself, or 
scheduling something to check if a RAID array is out of sync before even 
trying to start a scrub.  No sane programmer would do any of that 
(although a lot of rather poorly educated sysadmins do the third), 
because _IT'S NOT RELIABLE_.  The process you're trying to send a signal 
to might disappear after checking for it (or worse, might be a different 
process), the file might get locked by something with a low PID while 
you're busy scanning /proc, or the array could completely die right 
after you check if it's OK.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Adam Borowski
On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
> 
> > systemd can't possibly need to know more information than a person
> > does in the exact same situation in order to do the right thing. No
> > human would wait 10 minutes, let alone literally the heat death of the
> > planet for "all devices have appeared" but systemd will. And it does
> 
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices.

Because there is NO compound device.  You can't wait for something that
doesn't exist.  The user wants a filesystem, not some mythical compound
device, and as knowing whether we have enough requires doing most of mount
work, we can as well complete the mount instead of backing off and
reporting, so you can then racily repeat the work.

> Just like it 'waits' for someone to plug USB pendrive in.

Plugging an USB pendrive is an event -- there's no user request.  On the
other hand, we already know we want to mount -- the user requested so either
by booting ("please mount everything in fstab") or by an explicit mount
command.

So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue the
mount() call and let the kernel do the work.

> It is a btrfs choice to not expose compound device as separate one (like
> every other device manager does)

Btrfs is not a device manager, it's a filesystem.

> it is a btrfs drawback that doesn't provice anything else except for this
> IOCTL with it's logic

How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?

> it is a btrfs drawback that there is nothing to push assembling into "OK,
> going degraded" state

The way to do so is to timeout, then retry with -o degraded.

> I've told already - pretend the /dev/sda1 device doesn't
> exist until assembled.

It does... you're confusing a block device (a _part_ of the filesystem) with
the filesystem itself.  MD takes a bunch of such block devices and provides
you with another block devices, btrfs takes a bunch of block devices and
provides you with a filesystem.

> If this overlapping usage was designed with 'easier mounting' on mind,
> this is simply bad design.

No other rc system but systemd has a problem.

> > that by its own choice, its own policy. That's the complaint. It's
> > choosing to do something a person wouldn't do, given identical
> > available information.
> 
> You are expecting systemd to mix in functions of kernel and udev.
> There is NO concept of 'assembled stuff' in systemd AT ALL.
> There is NO concept of 'waiting' in udev AT ALL.
> If you want to do some crazy interlayer shortcuts just implement btrfsd.

No, I don't want systemd, or any userspace daemon, to try knowing kernel
stuff better than the kernel.  Just call mount(), and that's it.

Let me explain via a car analogy.  There is a flood that covers many roads,
the phone network is unreliable, and you want to drive to help relatives at
place X.

You can ask someone who was there yesterday how to get there (ie, ask a
device; it can tell you "when I was a part of the filesystem last time, its
layout was such and such").  Usually, this is reliable (you don't reshape an
array every day), but if there's flooding (you're contemplating a degraded
mount), yesterday's data being stale shouldn't be a surprise.

So, you climb into the car and drive.  It's possible that the road you
wanted to take has changed, it's also possible some other roads you didn't
even know about are now driveable.  Once you have X in sight, do you retrace
all the way home, tell your mom (systemd) who's worrying but has no way to
help, that the road is clear, and only then get to X?  Or do you stop,
search for a spot with working phone coverage to phone mom asking for
advice, despite her having no informations you don't have?  The reasonable
thing to do (and what all other rc systems do) is to get to X, help the
relatives, and only then tell mom that all is ok.

But with mom wanting to control everything, things can go worse.  If you,
without mom's prior knowledge (the user typed "mount" by hand) manage to
find a side road to X, she shouldn't tell you "I hear you telling me you're
at X -- as the road is flooded, that's impossible, so get home this instant"
(ie, systemd thinking the filesystem not being complete, despite it being
already mounted).

> > There's nothing the kernel is doing that's
> > telling systemd to wait for goddamn ever.
> 
> There's nothing the kernel is doing that's
> telling udev there IS a degraded device assembled to be used.

Because there is no device.

> There's nothing a userspace-thing is doing that's
> 

Re: degraded permanent mount option

2018-01-29 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:

> systemd can't possibly need to know more information than a person
> does in the exact same situation in order to do the right thing. No
> human would wait 10 minutes, let alone literally the heat death of the
> planet for "all devices have appeared" but systemd will. And it does

We're already repeating - systemd waits for THE btrfs-compound-device,
not ALL the block-devices. Just like it 'waits' for someone to plug USB
pendrive in.

It is a btrfs choice to not expose compound device as separate one (like
every other device manager does), it is a btrfs drawback that doesn't
provice anything else except for this IOCTL with it's logic, it is a
btrfs drawback that there is nothing to push assembling into "OK, going
degraded" state, it is btrfs drawback that there are no states...

I've told already - pretend the /dev/sda1 device doesn't
exist until assembled. If this overlapping usage was designed with
'easier mounting' on mind, this is simply bad design.

> that by its own choice, its own policy. That's the complaint. It's
> choosing to do something a person wouldn't do, given identical
> available information.

You are expecting systemd to mix in functions of kernel and udev.
There is NO concept of 'assembled stuff' in systemd AT ALL.
There is NO concept of 'waiting' in udev AT ALL.
If you want to do some crazy interlayer shortcuts just implement btrfsd.

> There's nothing the kernel is doing that's
> telling systemd to wait for goddamn ever.

There's nothing the kernel is doing that's
telling udev there IS a degraded device assembled to be used.

There's nothing a userspace-thing is doing that's
telling udev to mark degraded device as mountable.

There is NO DEVICE to be mounted, so systemd doesn't mount it.

The difference is:

YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs device 
that COULD BE mounted.

I think that there is real sda1 device, following Linux rules of system
registration, which CAN be overtaken by ephemeral btrfs-compound device.
Can I mount that thing above sda1 block device? ONLY when it's properly
registered in the system.

Does btrfs-compound-device register in the system? - Yes, but only fully 
populated.

Just don't expect people will break their code with broken designs just
to overcome your own limitations. If you want systemd to mount degraded
btrfs volume, just MAKE IT REGISTER in the system.

How can btrfs register in the system being degraded? Either by some
userspace daemon handling btrfs volumes states (which are missing from
the kernel), or by some IOCTLs altering in-kernel states.


So for the last time: nobody will break his own code to patch missing
code from other (actively maintained) subsystem.

If you expect degraded mounts, there are 2 choices:

1. implement degraded STATE _some_where_ - udev would handle falling
   back to degraded mount after specified timeout,

2. change this IOCTL to _always_ return 1 - udev would register any
   btrfs device, but you will get random behaviour of mounting
   degraded/populated. But you should expect that since there is no
   concept of any state below.


Actually, this is ridiculous - you expect the degradation to be handled
in some 3rd party software?! In init system? With the only thing you got
is 'degraded' mount option?!
What next - moving MD and LVM logic into systemd?

This is not systemd's job - there are
btrfs-specific kernel cmdline options to be parsed (allowing degraded
volumes), there is tracking of volume health required.
Yes, device-manager needs to track it's components, RAID controller
needs to track minimum required redundancy. It's not only about
mounting. But doing the degraded mounting is easy, only this one
particular ioctl needs to be fixed:

1. counted devices not_ready

2. counted devices ok_degraded

3. counted devices==all => ok


If btrfs DISTINGUISHES these two states, systemd would be able to use them.


You might ask why this is important for the state to be kept inside some
btrfs-related stuff, like kernel or btrfsd, while the systemd timer
could do the same and 'just mount degraded'. The answear is simple:
systemd.timer is just a sane default CONFIGURATION, that can be EASILY
changed by system administrator. But somewhere, sometime, someone would
have a NEED for totally different set of rules for handling degraded
volumes, just like MD or LVM does. This would be totally irresponsible
to hardcode any mount-degraded rule inside systemd itself.

That is exactly why this must go through the udev - udev is responsible
for handling devices in Linux world. How can I register btrfs device
in udev, since it's overlapping the block device? I can't - the ioctl
is one-way, doesn't accept any userspace feedback.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sun, Jan 28, 2018 at 3:39 PM, Tomasz Pala  wrote:
> On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:
>
>>> Tell me please, if you mount -o degraded btrfs - what would
>>> BTRFS_IOC_DEVICES_READY return?
>>
>> case BTRFS_IOC_DEVICES_READY:
>> ret = btrfs_scan_one_device(vol->name, FMODE_READ,
>> _fs_type, _devices);
>> if (ret)
>> break;
>> ret = !(fs_devices->num_devices == fs_devices->total_devices);
>> break;
>>
>>
>> All it cares about is whether the number of devices found is the same
>> as the number of devices any of that volume's supers claim make up
>> that volume. That's it.
>>
>>> This is not "outsmarting" nor "knowing better", on the contrary, this is 
>>> "FOLLOWING the
>>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>>> that should change to READY state *if* someone has tried and apparently
>>> succeeded mounting the not-ready volume.
>>
>> Nope. That is not what the ioctl does.
>
> So who is to blame for creating utterly useless code? Userspace
> shouldn't depend on some stats (as number of devices is nothing more
> than that), but overall _availability_.

There's quite a lot missing. Btrfs doesn't even really have a degraded
state concept. It has a degraded mount option, but this is not a
state. e.g. if you have a normally mounted volume, and a drive dies or
vanishes, there's no way for the user to know the array is degraded.
They can only infer that it's degraded by a.) metric f tons of
read/write errors to a bdev b.) the application layer isn't pissed off
about it; or in lieu of a. they see via 'btrfs fi show' that a device
is missing. Likewise, when a device is failing to read and write,
Btrfs doesn't consider it faulty and boot it out of the array, it just
keeps on trying, the spew of which can cause disk contention of those
errors are written to a log on spinning rust.

Anyway, the fact many state features are missing doesn't mean the
necessary information to do the right thing is missing.



> I do not care if there are 2, 5 or 100 devices. I do care if there is
> ENOUGH devices to run regular (including N-way mirroring and hot spares)
> and if not - if there is ENOUGH devices to run degraded. Having ALL the
> devices is just the edge case.

systemd can't possibly need to know more information than a person
does in the exact same situation in order to do the right thing. No
human would wait 10 minutes, let alone literally the heat death of the
planet for "all devices have appeared" but systemd will. And it does
that by its own choice, its own policy. That's the complaint. It's
choosing to do something a person wouldn't do, given identical
available information. There's nothing the kernel is doing that's
telling systemd to wait for goddamn ever.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote:

>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> "not ready" is rather vague terminology but yes that's how systemd
> ends up using the ioctl this rule depends on, even though the rule has
> nothing to do with readiness per se. If all devices for a volume

If you avoid using THIS ioctl, then you'd have nothing to fire the rule
at all. One way or another, this is btrfs that must emit _some_ event or
be polled _somehow_.

> aren't found, we can correctly conclude a normal mount attempt *will*
> fail. But that's all we can conclude. What I can't parse in all of
> this is if the udev rule is a one shot, if the ioctl is a one shot, if
> something is constantly waiting for "not all devices are found" to
> transition to "all devices are found" or what. I can't actually parse

It's not one shot. This works like this:

sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready
sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready

The end.

If there were some other device appearing after assembly, like /dev/md1,
or if there were some event generated by btrfs code itself, udev could
catch this and follow. Now, if you unplug sdb1, there's no such event at
all.

Since this IOCTL is the *only* thing that udev can rely on, it cannot be
removed from the logic. So even if you create a timer to force assembly,
you must do it by influencing the IOCTL response.

Or creating some other IOCTL for this purpose, or creating some
userspace daemon or whatever.

> the two critical lines in this rule. I
> 
> # let the kernel know about this btrfs filesystem, and check if it is complete
> IMPORT{builtin}="btrfs ready $devnode"

This sends IOCTL.

> # mark the device as not ready to be used by the system
> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
  ^^this is IOCTL response being checked

and SYSTEMD_READY set to 0 prevents systemd from mounting.

> I think the Btrfs ioctl is a one shot. Either they are all present or not.

The rules are called once per (block) device.
So when btrfs scans all the devices to return READY, this would finally
be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger),
but there is no way to force btrfs to return READY after any timeout.

> The waiting is a policy by systemd udev rule near as I can tell.

There is no problem in waiting or re-triggering. This can be done in ~10
lines of rules. The problem is that the IOCTL won't EVER return READY until
there are ALL the components present.

It's simple as that: there MUST be some mechanism at device-manager
level that tells if a compound device is mountable, degraded or not;
upper layers (systemd-mount) do not care about degradation, handling
redundancy/mirrors/chunks/stripes/spares is not it's job.
It (systemd) can (easily!) handle expiration timer to push pending
compound to be force-assembled, but currently there is no way to push.


If the IOCTL would be extended to return TRYING_DEGRADED (when
instructed to do so after expired timeout), systemd could handle
additional per-filesystem fstab options, like x-systemd.allow-degraded.

Then in would be possible to have best-effort policy for rootfs (to make
machine boot), and more strict one for crucial data (do not mount it
when there is no redundancy, wait for operator intervention).

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:

>> Tell me please, if you mount -o degraded btrfs - what would
>> BTRFS_IOC_DEVICES_READY return?
> 
> case BTRFS_IOC_DEVICES_READY:
> ret = btrfs_scan_one_device(vol->name, FMODE_READ,
> _fs_type, _devices);
> if (ret)
> break;
> ret = !(fs_devices->num_devices == fs_devices->total_devices);
> break;
> 
> 
> All it cares about is whether the number of devices found is the same
> as the number of devices any of that volume's supers claim make up
> that volume. That's it.
>
>> This is not "outsmarting" nor "knowing better", on the contrary, this is 
>> "FOLLOWING the
>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>> that should change to READY state *if* someone has tried and apparently
>> succeeded mounting the not-ready volume.
> 
> Nope. That is not what the ioctl does.

So who is to blame for creating utterly useless code? Userspace
shouldn't depend on some stats (as number of devices is nothing more
than that), but overall _availability_.

I do not care if there are 2, 5 or 100 devices. I do care if there is
ENOUGH devices to run regular (including N-way mirroring and hot spares)
and if not - if there is ENOUGH devices to run degraded. Having ALL the
devices is just the edge case.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sun, Jan 28, 2018 at 1:06 AM, Andrei Borzenkov  wrote:
> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
 On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even
>> with degraded option in fstab and grub, unable to boot !  The boot
>> process stop on initramfs.
>>
>> Is there a solution to boot with systemd and degraded array ?
>
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
>
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

"not ready" is rather vague terminology but yes that's how systemd
ends up using the ioctl this rule depends on, even though the rule has
nothing to do with readiness per se. If all devices for a volume
aren't found, we can correctly conclude a normal mount attempt *will*
fail. But that's all we can conclude. What I can't parse in all of
this is if the udev rule is a one shot, if the ioctl is a one shot, if
something is constantly waiting for "not all devices are found" to
transition to "all devices are found" or what. I can't actually parse
the two critical lines in this rule. I


$ cat /usr/lib/udev/rules.d/64-btrfs.rules
# do not edit this file, it will be overwritten on update

SUBSYSTEM!="block", GOTO="btrfs_end"
ACTION=="remove", GOTO="btrfs_end"
ENV{ID_FS_TYPE}!="btrfs", GOTO="btrfs_end"

# let the kernel know about this btrfs filesystem, and check if it is complete
IMPORT{builtin}="btrfs ready $devnode"

# mark the device as not ready to be used by the system
ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"

LABEL="btrfs_end"




And udev builtin btrfs, which I guess the above rule is referring to:

https://github.com/systemd/systemd/blob/master/src/udev/udev-builtin-btrfs.c

I think the Btrfs ioctl is a one shot. Either they are all present or
not. The waiting is a policy by systemd udev rule near as I can tell.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Chris Murphy
On Sat, Jan 27, 2018 at 5:39 PM, Tomasz Pala  wrote:
> On Sat, Jan 27, 2018 at 15:22:38 +, Duncan wrote:
>
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>>
>> No kidding.
>>
>> All systemd has to do is leave the mount alone that the kernel has
>> already done, instead of insisting it knows what's going on better than
>> the kernel does, and immediately umounting it.
>
> Tell me please, if you mount -o degraded btrfs - what would
> BTRFS_IOC_DEVICES_READY return?


case BTRFS_IOC_DEVICES_READY:
ret = btrfs_scan_one_device(vol->name, FMODE_READ,
_fs_type, _devices);
if (ret)
break;
ret = !(fs_devices->num_devices == fs_devices->total_devices);
break;


All it cares about is whether the number of devices found is the same
as the number of devices any of that volume's supers claim make up
that volume. That's it.


>
> This is not "outsmarting" nor "knowing better", on the contrary, this is 
> "FOLLOWING the
> kernel-returned data". The umounting case is simply a bug in btrfs.ko
> that should change to READY state *if* someone has tried and apparently
> succeeded mounting the not-ready volume.


Nope. That is not what the ioctl does.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov
28.01.2018 18:57, Duncan пишет:
> Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:
> 
>> 27.01.2018 18:22, Duncan пишет:
>>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>>
 On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>
>>> I just tested to boot with a single drive (raid1 degraded), even
>>> with degraded option in fstab and grub, unable to boot !  The boot
>>> process stop on initramfs.
>>>
>>> Is there a solution to boot with systemd and degraded array ?
>>
>> No. It is finger pointing. Both btrfs and systemd developers say
>> everything is fine from their point of view.

 It's quite obvious who's the culprit: every single remaining rc system
 manages to mount degraded btrfs without problems.  They just don't try
 to outsmart the kernel.
>>>
>>> No kidding.
>>>
>>> All systemd has to do is leave the mount alone that the kernel has
>>> already done,
>>
>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> As Tomaz indicates, I'm talking about manual mounting (after the initr* 
> drops to a maintenance prompt if it's root being mounted, or on manual 
> mount later if it's an optional mount) here.  The kernel accepts the 
> degraded mount and it's mounted for a fraction of a second, but systemd 
> actually undoes the successful work of the kernel to mount it, so by the 
> time the prompt returns and a user can check, the filesystem is unmounted 
> again, with the only indication that it was mounted at all being the log.
> 

This is fixed in current systemd (actually for quite some time). If you
still observe it with more or less recent systemd, report a bug.

> He says that's because the kernel still says it's not ready, but that's 
> for /normal/ mounting.  The kernel accepted the degraded mount and 
> actually mounted the filesystem, but systemd undoes that.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Duncan
Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:

> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>> 
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
 On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even
>> with degraded option in fstab and grub, unable to boot !  The boot
>> process stop on initramfs.
>>
>> Is there a solution to boot with systemd and degraded array ?
>
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>> 
>> No kidding.
>> 
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

As Tomaz indicates, I'm talking about manual mounting (after the initr* 
drops to a maintenance prompt if it's root being mounted, or on manual 
mount later if it's an optional mount) here.  The kernel accepts the 
degraded mount and it's mounted for a fraction of a second, but systemd 
actually undoes the successful work of the kernel to mount it, so by the 
time the prompt returns and a user can check, the filesystem is unmounted 
again, with the only indication that it was mounted at all being the log.

He says that's because the kernel still says it's not ready, but that's 
for /normal/ mounting.  The kernel accepted the degraded mount and 
actually mounted the filesystem, but systemd undoes that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 01:00:16 +0100, Tomasz Pala wrote:

> It can't mount degraded, because the "missing" device might go online a
> few seconds ago.

s/ago/after/

>> The central problem is the lack of a timer and time out.
> 
> You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
> anything, as you all state here, this should be easy to make this work.
> Go ahead please.

And just to make it even easier - this is how you can react to events
inside udev (this is to eliminane btrfs-scan tool being required as it sux):

https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

One could even try to trick systemd by SETTING (note the single '=')

ENV{ID_BTRFS_READY}="0"

- which would probably break as soon as btrfs.ko emits next 'changed' event.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 11:06:06 +0300, Andrei Borzenkov wrote:

>> All systemd has to do is leave the mount alone that the kernel has 
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

And even after successful manual mount (with -o degraded) btrfs.ko
insists that the device is not ready.

That schizophrenia makes systemd umount that immediately, because this
is the only proper way to handle missing devices (only the failed ones
should go r/o). And there is really nothing systemd can do about this,
until underlying code stops lying, unless we're going back to 1990s when
devices were never unplugged or detached during system uptime. But even
floppies could be ejected without system reboot.

BTRFS is no exception here - when marked as 'not available',
don't expect it to be kept used. Just fix the code to match reality.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov
27.01.2018 18:22, Duncan пишет:
> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
> 
>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>
> I just tested to boot with a single drive (raid1 degraded), even
> with degraded option in fstab and grub, unable to boot !  The boot
> process stop on initramfs.
>
> Is there a solution to boot with systemd and degraded array ?

 No. It is finger pointing. Both btrfs and systemd developers say
 everything is fine from their point of view.
>>
>> It's quite obvious who's the culprit: every single remaining rc system
>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done,

Are you sure you really understand the problem? No mount happens because
systemd waits for indication that it can mount and it never gets this
indication.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 15:22:38 +, Duncan wrote:

>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done, instead of insisting it knows what's going on better than 
> the kernel does, and immediately umounting it.

Tell me please, if you mount -o degraded btrfs - what would
BTRFS_IOC_DEVICES_READY return?

This is not "outsmarting" nor "knowing better", on the contrary, this is 
"FOLLOWING the
kernel-returned data". The umounting case is simply a bug in btrfs.ko
that should change to READY state *if* someone has tried and apparently
succeeded mounting the not-ready volume.

Otherwise - how should any system part behave when you detach some drive? Insist
that "the kernel has already mounted it" and ignore kernel screaming
"the device is (not yet there/gone)"?


Just update the internal state after successful mount and this
particular problem is gone. Unless there is some race condition and the
state should be changed before the mount is announced to the userspace.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 14:12:01 -0700, Chris Murphy wrote:

> doesn't count devices itself. The Btrfs systemd udev rule defers to
> Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally
> binary. Either they are all ready, in which case it exits 0, and if
> they aren't all ready it exits 1.
> 
> But yes, mounting whether degraded or not is sufficiently complicated
> that you just have to try it. I don't get the point of wanting to know
> whether it's possible without trying. Why would this information be

If you want to blind-try it, just tell the btrfs.ko to flip the IOCTL bit.

No shortcuts please, do it legit, where it belongs.

>> Ie, the thing systemd can safely do, is to stop trying to rule everything,
>> and refrain from telling the user whether he can mount something or not.
> 
> Right. Open question is whether the timer and timeout can be
> implemented in the systemd world and I don't see why not, I certainly

It can. The reasons why it's not already there follow:

1. noone created udev rules and systemd units for btrfs-progs yet (that
   is trivial),
2. btrfs is not degraded-safe yet (the rules would have to check if the
   filesystem won't stuck in read-only mode for example, this is NOT
   trivial),
3. there is not way to tell the kernel that we want degraded (probably
   some new IOCTL) - this is the path that timer would use to trigger udev
   event releasing systemd mount.

Let me repeat this, so this would be clear: this is NOT going to work
as some systemd-shortcut being "mount -o degraded", this must go through
the kernel IOCTL -> udev -> systemd path, i.e.:

timer expires -> executes IOCTL with "OK, give me degraded /dev/blah" ->
BTRFS_IOC_DEVICES_READY returns "READY" (or new value "DEGRADED") -> udev
catches event and changes SYSTEMD_READY -> systemd mounts the volume.


This is really simple. All you need to do is to pass "degraded" to the
btrfs.ko, so the BTRFS_IOC_DEVICES_READY would return "go ahead".

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 13:57:29 -0700, Chris Murphy wrote:

> The Btrfs systemd udev rule is a sledghammer because it has no
> timeout. It neither times out and tries to mount anyway, nor does it
> time out and just drop to a dracut prompt. There are a number of
> things in systemd startups that have timeouts, I have no idea how they
> get defined, but that single thing would make this a lot better. Right
> now the Btrfs udev rule means if all devices aren't available, hang
> indefinitely.

You mix udev with systemd:
- udev doesn't wait for anything - it REACTS to events. Blame the part
  that doesn't emit a one that you want.
- systemd WAITS for device to appear AND SETTLE. The timeout for devices
  is 90 seconds by default and can be changed in fstab with
  x-systemd.device-timeout.

It cannot "just try-mount this-or-that ON boot" as there is NO state of
"booting" or "shutting down" in systemd flow, there is only a chain of
events.

Mounting a device happens when it becomes available. If the device is
not crucial for booting up, just add the "nofail". If you want to lie
about the device being available, put the appropriate code into device
handler (btrfs could parse kernel cmdline and return READY for degraded).

> this rule can have a timer. Service units absolutely can have timers,
> so maybe there's a way to marry a udev rule with a service which has a
> timer. The absolute dumbest thing that's better than now, is at the
> timer just fail and drop to a dracut prompt. Better would be to try a
> normal mount anyway, which also fails to a dracut prompt, but
> additionally gives us a kernel error for Btrfs (the missing device
> open ctree error you'd expect to get when mounting without -o degraded
> when you're missing a device). And even better would be a way for the
> user to edit the service unit to indicate "upon timeout being reached,
> use mount -o degraded rather than just mount". This is the simplest of
> Boolean logic, so I'd be surprised if systemd doesn't offer a way for
> us to do exactly what I'm describing.

Any fallback to degraded mode requires the volume manager to handle this
gracefuly first. Until btrfs is degraded-safe, systemd cannot offer to
mount it degraded in _any_ way.

> Again the central problem is the udev rule now means "wait for device
> to appear" with no timed fallback.

If I got photos on external drive, is this also the udev rule that waits
for me to plug it in?

You should blame btrfs.ko for not "plugging in" (emiting event to udev).

> The mdadm case has this, and it's done by dracut. At this same stage
> of startup with a  missing device, there is in fact no fs colume UUID
> yet because the array hasn't started. Dracut+mdadm knows there's a
> missing device so it's just iterating: look, sleep 3, look, sleep 3,
> look, sleep 3. It's on a loop. And after that loop hits something like
> 100, the script says f it, start array anyway, so now there is a

So you need some btrfsd to iterate and eventually say "go degraded",
as systemd isn't the volume manager here.

> degraded array, and for the first time the fs volume UUID appears, and
> systemd goes "ahaha! mount that!" and it does it normally.

You see for yourself, systemd mounts ready md device. It doesn't handle
the 'ready-or-not-yet' timing out logic.

> So the timer and timeout and what happens at the timeout is defined by
> dracut. That's probably why the systemd folks say "not our problem"
> and why the kernel folks say "not our problem".

And they are both right - there should be btrfsd for handling this.
Well, except for the kernel cmdline that should be parsed by kernel
guys. But since it is impossible to assemble multidevice btrfs by kernel
itself anyway (i.e. without initrd) this could all go to the daemon.

>> If btrfs pretends to be device manager it should expose more states,
>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> No, mdadm is a device manager and it has no such facility. Something

It has - the ultimate "signalling" in case of mdadm is apperance of
/dev/mdX device. Until the device won't came up, the systemd obviously
won't mount it.
In case of btrfs the situation is abnormal - there IS /dev/sda1 device
available, but in fact it might be not available. So there is the IOCTL
to check if the available device is really available. And guess what -
it returns NOT_READY... And you want systemd to mount this
ready-not_ready? The same way you could ask systemd to mount MD/LVM
device that is not existing.

This is btrfs fault to:
1. reuse device node for different purposes [*],
2. lack the timing-out/degraded logic implemented somewhere.

> issues a command to start the array anyway, and only then do you find
> out if there are enough devices to start it. I don't understand the
> value of knowing whether it is possible. Just 

Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:

> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try to
> outsmart the kernel.

Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
Recently I've started mdadm on top of bunch of LVM volumes, with others
using btrfs and others prepared for crypto. And you know what? systemd
assembled everything just fine.

So with argument just like yours:

It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.

>> This is not a systemd issue, but apparently btrfs design choice to allow
>> using any single component device name also as volume name itself.
> 
> And what other user interface would you propose? The only alternative I see
> is inventing a device manager (like you're implying below that btrfs does),
> which would needlessly complicate the usual, single-device, case.

The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).

>> If btrfs pretends to be device manager it should expose more states,
> 
> But it doesn't pretend to.

Why mounting sda2 requires sdb2 in my setup then?

>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> You're assuming that btrfs somehow knows this itself.

"It's quite obvious who's the culprit: every single volume manager keeps
track of it's component devices".

>  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.

There is a term for such situation: broken by design.

> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
> naive counting of this kind, it had to be replaced by actually checking
> whether at least one copy of every block group is actually present.

And you still blame systemd for using BTRFS_IOC_DEVICES_READY?

[...]
> just slow to initialize (USB...).  So, systemd asks sda how many devices
> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
> even ask for UUIDs -- all devices are present.  So, mount will succeed,
> right?

Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as
implemented in btrfs/super.c.

> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.

Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Chris Murphy
On Sat, Jan 27, 2018 at 6:26 AM, Adam Borowski  wrote:

> You're assuming that btrfs somehow knows this itself.  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.

That's right, although a small clarification is in order: systemd
doesn't count devices itself. The Btrfs systemd udev rule defers to
Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally
binary. Either they are all ready, in which case it exits 0, and if
they aren't all ready it exits 1.

But yes, mounting whether degraded or not is sufficiently complicated
that you just have to try it. I don't get the point of wanting to know
whether it's possible without trying. Why would this information be
useful if you were NOT going to mount it?


> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.

Right. Open question is whether the timer and timeout can be
implemented in the systemd world and I don't see why not, I certainly
see it put various services on timers, some of which are indefinite,
some are 1m30s and others are 3m. Pretty much every unit gets a
descrete boot line with a green dot or red cylon eye as it waits. I
don't see why at the very least we don't have that for Btrfs rootfs
mounts because *that* alone would at least clue in a user why their
startup is totally hung indefinitely.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Chris Murphy
On Sat, Jan 27, 2018 at 4:06 AM, Tomasz Pala  wrote:

> As for the regular by-UUID mounts: these links are created by udev WHEN
> underlying devices appear. Does btrfs volume appear? No.

If I boot with rd.break=pre-mount I can absolutely mount a Btrfs
multiple volume that has a missing device by UUID with --uuid flag, or
by /dev/sdXY, along with -o degraded. And I can then use the exit
command to continue the startup process. In fact I can try to mount
without -o degraded, and the mount command "works" in that it does not
complain about an invalid node or UUID.

The Btrfs systemd udev rule is a sledghammer because it has no
timeout. It neither times out and tries to mount anyway, nor does it
time out and just drop to a dracut prompt. There are a number of
things in systemd startups that have timeouts, I have no idea how they
get defined, but that single thing would make this a lot better. Right
now the Btrfs udev rule means if all devices aren't available, hang
indefinitely.

I don't know systemd or systemd-udev well enough at all to know if
this rule can have a timer. Service units absolutely can have timers,
so maybe there's a way to marry a udev rule with a service which has a
timer. The absolute dumbest thing that's better than now, is at the
timer just fail and drop to a dracut prompt. Better would be to try a
normal mount anyway, which also fails to a dracut prompt, but
additionally gives us a kernel error for Btrfs (the missing device
open ctree error you'd expect to get when mounting without -o degraded
when you're missing a device). And even better would be a way for the
user to edit the service unit to indicate "upon timeout being reached,
use mount -o degraded rather than just mount". This is the simplest of
Boolean logic, so I'd be surprised if systemd doesn't offer a way for
us to do exactly what I'm describing.

Again the central problem is the udev rule now means "wait for device
to appear" with no timed fallback.

The mdadm case has this, and it's done by dracut. At this same stage
of startup with a  missing device, there is in fact no fs colume UUID
yet because the array hasn't started. Dracut+mdadm knows there's a
missing device so it's just iterating: look, sleep 3, look, sleep 3,
look, sleep 3. It's on a loop. And after that loop hits something like
100, the script says f it, start array anyway, so now there is a
degraded array, and for the first time the fs volume UUID appears, and
systemd goes "ahaha! mount that!" and it does it normally.

So the timer and timeout and what happens at the timeout is defined by
dracut. That's probably why the systemd folks say "not our problem"
and why the kernel folks say "not our problem".


> If btrfs pretends to be device manager it should expose more states,
> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

No, mdadm is a device manager and it has no such facility. Something
issues a command to start the array anyway, and only then do you find
out if there are enough devices to start it. I don't understand the
value of knowing whether it is possible. Just try to mount it degraded
and then if it fails we fail, nothing can be done automatically it's
up to an admin.

And even if you had this "degraded mount possible" state, you still
need a timer. So just build the timer.

If all devices ready ioctl is true, the timer doesn't start, it means
all devices are available, mount normally.
If all devices ready ioctl is false, the timer starts, if all devices
appear later the ioctl goes to true, the timer is belayed, mount
normally.
If all devices ready ioctl is false, the timer starts, when the timer
times out, mount normally which fails and gives us a shell to
troubleshoot at.
OR
If all devices ready ioctl is false, the timer starts, when the timer
times out, mount with -o degraded which either succeeds and we boot or
it fails and we have a troubleshooting shell.


The central problem is the lack of a timer and time out.


> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

That is not true. It's not how mdadm works anyway.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 03:36:48PM +0100, Goffredo Baroncelli wrote:
> I think that the real problem relies that the mounting a btrfs filesystem
> cannot be a responsibility of systemd (or whichever rc-system). 
> Unfortunately in the past it was thought that it would be sufficient to
> assemble a devices list in the kernel, then issue a simple mount...

Yeah... every device that comes online may have its own idea what devices
are part of the filesystem.  There's also a quite separate question whether
we have enough chunks for a degraded mount (implemented by Qu), which
requires reading the chunk tree.

> In the past[*] I proposed a mount helper, which would perform all the
> device registering and mounting in degraded mode (depending by the
> option).  My idea is that all the policies should be placed only in one
> place.  Now some policies are in the kernel, some in udev, some in
> systemd...  It is a mess.  And if something goes wrong, you have to look
> to several logs to understand which/where is the problem..

Since most of the logic needs to be in the kernel anyway, I believe it'd be
best to keep as much as possible in the kernel, and let the userspace
request at most "try regular/degraded mount, block/don't block".  Anything
else would be duplicating functionality.

> I have to point out that there is not a sane default for mounting in
> degraded mode or not.  May be that now RAID1/10 are "mount-degraded"
> friendly, so it would be a sane default; but for other (raid5/6) I think
> that this is not mature enough.  And it is possible to exist hybrid
> filesystem (both RAID1/10 and RAID5/6)

Not yet: if one of the devices comes a bit late, btrfs won't let it into the
filesystem yet (patches to do so have been proposed), and if you run
degraded for even a moment, a very lengthy action is required.  That lengthy
action could be improved -- we can note the last generation when the raid
was complete[1], and scrub/balance only extents newer than that[2] -- but
that's a SMOC then SMOR, and I don't see volunteers yet.

Thus, auto-degrading without a hearty timeout first is currently sitting
strongly in the "do not want" land.

> Mounting in degraded mode would be better for a root filesystem, than a
> non-root one (think about remote machine)

I for one use ext4-on-md for root, and btrfs raid for the actual data.  It's
not like production servers see much / churn anyway.


Meow!

[1]. Extra fun for raid6 (or possible future raid1×N where N>2 modes):
there's "fully complete", "degraded missing A", "degraded missing B",
"degraded missing A and B".

[2]. NOCOW extents would require an artificial generation bump upon writing
to whenever the level of degradeness changes.
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Duncan
Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:

> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>> 
>> >> I just tested to boot with a single drive (raid1 degraded), even
>> >> with degraded option in fstab and grub, unable to boot !  The boot
>> >> process stop on initramfs.
>> >> 
>> >> Is there a solution to boot with systemd and degraded array ?
>> > 
>> > No. It is finger pointing. Both btrfs and systemd developers say
>> > everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try
> to outsmart the kernel.

No kidding.

All systemd has to do is leave the mount alone that the kernel has 
already done, instead of insisting it knows what's going on better than 
the kernel does, and immediately umounting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Goffredo Baroncelli
On 01/27/2018 02:26 PM, Adam Borowski wrote:
> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>
 I just tested to boot with a single drive (raid1 degraded), even with
 degraded option in fstab and grub, unable to boot !  The boot process
 stop on initramfs.

 Is there a solution to boot with systemd and degraded array ?
>>>
>>> No. It is finger pointing. Both btrfs and systemd developers say
>>> everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try to
> outsmart the kernel.

I think that the real problem relies that the mounting a btrfs filesystem 
cannot be a responsibility of systemd (or whichever rc-system). Unfortunately 
in the past it was thought that it would be sufficient to assemble a devices 
list in the kernel, then issue a simple mount...

I think that all the possible scenarios of a btrfs filesystem are a lot wider 
than a conventional one; and this approach is too much basic.

Systemd is another factor (which spread the responsibilities); but it is not 
the real problem.

In the past[*] I proposed a mount helper, which would perform all the device 
registering and mounting in degraded mode (depending by the option). My idea is 
that all the policies should be placed only in one place. Now some policies are 
in the kernel, some in udev, some in systemd... It is a mess. And if something 
goes wrong, you have to look to several logs to understand which/where is the 
problem..

I have to point out that there is not a sane default for mounting in degraded 
mode or not. May be that now RAID1/10 are "mount-degraded" friendly, so it 
would be a sane default; but for other (raid5/6) I think that this is not 
mature enough. And it is possible to exist hybrid filesystem (both RAID1/10 and 
RAID5/6)

Mounting in degraded mode would be better for a root filesystem, than a 
non-root one (think about remote machine)

BR
G.Baroncelli

[*]
https://www.spinics.net/lists/linux-btrfs/msg39706.html




> 
>> Treating btrfs volume as ready by systemd would open a window of
>> opportunity when volume would be mounted degraded _despite_ all the
>> components are (meaning: "would soon") be ready - just like Chris Murphy
>> wrote; provided there is -o degraded somewhere.
> 
> For this reason, currently hardcoding -o degraded isn't a wise choice.  This
> might chance once autoresync and devices coming back at runtime are
> implemented.
> 
>> This is not a systemd issue, but apparently btrfs design choice to allow
>> using any single component device name also as volume name itself.
> 
> And what other user interface would you propose?  The only alternative I see
> is inventing a device manager (like you're implying below that btrfs does),
> which would needlessly complicate the usual, single-device, case.
>  
>> If btrfs pretends to be device manager it should expose more states,
> 
> But it doesn't pretend to.
> 
>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> You're assuming that btrfs somehow knows this itself.  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.
> 
> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
> naive counting of this kind, it had to be replaced by actually checking
> whether at least one copy of every block group is actually present.
> 
> An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
> sda goes offline due to a loose cable, controller hiccup, evil fairies, or
> something of this kind.  The sysadmin notices this, rushes in with an
> USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
> its cable reseated, etc), while sdd either got accidentally removed or is
> just slow to initialize (USB...).  So, systemd asks sda how many devices
> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
> even ask for UUIDs -- all devices are present.  So, mount will succeed,
> right?
>  
>> Unless there is *some* signalling from btrfs, there is really not much
>> systemd can *safely* do.
> 
> Btrfs already tells everything it knows.  To learn more, you need to do most
> of the mount process (whether you continue or abort is another matter). 
> This can't be done sanely from outside the kernel.  Adding finer control
> would be reasonable ("wait and block" vs "try and return immediately") but
> that's about all.  It's be also wrong to have a different interface for
> daemon X than for humans.
> 
> Ie, the 

Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
> 
> >> I just tested to boot with a single drive (raid1 degraded), even with
> >> degraded option in fstab and grub, unable to boot !  The boot process
> >> stop on initramfs.
> >> 
> >> Is there a solution to boot with systemd and degraded array ?
> > 
> > No. It is finger pointing. Both btrfs and systemd developers say
> > everything is fine from their point of view.

It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems.  They just don't try to
outsmart the kernel.

> Treating btrfs volume as ready by systemd would open a window of
> opportunity when volume would be mounted degraded _despite_ all the
> components are (meaning: "would soon") be ready - just like Chris Murphy
> wrote; provided there is -o degraded somewhere.

For this reason, currently hardcoding -o degraded isn't a wise choice.  This
might chance once autoresync and devices coming back at runtime are
implemented.

> This is not a systemd issue, but apparently btrfs design choice to allow
> using any single component device name also as volume name itself.

And what other user interface would you propose?  The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.
 
> If btrfs pretends to be device manager it should expose more states,

But it doesn't pretend to.

> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

You're assuming that btrfs somehow knows this itself.  Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.

Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
naive counting of this kind, it had to be replaced by actually checking
whether at least one copy of every block group is actually present.

An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
sda goes offline due to a loose cable, controller hiccup, evil fairies, or
something of this kind.  The sysadmin notices this, rushes in with an
USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
its cable reseated, etc), while sdd either got accidentally removed or is
just slow to initialize (USB...).  So, systemd asks sda how many devices
there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
even ask for UUIDs -- all devices are present.  So, mount will succeed,
right?
 
> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

Btrfs already tells everything it knows.  To learn more, you need to do most
of the mount process (whether you continue or abort is another matter). 
This can't be done sanely from outside the kernel.  Adding finer control
would be reasonable ("wait and block" vs "try and return immediately") but
that's about all.  It's be also wrong to have a different interface for
daemon X than for humans.

Ie, the thing systemd can safely do, is to stop trying to rule everything,
and refrain from telling the user whether he can mount something or not.
And especially, unmounting after the user mounts manually...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even with 
>> degraded option in fstab and grub, unable to boot ! The boot process stop on 
>> initramfs.
>> 
>> Is there a solution to boot with systemd and degraded array ?
> 
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.

Treating btrfs volume as ready by systemd would open a window of
opportunity when volume would be mounted degraded _despite_ all the
components are (meaning: "would soon") be ready - just like Chris Murphy
wrote; provided there is -o degraded somewhere.

This is not a systemd issue, but apparently btrfs design choice to allow
using any single component device name also as volume name itself.

IF a volume has degraded flag, then it is btrfs job to mark is as ready:

>> ... and it still does not work even if I change it to root=/dev/sda1
>> explicitly because sda1 will *not* be announced as "present" to
>> systemd> until all devices have been seen once ...

...so this scenario would obviously and magically start working.

As for the regular by-UUID mounts: these links are created by udev WHEN
underlying devices appear. Does btrfs volume appear? No.

If btrfs pretends to be device manager it should expose more states,
especially "ready to be mounted, but not fully populated" (i.e.
"degraded mount possible"). Then systemd could _fallback_ after timing
out to degraded mount automatically according to some systemd-level
option.

Unless there is *some* signalling from btrfs, there is really not much
systemd can *safely* do.

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Andrei Borzenkov
27.01.2018 13:08, Christophe Yayon пишет:
> I just tested to boot with a single drive (raid1 degraded), even with 
> degraded option in fstab and grub, unable to boot ! The boot process stop on 
> initramfs.
> 
> Is there a solution to boot with systemd and degraded array ?

No. It is finger pointing. Both btrfs and systemd developers say
everything is fine from their point of view.

> 
> Thanks 
> 
> --
> Christophe Yayon
> 
>> On 27 Jan 2018, at 07:48, Christophe Yayon  wrote:
>>
>> I think you are right, i do not see any systemd message when degraded option 
>> is missing and have to remount manually with degraded.
>>
>> It seems it is better to use mdadm for raid and btrfs over it as i 
>> understand. Even in recent kernel ?
>> I hav me to do some bench and compare...
>>
>> Thanks
>>
>> --
>> Christophe Yayon
>>
>>> On 27 Jan 2018, at 07:43, Andrei Borzenkov  wrote:
>>>
>>> 27.01.2018 09:40, Christophe Yayon пишет:
 Hi, 

 I am using archlinux with kernel 4.14, there is btrfs module in initrd.
 In fstab root is mounted via UUID. As far as I know the UUID is the same
 for all devices in raid array.
 The system boot with no problem with degraded and only 1/2 root device.
>>>
>>> Then your initramfs does not use systemd.
>>>
 --
 Christophe Yayon
 cyayon-l...@nbux.org



>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>> 26.01.2018 17:47, Christophe Yayon пишет:
>> Hi Austin,
>>
>> Thanks for your answer. It was my opinion too as the "degraded"
>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>> running Archlinux with recent kernel on all my servers (because of
>> use of btrfs as my main filesystem, i need a recent kernel).> >
>> Your idea to add a separate entry in grub.cfg with
>> rootflags=degraded is attractive, i will do this...> >
>> Just a last question, i thank that it was necessary to add
>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
>> have done some test and boot a a system with only 1 of
>> 2 drive in my root raid1 array. No problem with systemd,>
> Are you using systemd in your initramfs (whatever
> implementation you are> using)? I just tested with dracut using systemd 
> dracut module and it
> does not work - it hangs forever waiting for device. Of course,
> there is> no way to abort it and go into command line ...
>
> Oh, wait - what device names are you using? I'm using mount by
> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
> not appear> unless all devices have been seen once ...
>
> ... and it still does not work even if I change it to root=/dev/sda1
> explicitly because sda1 will *not* be announced as "present" to
> systemd> until all devices have been seen once ...
>
> So no, it does not work with systemd *in initramfs*. Absolutely.


>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Christophe Yayon
I just tested to boot with a single drive (raid1 degraded), even with degraded 
option in fstab and grub, unable to boot ! The boot process stop on initramfs.

Is there a solution to boot with systemd and degraded array ?

Thanks 

--
Christophe Yayon

> On 27 Jan 2018, at 07:48, Christophe Yayon  wrote:
> 
> I think you are right, i do not see any systemd message when degraded option 
> is missing and have to remount manually with degraded.
> 
> It seems it is better to use mdadm for raid and btrfs over it as i 
> understand. Even in recent kernel ?
> I hav me to do some bench and compare...
> 
> Thanks
> 
> --
> Christophe Yayon
> 
>> On 27 Jan 2018, at 07:43, Andrei Borzenkov  wrote:
>> 
>> 27.01.2018 09:40, Christophe Yayon пишет:
>>> Hi, 
>>> 
>>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>>> for all devices in raid array.
>>> The system boot with no problem with degraded and only 1/2 root device.
>> 
>> Then your initramfs does not use systemd.
>> 
>>> --
>>> Christophe Yayon
>>> cyayon-l...@nbux.org
>>> 
>>> 
>>> 
> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
> 26.01.2018 17:47, Christophe Yayon пишет:
> Hi Austin,
> 
> Thanks for your answer. It was my opinion too as the "degraded"
> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
> running Archlinux with recent kernel on all my servers (because of
> use of btrfs as my main filesystem, i need a recent kernel).> >
> Your idea to add a separate entry in grub.cfg with
> rootflags=degraded is attractive, i will do this...> >
> Just a last question, i thank that it was necessary to add
> "degraded" option in grub.cfg AND fstab to allow boot in degraded
> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
> have done some test and boot a a system with only 1 of
> 2 drive in my root raid1 array. No problem with systemd,>
 Are you using systemd in your initramfs (whatever
 implementation you are> using)? I just tested with dracut using systemd 
 dracut module and it
 does not work - it hangs forever waiting for device. Of course,
 there is> no way to abort it and go into command line ...
 
 Oh, wait - what device names are you using? I'm using mount by
 UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
 not appear> unless all devices have been seen once ...
 
 ... and it still does not work even if I change it to root=/dev/sda1
 explicitly because sda1 will *not* be announced as "present" to
 systemd> until all devices have been seen once ...
 
 So no, it does not work with systemd *in initramfs*. Absolutely.
>>> 
>>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Christophe Yayon
I think you are right, i do not see any systemd message when degraded option is 
missing and have to remount manually with degraded.

It seems it is better to use mdadm for raid and btrfs over it as i understand. 
Even in recent kernel ?
I hav me to do some bench and compare...

Thanks

--
Christophe Yayon

> On 27 Jan 2018, at 07:43, Andrei Borzenkov  wrote:
> 
> 27.01.2018 09:40, Christophe Yayon пишет:
>> Hi, 
>> 
>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>> for all devices in raid array.
>> The system boot with no problem with degraded and only 1/2 root device.
> 
> Then your initramfs does not use systemd.
> 
>> --
>>  Christophe Yayon
>>  cyayon-l...@nbux.org
>> 
>> 
>> 
>>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>>> 26.01.2018 17:47, Christophe Yayon пишет:
 Hi Austin,
 
 Thanks for your answer. It was my opinion too as the "degraded"
 seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
 running Archlinux with recent kernel on all my servers (because of
 use of btrfs as my main filesystem, i need a recent kernel).> >
 Your idea to add a separate entry in grub.cfg with
 rootflags=degraded is attractive, i will do this...> >
 Just a last question, i thank that it was necessary to add
 "degraded" option in grub.cfg AND fstab to allow boot in degraded
 mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
 have done some test and boot a a system with only 1 of
 2 drive in my root raid1 array. No problem with systemd,>
>>> Are you using systemd in your initramfs (whatever
>>> implementation you are> using)? I just tested with dracut using systemd 
>>> dracut module and it
>>> does not work - it hangs forever waiting for device. Of course,
>>> there is> no way to abort it and go into command line ...
>>> 
>>> Oh, wait - what device names are you using? I'm using mount by
>>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>>> not appear> unless all devices have been seen once ...
>>> 
>>> ... and it still does not work even if I change it to root=/dev/sda1
>>> explicitly because sda1 will *not* be announced as "present" to
>>> systemd> until all devices have been seen once ...
>>> 
>>> So no, it does not work with systemd *in initramfs*. Absolutely.
>> 
>> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Andrei Borzenkov
27.01.2018 09:40, Christophe Yayon пишет:
> Hi, 
> 
> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
> In fstab root is mounted via UUID. As far as I know the UUID is the same
> for all devices in raid array.
> The system boot with no problem with degraded and only 1/2 root device.

Then your initramfs does not use systemd.

> --
>   Christophe Yayon
>   cyayon-l...@nbux.org
> 
> 
> 
> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>> 26.01.2018 17:47, Christophe Yayon пишет:
>>> Hi Austin,
>>>
>>> Thanks for your answer. It was my opinion too as the "degraded"
>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>> running Archlinux with recent kernel on all my servers (because of
>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>> Your idea to add a separate entry in grub.cfg with
>>> rootflags=degraded is attractive, i will do this...> >
>>> Just a last question, i thank that it was necessary to add
>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
>>> have done some test and boot a a system with only 1 of
>>> 2 drive in my root raid1 array. No problem with systemd,>
>> Are you using systemd in your initramfs (whatever
>> implementation you are> using)? I just tested with dracut using systemd 
>> dracut module and it
>> does not work - it hangs forever waiting for device. Of course,
>> there is> no way to abort it and go into command line ...
>>
>> Oh, wait - what device names are you using? I'm using mount by
>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>> not appear> unless all devices have been seen once ...
>>
>> ... and it still does not work even if I change it to root=/dev/sda1
>> explicitly because sda1 will *not* be announced as "present" to
>> systemd> until all devices have been seen once ...
>>
>> So no, it does not work with systemd *in initramfs*. Absolutely.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Andrei Borzenkov
26.01.2018 17:47, Christophe Yayon пишет:
> Hi Austin,
> 
> Thanks for your answer. It was my opinion too as the "degraded" seems to be 
> flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with 
> recent kernel on all my servers (because of use of btrfs as my main 
> filesystem, i need a recent kernel).
> 
> Your idea to add a separate entry in grub.cfg with rootflags=degraded is 
> attractive, i will do this...
> 
> Just a last question, i thank that it was necessary to add "degraded" option 
> in grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only 
> grub.cfg is sufficient... 
> Yesterday, i have done some test and boot a a system with only 1 of 2 drive 
> in my root raid1 array. No problem with systemd,

Are you using systemd in your initramfs (whatever implementation you are
using)? I just tested with dracut using systemd dracut module and it
does not work - it hangs forever waiting for device. Of course, there is
no way to abort it and go into command line ...

Oh, wait - what device names are you using? I'm using mount by UUID and
this is where the problem starts - /dev/disk/by-uuid/xxx will not appear
unless all devices have been seen once ...

... and it still does not work even if I change it to root=/dev/sda1
explicitly because sda1 will *not* be announced as "present" to systemd
until all devices have been seen once ...

So no, it does not work with systemd *in initramfs*. Absolutely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Christophe Yayon
Hi Chris,

Thanks for this complete answer.

I have to do some benchmark with mdadm raid and btrfs native raid...

Thanks

--
Christophe Yayon

> On 26 Jan 2018, at 22:54, Chris Murphy  wrote:
> 
>> On Fri, Jan 26, 2018 at 7:02 AM, Christophe Yayon  
>> wrote:
>> 
>> Just a little question about "degraded" mount option. Is it a good idea to 
>> add this option (permanent) in fstab and grub rootflags for raid1/10 array ? 
>> Just to allow the system to boot again if a single hdd fail.
> 
> No because it's going to open a window where a delayed member drive
> will mean the volume is mounted degraded, which will happen silently.
> And current behavior in such a case, any new writes go to single
> chunks. Again it's silent. When the delayed drive appears, it's not
> going to be added, the volume is still treated as degraded. And even
> when you remount to bring them all together in a normal mount, Btrfs
> will not automatically sync the drives, so you will still have some
> single chunk writes on one drive not the other. So you have a window
> of time where there can be data loss if a real failure occurs, and you
> need degraded mounting. Further, right now Btrfs will only do one
> degraded rw mount, and you *must* fix that degradedness before it is
> umounted or else you will only ever be able to mount it again ro.
> There are unmerged patches to work around this, so you'd need to
> commit to building your own kernel. I can't see any way of reliably
> using Btrfs in production for the described use case otherwise. You
> can't depend on getting the delayed or replacement drive restored, and
> the volume made healthy again, because ostensibly the whole point of
> the setup is having good uptime and you won't have that assurance
> unless you carry these patches.
> 
> Also note that there are two kinds of degraded writes. a.) drive was
> missing at mount time, and volume is mounted degraded, for raid1
> volumes you get single chunks written; to sync once the missing drive
> appears you do a btrfs balance -dconvert=raid1,soft
> -mconvert=raid1,soft which should be fairly fast; b.) if the drive
> goes missing after a normal mount, Btrfs continues to write out raid1
> chunks; to sync once the missing drive appears you have to do a full
> scrub or balance of the entire volume there's no shortcut.
> 
> Anyway, for the described use case I think you're better off with
> mdadm or LVM raid1 or raid10, and then format with Btrfs and DUP
> metadata (default mkfs) in which case you get full error detection and
> metadata error detection and correction, as well as the uptime you
> want.
> 
> -- 
> Chris Murphy

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Chris Murphy
On Fri, Jan 26, 2018 at 7:02 AM, Christophe Yayon  wrote:

> Just a little question about "degraded" mount option. Is it a good idea to 
> add this option (permanent) in fstab and grub rootflags for raid1/10 array ? 
> Just to allow the system to boot again if a single hdd fail.

No because it's going to open a window where a delayed member drive
will mean the volume is mounted degraded, which will happen silently.
And current behavior in such a case, any new writes go to single
chunks. Again it's silent. When the delayed drive appears, it's not
going to be added, the volume is still treated as degraded. And even
when you remount to bring them all together in a normal mount, Btrfs
will not automatically sync the drives, so you will still have some
single chunk writes on one drive not the other. So you have a window
of time where there can be data loss if a real failure occurs, and you
need degraded mounting. Further, right now Btrfs will only do one
degraded rw mount, and you *must* fix that degradedness before it is
umounted or else you will only ever be able to mount it again ro.
There are unmerged patches to work around this, so you'd need to
commit to building your own kernel. I can't see any way of reliably
using Btrfs in production for the described use case otherwise. You
can't depend on getting the delayed or replacement drive restored, and
the volume made healthy again, because ostensibly the whole point of
the setup is having good uptime and you won't have that assurance
unless you carry these patches.

Also note that there are two kinds of degraded writes. a.) drive was
missing at mount time, and volume is mounted degraded, for raid1
volumes you get single chunks written; to sync once the missing drive
appears you do a btrfs balance -dconvert=raid1,soft
-mconvert=raid1,soft which should be fairly fast; b.) if the drive
goes missing after a normal mount, Btrfs continues to write out raid1
chunks; to sync once the missing drive appears you have to do a full
scrub or balance of the entire volume there's no shortcut.

Anyway, for the described use case I think you're better off with
mdadm or LVM raid1 or raid10, and then format with Btrfs and DUP
metadata (default mkfs) in which case you get full error detection and
metadata error detection and correction, as well as the uptime you
want.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Austin S. Hemmelgarn

On 2018-01-26 09:47, Christophe Yayon wrote:

Hi Austin,

Thanks for your answer. It was my opinion too as the "degraded" seems to be flagged as 
"Mostly OK" on btrfs wiki status page. I am running Archlinux with recent kernel on all 
my servers (because of use of btrfs as my main filesystem, i need a recent kernel).

Your idea to add a separate entry in grub.cfg with rootflags=degraded is 
attractive, i will do this...

Just a last question, i thank that it was necessary to add "degraded" option in 
grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only grub.cfg is 
sufficient...
Yesterday, i have done some test and boot a a system with only 1 of 2 drive in 
my root raid1 array. No problem with systemd, but i added rootflags and fstab 
option. I didn't test with only rootflags.
Hmm...  I'm pretty sure that you only need degraded in rootflags for a 
degraded boot without systemd involved.  Not sure about with systemd 
involved, though the fact that it worked with systemd at all is 
interesting, as last I knew systemd doesn't do any special casing for 
BTRFS and just looks at whether all the devices are registered with the 
kernel or not.


Also, as far as I know, `degraded` in the mount options won't cause any 
change in behavior if there is no device missing, so you're not really 
going to be running 'degraded' if you've got all your devices (though 
depending on how long it takes to scan devices, you may end up with some 
issues during boot when they're technically all present and working).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Christophe Yayon
Hi Austin,

Thanks for your answer. It was my opinion too as the "degraded" seems to be 
flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with 
recent kernel on all my servers (because of use of btrfs as my main filesystem, 
i need a recent kernel).

Your idea to add a separate entry in grub.cfg with rootflags=degraded is 
attractive, i will do this...

Just a last question, i thank that it was necessary to add "degraded" option in 
grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only 
grub.cfg is sufficient... 
Yesterday, i have done some test and boot a a system with only 1 of 2 drive in 
my root raid1 array. No problem with systemd, but i added rootflags and fstab 
option. I didn't test with only rootflags.

Thanks. 


-- 
  Christophe Yayon
  cyayon-l...@nbux.org

On Fri, Jan 26, 2018, at 15:18, Austin S. Hemmelgarn wrote:
> On 2018-01-26 09:02, Christophe Yayon wrote:
> > Hi all,
> > 
> > I don't know if it the right place to ask. Sorry it's not...
> No, it's just fine to ask here.  Questions like this are part of why the 
> mailing list exists.
> > 
> > Just a little question about "degraded" mount option. Is it a good idea to 
> > add this option (permanent) in fstab and grub rootflags for raid1/10 array 
> > ? Just to allow the system to boot again if a single hdd fail.
> Some people will disagree with me on this, but I would personally 
> suggest not doing this.  I'm of the opinion that running an array 
> degraded for any period of time beyond the bare minimum required to fix 
> it is a bad idea, given that:
> * It's not a widely tested configuration, so you are statistically more 
> likely to run into previously unknown bugs.  Even aside from that, there 
> are probably some edge cases that people have not yet found.
> * There are some issues with older kernel versions trying to access the 
> array after it's been mounted writable and degraded when it's only two 
> devices in raid1 mode.  This in turn is a good example of the above 
> point about not being widely tested, as it took quite a while for this 
> problem to come up on the mailing list.
> * Running degraded is liable to be slower, because the filesystem has to 
> account for the fact that the missing device might reappear at any 
> moment.  This is actually true of any replication system, not just BTRFS.
> * For a 2 device raid1 volume, there is no functional advantage to 
> running degraded with one device compared to converting to just use a 
> single device (this is only true of BTRFS because of the fact that it's 
> trivial to convert things, while for MD and LVM it is extremely 
> complicated to do so online).
> 
> Additionally, adding the `degraded` mount option won't actually let you 
> mount the root filesystem if you're using systemd as an init system, 
> because systemd refuses to mount BTRFS volumes which have devices missing.
> 
> Assuming that the systemd thing isn't an issue for you, I would suggest 
> instead creating a separate GRUB entry with the option set in rootflags. 
>   This will allow you to manually boot the system if the array is 
> degraded, but will make sure you notice during boot (in my case, I don't 
> even do that, but I'm also reasonably used to tweaking kernel parameters 
> from GRUB prior to booting the system that it would end up just wasting 
> space).
> > 
> > Of course, i have some cron jobs to check my array health.
> It's good to hear that you're taking the initiative to monitor things, 
> however this fact doesn't really change my assessment above.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-26 Thread Austin S. Hemmelgarn

On 2018-01-26 09:02, Christophe Yayon wrote:

Hi all,

I don't know if it the right place to ask. Sorry it's not...
No, it's just fine to ask here.  Questions like this are part of why the 
mailing list exists.


Just a little question about "degraded" mount option. Is it a good idea to add 
this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just to allow 
the system to boot again if a single hdd fail.
Some people will disagree with me on this, but I would personally 
suggest not doing this.  I'm of the opinion that running an array 
degraded for any period of time beyond the bare minimum required to fix 
it is a bad idea, given that:
* It's not a widely tested configuration, so you are statistically more 
likely to run into previously unknown bugs.  Even aside from that, there 
are probably some edge cases that people have not yet found.
* There are some issues with older kernel versions trying to access the 
array after it's been mounted writable and degraded when it's only two 
devices in raid1 mode.  This in turn is a good example of the above 
point about not being widely tested, as it took quite a while for this 
problem to come up on the mailing list.
* Running degraded is liable to be slower, because the filesystem has to 
account for the fact that the missing device might reappear at any 
moment.  This is actually true of any replication system, not just BTRFS.
* For a 2 device raid1 volume, there is no functional advantage to 
running degraded with one device compared to converting to just use a 
single device (this is only true of BTRFS because of the fact that it's 
trivial to convert things, while for MD and LVM it is extremely 
complicated to do so online).


Additionally, adding the `degraded` mount option won't actually let you 
mount the root filesystem if you're using systemd as an init system, 
because systemd refuses to mount BTRFS volumes which have devices missing.


Assuming that the systemd thing isn't an issue for you, I would suggest 
instead creating a separate GRUB entry with the option set in rootflags. 
 This will allow you to manually boot the system if the array is 
degraded, but will make sure you notice during boot (in my case, I don't 
even do that, but I'm also reasonably used to tweaking kernel parameters 
from GRUB prior to booting the system that it would end up just wasting 
space).


Of course, i have some cron jobs to check my array health.
It's good to hear that you're taking the initiative to monitor things, 
however this fact doesn't really change my assessment above.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded permanent mount option

2018-01-26 Thread Christophe Yayon
Hi all,

I don't know if it the right place to ask. Sorry it's not...

Just a little question about "degraded" mount option. Is it a good idea to add 
this option (permanent) in fstab and grub rootflags for raid1/10 array ? Just 
to allow the system to boot again if a single hdd fail.

Of course, i have some cron jobs to check my array health.

Thanks.

-- 
  Christophe Yayon
  cyayon-l...@nbux.org
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html