Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:

>> I.e.: every shared segment should be accounted within quota (at least once).
> I think what you mean to say here is that every shared extent should be 
> accounted to quotas for every location it is reflinked from.  IOW, that 
> if an extent is shared between two subvolumes each with it's own quota, 
> they should both have it accounted against their quota.

Yes.

>> Moreover - if there would be per-subvolume RAID levels someday, the data
>> should be accouted in relation to "default" (filesystem) RAID level,
>> i.e. having a RAID0 subvolume on RAID1 fs should account half of the
>> data, and twice the data in an opposite scenario (like "dup" profile on
>> single-drive filesystem).
>
> This is irrelevant to your point here.  In fact, it goes against it, 
> you're arguing for quotas to report data like `du`, but all of 
> chunk-profile stuff is invisible to `du` (and everything else in 
> userspace that doesn't look through BTRFS ioctls).

My point is user-point, not some system tool like du. Consider this:
1. user wants higher (than default) protection of some data,
2. user wants more storage space with less protection.

Ad. 1 - requesting better redundancy is similar to cp --reflink=never
- there are functional differences, but the cost is similar: trading
  space for security,

Ad. 2 - many would like to have .cache, .ccache, tmp or some build
system directory with faster writes and no redundancy at all. This
requires per-file/directory data profile attrs though.

Since we agreed that transparent data compression is user's storage bonus,
gains from the reduced redundancy should also profit user.


Disclaimer: all the above statements in relation to conception and
understanding of quotas, not to be confused with qgroups.

-- 
Tomasz Pala 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 15:55:46 +0800, Qu Wenruo wrote:

>> The first thing about virtually every mechanism should be
>> discoverability and reliability. I expect my quota not to change without
>> my interaction. Never. How did you cope with this?
>> If not - how are you going to explain such weird behaviour to users?
> 
> Read the manual first.
> Not every feature is suitable for every use case.

I, the sysadm, must RTFM.
My users won't comprehend this and moreover - they won't even care.

> IIRC lvm thin is pretty much the same for the same case.

LVM doesn't pretend to be user-oriented, it is the system scope.
LVM didn't name it's thin provisioning "quotas".

> For 4 disk with 1T free space each, if you're using RAID5 for data, then
> you can write 3T data.
> But if you're also using RAID10 for metadata, and you're using default
> inline, we can use small files to fill the free space, resulting 2T
> available space.
> 
> So in this case how would you calculate the free space? 3T or 2T or
> anything between them?

The answear is pretty simple: 3T. Rationale:
- this is the space I do can put in a single data stream,
- people are aware that there is metadata overhead with any object;
  after all, metadata are also data,
- while filling the fs with small files the free space available would
  self-adjust after every single file put, so after uploading 1T of such
  files the df should report 1.5T free. There would be nothing weird(er
  that now) that 1T of data has actually eaten 1.5T of storage.

No crystal ball calculations, just KISS; since one _can_ put 3T file
(non sparse, uncompressible, bulk written) on a filesystem, the free space is 
3T.

> Only yourself know what the heck you're going to use the that 4 disks
> with 1T free space each.
> Btrfs can't look into your head and know what you're thinking.

It shouldn't. I expect raw data - there is 3TB of unallocated space for
current data profile.

> That's the design from the very beginning of btrfs, yelling at me makes
> no sense at all.

Sorry if you receive me "yelling" - I honestly must put in on my
non-native english. I just want to clarify some terminology and
perspective expectations. They are irrevelant to the underlying
technical solutions, but the literal *description* of the solution
you provide should match user expectations of that terminology.

> I have tried to explain what btrfs quota does and it doesn't, if it
> doesn't fit you use case, that's all.
> (Whether you have ever tried to understand is another problem)

I am (more than before) aware what btrfs quotas are not.

So, my only expectation (except for worldwide peace and other
unrealistic ones) would be to stop using "quotas", "subvolume quotas"
and "qgroups" interchangeably in btrfs context, as IMvHO these are not
plain, well-known "quotas".

-- 
Tomasz Pala 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
On Fri, Aug 10, 2018 at 07:03:18 +0300, Andrei Borzenkov wrote:

>> So - the limit set on any user
> 
> Does btrfs support per-user quota at all? I am aware only of per-subvolume 
> quotas.

Well, this is a kind of deceptive word usage in "post-truth" times.

In this case both "user" and "quota" are not valid...
- by "user" I ment general word, not unix-user account; such user might
  possess some container running full-blown guest OS,
- by "quota" btrfs means - I guess, dataset-quotas?


In fact: https://btrfs.wiki.kernel.org/index.php/Quota_support
"Quota support in BTRFS is implemented at a subvolume level by the use of quota 
groups or qgroup"

- what the hell is "quota group" and how it differs from qgroup? According to 
btrfs-quota(8):

"The quota groups (qgroups) are managed by the subcommand btrfs qgroup(8)"

- they are the same... just completely different from traditional "quotas".


My suggestion would be to completely remove the standalone "quota" word
from btrfs documentation - there is no "quota", just "subvolume quota"
or "qgroup" supported.

-- 
Tomasz Pala 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Tomasz Pala
ially when his data
becomes "exclusive" one day without any known reason), misnamed ...and
not reflecting anything valuable, unless the problems with extent
fragmentation are already resolved somehow?

So IMHO current quotas are:
- not discoverable for user (shared->exclusive transition of my data by 
someone's else action),
- not reliable for sysadm (offensive write pattern by any user can allocate 
virtually any space despite of quotas).

-- 
Tomasz Pala 


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-09 Thread Tomasz Pala
On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:

> 2) Different limitations on exclusive/shared bytes
>Btrfs can set different limit on exclusive/shared bytes, further
>complicating the problem.
> 
> 3) Btrfs quota only accounts data/metadata used by the subvolume
>It lacks all the shared trees (mentioned below), and in fact such
>shared tree can be pretty large (especially for extent tree and csum
>tree).

I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.

So - the limit set on any user should enforce maximum and absolute space
he has allocated, including the shared stuff. I could even imagine that
creating a snapshot might immediately "eat" the available quota. In a
way, that quota returned matches (give or take) `du` reported usage,
unless "do not account reflinks withing single qgroup" was easy to implemet.

I.e.: every shared segment should be accounted within quota (at least once).

And the numbers accounted should reflect the uncompressed sizes.


Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).


In short: values representing quotas are user-oriented ("the numbers one
bought"), not storage-oriented ("the numbers they actually occupy").

-- 
Tomasz Pala 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Tomasz Pala
On Fri, May 18, 2018 at 13:10:02 -0400, Austin S. Hemmelgarn wrote:

> Personally though, I think the biggest issue with what was done was not 
> the memory consumption, but the fact that there was no switch to turn it 
> on or off.  Making defrag unconditionally snapshot aware removes one of 
> the easiest ways to forcibly unshare data without otherwise altering the 

The "defrag only not-snapshotted data" mode would be enough for many
use cases and wouldn't require more RAM. One could run this before
taking a snapshot and merge _at least_ the new data.

And even with current approach it should be possible to interlace
defragmentation with some kind of naive-deduplication; "naive" in the
approach of comparing blocks only within the same in-subvolume paths.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: your mail

2018-02-18 Thread Tomasz Pala
On Sun, Feb 18, 2018 at 10:28:02 +0100, Tomasz Pala wrote:

> I've already noticed this problem on February 10th:
> [btrfs-progs] coreutils-like -i parameter, splitting permissions for various 
> tasks
> 
> In short: not possible. Regular user can only create subvolumes.

Not possible "oficially". Axel Burri has replied with more helpful approach:
https://github.com/digint/btrfs-progs-btrbk

Unfortunately this issue was not picked up by any developer, so for now
we can only wait for splitting libbtrfsutil so this task could be easier.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: your mail

2018-02-18 Thread Tomasz Pala
On Sun, Feb 18, 2018 at 08:14:25 +, Tomasz Kłoczko wrote:

> For some reasons btrfs pool each volume is not displayed in mount and
> df output, and I cannot find how to display volumes/snapshots usage
> using btrfs command.

In general: not possible without enabling quotas, which in turn impact
snapshot performance significally.

btrfs quota enable /
btrfs quota rescan /
btrfs qgroup sh --sort=excl /

> So now I have many volumes and snapshots in my home directory, but to
> maintain all this I must use root permission. As non-root working in
> my own home which is separated btrfs volume it would be nice to have
> the possibility to delegate permission to create, destroy,
> send/receive, mount/umount etc. snapshots, volumes like it os possible
> on zfs.

I've already noticed this problem on February 10th:
[btrfs-progs] coreutils-like -i parameter, splitting permissions for various 
tasks

In short: not possible. Regular user can only create subvolumes.

> BTW: someone maybe started working on something like .zfs hidden
> directory functions which is in each zfs volume mountpoint?

In btrfs world this is done differently - don't put main (working) volume in the
root, but mount some branch by default, keeping all the subvolumes next
to it. I.e. don't:

@working_subvolume
@working_subvolume/snapshots

but:

@root_of_the_fs
@root_of_the_fs/working_subvolume
@root_of_the_fs/snapshots

In fact this is manual workaroud for the problem you've mentioned.

> Have few or few tenths snapshots is not so big deal but the same on
> scale few hundreds, thousands or more snapshots I think that would be
> really hard without something like hidden .btrfs/snapshots directory.

With few hundreds of subvolumes btrfs would fail miserably.

> After few years not using btrfs (because previously was quite
> unstable) It is really good to see that now I'm not able to crash it.

It's not crashing with LTS 4.4 and 4.9 kernels, many reports of various
crashes in 4.12, 4.14 and 4.15 were posted here. It is really hard to say,
which of the post-4.9 kernels have reliable btrfs.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[btrfs-progs] coreutils-like -i parameter, splitting permissions for various tasks

2018-02-10 Thread Tomasz Pala
There is a serious flaw in btrfs subcommands handling. Since all of them
are handled by single 'btrfs' binary, there is no way to create any
protection against accidental data loss for (the only one I've found,
but still DANGEROUS) 'btrfs subvolume delete'.

There are several protections that are being used for various commands.
For example, with zsh having hist_ignore_space enabled I got:

alias kill=' kill'
alias halt=' halt'
alias init=' init'
alias poweroff=' poweroff'
alias reboot=' reboot'
alias shutdown=' shutdown'
alias telinit=' telinit'

so that these command are never saved into my shell history.

Other system-wide protection enabled by default might be coreutils.sh
creating aliases:

alias cp=' cp --interactive --archive --backup=numbered --reflink=auto'
alias mv=' mv --interactive --backup=numbered'
alias rm=' rm --interactive --one-file-system --interactive=once'

All such countermeasures reduce the probability of fatal mistakes.


There is no 'prompt before doing ANYTHING irreversible' option for btrfs,
so everyone needs to take special care typing commands. Since snapshotting
and managing subvolumes is some daily-routine, not anything special
(like creating storage pools or managing devices), this should be more
forgiving for any user errors. Since there is no other (obvious)
solution, I propose makeing "subvolume delete" ask for confirmation by
default, unless used with newly introduced option, like -y(--yes).


Moreover, since there might be different admin roles on the system, the
btrfs-progs should be splitted into separate tools, so one could have
quota-admin without permissions for managing devices, backup-admin
with access to all the subvolumes or maintenance-admin that could issue
scrub or rebalance volumes. For backward compatibility, these tools
could be issued by 'btrfs' wrapper binary.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 08:46:32 -0500, Austin S. Hemmelgarn wrote:

>> I personally think the degraded mount option is a mistake as this 
>> assumes that a lightly degraded system is not able to work which is false.
>> If the system can mount to some working state then it should mount 
>> regardless if it is fully operative or not. If the array is in a bad 
>> state you need to learn about it by issuing a command or something. The 
>> same goes for a MD array (and yes, I am aware of the block layer vs 
>> filesystem thing here).
> The problem with this is that right now, it is not safe to run a BTRFS 
> volume degraded and writable, but for an even remotely usable system 

Mounting read-only is still better than not mounting at all.

For example, my emergency.target has limited network access and starts
ssh server so I could recover from this situation remotely.

> with pretty much any modern distro, you need your root filesystem to be 
> writable (or you need to have jumped through the hoops to make sure /var 
> and /tmp are writable even if / isn't).

Easy to handle by systemd. Not only this, but much more is planned:

http://0pointer.net/blog/projects/stateless.html

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
Just one final word, as all was already said:

On Tue, Jan 30, 2018 at 11:30:31 -0500, Austin S. Hemmelgarn wrote:

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

Hard to agree with someone who refuses to do _anything_.

You can choose to follow whatever, MD, LVM, ZFS, invent something
totally different, write custom daemon or put timeout logic inside the
kernel itself. It doesn't matter. You know the ecosystem - it is the
udev that must be signalled somehow and systemd WILL follow.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
; A sub-volume is a BTRFS-specific concept referring to a mostly 
> independent filesystem tree within a BTRFS volume that still depends on 
> the super-blocks, chunk-tree, and a couple of other internal structures 
> from the main filesystem.

LVM volumes also depend on VG metadata. Main btrfs 'volume', that
handles other subvolumes, is only technical difference.

>> Great example - how is systemd mounting distributed/network filesystems?
>> Does it mount them blindly, in a loop, or fires some checks against
>> _plausible_ availability?
> Yes, but availability there is a boolean value.

No, systemd won't try to mount remote filesystems until network is up.

> In BTRFS it's tri-state 
> (as of right now, possibly four to six states in the future depending on 
> what gets merged), and the intermediate (not true or false) state can't 
> be checked in a trivial manner.

All the udev need is: "am I ALLOWED to force-mount this, even if degraded".

And this 'permission' must change after a user-supplied timeout.

>> In other words, is it:
>> - the systemd that threats btrfs WORSE than distributed filesystems, OR
>> - btrfs that requires from systemd to be threaded BETTER than other fss?
> Or maybe it's both?  I'm more than willing to admit that what BTRFS does 
> expose currently is crap in terms of usability.  The reason it hasn't 
> changed is that we (that is, the BTRFS people and the systemd people) 
> can't agree on what it should look like.

This might be ANY way, that allows udev to work just like it works with MD.

>> ...provided there are some measures taken for the premature operation to be
>> repeated. There is non in btrfs-ecosystem.
> Yes, because we expect the user to do so, just like LVM, and MD, and 
> pretty much every other block layer you're claiming we should be 
> behaving like.

MD and LVM export their state, so the userspace CAN react. btrfs doesn't.

>> Other init systems either fail at mounting degraded btrfs just like
>> systemd does, or have buggy workarounds in their code reimplemented in
>> each other just to handle thing, that should be centrally organized.
>> 
> Really? So the fact that I can mount a 2-device volume with RAID1 
> profiles degraded using OpenRC without needing anything more than adding 
> rootflags=degraded to the kernel parameters must be a fluke then...

We are talking about automatic fallback after timeout, not manually
casting any magic spells! Since OpenRC doesn't read rootflags at all:

grep -iE 'rootflags|degraded|btrfs' openrc/**/*

it won't support this without some extra code.

> The thing is, it primarily breaks if there are hardware issues, 
> regardless of the init system being used, but at least the other init 
> systems _give you an error message_ (even if it's really the kernel 
> spitting it out) instead of just hanging there forever with no 
> indication of what's going on like systemd does.

If your systemd waits forever and you have no error messages, report bug
to your distro maintainer, as he is probably the one to blame for fixing
what ain't broken.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 16:09:50 +0100, Tomasz Pala wrote:

>> BCP for over a 
>> decade has been to put multipathing at the bottom, then crypto, then 
>> software RAID, than LVM, and then whatever filesystem you're using. 
> 
> Really? Let's enumerate some caveats of this:
> 
> - crypto below software RAID means double-encryption (wasted CPU),
> 
> - RAID below LVM means you're stuck with the same RAID-profile for all
>   the VGs. What if I want 3-way RAID1+0 for crucial data, RAID1 for
>   system and RAID0 for various system caches (like ccache on software
>   builder machine) or transient LVM-level snapshots.
> 
> - RAID below filesystem means loosing btrfs-RAID extra functionality,
>   like recovering data from different mirror when CRC mismatch happens,
> 
> - crypto below LVN means encrypting everything, including data that is
>   not sensitive - more CPU wasted,

And, what is much worse - encrypting everything using the same secret.
BIG show-stopper.

I would shred such BCP as ineffective and insecure for both, data
integrity and confidentiality.

> - RAID below LVM means no way to use SSD acceleration of part of the HDD
>   space using MD write-mostly functionality.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Tue, Jan 30, 2018 at 10:05:34 -0500, Austin S. Hemmelgarn wrote:

>> Instead, they should move their legs continuously and if the train is > not 
>> on the station yet, just climb back and retry.
> No, that's really not a good analogy given the fact that that check for 
> the presence of a train takes a normal person milliseconds while the 
> event being raced against (the train departing) takes minutes.  In the 

OMG... preventing races by "this would always take longer"? Seriously?

> You're already looping forever _waiting_ for the volume to appear.  How 

udev is waiting for events, not systemd. Nobody will do some crazy
cross-layered shortcuts to overcome other's lazyness.

> is that any different from lopping forever trying to _mount_ the volume 

Yes, because udev doesn't mount anything, ever. Not this binary dude!

> instead given that failing to mount the volume is not going to damage 
> things. 

Failed premature attempt to mount prevents the system from booting WHEN
the devices are ready - this is fatal. System boots randomly on racy
conditions.

But hey, "the devices will always appear faster, than the init attempt
to do the mount"!

Have you ever had some hardware RAID controller? Never heard about
devices appearing after 5 minutes of warming up?

> The issue here is that systemd refuses to implement any method 
> of actually retrying things that fail during startup.>

1. Such methods are trivial and I've already mentioned them a dozen of times.
2. They should be implemented in btrfs-upstream, not systemd-upstream,
   but I personally would happily help with writing them here.
3. They require full-circle path of 'allow-degraded' to be passed
   through btrfs code.

>> mounting BEFORE volume is complete is FATAL - since no userspace daemon
>> would ever retrigger the mount and the system won't came up. Provide one
>> btrfsd volume manager and systemd could probably switch to using it.
> And here you've lost any respect I might have had for you.

Going personal? So thank you for discussion and good bye.

Please refrain from answering me, I'm not going to discuss this any
further with you.

> **YOU DO NOT NEED A DAEMON TO DO EVERY LAST TASK ON THE SYSTEM**

Sorry dude, but I won't repeat for the 5th times all the alternatives.

You *all* refuse to step in ANY possible solution mentioned.
You *all* except the systemd to do ALL the job, just like other init
systems were forced to do, against the good design principles.

Good luck having btrfs degraded mount under systemd.

> 
> This is one of the two biggest things I hate about systemd(the journal 
> is the other one for those who care).

The journal has currently *many* drawbacks, but this is not 'by design'
but 'by appropriate code missing for now'. The same applies to btrfs,
isn't it?

> You don't need some special daemon to set the time,

Ever heard about NTP?

> or to set the hostname,

FUD - no such daemon

> or to fetch account data,

FUD

> or even to track who's logged in

FUD

> As much as it may surprise the systemd developers, people got on just 
> fine handling setting the system time, setting the hostname, fetching 
> account info, tracking active users, and any number of myriad other 
> tasks before systemd decided they needed to have their own special daemon.
> 

Sure, in myriad of different scattered distro-specific files. The only
reason systemd stepped in for some of there is that nobody else could
introduce and force Linux-wide consensus. And if anyone would succeed,
there would be some Austins blaming them for 'overtaking good old
trashyard into coherent de facto standard.'

> In this particular case, you don't need a daemon because the kernel does 
> the state tracking. 

Sure, MD doesn't require daemon and LVM doesn't require either. But they
do provide some - I know, they are all wrong.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 21:44:23 -0700, Chris Murphy wrote:

> Btrfs is orthogonal to systemd's willingness to wait forever while
> making no progress. It doesn't matter what it is, it shouldn't wait
> forever.

It times out after 90 seconds (by default) and then it fails the mount
entirely.

> It occurs to me there are such systemd service units specifically for
> waiting for example
> 
> systemd-networkd-wait-online.service, systemd-networkd-wait-online -
> Wait for network to
>come online
> 
>  chrony-wait.service - Wait for chrony to synchronize system clock
> 
> NetworkManager has a version of this. I don't see why there can't be a
> wait for Btrfs to normally mount,

Because mounting degraded btrfs without -o degraded won't WAIT for
anything just immediatelly return failed.

> just simply try to mount, it fails, wait 10, try again, wait 10 try again.

For the last time:

No
Such
Logic
In
Systemd
CORE

Every wait/repeat is done using UNITS - as you already noticed
itself. And these are plain, regular UNITS.

Is there anything that prevents YOU, Chris, from writing these UNITS for
btrfs?

I know what makes ME stop writing these units - it's lack of feedback
from btrfs.ko ioctl handler. Without this I am unable to write UNITS
handling fstab mount entries, because the logic would PROBABLY have to
be hardcoded inside systemd-fstab-generator.

And such logic MUST NOT be hardcoded - this MUST be user-configurable,
i.e. made on UNITS level.

You might argue that some-distros-SysV units or some Gentoo-OpenRC have
support for this and if you want to change anything this is only a few
lines of shell code to be altered. But systemd-fstab-generator is
compiled binary and so WON'T allow the behaviour to be user-configurable.

> And then fail the unit so we end up at a prompt.

This can also be easily done, just like emergency-shell spawns
when configured. If only btrfs could accept and keep information about
volume being allowed for degraded mount.


OK, to be honest I _can_ write such rules now, keeping the
'allow-degraded' state somewhere else (in a file for example).

But since this is some non-standarized side-channel, such code won't
be accepted in systemd upstream, especially because it requires the
current udev rule to be slightly changed.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 14:00:53 -0500, Austin S. Hemmelgarn wrote:

> We already do so in the accepted standard manner.  If the mount fails 
> because of a missing device, you get a very specific message in the 
> kernel log about it, as is the case for most other common errors (for 
> uncommon ones you usually just get a generic open_ctree error).  This is 
> really the only option too, as the mount() syscall (which the mount 
> command calls) returns only 0 on success or -1 and an appropriate errno 
> value on failure, and we can't exactly go about creating a half dozen 
> new error numbers just for this (well, technically we could, but I very 
> much doubt that they would be accepted upstream, which defeats the purpose).

This is exacly why the separate communication channel being the ioctl is
currently used. And I really don't understand why do you fight against
expanding this ioctl response.

> With what you're proposing for BTRFS however, _everything_ is a 
> complicated decision, namely:
> 1. Do you retry at all?  During boot, the answer should usually be yes, 
> but during normal system operation it should normally be no (because we 
> should be letting the user handle issues at that point).

This is exactly why I propose to introduce ioctl in btrfs.ko that
accepts userspace-configured (as per-volume policy) expectations.

> 2. How long should you wait before you retry?  There is no right answer 
> here that will work in all cases (I've seen systems which take multiple 
> minutes for devices to become available on boot), especially considering 
> those of us who would rather have things fail early.

btrfs-last-resort@.timer per analogy to mdadm-last-resort@.timer

> 3. If the retry fails, do you retry again?  How many times before it 
> just outright fails?  This is going to be system specific policy.  On 
> systems where devices may take a while to come online, the answer is 
> probably yes and some reasonably large number, while on systems where 
> devices are known to reliably be online immediately, it makes no sense 
> to retry more than once or twice.

All of this is systemd timer/service job.

> 4. If you are going to retry, should you try a degraded mount?  Again, 
> this is going to be system specific policy (regular users would probably 
> want this to be a yes, while people who care about data integrity over 
> availability would likely want it to be a no).

Just like above - user-configured in systemd timers/services easily.

> 5. Assuming you do retry with the degraded mount, how many times should 
> a normal mount fail before things go degraded?  This ties in with 3 and 
> has the same arguments about variability I gave there.

As above.

> 6. How many times do you try a degraded mount before just giving up? 
> Again, similar variability to 3.
> 7. Should each attempt try first a regular mount and then a degraded 
> one, or do you try just normal a couple times and then switch to 
> degraded, or even start out trying normal and then start alternating? 
> Any of those patterns has valid arguments both for and against it, so 
> this again needs to be user configurable policy.
> 
> Altogether, that's a total of 7 policy decisions that should be user 
> configurable. 

All of them easy to implement if the btrfs.ko could accept
'allow-degraded' per-volume instruction and return 'try-degraded' in the
ioctl.

> Having a config file other than /etc/fstab for the mount 
> command should probably be avoided for sanity reasons (again, BTRFS is a 
> filesystem, not a volume manager), so they would all have to be handled 
> through mount options.  The kernel will additionally have to understand 
> that those options need to be ignored (things do try to mount 
> filesystems without calling a mount helper, most notably the kernel when 
> it mounts the root filesystem on boot if you're not using an initramfs). 
>   All in all, this type of thing gets out of hand _very_ fast.

You need to think about the two separately:
1. tracking STATE - this is remembering 'allow-degraded' option for now,
2. configured POLICY - this is to be handled by init system.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
er we put this into words, it is btrfs that behaves differently.

>> The 'needless complication', as you named it, usually should be the default
>> to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
>> No easy way to RAID the drive (there are device-mapper tricks, they are
>> just way more complicated). Even attaching SSD cache is not trivial
>> without preparations (for bcache being the absolutely necessary, much
>> easier with LVM in place).
> For a bog-standard client system, all of those _ARE_ overkill (and 
> actually, so is BTRFS in many cases too, it's just that we're the only 
> option for main-line filesystem-level snapshots at the moment).

Such standard systems don't have multidevice btrfs volumes neither, so
they are beyond the problem discussed here.

>>>> If btrfs pretends to be device manager it should expose more states,
>>>
>>> But it doesn't pretend to.
>> 
>> Why mounting sda2 requires sdb2 in my setup then?
> First off, it shouldn't unless you're using a profile that doesn't 
> tolerate any missing devices and have provided the `degraded` mount 
> option.  It doesn't in your case because you are using systemd.

I have written this previously (19-22 Dec, "Unexpected raid1 behaviour"):

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command).

> Second, BTRFS is not a volume manager, it's a filesystem with 
> multi-device support.

What is the designatum difference between 'volume' and 'subvolume'?

> The difference is that it's not a block layer, 

As a de facto design choice only.

> despite the fact that systemd is treating it as such.   Yes, BTRFS has 
> failure modes that result in regular operations being refused based on 
> what storage devices are present, but so does every single distributed 
> filesystem in existence, and none of those are volume managers either.

Great example - how is systemd mounting distributed/network filesystems?
Does it mount them blindly, in a loop, or fires some checks against
_plausible_ availability?

In other words, is it:
- the systemd that threats btrfs WORSE than distributed filesystems, OR
- btrfs that requires from systemd to be threaded BETTER than other fss?

>> There is a term for such situation: broken by design.
> So in other words, it's broken by design to try to connect to a remote 
> host without pinging it first to see if it's online?

Trying to connect to remote host without checking if OUR network is
already up and if the remote target MIGHT be reachable using OUR routes.

systemd checks LOCAL conditions: being online in case of network, being
online in case of hardware, being online in case of virtual devices.

> In all of those cases, there is no advantage to trying to figure out if 
> what you're trying to do is going to work before doing it, because every 

...provided there are some measures taken for the premature operation to be
repeated. There is non in btrfs-ecosystem.

> There's a name for the type of design you're saying we should have here, 
> it's called a time of check time of use (TOCTOU) race condition.  It's 
> one of the easiest types of race conditions to find, and also one of the 
> easiest to fix.  Ask any sane programmer, and he will say that _that_ is 
> broken by design.

Explained before.

>> And you still blame systemd for using BTRFS_IOC_DEVICES_READY?
> Given that it's been proven that it doesn't work and the developers 
> responsible for it's usage don't want to accept that it doesn't work?  Yes.

Remove it then.

>> Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.
>> 
> Or maybe we should just remove it completely, because checking it _IS 
> WRONG_,

That's right. But before commiting upstream, check for consequences.
I've already described a few today, pointed the source and gave some
possible alternate solutions.

> which is why no other init system does it, and in fact no 

Other init systems either fail at mounting degraded btrfs just like
systemd does, or have buggy workarounds in their code reimplemented in
each other just to handle thing, that should be centrally organized.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-30 Thread Tomasz Pala
On Mon, Jan 29, 2018 at 08:05:42 -0500, Austin S. Hemmelgarn wrote:

> Seriously, _THERE IS A RACE CONDITION IN SYSTEMD'S CURRENT HANDLING OF 
> THIS_.  It's functionally no different than prefacing an attempt to send 
> a signal to a process by checking if the process exists, or trying to 
> see if some other process is using a file that might be locked by 

Seriously, there is a race condition on train stations. People check if
the train has stopped and opened the door before they move their legs to
get in, but the train might be already gone - so this is pointless.

Instead, they should move their legs continuously and if the train is
not on the station yet, just climb back and retry.


See the difference? I hope now you know what is the race condition.
It is the condition, where CONSEQUENCES are fatal.


mounting BEFORE volume is complete is FATAL - since no userspace daemon
would ever retrigger the mount and the system won't came up. Provide one
btrfsd volume manager and systemd could probably switch to using it.

mounting AFTER volume is complete is FINE - and if the "pseudo-race" happens
and volume disappears, then this was either some operator action, so the
umount SHOULD happen, or we are facing some MALFUNCION, which is fatal
itself, not by being a "race condition".

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:

> systemd can't possibly need to know more information than a person
> does in the exact same situation in order to do the right thing. No
> human would wait 10 minutes, let alone literally the heat death of the
> planet for "all devices have appeared" but systemd will. And it does

We're already repeating - systemd waits for THE btrfs-compound-device,
not ALL the block-devices. Just like it 'waits' for someone to plug USB
pendrive in.

It is a btrfs choice to not expose compound device as separate one (like
every other device manager does), it is a btrfs drawback that doesn't
provice anything else except for this IOCTL with it's logic, it is a
btrfs drawback that there is nothing to push assembling into "OK, going
degraded" state, it is btrfs drawback that there are no states...

I've told already - pretend the /dev/sda1 device doesn't
exist until assembled. If this overlapping usage was designed with
'easier mounting' on mind, this is simply bad design.

> that by its own choice, its own policy. That's the complaint. It's
> choosing to do something a person wouldn't do, given identical
> available information.

You are expecting systemd to mix in functions of kernel and udev.
There is NO concept of 'assembled stuff' in systemd AT ALL.
There is NO concept of 'waiting' in udev AT ALL.
If you want to do some crazy interlayer shortcuts just implement btrfsd.

> There's nothing the kernel is doing that's
> telling systemd to wait for goddamn ever.

There's nothing the kernel is doing that's
telling udev there IS a degraded device assembled to be used.

There's nothing a userspace-thing is doing that's
telling udev to mark degraded device as mountable.

There is NO DEVICE to be mounted, so systemd doesn't mount it.

The difference is:

YOU think that sda1 device is ephemeral, as it's covered by sda1 btrfs device 
that COULD BE mounted.

I think that there is real sda1 device, following Linux rules of system
registration, which CAN be overtaken by ephemeral btrfs-compound device.
Can I mount that thing above sda1 block device? ONLY when it's properly
registered in the system.

Does btrfs-compound-device register in the system? - Yes, but only fully 
populated.

Just don't expect people will break their code with broken designs just
to overcome your own limitations. If you want systemd to mount degraded
btrfs volume, just MAKE IT REGISTER in the system.

How can btrfs register in the system being degraded? Either by some
userspace daemon handling btrfs volumes states (which are missing from
the kernel), or by some IOCTLs altering in-kernel states.


So for the last time: nobody will break his own code to patch missing
code from other (actively maintained) subsystem.

If you expect degraded mounts, there are 2 choices:

1. implement degraded STATE _some_where_ - udev would handle falling
   back to degraded mount after specified timeout,

2. change this IOCTL to _always_ return 1 - udev would register any
   btrfs device, but you will get random behaviour of mounting
   degraded/populated. But you should expect that since there is no
   concept of any state below.


Actually, this is ridiculous - you expect the degradation to be handled
in some 3rd party software?! In init system? With the only thing you got
is 'degraded' mount option?!
What next - moving MD and LVM logic into systemd?

This is not systemd's job - there are
btrfs-specific kernel cmdline options to be parsed (allowing degraded
volumes), there is tracking of volume health required.
Yes, device-manager needs to track it's components, RAID controller
needs to track minimum required redundancy. It's not only about
mounting. But doing the degraded mounting is easy, only this one
particular ioctl needs to be fixed:

1. counted devices not_ready

2. counted devices ok_degraded

3. counted devices==all => ok


If btrfs DISTINGUISHES these two states, systemd would be able to use them.


You might ask why this is important for the state to be kept inside some
btrfs-related stuff, like kernel or btrfsd, while the systemd timer
could do the same and 'just mount degraded'. The answear is simple:
systemd.timer is just a sane default CONFIGURATION, that can be EASILY
changed by system administrator. But somewhere, sometime, someone would
have a NEED for totally different set of rules for handling degraded
volumes, just like MD or LVM does. This would be totally irresponsible
to hardcode any mount-degraded rule inside systemd itself.

That is exactly why this must go through the udev - udev is responsible
for handling devices in Linux world. How can I register btrfs device
in udev, since it's overlapping the block device? I can't - the ioctl
is one-way, doesn't accept any userspace feedback.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord

Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:28:55 -0700, Chris Murphy wrote:

>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> "not ready" is rather vague terminology but yes that's how systemd
> ends up using the ioctl this rule depends on, even though the rule has
> nothing to do with readiness per se. If all devices for a volume

If you avoid using THIS ioctl, then you'd have nothing to fire the rule
at all. One way or another, this is btrfs that must emit _some_ event or
be polled _somehow_.

> aren't found, we can correctly conclude a normal mount attempt *will*
> fail. But that's all we can conclude. What I can't parse in all of
> this is if the udev rule is a one shot, if the ioctl is a one shot, if
> something is constantly waiting for "not all devices are found" to
> transition to "all devices are found" or what. I can't actually parse

It's not one shot. This works like this:

sda1 appears -> udev catches event -> udev detects btrfs and IOCTLs => not ready
sdb1 appears -> udev catches event -> udev detects btrfs and IOCTLs => ready

The end.

If there were some other device appearing after assembly, like /dev/md1,
or if there were some event generated by btrfs code itself, udev could
catch this and follow. Now, if you unplug sdb1, there's no such event at
all.

Since this IOCTL is the *only* thing that udev can rely on, it cannot be
removed from the logic. So even if you create a timer to force assembly,
you must do it by influencing the IOCTL response.

Or creating some other IOCTL for this purpose, or creating some
userspace daemon or whatever.

> the two critical lines in this rule. I
> 
> # let the kernel know about this btrfs filesystem, and check if it is complete
> IMPORT{builtin}="btrfs ready $devnode"

This sends IOCTL.

> # mark the device as not ready to be used by the system
> ENV{ID_BTRFS_READY}=="0", ENV{SYSTEMD_READY}="0"
  ^^this is IOCTL response being checked

and SYSTEMD_READY set to 0 prevents systemd from mounting.

> I think the Btrfs ioctl is a one shot. Either they are all present or not.

The rules are called once per (block) device.
So when btrfs scans all the devices to return READY, this would finally
be systemd-ready. This is trivial to re-trigger udev rule (udevadm trigger),
but there is no way to force btrfs to return READY after any timeout.

> The waiting is a policy by systemd udev rule near as I can tell.

There is no problem in waiting or re-triggering. This can be done in ~10
lines of rules. The problem is that the IOCTL won't EVER return READY until
there are ALL the components present.

It's simple as that: there MUST be some mechanism at device-manager
level that tells if a compound device is mountable, degraded or not;
upper layers (systemd-mount) do not care about degradation, handling
redundancy/mirrors/chunks/stripes/spares is not it's job.
It (systemd) can (easily!) handle expiration timer to push pending
compound to be force-assembled, but currently there is no way to push.


If the IOCTL would be extended to return TRYING_DEGRADED (when
instructed to do so after expired timeout), systemd could handle
additional per-filesystem fstab options, like x-systemd.allow-degraded.

Then in would be possible to have best-effort policy for rootfs (to make
machine boot), and more strict one for crucial data (do not mount it
when there is no redundancy, wait for operator intervention).

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 13:02:08 -0700, Chris Murphy wrote:

>> Tell me please, if you mount -o degraded btrfs - what would
>> BTRFS_IOC_DEVICES_READY return?
> 
> case BTRFS_IOC_DEVICES_READY:
> ret = btrfs_scan_one_device(vol->name, FMODE_READ,
> _fs_type, _devices);
> if (ret)
> break;
> ret = !(fs_devices->num_devices == fs_devices->total_devices);
> break;
> 
> 
> All it cares about is whether the number of devices found is the same
> as the number of devices any of that volume's supers claim make up
> that volume. That's it.
>
>> This is not "outsmarting" nor "knowing better", on the contrary, this is 
>> "FOLLOWING the
>> kernel-returned data". The umounting case is simply a bug in btrfs.ko
>> that should change to READY state *if* someone has tried and apparently
>> succeeded mounting the not-ready volume.
> 
> Nope. That is not what the ioctl does.

So who is to blame for creating utterly useless code? Userspace
shouldn't depend on some stats (as number of devices is nothing more
than that), but overall _availability_.

I do not care if there are 2, 5 or 100 devices. I do care if there is
ENOUGH devices to run regular (including N-way mirroring and hot spares)
and if not - if there is ENOUGH devices to run degraded. Having ALL the
devices is just the edge case.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 01:00:16 +0100, Tomasz Pala wrote:

> It can't mount degraded, because the "missing" device might go online a
> few seconds ago.

s/ago/after/

>> The central problem is the lack of a timer and time out.
> 
> You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
> anything, as you all state here, this should be easy to make this work.
> Go ahead please.

And just to make it even easier - this is how you can react to events
inside udev (this is to eliminane btrfs-scan tool being required as it sux):

https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

One could even try to trick systemd by SETTING (note the single '=')

ENV{ID_BTRFS_READY}="0"

- which would probably break as soon as btrfs.ko emits next 'changed' event.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-28 Thread Tomasz Pala
On Sun, Jan 28, 2018 at 11:06:06 +0300, Andrei Borzenkov wrote:

>> All systemd has to do is leave the mount alone that the kernel has 
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

And even after successful manual mount (with -o degraded) btrfs.ko
insists that the device is not ready.

That schizophrenia makes systemd umount that immediately, because this
is the only proper way to handle missing devices (only the failed ones
should go r/o). And there is really nothing systemd can do about this,
until underlying code stops lying, unless we're going back to 1990s when
devices were never unplugged or detached during system uptime. But even
floppies could be ejected without system reboot.

BTRFS is no exception here - when marked as 'not available',
don't expect it to be kept used. Just fix the code to match reality.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 15:22:38 +, Duncan wrote:

>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done, instead of insisting it knows what's going on better than 
> the kernel does, and immediately umounting it.

Tell me please, if you mount -o degraded btrfs - what would
BTRFS_IOC_DEVICES_READY return?

This is not "outsmarting" nor "knowing better", on the contrary, this is 
"FOLLOWING the
kernel-returned data". The umounting case is simply a bug in btrfs.ko
that should change to READY state *if* someone has tried and apparently
succeeded mounting the not-ready volume.

Otherwise - how should any system part behave when you detach some drive? Insist
that "the kernel has already mounted it" and ignore kernel screaming
"the device is (not yet there/gone)"?


Just update the internal state after successful mount and this
particular problem is gone. Unless there is some race condition and the
state should be changed before the mount is announced to the userspace.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 14:12:01 -0700, Chris Murphy wrote:

> doesn't count devices itself. The Btrfs systemd udev rule defers to
> Btrfs kernel code by using BTRFS_IOC_DEVICES_READY. And it's totally
> binary. Either they are all ready, in which case it exits 0, and if
> they aren't all ready it exits 1.
> 
> But yes, mounting whether degraded or not is sufficiently complicated
> that you just have to try it. I don't get the point of wanting to know
> whether it's possible without trying. Why would this information be

If you want to blind-try it, just tell the btrfs.ko to flip the IOCTL bit.

No shortcuts please, do it legit, where it belongs.

>> Ie, the thing systemd can safely do, is to stop trying to rule everything,
>> and refrain from telling the user whether he can mount something or not.
> 
> Right. Open question is whether the timer and timeout can be
> implemented in the systemd world and I don't see why not, I certainly

It can. The reasons why it's not already there follow:

1. noone created udev rules and systemd units for btrfs-progs yet (that
   is trivial),
2. btrfs is not degraded-safe yet (the rules would have to check if the
   filesystem won't stuck in read-only mode for example, this is NOT
   trivial),
3. there is not way to tell the kernel that we want degraded (probably
   some new IOCTL) - this is the path that timer would use to trigger udev
   event releasing systemd mount.

Let me repeat this, so this would be clear: this is NOT going to work
as some systemd-shortcut being "mount -o degraded", this must go through
the kernel IOCTL -> udev -> systemd path, i.e.:

timer expires -> executes IOCTL with "OK, give me degraded /dev/blah" ->
BTRFS_IOC_DEVICES_READY returns "READY" (or new value "DEGRADED") -> udev
catches event and changes SYSTEMD_READY -> systemd mounts the volume.


This is really simple. All you need to do is to pass "degraded" to the
btrfs.ko, so the BTRFS_IOC_DEVICES_READY would return "go ahead".

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
ferent purposes [*],
2. lack the timing-out/degraded logic implemented somewhere.

> issues a command to start the array anyway, and only then do you find
> out if there are enough devices to start it. I don't understand the
> value of knowing whether it is possible. Just try to mount it degraded
> and then if it fails we fail, nothing can be done automatically it's
> up to an admin.

It can't mount degraded, because the "missing" device might go online a
few seconds ago.

> And even if you had this "degraded mount possible" state, you still
> need a timer. So just build the timer.

Exactly! This "timer" is btrfs-specific daemon that should be shipped
with btrfs-tools. Well, maybe not the actual daemon, as btrfs handles
incremental assembly on it's own, just the appropriate units and
signalling.

For mdadm there is --incremental used for gradually assemble via udev rule:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/udev-md-raid-assembly.rules
(this also fires timer)

and the systemd part for timing-out and degraded fallback:
https://git.kernel.org/pub/scm/utils/mdadm/mdadm.git/tree/systemd
mdadm-last-resort@.timer
mdadm-last-resort@.service

There is appropriate code in LVM as well, using lvmetad, but this one is easier.

So, let's step by step your proposal:

> If all devices ready ioctl is true, the timer doesn't start, it means
> all devices are available, mount normally.

sure

> If all devices ready ioctl is false, the timer starts, if all devices
> appear later the ioctl goes to true, the timer is belayed, mount
> normally.

sure

> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount normally which fails and gives us a shell to
> troubleshoot at.
> OR
> If all devices ready ioctl is false, the timer starts, when the timer
> times out, mount with -o degraded which either succeeds and we boot or
> it fails and we have a troubleshooting shell.

Don't mix layers - just image your /dev/sda1 is not there and you simply
cannot even try to mount it; this should be done like this:

If all devices ready ioctl is false, the timer starts, when the timer
times out, TELL THE KERNEL THAT WE WANT DEGRADED MOUNT. This in turn
should switch IOCTL response to "OK, go degraded" which in turn would
make udev rule to raise the flag[*] and then systemd could mount this.

This is important that the kernel would be instructed in a way, not the
last one, as this gives the chance to pass the degraded option using the
cmdline.

> The central problem is the lack of a timer and time out.

You got mdadm-last-resort@.timer/service above, if btrfs doesn't lack
anything, as you all state here, this should be easy to make this work.
Go ahead please.

>> Unless there is *some* signalling from btrfs, there is really not much
>> systemd can *safely* do.
> 
> That is not true. It's not how mdadm works anyway.

Yes it does. You can't mount mdadm until /dev/mdX appears, which happens
when array get's fully assembled *OR* times out and kernel get's
instructed to run array as degraded, which effects in /dev/mdX appearing.
There is NO additional logic in systemd.

This is NOT systemd that assembles degraded mdadm, this is mdadm that
tells the kernel to assemble it and systemd mounts READY md.
Moreover, systemd gives you a set of tools that you can use for timers.

[*] the udev flag is required to distinguish /dev/sda1 block device from
/dev/sda1 btrfs-volume device being ready. If there were separate device
created, there would be no need for this entire IOCTL.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 14:26:41 +0100, Adam Borowski wrote:

> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try to
> outsmart the kernel.

Yes. They are stupid enough to fail miserably with any more complicated
setups, like stacking volume managers, crypto layer, network attached
storage etc.
Recently I've started mdadm on top of bunch of LVM volumes, with others
using btrfs and others prepared for crypto. And you know what? systemd
assembled everything just fine.

So with argument just like yours:

It's quite obvious who's the culprit: every single remaining filesystem
manages to mount under systemd without problems. They just expose
informations about their state.

>> This is not a systemd issue, but apparently btrfs design choice to allow
>> using any single component device name also as volume name itself.
> 
> And what other user interface would you propose? The only alternative I see
> is inventing a device manager (like you're implying below that btrfs does),
> which would needlessly complicate the usual, single-device, case.

The 'needless complication', as you named it, usually should be the default
to use. Avoiding LVM? Then take care of repartitioning. Avoiding mdadm?
No easy way to RAID the drive (there are device-mapper tricks, they are
just way more complicated). Even attaching SSD cache is not trivial
without preparations (for bcache being the absolutely necessary, much
easier with LVM in place).

>> If btrfs pretends to be device manager it should expose more states,
> 
> But it doesn't pretend to.

Why mounting sda2 requires sdb2 in my setup then?

>> especially "ready to be mounted, but not fully populated" (i.e.
>> "degraded mount possible"). Then systemd could _fallback_ after timing
>> out to degraded mount automatically according to some systemd-level
>> option.
> 
> You're assuming that btrfs somehow knows this itself.

"It's quite obvious who's the culprit: every single volume manager keeps
track of it's component devices".

>  Unlike the bogus
> assumption systemd does that by counting devices you can know whether a
> degraded or non-degraded mount is possible, it is in general not possible to
> know whether a mount attempt will succeed without actually trying.

There is a term for such situation: broken by design.

> Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
> naive counting of this kind, it had to be replaced by actually checking
> whether at least one copy of every block group is actually present.

And you still blame systemd for using BTRFS_IOC_DEVICES_READY?

[...]
> just slow to initialize (USB...).  So, systemd asks sda how many devices
> there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
> even ask for UUIDs -- all devices are present.  So, mount will succeed,
> right?

Systemd doesn't count anything, it asks BTRFS_IOC_DEVICES_READY as
implemented in btrfs/super.c.

> Ie, the thing systemd can safely do, is to stop trying to rule everything,
> and refrain from telling the user whether he can mount something or not.

Just change the BTRFS_IOC_DEVICES_READY handler to always return READY.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Tomasz Pala
On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:

>> I just tested to boot with a single drive (raid1 degraded), even with 
>> degraded option in fstab and grub, unable to boot ! The boot process stop on 
>> initramfs.
>> 
>> Is there a solution to boot with systemd and degraded array ?
> 
> No. It is finger pointing. Both btrfs and systemd developers say
> everything is fine from their point of view.

Treating btrfs volume as ready by systemd would open a window of
opportunity when volume would be mounted degraded _despite_ all the
components are (meaning: "would soon") be ready - just like Chris Murphy
wrote; provided there is -o degraded somewhere.

This is not a systemd issue, but apparently btrfs design choice to allow
using any single component device name also as volume name itself.

IF a volume has degraded flag, then it is btrfs job to mark is as ready:

>>>>>> ... and it still does not work even if I change it to root=/dev/sda1
>>>>>> explicitly because sda1 will *not* be announced as "present" to
>>>>>> systemd> until all devices have been seen once ...

...so this scenario would obviously and magically start working.

As for the regular by-UUID mounts: these links are created by udev WHEN
underlying devices appear. Does btrfs volume appear? No.

If btrfs pretends to be device manager it should expose more states,
especially "ready to be mounted, but not fully populated" (i.e.
"degraded mount possible"). Then systemd could _fallback_ after timing
out to degraded mount automatically according to some systemd-level
option.

Unless there is *some* signalling from btrfs, there is really not much
systemd can *safely* do.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 17:08:28 -0700, Chris Murphy wrote:

>>>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>>>> safely add "degraded" to the mount options? My primary concern is the
>> [...]
> 
> Well it only does rw once, then the next degraded is ro - there are
> patches dealing with this better but I don't know the state. And
> there's no resync code that I'm aware of, absolutely it's not good
> enough to just kick off a full scrub - that has huge performance
> implications and I'd consider it a regression compared to
> functionality in LVM and mdadm RAID by default with the write intent
> bitmap.  Without some equivalent short cut, automatic degraded means a

I read about the 'scrub' all over the time here, so let me ask this
directly, as this is also not documented clearly:

1. is the full scrub required after ANY desync? (like: degraded mount
followed by readding old device)?

2. if the scrub is omitted - is it possible that btrfs return invalid data 
(from the
desynced and readded drive)?

3. is the scrub required to be scheduled on regular basis? By 'required'
I mean by design/implementation issues/quirks, _not_ related to possible
hardware malfunctions.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala
On Fri, Dec 22, 2017 at 14:04:43 -0700, Chris Murphy wrote:

> I'm pretty sure degraded boot timeout policy is handled by dracut. The

Well, last time I've checked dracut on systemd-system couldn't even
generate systemd-less image.

> kernel doesn't just automatically assemble an md array as soon as it's
> possible (degraded) and then switch to normal operation as other

MD devices are explicitly listed in mdadm.conf (for mdadm --assemble
--scan) or kernel command line or metadata of autodetected partitions (fd).

> devices appear. I have no idea how LVM manages the delay policy for
> multiple devices.

I *guess* it's not about waiting, but simply being executed after the
devices are ready.

And there is a VERY long history of various init systems having problems
to boot systems using multi-layer setups (LVM/MD under or above LUKS,
not to mention remote ones that need networking to be set up).

All of this works reasonably well under systemd - except for the btrfs
that uses single device node to match entire group of devices. Which is
convenient for living person (no need to switch between /dev/mdX and
/dev/sdX), but impossible to guess automatically by userspace tools.
There is only probe IOCTL which doesn't handle degraded mode.

> I don't think the delay policy belongs in the kernel.

That is exactly why the systemd waits for appropriate udev state.

> It's pie in the sky, and unicorns, but it sure would be nice to have
> standardization rather than everyone rolling their own solution. The

There was a de facto standard I think - expose component devices or
require them to be specified. Apparently no such thing in btrfs, so
it must be handled in btrfs-way.

Also note that MD can be assembled by kernel itself, while btrfs cannot
(so initrd is required for rootfs).

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-22 Thread Tomasz Pala
On Thu, Dec 21, 2017 at 07:27:23 -0500, Austin S. Hemmelgarn wrote:

> No, it isn't.  You can just make the damn mount call with the supplied 
> options.  If it succeeds, the volume was ready, if it fails, it wasn't, 
> it's that simple, and there's absolutely no reason that systemd can't 
> just do that in a loop until it succeeds or a timeout is reached.  That 

There is no such loop, so if mount would happen before all the required
devices show up, it would either definitely fail, or if there were 'degraded'
in fstab, just start degraded.

> any of these issues with the volume being completely unusable in a 
> degraded state.
> 
> Also, it's not 'up to the filesystem', it's 'up to the underlying 
> device'.  LUKS, LVM, MD, and everything else that's an actual device 
> layer is what systemd waits on.  XFS, ext4, and any other filesystem 
> except BTRFS (and possibly ZFS, but I'm not 100% sure about that) 
> provides absolutely _NOTHING_ to wait on.  Systemd just chose to handle 

You wait for all the devices to settle. One might have dozen of drives
including some attached via network and it might take a time to become
available. Since systemd knows nothing about underlying components, it
simply waits for the btrfs itself to announce it's ready.

> BTRFS like a device layer, and not a filesystem, so we have this crap to 

As btrfs handles many devices in "lower part", this effectively is
device layer. Mounting /dev/sda happens to mount various other /dev/sd*
that are _not_ explicitly exposed, so there is really not an
alternative. Except for the 'mount loop' which is a no-go.

> deal with, as well as the fact that it makes it impossible to manually 
> mount a BTRFS volume with missing or failed devices in degraded mode 
> under systemd (because it unmounts it damn near instantly because it 
> somehow thinks it knows better than the user what the user wants to do).

This seems to be some distro-specific misconfiguration, didn't happen to
me on plain systemd/udev. What is the reproducing scenario?

>> This integration issue was so far silently ignored both by btrfs and
>> systemd developers. 
> It's been ignored by BTRFS devs because there is _nothing_ wrong on this 
> side other than the naming choice for the ioctl.  Systemd is _THE ONLY_ 
> init system which has this issue, every other one works just fine.

Not true - mounting btrfs without "btrfs device scan" doesn't work at
all without udev rules (that mimic behaviour of the command). Let me
repeat example from Dec 19th:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

> As far as the systemd side, I have no idea why they are ignoring it, 
> though I suspect it's the usual spoiled brat mentality that seems to be 
> present about everything that people complain about regarding systemd.

Explanation above. This is the point when _you_ need to stop ignoring
the fact, that you simply cannot just try mounting devices in a loop as
this would render any NAS/FC/iSCSI-backed or more complicated systems
unusable or hide problems in case of temporary problems with connection.

systemd waits for the _underlying_ device - unless btrfs exposes them as
a list of _actual_ devices to wait for, there is nothing except for
waiting for btrfs itself that systemd can do.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Tomasz Pala
Errata:

On Wed, Dec 20, 2017 at 09:34:48 +0100, Tomasz Pala wrote:

> /dev/sda -> 'not ready'
> /dev/sdb -> 'not ready'
> /dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 
> 'not ready'
> /dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 
> 'ready'

The last line should start with /dev/sdd.

> After such timeout, I'd like to tell the kernel: "no more devices, give
> me all the remaining btrfs volumes in degraded mode if possible". By

Actually "if possible" means both:
- if technically possible (i.e. required data is available, like half of RAID1),
- AND if allowed for specific volume as there might be different policies.

For example - one might allow rootfs to be started in degraded-rw mode in
order for the system to boot up, /home in degraded read-only for the
users to have access to their files and do not mount /srv degraded at all.
The failed mount can be non-critical with 'nofail' fstab flag.

> "give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
> fire it's rules. And if there would be anything for udev to distinguish
> 'ready' from 'ready-degraded' one could easily compose some notification
> scripting on top of it, including sending e-mail to sysadmin.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-20 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 16:59:39 -0700, Chris Murphy wrote:

>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
> 
> I can't parse this commit. In particular I can't tell how long it
> waits, or what triggers the end to waiting.

The point is - it doesn't wait at all. Instead, every 'ready' btrfs
device triggers event on all the pending devices. Consider 3-device
filesystem consisting of /dev/sd[abd] with /dev/sdc being different,
standalone btrfs:

/dev/sda -> 'not ready'
/dev/sdb -> 'not ready'
/dev/sdc -> 'ready', triggers /dev/sda -> 'not ready' and /dev/sdb - still 'not 
ready'
/dev/sdc -> kernel says 'ready', triggers /dev/sda - 'ready' and /dev/sdb -> 
'ready'

This way all the parts of a volume are marked as ready, so systemd won't
refuse mounting using legacy device nodes like /dev/sda.


This particular solution depends on kernel returning 'btrfs ready',
which would obviously not work for degraded arrays unless the btrfs.ko
handles some 'missing' or 'mount_degraded' kernel cmdline options
_before_ actually _trying_ to mount it with -o degraded.

And there is a logical problem with this - _which_ array components
should be ignored? Consider:

volume1: /dev/sda /dev/sdb
volume2: /dev/sdc /dev/sdd-broken

If /dev/sdd is missing from the system, it would never be scanned, so
/dev/sdc would be pending. It cannot be assembled just in time of
scanning alone, because the same would happen with /dev/sda and there
would be desync with /dev/sdb, which IS available - a few moments later.

This is the place for the timeout you've mentioned - there should be
*some* decent timeout allowing all the devices to show up (udev waits
for 90 seconds by default or x-systemd.device-timeout=N from fstab).

After such timeout, I'd like to tell the kernel: "no more devices, give
me all the remaining btrfs volumes in degraded mode if possible". By
"give me btrfs vulumes" I mean "mark them as 'ready'" so the udev could
fire it's rules. And if there would be anything for udev to distinguish
'ready' from 'ready-degraded' one could easily compose some notification
scripting on top of it, including sending e-mail to sysadmin.

Is there anything that would make the kernel do the above?

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 15:47:03 -0500, Austin S. Hemmelgarn wrote:

>> Sth like this? I got such problem a few months ago, my solution was
>> accepted upstream:
>> https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f
>> 
>> Rationale is in referred ticket, udev would not support any more btrfs
>> logic, so unless btrfs handles this itself on kernel level (daemon?),
>> that is all that can be done.
> Or maybe systemd can quit trying to treat BTRFS like a volume manager 
> (which it isn't) and just try to mount the requested filesystem with the 
> requested options?

Tried that before ("just mount my filesystem, stupid"), it is a no-go.
The problem source is not within systemd treating BTRFS differently, but
in btrfs kernel logic that it uses. Just to show it:

1. create 2-volume btrfs, e.g. /dev/sda and /dev/sdb,
2. reboot the system into clean state (init=/bin/sh), (or remove btrfs-scan 
tool),
3. try
mount /dev/sda /test - fails
mount /dev/sdb /test - works
4. reboot again and try in reversed order
mount /dev/sdb /test - fails
mount /dev/sda /test - works

THIS readiness is exposed via udev to systemd. And it must be used for
multi-layer setups to work (consider stacked LUKS, LVM, MD, iSCSI, FC etc).

In short: until *something* scans all the btrfs components, so the
kernel makes it ready, systemd won't even try to mount it.

> Then you would just be able to specify 'degraded' in 
> your mount options, and you don't have to care that the kernel refuses 
> to mount degraded filesystems without being explicitly asked to.

Exactly. But since LP refused to try mounting despite kernel "not-ready"
state - it is the kernel that must emit 'ready'. So the
question is: how can I make kernel to mark degraded array as "ready"?

The obvious answer is: do it via kernel command line, just like mdadm
does:
rootflags=device=/dev/sda,device=/dev/sdb
rootflags=device=/dev/sda,device=missing
rootflags=device=/dev/sda,device=/dev/sdb,degraded

If only btrfs.ko recognized this, kernel would be able to assemble
multivolume btrfs itself. Not only this would allow automated degraded
mounts, it would also allow using initrd-less kernels on such volumes.

>> It doesn't have to be default, might be kernel compile-time knob, module
>> parameter or anything else to make the *R*aid work.
> There's a mount option for it per-filesystem.  Just add that to all your 
> mount calls, and you get exactly the same effect.

If only they were passed...

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 15:11:22 -0500, Austin S. Hemmelgarn wrote:

> Except the systems running on those ancient kernel versions are not 
> necessarily using a recent version of btrfs-progs.

Still much easier to update a userspace tools than kernel (consider
binary drivers for various hardware).

> So in other words, spend the time to write up code for btrfs-progs that 
> will then be run by a significant minority of users because people using 
> old kernels usually use old userspace, and people using new kernels 
> won't have to care, instead of working on other bugs that are still 
> affecting people?

I am aware of the dillema and the answer is: that depends.
Depends on expected usefulness of such infrastructure regarding _future_
changes and possible bugs.
In case of stable/mature/frozen projects this doesn't make much sense,
as the possible incompatibilities would be very rare.
Wheter this makes sense for btrfs? I don't know - it's not mature, but if the 
quirk rate
would be too high to track appropriate kernel versions it might be
really better to officially state "DO USE 4.14+ kernel, REALLY".

This might be accomplished very easy - when releasing new btrfs-progs
check currently available LTS kernel and use it as a base reference for
warning.

After all, "giving users a hurt me button is not ethical programming."

>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>> safely add "degraded" to the mount options? My primary concern is the
>> machine UPTIME. I care less about the data, as they are backed up to
>> some remote location and loosing day or week of changes is acceptable,
>> brain-split as well, while every hour of downtime costs me a real money.
> In which case you shouldn't be relying on _ANY_ kind of RAID by itself, 
> let alone BTRFS.  If you care that much about uptime, you should be 
> investing in a HA setup and going from there.  If downtime costs you 

I got this handled and don't use btrfs there - the question remains:
in a situation as described above, is it safe now to add "degraded"?

To rephrase the question: can degraded RAID1 run permanently as rw
without some *internal* damage?

>> Anyway, users shouldn't look through syslog, device status should be
>> reported by some monitoring tool.
> This is a common complaint, and based on developer response, I think the 
> consensus is that it's out of scope for the time being.  There have been 
> some people starting work on such things, but nobody really got anywhere 
> because most of the users who care enough about monitoring to be 
> interested are already using some external monitoring tool that it's 
> easy to hook into.

I agree, the btrfs code should only emit events, so
SomeUserspaceGUIWhatever could display blinking exclamation mark.

>> Well, the question is: either it is not raid YET, or maybe it's time to 
>> consider renaming?
> Again, the naming is too ingrained.  At a minimum, you will have to keep 
> the old naming, and at that point you're just wasting time and making 
> things _more_ confusing because some documentation will use the old 

True, but realizing that documentation is already flawed it gets easier.
But I still don't know if it is going to be RAID some day? Or won't be
"by design"?

>> Ha! I got this disabled on every bus (although for different reasons)
>> after boot completes. Lucky me:)
> Security I'm guessing (my laptop behaves like that for USB devices for 
> that exact reason)?  It's a viable option on systems that are tightly 

Yes, machines are locked and only authorized devices are allowed during
boot.

> IOW, if I lose a disk in a two device BTRFS volume set up for 
> replication, I'll mount it degraded, and convert it from the raid1 
> profile to the single profile and then remove the missing disk from the 
> volume.

I was about to do the same with my r/o-stuck btrfs system, unfortunatelly
unplugged the wrong cable...

>> Writing accurate documentation requires deep undestanding of internals.
[...]
> Writing up something like that is near useless, it would only be valid 
> for upstream kernels (And if you're using upstream kernels and following 
> the advice of keeping up to date, what does it matter anyway?  The 
[...]
> kernel that fixes the issues it reports.), because distros do whatever 
> the hell they want with version numbers (RHEL for example is notorious 
> for using _ancient_ version numbers bug having bunches of stuff 
> back-ported, and most other big distros that aren't Arch, Gentoo, or 
> Slackware derived do so too to a lesser degree), and it would require 
> constant curation to keep up to date.  Only for long-term known issues 

OK, you've convinced me that kernel-vs-feature list is overhead.

So maybe other approach: just like sys

Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 12:47:33 -0700, Chris Murphy wrote:

> The more verbose man pages are, the more likely it is that information
> gets stale. We already see this with the Btrfs Wiki. So are you

True. The same applies to git documentation (3rd paragraph):

https://stevebennett.me/2012/02/24/10-things-i-hate-about-git/

Fortunately this CAN be done properly, one of the greatest
documentations I've seen is systemd one.

What I don't like about documentation is lack of objectivity:

$ zgrep -i bugs /usr/share/man/man8/*btrfs*.8.gz | grep -v bugs.debian.org

Nothing. The old-school manuals all had BUGS section even if it was
empty. Seriously, nothing appropriate to be put in there? Documentation
must be symmetric - if it mentions feature X, it must mention at least the
most common caveats.

> volunteering to do the btrfs-progs work to easily check kernel
> versions and print appropriate warnings? Or is this a case of
> complaining about what other people aren't doing with their time?

This is definitely the second case. You see, I got my issues with btrfs, I
already know where to use it and when not. I've learned HARD and still
didn't fully recovered (some dangling r/o, some ENOSPACE due to
fragmentation etc). What I /MIGHT/ help to the community is to share my
opinions and suggestions. And it's all up to you, what would you do with
this. Either you blame me for complaining or you ignore me - you
should realize, that _I_do_not_care_, because I already know things that
I write. At least some other guy, some other day would read this thread and my
opinions might save HIS day. After all, using btrfs should be preceded
with research.

No offence, just trying to be honest with you. Because the other thing
that I've learned hard in my life is to listen regular users of my
products and appreciate any feedback, even if it doesn't suit me.

>> Now, if the current kernels won't toggle degraded RAID1 as ro, can I
>> safely add "degraded" to the mount options? My primary concern is the
[...]
> Btrfs simply is not ready for this use case. If you need to depend on
> degraded raid1 booting, you need to use mdadm or LVM or hardware raid.
> Complaining about the lack of maturity in this area? Get in line. Or
> propose a design and scope of work that needs to be completed to
> enable it.

I thought the work was already done if current kernel handles degraded RAID1
without switching to r/o, doesn't it? Or something else is missing?

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 12:35:20 -0700, Chris Murphy wrote:

> with a read only file system. Another reason is the kernel code and
> udev rule for device "readiness" means the volume is not "ready" until
> all member devices are present. And while the volume is not "ready"
> systemd will not even attempt to mount. Solving this requires kernel
> and udev work, or possibly a helper, to wait an appropriate amount of

Sth like this? I got such problem a few months ago, my solution was
accepted upstream:
https://github.com/systemd/systemd/commit/0e8856d25ab71764a279c2377ae593c0f2460d8f

Rationale is in referred ticket, udev would not support any more btrfs
logic, so unless btrfs handles this itself on kernel level (daemon?),
that is all that can be done.

> time. I also think it's a bad idea to implement automatic degraded
> mounts unless there's an API for user space to receive either a push
[...]
> There is no amount of documentation that makes up for these
> deficiencies enough to enable automatic degraded mounts by default. I
> would consider it a high order betrayal of user trust to do it.

It doesn't have to be default, might be kernel compile-time knob, module
parameter or anything else to make the *R*aid work.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
On Tue, Dec 19, 2017 at 10:31:40 -0800, George Mitchell wrote:

> I have significant experience as a user of raid1. I spent years using 
> software raid1 and then more years using hardware (3ware) raid1 and now 
> around 3 years using btrfs raid1. I have not found btrfs raid1 to be 
> less reliable than any of the previous implementations of raid.  I have 

You are aware that in order to proof something one needs only one
example? Degraded r/o is such, QED.
Doesn't matter how long did you ride on top of any RAID implementation,
unless you got them in action, i.e. had actual drive malfunction. Did you
have broken drive under btrfs raid?

> a failure, you don't just plug things back in and expect it to be fixed 
> without seriously investigating what has gone wrong and potential 
> unexpected consequences.  I have found that even with hardware raid you 
> can find ways to screw things up to the point that you lose your data.  

Everything could be screwed beyond comprehension, but we're talking
about PRIMARY objectives. In case of RAID1+ it seems to be obvious:

https://en.oxforddictionaries.com/definition/redundancy

- unplugging ANY SINGLE drive MUST NOT render system unusable.
This is really as simple as that.

> I have had situations where I reconnected a drive on hardware raid1 only 
> to find that the array would not sync and from there on I ended up 
> having to directly attach one of the drives and recover the partition 

I had a situation when replugging a drive started a sync of older data
over the newer. So what? This doesn't change a thing - the drive
reappearance or resync is RECOVERY part. RECOVERY scenarios are entirely
different thing than REDUNDANCY itself. RECOVERY phase in some
implementation could be entirely off-line process and it still would be
RAID. Remove REDUNDANCY part and it's not RAID anymore.

If one is naming thing an apple, shouldn't be surprised if others
compare it to apples, not oranges.

> table with test disk in order to regain access to my data.  So NO FORM 
> of raid is a replacement for backups and NO FORM of raid is a 
> replacement for due diligence in recovery from failure mode.  Raid gives 

And who said it is?

> you a second chance when things go wrong, it does not make failures 
> transparent which is seemingly what we sometimes expect from raid.  And 

Wouldn't want to worry you, but properly managed RAIDs make I/J-of-K
trivial-failures transparent. Just like ECC protects N/M bits transparently.

Investigating the reasons is sysadmin's job, just like other
maintenance, including restoring protection level.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
.
> This is a matter of opinion.

Sure! And the particular opinion depends on system being affected. I'd
rather not have any brain-split scenario under my database servers, but
also won't mind data loss on BGP router as long as it keeps running and
is fully operational.

> I still contend that running half a two 
> device array for an extended period of time without reshaping it to be a 
> single device is a bad idea for cases other than BTRFS.  The fewer 
> layers of code you're going through, the safer you are.

I create single-device degraded MD RAID1 when I attach one disk for
deployment (usually test machines), which are going to be converted into
dual (production) in a future - attaching second disk to array is much
easier and faster than messing with device nodes (or labels or
anything). The same applies to LVM, it's better to have it even when not
used at a moment. In case of btrfs there is no need for such
preparations, as the devices are added without renaming.

However, sometimes the systems end up without second disk attached.
Either due to their low importance, sometimes power usage, others
need to be quiet.


One might ask, why don't I attach second disk before initial system
creation - the answer is simple: I usually use the same drive models in
RAID1, but it happens that drives bought from the same production lot
fail simultaneously, so this approach mitigates the problem and gives
more time to react.

> Patches would be gratefully accepted.  It's really not hard to update 
> the documentation, it's just that nobody has had the time to do it.

Writing accurate documentation requires deep undestanding of internals.
Me - for example, I know some of the results: "don't do this", "if X happens, Y
should be done", "Z doesn't work yet, but there were some patches", "V
was fixed in some recent kernel, but no idea which commit was it
exactly", "W was severly broken in kernel I.J.K" etc. Not the hard data
that could be posted without creating the impression, that it's all
about creating complain-list. Not to mention I'm absolutely not familiar
with current patches, WIP and many many other corner cases or usage
scenarios. In a fact, not only the internals, but motivation and design
principles must be well understood to write piece of documentation.

Otherwise some "fake news" propaganda is being created, just like
https://suckless.org/sucks/systemd or other systemd-haters that haven't
spent a day in their life for writing SysV init scripts or managing a
bunch of mission critical machines with handcrafted supervisors.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Tomasz Pala
ted,
3. I was about to fix the volume, accidentally the machine has rebooted.
   Which should do no harm if I had a RAID1.
4. As already said before, using r/w degraded RAID1 is FULLY ACCEPTABLE,
   as long as you accept "no more redundancy"...
4a. ...or had an N-way mirror and there is still some redundancy if N>2.


Since we agree, that btrfs RAID != common RAID, as there are/were
different design principles and some features are in WIP state at best,
the current behaviour should be better documented. That's it.


-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-18 Thread Tomasz Pala
On Mon, Dec 18, 2017 at 08:06:57 -0500, Austin S. Hemmelgarn wrote:

> The fact is, the only cases where this is really an issue is if you've 
> either got intermittently bad hardware, or are dealing with external 

Well, the RAID1+ is all about the failing hardware.

> storage devices.  For the majority of people who are using multi-device 
> setups, the common case is internally connected fixed storage devices 
> with properly working hardware, and for that use case, it works 
> perfectly fine.

If you're talking about "RAID"-0 or storage pools (volume management)
that is true.
But if you imply, that RAID1+ "works perfectly fine as long as hardware
works fine" this is fundamentally wrong. If the hardware needs to work
properly for the RAID to work properly, noone would need this RAID in
the first place.

> that BTRFS should not care.  At the point at which a device is dropping 
> off the bus and reappearing with enough regularity for this to be an 
> issue, you have absolutely no idea how else it's corrupting your data, 
> and support of such a situation is beyond any filesystem (including ZFS).

Support for such situation is exactly what RAID performs. So don't blame
people for expecting this to be handled as long as you call the
filesystem feature a 'RAID'.

If this feature is not going to mitigate hardware hiccups by design (as
opposed to "not implemented yet, needs some time", which is perfectly
understandable), just don't call it 'RAID'.

All the features currently working, like bit-rot mitigation for
duplicated data (dup/raid*) using checksums, are something different
than RAID itself. RAID means "survive failure of N devices/controllers"
- I got one "RAID1" stuck in r/o after degraded mount, not nice... Not
_expected_ to happen after single disk failure (without any reappearing).

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-15 Thread Tomasz Pala
On Tue, Dec 12, 2017 at 08:50:15 +0800, Qu Wenruo wrote:

> Even without snapshot, things can easily go crazy.
> 
> This will write 128M file (max btrfs file extent size) and write it to disk.
> # xfs_io -f -c "pwrite 0 128M" -c "sync" /mnt/btrfs/file
> 
> Then, overwrite the 1~128M range.
> # xfs_io -f -c "pwrite 1M 127M" -c "sync" /mnt/btrfs/file
> 
> Guess your real disk usage, it's 127M + 128M = 255M.
> 
> The point here, if there is any reference of a file extent, the whole
> extent won't be freed, even it's only 1M of a 128M extent.

OK, /this/ is scary. I guess nocow prevents this behaviour?
I have +C chatted the file eating my space and it ceased.

> Are you pre-allocating the file before write using tools like dd?

I have no idea, this could be checked in source of 
http://pam-abl.sourceforge.net/
But this is plain Berkeley DB (5.3 in my case)... which scarries me even
more:

$  rpm -q --what-requires 'libdb-5.2.so()(64bit)' 'libdb-5.3.so()(64bit)' | wc 
-l
14

#  ipoldek desc -B db5.3
Package:db5.3-5.3.28.0-4.x86_64
Required(by):   apache1-base, apache1-mod_ssl, apr-util-dbm-db,
bogofilter, 
c-icap, c-icap-srv_url_check, courier-authlib, 
courier-authlib-authuserdb, courier-imap, courier-imap-common, 
cyrus-imapd, cyrus-imapd-libs, cyrus-sasl, cyrus-sasl-sasldb, 
db5.3-devel, db5.3-utils, dnshistory, dsniff, evolution-data-server, 
evolution-data-server-libs, exim, gda-db, ggz-server, 
heimdal-libs-common, hotkeys, inn, inn-libs, isync, jabberd, jigdo, 
jigdo-gtk, jnettop, libetpan, libgda3, libgda3-devel, libhome, libqxt, 
libsolv, lizardfs-master, maildrop, moc, mutt, netatalk, nss_updatedb, 
ocaml-dbm, opensips, opensmtpd, pam-pam_abl, pam-pam_ccreds, perl-BDB, 
perl-BerkeleyDB, perl-BerkeleyDB, perl-DB_File, perl-URPM, 
perl-cyrus-imapd, php4-dba, php52-dba, php53-dba, php54-dba, php55-dba, 
php56-dba, php70-dba, php70-dba, php71-dba, php71-dba, php72-dba, 
php72-dba, postfix, python-bsddb, python-modules, python3-bsddb3, 
redland, ruby-modules, sendmail, squid-session_acl, 
squid-time_quota_acl, squidGuard, subversion-libs, swish-e, tomoe-svn, 
webalizer-base, wwwcount

OK, not much of user-applications here, as they mostly use sqlite.
I wonder how this one db-library behaves:

$  find . -name \*.sqlite | xargs ls -gGhS | head -n1
-rw-r--r-- 1  15M 2017-12-08 12:14 
./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite

$  ~/fiemap ./.mozilla/firefox/*.default/extension-data/ublock0.sqlite | head 
-n1
File ./.mozilla/firefox/vni9ojqi.default/extension-data/ublock0.sqlite has 128 
extents:


At least every $HOME/{.{,c}cache,tmp} should be +C...

> And if possible, use nocow for this file.

Actually, this should be officially advised to use +C for entire /var tree and
every other tree that might be exposed for hostile write patterns, like /home
or /tmp (if held on btrfs).

I'd say, that from security point of view the nocow should be default,
unless specified for mount or specific file... Currently, if I mount
with nocow, there is no way to whitelist trusted users or secure
location, and until btrfs-specific options could be handled per
subvolume, there is really no alternative.


-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-11 Thread Tomasz Pala
On Mon, Dec 11, 2017 at 07:44:46 +0800, Qu Wenruo wrote:

>> I could debug something before I'll clean this up, is there anything you
>> want to me to check/know about the files?
> 
> fiemap result along with btrfs dump-tree -t2 result.

fiemap attached, but dump-tree requires unmounted fs, doesn't it?

>> - I've lost 3.6 GB during the night with reasonably small
>> amount of writes, I guess it might be possible to trash entire
>> filesystem within 10 minutes if doing this on purpose.
> 
> That's a little complex.
> To get into such situation, snapshot must be used and one must know
> which file extent is shared and how it's shared.

Hostile user might assume that any of his own files old enough were
being snapshotted. Unless snapshots are not used at all...

The 'obvious' solution would be for quotas to limit the data size including
extents lost due to fragmentation, but this is not the real solution as
users don't care about fragmentation. So we're back to square one.

> But as I mentioned, XFS supports reflink, which means file extent can be
> shared between several inodes.
> 
> From the message I got from XFS guys, they free any unused space of a
> file extent, so it should handle it quite well.

Forgive my ignorance, as I'm not familiar with details, but isn't the
problem 'solvable' by reusing space freed from the same extent for any
single (i.e. the same) inode? This would certainly increase
fragmentation of a file, but reduce extent usage significially.


Still, I don't comprehend the cause of my situation. If - after doing a
defrag (after snapshotting whatever there were already trashed) btrfs
decides to allocate new extents for the file, why doesn't is use them
efficiently as long as I'm not doing snapshots anymore?

I'm attaching the second fiemap, the same file from last snapshot taken.
According to this one-liner:

for i in `awk '{print $3}' fiemap`; do grep $i fiemap_old; done

current file doesn't share any physical locations with the old one.
But still grows, so what does this situation have with snapshots anyway?

Oh, and BTW - 900+ extents for ~5 GB taken means there is about 5.5 MB
occupied per extent. How is that possible?

-- 
Tomasz Pala <go...@pld-linux.org>
File log.14 has 933 extents:
#   Logical  Physical Length   Flags
0:   00297a001000 1000 
1:  1000 00297aa01000 1000 
2:  2000 002979ffe000 1000 
3:  3000 00297d1fc000 1000 
4:  4000 00297e5f7000 1000 
5:  5000 00297d1fe000 1000 
6:  6000 00297c7f4000 1000 
7:  7000 00297dbf9000 1000 
8:  8000 00297eff3000 1000 
9:  9000 0029821c7000 1000 
10: a000 002982bbf000 1000 
11: b000 0029803e 1000 
12: c000 00297b40 1000 
13: d000 002979601000 1000 
14: e000 002980dd5000 1000 
15: f000 0029821be000 1000 
16: 0001 00298715f000 1000 
17: 00011000 002985d71000 1000 
18: 00012000 00298537f000 1000 
19: 00013000 00298676 1000 
20: 00014000 00298498d000 1000 
21: 00015000 0029821b4000 1000 
22: 00016000 0029817c7000 1000 
23: 00017000 00298a2fa000 1000 
24: 00018000 002988f1f000 1000 
25: 00019000 00298d47f000 1000 
26: 0001a000 00298c0af000 1000 
27: 0001b000 00298a2ee000 1000 
28: 0001c000 00298a2eb000 1000 
29: 0001d000 0029905f2000 1000 
30: 0001e000 00298f22a000 1000 
31: 0001f000 00298de66000 1000 
32: 0002 00298ace3000 1000 
33: 00021000 00298a2e9000 1000 
34: 00022000 00298a2e7000 1000 
35: 00023000 00298b6c3000 1000 
36: 00024000 002990fd5000 1000 
37: 00025000 002992d6c000 1000 
38: 00026000 0029954db000 1000 
39: 00027000 002993747000 1000 
40: 00028000 002992d

Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote:

> I have found a directory - pam_abl databases, which occupy 10 MB (yes,
> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after

#  df
Filesystem  Size  Used Avail Use% Mounted on
/dev/sda264G   61G  2.8G  96% /

#  btrfs fi du .
 Total   Exclusive  Set shared  Filename
 0.00B   0.00B   -  ./1/__db.register
  10.00MiB10.00MiB   -  ./1/log.01
  16.00KiB   0.00B   -  ./1/hosts.db
  16.00KiB   0.00B   -  ./1/users.db
 168.00KiB   0.00B   -  ./1/__db.001
  40.00KiB   0.00B   -  ./1/__db.002
  44.00KiB   0.00B   -  ./1/__db.003
  10.28MiB10.00MiB   -  ./1
 0.00B   0.00B   -  ./__db.register
  16.00KiB16.00KiB   -  ./hosts.db
  16.00KiB16.00KiB   -  ./users.db
  10.00MiB10.00MiB   -  ./log.13
 0.00B   0.00B   -  ./__db.001
 0.00B   0.00B   -  ./__db.002
 0.00B   0.00B   -  ./__db.003
  20.31MiB20.03MiB   284.00KiB  .

#  btrfs fi defragment log.13 
#  df
/dev/sda264G   54G  9.4G  86% /


6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ERROR: failed to repair root items: Input/output error

2017-12-10 Thread Tomasz Pala
On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote:

> I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is
> within warranty.
> I can't mount it read-write ("no rw mounting  after error").

There is a data-corruption issue with this controller!
The same as 840 EVO - just google this.

In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some
newer for entire 8* Samsung SSD family blacklisting) or disable NCQ.

Using queued TRIM on this drive leads to data loss! Firmware zeroes fist
512 bytes of a block, sorry.

If you only had smaller drive, as 850s up to 512 GB have different
controller...

> checksum verify failed on 103009173504 found 25334496 wanted 3500
> bytenr mismatch, want=103009173504, have=889192478
> ERROR: failed to repair root items: Input/output error
> 
> What do these errors mean?
> What should I do to fix the filesystem and be able to mount it read-write?

You probably can't fix this - there is data missing on bare metal, so you should
recover using backups. If you don't have one, you need to perform manual
data recovery procedures (like photorec) with little chances to restore
complete files due to the nature of data loss (beginning of blocks).

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote:

>> 1. is there any switch resulting in 'defrag only exclusive data'?
> 
> IIRC, no.

I have found a directory - pam_abl databases, which occupy 10 MB (yes,
TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after
defrag. After defragging files were not snapshotted again and I've lost
3.6 GB again, so I got this fully reproducible.
There are 7 files, one of which is 99% of the space (10 MB). None of
them has nocow set, so they're riding all-btrfs.

I could debug something before I'll clean this up, is there anything you
want to me to check/know about the files?

The fragmentation impact is HUGE here, 1000-ratio is almost a DoS
condition which could be triggered by malicious user during a few hours
or faster - I've lost 3.6 GB during the night with reasonably small
amount of writes, I guess it might be possible to trash entire
filesystem within 10 minutes if doing this on purpose.

>> 3. I guess there aren't, so how could I accomplish my target, i.e.
>>reclaiming space that was lost due to fragmentation, without breaking
>>spanshoted CoW where it would be not only pointless, but actually harmful?
> 
> What about using old kernel, like v4.13?

Unfortunately (I guess you had 3.13 on mind), I need the new ones and
will be pushing towards 4.14.

>> 4. How can I prevent this from happening again? All the files, that are
>>written constantly (stats collector here, PostgreSQL database and
>>logs on other machines), are marked with nocow (+C); maybe some new
>>attribute to mark file as autodefrag? +t?
> 
> Unfortunately, nocow only works if there is no other subvolume/inode
> referring to it.

This shouldn't be my case anymore after defrag (==breaking links).
I guess no easy way to check refcounts of the blocks?

> But in my understanding, btrfs is not suitable for such conflicting
> situation, where you want to have snapshots of frequent partial updates.
> 
> IIRC, btrfs is better for use case where either update is less frequent,
> or update is replacing the whole file, not just part of it.
> 
> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which
> is pointing to /usr/bin and /usr/lib) , but not for /var or /run.

That is something coherent with my conclusions after 2 years on btrfs,
however I didn't expect a single file to eat 1000 times more space than it
should...


I wonder how many other filesystems were trashed like this - I'm short
of ~10 GB on other system, many other users might be affected by that
(telling the Internet stories about btrfs running out of space).

It is not a problem that I need to defrag a file, the problem is I don't know:
1. whether I need to defrag,
2. *what* should I defrag
nor have a tool that would defrag smart - only the exclusive data or, in
general, the block that are worth defragging if space released from
extents is greater than space lost on inter-snapshot duplication.

I can't just defrag entire filesystem since it breaks links with snapshots.
This change was a real deal-breaker here...

Any way to fed the deduplication code with snapshots maybe? There are
directories and files in the same layout, this could be fast-tracked to
check and deduplicate.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-10 Thread Tomasz Pala
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote:

> OTOH, it's also quite possible that people chose btrfs at least partly
> for other reasons, say the "storage pool" qualities, and would rather

Well, to name some:

1. filesystem-level backups via snapshot/send/receive - much cleaner and
faster than rsyncs or other old-fashioned methods. This obviously requires the 
CoW-once feature;

- caveat: for btrfs-killing usage patterns all the snapshots but the
  last one need to be removed;


2. block-level checksums with RAID1-awareness - in contrary to mdadm
RAIDx, which chooses random data copy from underlying devices, this is
much less susceptible to bit rot;

- caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID
  instead of real balancing), no N-way mirroring nor write-mostly flag.


3. compression - there is no real alternative, however:

- caveat: requires CoW enabled, which makes it not suitable for
  ...systemd journals, which compress with great ratio (c.a. 1:10),
  nor for various databases, as they will be nocowed sooner or later;


4. storage pools you've mentioned - they are actually not much superior to
LVM-based approach; until one could create subvolume with different
profile (e.g. 'disable RAID1 for /var/log/journal') it is still better
to create separate filesystems, meaning one have to use LVM or (the hard
way) paritioning.


Some of the drawbacks above are immanent to CoW and so shouldn't be
expected to be fixed internally, as the needs are conflicting, but their
impact might be nullified by some housekeeping.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 17:28:12 +0100, Tomasz Pala wrote:

>> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
> [...]
>> Now make various small changes to the file, say under 16 KiB each.  These
>> will each be COWed elsewhere as one might expect. by default 16 KiB at
>> a time I believe (might be 4 KiB, as it was back when the default leaf
> 
> I got ~500 small files (100-500 kB) updated partially in regular
> intervals:
> 
> # du -Lc **/*.rrd | tail -n1
> 105Mtotal
> 
>> But here's the kicker.  Even without a snapshot locking that original 100
>> MiB extent in place, if even one of the original 16 KiB blocks isn't
>> rewritten, that entire 100 MiB extent will remain locked in place, as the
>> original 16 KiB blocks that have been changed and thus COWed elsewhere
>> aren't freed one at a time, the full 100 MiB extent only gets freed, all
>> at once, once no references to it remain, which means once that last
>> block of the extent gets rewritten.

OTOH - should this happen with nodatacow files? As I mentioned before,
these files are chattred +C (however this was not their initial state
due to https://bugzilla.kernel.org/show_bug.cgi?id=189671 ).
Am I wrong thinking, that in such case they should occupy twice their
size maximum? Or maybe there is some tool that could show me the real
space wasted by file, including extents count etc?

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
On Fri, 01 Dec 2017 18:57:08 -0800, Duncan wrote:

> OK, is this supposed to be raid1 or single data, because the above shows
> metadata as all raid1, while some data is single tho most is raid1, and
> while old mkfs used to create unused single chunks on raid1 that had to
> be removed manually via balance, those single data chunks aren't unused.

It is supposed to be RAID1, the single data were leftovers from my previous
attempts to gain some space by converting into single profile. Which
miserably failed BTW (would it be smarter with "soft" option?),
but I've already managed to clear this.

> Assuming the intent is raid1, I'd recommend doing...
>
> btrfs balance start -dconvert=raid1,soft /

Yes, this was the way to go. It also reclaimed the 8 GB. I assume the
failing -dconvert=single somehow locked that 8 GB, so this issue should
be addressed in btrfs-tools to report such locked out region. You've
already noted that the single profile data occupied much less itself.

So this was the first issue, the second is running overhead, that
accumulates over time. Since yesterday, when I had 19 GB free, I've lost
4 GB already. The scenario you've described is very probable:

> btrfs balance start -dusage=N /
[...]
> allocated value toward usage.  I too run relatively small btrfs raid1s
> and would suggest trying N=5, 20, 40, 70, until the spread between

There were no effects above N=10 (both dusage and musage).

> consuming your space either, as I'd suspect they might if the problem were
> for instance atime updates, so while noatime is certainly recommended and

I use noatime by default since years, so not the source of problem here.

> The other possibility that comes to mind here has to do with btrfs COW
> write patterns...

> Suppose you start with a 100 MiB file (I'm adjusting the sizes down from
[...]
> Now make various small changes to the file, say under 16 KiB each.  These
> will each be COWed elsewhere as one might expect. by default 16 KiB at
> a time I believe (might be 4 KiB, as it was back when the default leaf

I got ~500 small files (100-500 kB) updated partially in regular
intervals:

# du -Lc **/*.rrd | tail -n1
105Mtotal

> But here's the kicker.  Even without a snapshot locking that original 100
> MiB extent in place, if even one of the original 16 KiB blocks isn't
> rewritten, that entire 100 MiB extent will remain locked in place, as the
> original 16 KiB blocks that have been changed and thus COWed elsewhere
> aren't freed one at a time, the full 100 MiB extent only gets freed, all
> at once, once no references to it remain, which means once that last
> block of the extent gets rewritten.
>
> So perhaps you have a pattern where files of several MiB get mostly
> rewritten, taking more space for the rewrites due to COW, but one or
> more blocks remain as originally written, locking the original extent
> in place at its full size, thus taking twice the space of the original
> file.
>
> Of course worst-case is rewrite the file minus a block, then rewrite
> that minus a block, then rewrite... in which case the total space
> usage will end up being several times the size of the original file!
>
> Luckily few people have this sort of usage pattern, but if you do...
>
> It would certainly explain the space eating...

Did anyone investigated how is that related to RRD rewrites? I don't use
rrdcached, never thought that 100 MB of data might trash entire
filesystem...

best regards,
-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-02 Thread Tomasz Pala
OK, I seriously need to address that, as during the night I lost
3 GB again:

On Sat, Dec 02, 2017 at 10:35:12 +0800, Qu Wenruo wrote:

>> #  btrfs fi sh /
>> Label: none  uuid: 17a3de25-6e26-4b0b-9665-ac267f6f6c4a
>> Total devices 2 FS bytes used 44.10GiB
   Total devices 2 FS bytes used 47.28GiB

>> #  btrfs fi usage /
>> Overall:
>> Used: 88.19GiB
   Used: 94.58GiB
>> Free (estimated): 18.75GiB  (min: 18.75GiB)
   Free (estimated): 15.56GiB  (min: 15.56GiB)
>> 
>> #  btrfs dev usage /
- output not changed

>> #  btrfs fi df /
>> Data, RAID1: total=51.97GiB, used=43.22GiB
   Data, RAID1: total=51.97GiB, used=46.42GiB
>> System, RAID1: total=32.00MiB, used=16.00KiB
>> Metadata, RAID1: total=2.00GiB, used=895.69MiB
>> GlobalReserve, single: total=131.14MiB, used=0.00B
   GlobalReserve, single: total=135.50MiB, used=0.00B
>> 
>> # df
>> /dev/sda264G   45G   19G  71% /
   /dev/sda264G   48G   16G  76% /
>> However the difference is on active root fs:
>> 
>> -0/29124.29GiB  9.77GiB
>> +0/29115.99GiB 76.00MiB
0/29119.19GiB  3.28GiB
> 
> Since you have already showed the size of the snapshots, which hardly
> goes beyond 1G, it may be possible that extent booking is the cause.
> 
> And considering it's all exclusive, defrag may help in this case.

I'm going to try defrag here, but have a bunch of questions before;
as defrag would break CoW, I don't want to defrag files that span
multiple snapshots, unless they have huge overhead:
1. is there any switch resulting in 'defrag only exclusive data'?
2. is there any switch resulting in 'defrag only extents fragmented more than X'
   or 'defrag only fragments that would be possibly freed'?
3. I guess there aren't, so how could I accomplish my target, i.e.
   reclaiming space that was lost due to fragmentation, without breaking
   spanshoted CoW where it would be not only pointless, but actually harmful?
4. How can I prevent this from happening again? All the files, that are
   written constantly (stats collector here, PostgreSQL database and
   logs on other machines), are marked with nocow (+C); maybe some new
   attribute to mark file as autodefrag? +t?

For example, the largest file from stats collector:
 Total   Exclusive  Set shared  Filename
 432.00KiB   176.00KiB   256.00KiB  load/load.rrd

but most of them has 'Set shared'==0.

5. The stats collector is running from the beginning, according to the
quota output was not the issue since something happened. If the problem
was triggered by (guessing) low space condition, and it results in even
more space lost, there is positive feedback that is dangerous, as makes
any filesystem unstable ("once you run out of space, you won't recover").
Does it mean btrfs is simply not suitable (yet?) for frequent updates usage
pattern, like RRD files?

6. Or maybe some extra steps just before taking snapshot should be taken?
I guess 'defrag exclusive' would be perfect here - reclaiming space
before it is being locked inside snapshot.
Rationale behind this is obvious: since the snapshot-aware defrag was
removed, allow to defrag snapshot exclusive data only.
This would of course result in partial file defragmentation, but that
should be enough for pathological cases like mine.

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
one --- ---  
0/36422.63GiB 72.03MiB none none --- ---  
0/28510.78GiB 75.95MiB none none --- ---  
0/29115.99GiB 76.24MiB none none --- ---  <- 
this one (default rootfs) got fixed
0/32321.35GiB 95.85MiB none none --- ---  
0/36923.26GiB 96.12MiB none none --- ---  
0/32421.36GiB104.46MiB none none --- ---  
0/32721.36GiB115.42MiB none none --- ---  
0/36823.27GiB118.25MiB none none --- ---  
0/29511.20GiB148.59MiB none none --- ---  
0/29812.38GiB283.41MiB none none --- ---  
0/26012.25GiB  3.22GiB none none --- ---  <- 
170712, initial snapshot, OK
0/31217.54GiB  4.56GiB none none --- ---  <- 
170811, definitely less excl
0/38821.69GiB  7.16GiB none none --- ---  <- 
this one has <100M exclusive


So the one block of data was released, but there are probably two more
stuck here. If the 4.5G and 7G were freed I would have 45-4.5-7=33G used,
which would agree with the 25G of data I've counted manually.

Any ideas how to look inside these two snapshots?

> Rescan and --sync are important to get the correct number.
> (while rescan can take a long long time to finish)

#  time btrfs quota rescan -w /
quota rescan started
btrfs quota rescan -w /  0.00s user 0.00s system 0% cpu 30.798 total

> And further more, please ensure that all deleted files are really deleted.
> Btrfs delay file and subvolume deletion, so you may need to sync several
> times or use "btrfs subv sync" to ensure deleted files are deleted.

Yes, I was aware about that. However I've never had to wait after rebalance...

regards,
-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 09:05:50 +0800, Qu Wenruo wrote:

>> qgroupid rfer excl 
>>    
>> 0/26012.25GiB  3.22GiB   from 170712 - first snapshot
>> 0/31217.54GiB  4.56GiB   from 170811
>> 0/36625.59GiB  2.44GiB   from 171028
>> 0/37023.27GiB 59.46MiB   from 18 - prev snapshot
>> 0/38821.69GiB  7.16GiB   from 171125 - last snapshot
>> 0/29124.29GiB  9.77GiB   default subvolume
> 
> You may need to manually sync the filesystem (trigger a transaction
> commitment) to update qgroup accounting.

The data I've pasted were just calculated.

>> # btrfs quota enable /
>> # btrfs qgroup show /
>> WARNING: quota disabled, qgroup data may be out of date
>> [...]
>> # btrfs quota enable /   - for the second time!
>> # btrfs qgroup show /
>> WARNING: qgroup data inconsistent, rescan recommended
> 
> Please wait the rescan, or any number is not correct.

Here I was pointing that first "quota enable" resulted in "quota
disabled" warning until I've enabled it once again.

> It's highly recommended to read btrfs-quota(8) and btrfs-qgroup(8) to
> ensure you understand all the limitation.

I probably won't understand them all, but this is not an issue of my
concern as I don't use it. There is simply no other way I am aware that
could show me per-subvolume stats. Well, straightforward way, as the
hard way I'm using (btrfs send) confirms the problem.

You could simply remove all the quota results I've posted and there will
be the underlaying problem, that the 25 GB of data I got occupies 52 GB.
At least one recent snapshot, that was taken after some minor (<100 MB) changes
from the subvolume, that has undergo some minor changes since then,
occupied 8 GB during one night when the entire system was idling.

This was crosschecked on files metadata (mtimes compared) and 'du'
results.


As a last-resort I've rebalanced the disk (once again), this time with
-dconvert=raid1 (to get rid of the single residue).

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
On Sat, Dec 02, 2017 at 08:27:56 +0800, Qu Wenruo wrote:

> I assume there is program eating up the space.
> Not btrfs itself.

Very doubtful. I've encountered ext3 "eating" problem once, that couldn't be
find by lsof on 3.4.75 kernel, but the space was returning after killing
Xorg. The system I'm having problem now is very recent, the space
doesn't return after reboot/emergency and doesn't sum up with files.

>> Now, the weird part for me is exclusive data count:
>> 
>> # btrfs sub sh ./snapshot-171125
>> [...]
>> Subvolume ID:   388
>> # btrfs fi du -s ./snapshot-171125 
>>  Total   Exclusive  Set shared  Filename
>>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.

Just to be clear: I've enabled quota _only_ to see subvolume usage on
spot. And exclusive data - the more detailed approach I've described in
e-mail I've send a minute ago.

> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.

Good to know, but this is almost UID0-only system. There are system
users (vendor provided) and 2 ssh accounts for su, but nobody uses this
machine for daily work. The quota values were the last tool I could find
to debug.

> For fi du, it's per-inode owner check. (The exact behavior is a little
> more complex, I'll skip such corner case to make it a little easier to
> understand).
> 
> That's to say, if one file extent is shared by different inodes, then
> it's counted as shared, no matter if these inodes belong to different or
> the same subvolume.
> 
> That's to say, "fi du" has a looser condition for "shared" calculation,
> and that should explain why you have 20+G shared.

There shouldn't be many multi-inode extents inside single subvolume, as this is 
mostly fresh
system, with no containers, no deduplication, snapshots are taken from
the same running system before or after some more important change is
done. By 'change' I mean altering text config files mostly (plus
etckeeper's git metadata), so the volume of difference is extremelly
low. Actually most of the difs between subvolumes come from updating
distro packages. There were not much reflink copies made on this
partition, only one kernel source compiled (.ccache files removed
today). So this partition is as clean, as it could be after almost
5 months in use.

Actually I should rephrase the problem:

"snapshot has taken 8 GB of space despite nothing has altered source subvolume"

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
hot.

Well, even btrfs send -p snapshot-170712 snapshot-171125 | pv > /dev/null
5.68GiB 0:03:23 [28.6MiB/s]

I've created a new snapshot right now to compare it with 171125:
75.5MiB 0:00:43 [1.73MiB/s]


OK, I could even compare all the snapshots in sequence:

# for i in snapshot-17*; btrfs prop set $i ro true
# p=''; for i in snapshot-17*; do [ -n "$p" ] && btrfs send -p "$p" "$i" | pv > 
/dev/null; p="$i" done
 1.7GiB 0:00:15 [ 114MiB/s]
1.03GiB 0:00:38 [27.2MiB/s]
 155MiB 0:00:08 [19.1MiB/s]
1.08GiB 0:00:47 [23.3MiB/s]
 294MiB 0:00:29 [ 9.9MiB/s]
 324MiB 0:00:42 [7.69MiB/s]
82.8MiB 0:00:06 [12.7MiB/s]
64.3MiB 0:00:05 [11.6MiB/s]
 137MiB 0:00:07 [19.3MiB/s]
85.3MiB 0:00:13 [6.18MiB/s]
62.8MiB 0:00:19 [3.21MiB/s]
 132MiB 0:00:42 [3.15MiB/s]
 102MiB 0:00:42 [2.42MiB/s]
 197MiB 0:00:50 [3.91MiB/s]
 321MiB 0:01:01 [5.21MiB/s]
 229MiB 0:00:18 [12.3MiB/s]
 109MiB 0:00:11 [ 9.7MiB/s]
 139MiB 0:00:14 [9.32MiB/s]
 573MiB 0:00:35 [15.9MiB/s]
64.1MiB 0:00:30 [2.11MiB/s]
 172MiB 0:00:11 [14.9MiB/s]
98.9MiB 0:00:07 [14.1MiB/s]
  54MiB 0:00:08 [6.17MiB/s]
78.6MiB 0:00:02 [32.1MiB/s]
15.1MiB 0:00:01 [12.5MiB/s]
20.6MiB 0:00:00 [  23MiB/s]
20.3MiB 0:00:00 [  23MiB/s]
 110MiB 0:00:14 [7.39MiB/s]
62.6MiB 0:00:11 [5.67MiB/s]
65.7MiB 0:00:08 [7.58MiB/s]
 731MiB 0:00:42 [  17MiB/s]
73.7MiB 0:00:29 [ 2.5MiB/s]
 322MiB 0:00:53 [6.04MiB/s]
 105MiB 0:00:35 [2.95MiB/s]
95.2MiB 0:00:36 [2.58MiB/s]
74.2MiB 0:00:30 [2.43MiB/s]
75.5MiB 0:00:46 [1.61MiB/s]

This is 9.3 GB of total diffs between all the snapshots I got.
Plus 15 GB of initial snapshot means there is about 25 GB used,
while df reports twice the amount, way too much for overhead:
/dev/sda264G   52G   11G  84% /


# btrfs quota enable /
# btrfs qgroup show /
WARNING: quota disabled, qgroup data may be out of date
[...]
# btrfs quota enable /  - for the second time!
# btrfs qgroup show /
WARNING: qgroup data inconsistent, rescan recommended
[...]
0/42815.96GiB 19.23MiB  newly created (now) snapshot



Assuming the qgroups output is bugus and the space isn't physically
occupied (which is coherent with btrfs fi du output and my expectation)
the question remains: why is that bogus-excl removed from available
space as reported by df or btrfs fi df/usage? And how to reclaim it?


[~/test]#  btrfs device usage /
/dev/sda2, ID: 1
   Device size:64.00GiB
   Device slack:  0.00B
   Data,single: 1.07GiB
   Data,RAID1: 55.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated: 4.93GiB

/dev/sdb2, ID: 2
   Device size:64.00GiB
   Device slack:      0.00B
   Data,single:   132.00MiB
   Data,RAID1: 55.97GiB
   Metadata,RAID1:  2.00GiB
   System,RAID1:   32.00MiB
   Unallocated: 5.87GiB

-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


exclusive subvolume space missing

2017-12-01 Thread Tomasz Pala
Hello,

I got a problem with btrfs running out of space (not THE
Internet-wide, well known issues with interpretation).

The problem is: something eats the space while not running anything that
justifies this. There were 18 GB free space available, suddenly it
dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
right now, just as I'm writing this e-mail:

/dev/sda264G   63G  452M 100% /
/dev/sda264G   63G  365M 100% /
/dev/sda264G   63G  316M 100% /
/dev/sda264G   63G  287M 100% /
/dev/sda264G   63G  268M 100% /
/dev/sda264G   63G  239M 100% /
/dev/sda264G   63G  230M 100% /
/dev/sda264G   63G  182M 100% /
/dev/sda264G   63G  163M 100% /
/dev/sda264G   64G  153M 100% /
/dev/sda264G   64G  143M 100% /
/dev/sda264G   64G   96M 100% /
/dev/sda264G   64G   88M 100% /
/dev/sda264G   64G   57M 100% /
/dev/sda264G   64G   25M 100% /

while my rough calculations show, that there should be at least 10 GB of
free space. After enabling quotas it is somehow confirmed:

# btrfs qgroup sh --sort=excl / 
qgroupid rfer excl 
   
0/5  16.00KiB 16.00KiB 
[30 snapshots with about 100 MiB excl]
0/33324.53GiB305.79MiB 
0/29813.44GiB312.74MiB 
0/32723.79GiB427.13MiB 
0/33123.93GiB930.51MiB 
0/26012.25GiB  3.22GiB 
0/31219.70GiB  4.56GiB 
0/38828.75GiB  7.15GiB 
0/29130.60GiB  9.01GiB <- this is the running one

This is about 30 GB total excl (didn't find a switch to sum this up). I
know I can't just add 'excl' to get usage, so tried to pinpoint the
exact files that occupy space in 0/388 exclusively (this is the last
snapshots taken, all of the snapshots are created from the running fs).


Now, the weird part for me is exclusive data count:

# btrfs sub sh ./snapshot-171125
[...]
Subvolume ID:   388
# btrfs fi du -s ./snapshot-171125 
 Total   Exclusive  Set shared  Filename
  21.50GiB63.35MiB20.77GiB  snapshot-171125


How is that possible? This doesn't even remotely relate to 7.15 GiB
from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
And the same happens with other snapshots, much more exclusive data
shown in qgroup than actually found in files. So if not files, where
is that space wasted? Metadata?

btrfs-progs-4.12 running on Linux 4.9.46.

best regards,
-- 
Tomasz Pala <go...@pld-linux.org>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html