from:"Duncan"

Re: applications hang on a btrfs spanning two partitions

2019-01-17 Thread Duncan

Marc Joliet posted on Tue, 15 Jan 2019 23:40:18 +0100 as excerpted:

> Am Dienstag, 15. Januar 2019, 09:33:40 CET schrieb Duncan:
>> Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted:
>> > Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
>> > 
>> >> ... noatime ...
>> > 
>> > The one reason I decided to remove noatime from my systems' mount
>> > options is because I use systemd-tmpfiles to clean up cache
>> > directories, for which it is necessary to leave atime intact
>> > (since caches are often Write Once Read Many).
>> 
>> Thanks for the reply.  I hadn't really thought of that use, but it
>> makes sense...

I really enjoy these "tips" subthreads.  As I said I hadn't really 
thought of that use, and seeing and understanding other people's 
solutions helps when I later find reason to review/change my own. =:^)

One example is an ssd brand reliability discussion from a couple years 
ago.  I had the main system on ssds then and wasn't planning on an 
immediate upgrade, but later on, I got tired of the media partition and a 
main system backup being on slow spinning rust, and dug out that ssd 
discussion to help me decide what to buy.  (Samsung 1 TB evo 850s, FWIW.)

> Specifically, I mean ~/.cache/ (plus a separate entry for ~/.cache/
> thumbnails/, since I want thumbnails to live longer):

Here, ~/.cache -> tmp/cache/ and ~/tmp -> /tmp/tmp-$USER/, plus 
XDG_CACHE_HOME=$HOME/tmp/cache/, with /tmp being tmpfs.

So as I said, user cache is on tmpfs.

Thumbnails... I actually did an experiment with the .thumbnails backed up 
elsewhere and empty, and found that with my ssds anyway, rethumbnailing 
was close enough to having them cached that it didn't really matter to my 
visual browsing experience.  So not only do I not mind thumbnails being 
on tmpfs, I actually have gwenview, my primary images browser, set to 
delete its thumbnails dir on close.

> I haven't bothered configuring /var/cache/, other than making it a
> subvolume so it's not a part of my snapshots (overriding the systemd
> default of creating it as a directory).  It appears to me that it's
> managed just fine by pre- existing tmpfiles.d snippets and by the
> applications that use it cleaning up after themselves (except for
> portage, see below).

Here, /var/cache/ is on /, which remains mounted read-only by default.  
The only things using it are package-updates related, and I obviously 
have to mount / rw for package updates, so it works fine.  (My sync 
script mounts the dedicated packages filesystem containing the repos, 
ccache, distdir, and binpkgs, and remounting / rw, and that's the first 
thing I run doing an update, so I don't even have to worry about doing 
the mounts manually.)

>> FWIW systemd here too, but I suppose it depends on what's being cached
>> and particularly on the expense of recreation of cached data.  I
>> actually have many of my caches (user/browser caches, etc) on tmpfs and
>> reboot several times a week, so much of the cached data is only
>> trivially cached as it's trivial to recreate/redownload.
> 
> While that sort of tmpfs hackery is definitely cool, my system is,
> despite its age, fast enough for me that I don't want to bother with
> that (plus I like my 8 GB of RAM to be used just for applications and
> whatever Linux decides to cache in RAM).  Also, modern SSDs live long
> enough that I'm not worried about wearing them out through my daily
> usage (which IIRC was a major reason for you to do things that way).

16 gigs RAM here, and except for building chromium (in tmpfs), I seldom 
fill it even with cache -- most of the time several gigs remain entirely 
empty.  With 8 gig I'd obviously have to worry a bit more about what I 
put in tmpfs, but given that I have the RAM space, I might as well use it.

When I setup this system I was upgrading from a 4-core (original 2-socket 
dual-core 3-digit Opterons, purchased in 2003 and ran until the caps 
started dying in 2011), this system being a 6-core fx-series, and based 
on the experience with the quad-core, I figured 12 gig RAM for the 6-
core.  But with pairs of RAM sticks for dual-channel, powers of two 
worked better, so it was 8 gig or 16 gig.  And given that I had worked 
with 8 gig on the quad-core, I knew that would be OK, but 12 gig would 
mean less cache dumping, so 16 gig it was.

And my estimate was right on.  Since 2011, I've typically run up to ~12 
gigs RAM used including cache, leaving ~4 gigs of the 16 entirely unused 
most of the time, tho I do use the full 16 gig sometimes when doing 
updates, since I have PORTAGE_TMPDIR set to tmpfs.

Of course since my purchase in 2011 I've upgraded to SSDs and RAM-based 
storage cache isn't as important

Re: applications hang on a btrfs spanning two partitions

2019-01-15 Thread Duncan

Marc Joliet posted on Mon, 14 Jan 2019 12:35:05 +0100 as excerpted:

> Am Montag, 14. Januar 2019, 06:49:58 CET schrieb Duncan:
> [...]
>> Unless you have a known reason not to[1], running noatime with btrfs
>> instead of the kernel-default relatime is strongly recommended,
>> especially if you use btrfs snapshotting on the filesystem.
> [...]
> 
> The one reason I decided to remove noatime from my systems' mount
> options is because I use systemd-tmpfiles to clean up cache directories,
> for which it is necessary to leave atime intact (since caches are often
> Write Once Read Many).

Thanks for the reply.  I hadn't really thought of that use, but it makes 
sense...

FWIW systemd here too, but I suppose it depends on what's being cached 
and particularly on the expense of recreation of cached data.  I actually 
have many of my caches (user/browser caches, etc) on tmpfs and reboot 
several times a week, so much of the cached data is only trivially cached 
as it's trivial to recreate/redownload.

OTOH, running gentoo, my ccache and binpkg cache are seriously CPU-cycle 
expensive to recreate, so you can bet those are _not_ tmpfs, but OTTH, 
they're not managed by systemd-tmpfiles either.  (Ccache manages its own 
cache and together with the source-tarballs cache and git-managed repo 
trees along with binpkgs, I have a dedicated packages btrfs containing 
all of them, so I eclean binpkgs and distfiles whenever the 24-gigs space 
(48-gig total, 24-gig each on pair-device btrfs raid1) gets too close to 
full, then btrfs balance with -dusage= to reclaim partial chunks to 
unallocated.)

Anyway, if you're not regularly snapshotting, relatime is reasonably 
fine, tho I'd still keep the atime effects in mind and switch to noatime 
if you end up in a recovery situation that requires writable mounting.  
(Losing a device in btrfs raid1 and mounting writable in ordered to 
replace it and rebalance comes to mind as one example of a writable-mount 
recovery scenario where noatime until full replace/rebalance/scrub 
completion would prevent unnecessary writes until the raid1 is safely 
complete and scrub-verified again.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: applications hang on a btrfs spanning two partitions

2019-01-13 Thread Duncan

Florian Stecker posted on Sat, 12 Jan 2019 11:19:14 +0100 as excerpted:

> $ mount | grep btrfs
> /dev/sda8 on / type btrfs
> (rw,relatime,ssd,space_cache,subvolid=5,subvol=/)

Unlikely to be apropos to the problem at hand, but FYI...

Unless you have a known reason not to[1], running noatime with btrfs 
instead of the kernel-default relatime is strongly recommended, 
especially if you use btrfs snapshotting on the filesystem.

The reasoning is that even tho relatime reduces the default access-time 
updates to once a day, it still likely-unnecessarily turns otherwise read-
only operations into read-write operations, and atimes are metadata, 
which btrfs always COWs (copy-on-writes), meaning atime updates can 
trigger cascading metadata block-writes and much larger than 
anticipated[2] write-amplification, potentially hurting performance, yes, 
even for relatime, depending on your usage.

In addition, if you're using snapshotting and not using noatime, it can 
easily happen that a large portion of the change between one snapshot and 
the next is simply atime updates, thus making the space referenced 
exclusively by individual affected snapshots far larger than it would 
otherwise be.

---
[1] mutt is AFAIK the only widely used application that still depends on 
atime updates, and it only does so in certain modes, not with mbox-format 
mailboxes, for instance.  So unless you're using it, or your backup 
solution happens to use atime, chances are quite high that noatime won't 
disrupt your usage at all.

[2] Larger than anticipated write-amplification:  Especially when you 
/thought/ you were only reading the files and hadn't considered the atime 
update that read could trigger, thus effectively generating infinity 
write amplification because the read access did an atime update and 
turned what otherwise wouldn't be a write operation at all into one!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Balance of Raid1 pool, does not balance properly.

2019-01-08 Thread Duncan

Karsten Vinding posted on Tue, 08 Jan 2019 20:40:12 +0100 as excerpted:

> Hello.
> 
> I have a Raid1 pool consisting of 6 drives, 3 3TB disks and 3 2TB disks.
> 
> Until yesterday it consisted of 3 2TB disks, 2 3TB disks and one 1TB
> disk.
> 
> I replaced the 1TB disk as the pool was close to full.
> 
> Replacement went well, and I ended up with 5 almost full disks, and 1
> 3TB disk that was one third full.
> 
> So I kicked of a balance, expecting it to balance the data as evenly as
> possible on the 6 disks (btrfs balace start poolname).
> 
> The balance ran fine but I ended up with this:
> 
> Total devices 6 FS bytes used 5.66TiB
>      devid    9 size 2.73TiB used 2.69TiB path /dev/sdf
>      devid   10 size 1.82TiB used 1.78TiB path /dev/sdb
>      devid   11 size 1.82TiB used 1.73TiB path /dev/sdc
>      devid   12 size 1.82TiB used 1.73TiB path /dev/sdd
>      devid   13 size 2.73TiB used 2.65TiB path /dev/sde
>      devid   15 size 2.73TiB used 817.87GiB path /dev/sdg
> 
> The sixth drive sdg, is still only one third full.
> 
> How do I force BTRFS to distribute the data more evenly across the
> disks?
> 
> The way BTRFS has done it now, will bring problems, when I write more
> data to the array.

After doing the btrfs replace to the larger device, did you resize to the 
full size of the larger device as noted in the btrfs-replace manpage (but 
before you do please post btrfs device usage from before, and then again 
after the resize, as below)?  I ask because that's an easy to forget step 
that you don't specifically mention doing.

If you didn't, that's your problem -- the filesystem on that device is 
still the size of the old device, and needs resized to the larger size of 
the new one, after which a balance should work as expected.

Note that there is very recently reported bug in the way btrfs filesystem 
usage reports the size in this case, adding the device slack to 
unallocated altho it can't actually be allocated by the filesystem at all 
as the filesystem size doesn't cover that space on that device.  I 
thought the bug didn't extend to show, which would indicate that you did 
the resize and just didn't mention it, but am asking as that's otherwise 
the most likely reason for the listed behavior.

I /believe/ btrfs device usage indicates the extra space in its device 
slack line, but the reporter had already increased the size by the time 
of posting and hadn't run btrfs device usage previous to that, and it was 
non-dev list regulars in the discussion that didn't know for sure and 
didn't have a replaced and as yet unresized-filesystem device to check, 
so we haven't actually verified whether it displays correctly or not yet.

Thus the request for the btrfs device usage output, to verify all that 
for both your case and the previous similar thread...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Undelete files

2019-01-01 Thread Duncan

Jesse Emeth posted on Sun, 30 Dec 2018 16:58:12 +0800 as excerpted:

> Hi Duncan
> 
> The backup is irrelevant in this case. I have a backup of this
> particular problem.
> I've had BTRFS on my OS system blow up several times.
> There are several snapshots of this within the subvolume.
> However, such snapshots are not helpful unless they are snapshots
> copied elsewhere with restore/rsync etc.

How can backups and snapshots not be helpful in terms of a problem where 
you'd be using undelete?  Undelete implies the filesystem is fine and 
that you're just trying to get a few files that you mistakenly deleted 
back, which in fact was the claim, and both backups and snapshots should 
allow you to do just that, get your deleted files back.

> I had spoken to someone expressing my concerns with BTRFS on IRC.
> He wanted me to present this so that such problems could be rectified.
> I also wanted to learn more about BTRFS to see if my determinations
> about its inadequacies were incorrect.
> 
> Thus I want to follow this through to see if what is actually a very
> very small problem related to just a non essential small Firefox cache
> directory can actually be fixed.
> At present this very very small problem brings down the entire volume
> and all subvolumes with no way to mount any of it rw or easily fix the
> issue.
> That is not sane for such a small issue.

That's not a file undelete issue.  That's an entire filesystem issue.  
Quite a different beast, and not one that I directly addressed in my 
reply (altho the data value vs. backups stuff applies to fat-fingering 
such as mistaken deletes, filesystem problems, hardware problems, and 
natural disasters, all four), because both the title and the content 
suggested a file undelete issue, which /was/ addressed.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Checksum errors

2019-01-01 Thread Duncan

,000+ (calculated from raw used
value against the percentage value for cooked, I didn't have all the different
ways of reporting it on mine that you have), so it took me quite awhile to work
thru them even tho I was chewing them up rather regularly, toward the end, 
sometimes
several hundred at a time.

But while the "cooked" values are standardized to 253 (254/255 are reserved) or
sometimes 100 (percentage) maximum, the raw values differ between manufacturers.
I'm pretty sure mine (Corsair Neutron brand) were the number of 512-byte 
sectors so
a couple K per MB and I had tens of MB of reserve, thus explaining the 5 digit 
raw
used numbers while still saying 80+ percent good cooked, but yours may be 
counting
in 2 MiB erase-blocks or some such, thus the far lower raw numbers.  Or perhaps
Samsung simply recognized that such huge numbers of reserve wasn't particularly
practical, people replaced the drive before it got /that/ bad, and put those 
would-be
reserves to higher usable capacity instead.


Regardless, while the ssd may continue to be usable as
cache for some time, I'd strongly suggest rotating it out of normal use for 
anything
you value, or at LEAST increasing your number of backups and/or pairing it with
something else in btrfs raid1, as I already had mine when I noticed it going 
bad, so
I could continue to use it and watch it degrade, over time.

I'd definitely *NOT* recommend trusting that ssd in single or raid0 mode, for 
anything
of value that's not backed up, period.  Whatever those raw events are 
measuring, 50%
on the cooked value is waaayyy too low to continue to trust it, tho as a cache 
device
or similar, where a block going out occasionally isn't a big deal, it may 
continue to
be useful for years.


FWIW, with my tens of thousands of reserve blocks and the device in btrfs raid1 
with
a known good device, I was able to use routine btrfs scrubs to clean up the 
damage
for quite some time, IIRC 8 months or so, until it just got so bad I was doing 
scrubs
and finding and correcting sometimes hundreds of errors on every reboot, and as 
I
actually had a third ssd I had planned to put in something else and never did 
get it
there, I finally decided I had had enough, and after one final scrub, I did a 
btrfs
replace of the old device with the new one.  But AFAIK it had only gotten down 
to 85
cooked value or so, even then.  And there's no way I'd have considered the ssd 
usable
at anything under say 92 cooked, as blocks were simply erroring out too often, 
had
I not had btrfs raid1 mode and been able to scrub away the errors.

Meanwhile, FWIW the other devices, both the good one of the original pair, and 
the
replacement for the bad one, same make and model as the bad one, are still 
going today.
One of them has a 5/reallocated-sector-count raw value of 17, still 100% on the 
cooked
value, the other says 0-raw/253 cooked.  (For many values including this one,
a cooked value of 253 means entirely clean, with a single "event" it drops to 
100%, and
it goes from there based on calculated percentage.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Undelete files

2018-12-29 Thread Duncan

Duncan posted on Sun, 30 Dec 2018 04:11:20 + as excerpted:

> Adrian Bastholm posted on Sat, 29 Dec 2018 23:22:46 +0100 as excerpted:
> 
>> Hello all,
>> Is it possible to undelete files on BTRFS ? I just deleted a bunch of
>> folders and would like to restore them if possible.
>> 
>> I found this script
>> https://gist.github.com/Changaco/45f8d171027ea2655d74 but it's not
>> finding stuff ..
> 
> That's an undelete-automation wrapper around btrfs restore...
> 
>> ./btrfs-undelete /dev/sde1 ./foto /home/storage/BTRFS_RESTORE/
>> Searching roots...
>> Trying root 389562368... (1/70)
>> ...
>> Trying root 37339136... (69/70)
>> Trying root 30408704... (70/70)
>> Didn't find './foto'
> 
> That script is the closest thing to a direct undelete command that btrfs
> has.  However, there's still some chance...
> 
> ** IMPORTANT **  If you still have the filesystem mounted read-write,
> remount it read-only **IMMEDIATELY**, because every write reduces your
> chance at recovering any of the deleted files.
> 
> (More in another reply, but I want to get this sent with the above
> ASAP.)


First a question:  Any chance you have a btrfs snapshot of the deleted 
files you can mount and recover from?  What about backups?

Note that a number of distros using btrfs have automated snapshotting 
setup, so it's possible you have a snapshot with the files safely 
available, and don't even know it.  Thus the snapshotting question (more 
on backups below).  It could be worth checking...


Assuming no snapshot and no backup with those files...

Disclaimer:  I'm not a dev, just a btrfs user and list regular myself.  
Thus, the level of direct technical help I can give is limited, and much 
of what remains is more what to do different to prevent a next time, tho 
there's some additional hints about the current situation further down...


Well the first thing in this case to note is the sysadmin's (yes, that's 
you... and me, and likely everyone here: [1]) first rule of backups:  The 
true value of data isn't defined by any arbitrary claims, but by the 
number of backups of that data it is considered valuable enough to have.  
Thus, in the most literal way possible, not having a backup is simply 
defining the data as not worth the time/trouble/hassle to make one, and 
not having a second and third and... backup is likewise, simply defining 
the value of the data as not worth that one more level of backup.  
(Likewise, not having an /updated/ backup is simply defining the value of 
data in the delta between the current working copy and the last backup as 
of trivial value, because as soon as it's worth more than the time/
trouble/resources required to update the backup, by definition, the 
backup will be updated in accordance with the value of the data in that 
delta.)

Thus, the fact that we're assuming no backup now means that that we 
already defined the data as of trivial value, not worth the time/trouble/
resources necessary to make even a single backup.

Which means no matter what the loss or why, hardware, software or 
"wetware" failure (the latter aka fat-fingering, as here), or even 
disaster such as flood or fire, when it comes to our data we can *always* 
rest easy, because we *always* save what was of most value, either the 
data if we defined it as such by the backups we had of it, or the time/
trouble/resources that would have otherwise gone into the backup, if we 
judged the data to be of lower value than that one more level of backup.

Which means there's a strict limit to the value of the data possibly 
lost, and thus a strict limit to the effort we're likely willing to put 
into recovery after that data loss risk factor appears to have evaluated 
to 1, before the recovery effort too becomes not worth the trouble.  
After all, if it /was/ worth the trouble, it would have also been worth 
the trouble to do that backup in the first place, and the fact that we 
don't have it means it wasn't worth that trouble.

At least for me, looking at it from this viewpoint significantly lowers 
my stress during disaster recovery situations.  There's simply not that 
much at risk, nor can there be, even in the event of losing 
"everything" (well, data-wise anyway, hardware, or for that matter, my 
life and/or health, family and friends, etc, unfortunately that's not as 
easy to backup as data!) to a fire or the like, since if there was more 
at risk, there's be backups (offsite backups in the fire/flood sort of 
case) we could fall back on should it come to that.


That said, before-the-fact, it's an unknown risk-factor, while after-the-
fact, that previously unknown risk-factor has evaluated to 100% chance of 
(at least apparent) data loss!  It's actually rather l

Re: Undelete files

2018-12-29 Thread Duncan

Adrian Bastholm posted on Sat, 29 Dec 2018 23:22:46 +0100 as excerpted:

> Hello all,
> Is it possible to undelete files on BTRFS ? I just deleted a bunch of
> folders and would like to restore them if possible.
> 
> I found this script
> https://gist.github.com/Changaco/45f8d171027ea2655d74 but it's not
> finding stuff ..

That's an undelete-automation wrapper around btrfs restore...

> ./btrfs-undelete /dev/sde1 ./foto /home/storage/BTRFS_RESTORE/ Searching
> roots...
> Trying root 389562368... (1/70)
> ...
> Trying root 37339136... (69/70)
> Trying root 30408704... (70/70)
> Didn't find './foto'

That script is the closest thing to a direct undelete command that btrfs 
has.  However, there's still some chance...

** IMPORTANT **  If you still have the filesystem mounted read-write, 
remount it read-only **IMMEDIATELY**, because every write reduces your 
chance at recovering any of the deleted files.

(More in another reply, but I want to get this sent with the above ASAP.)



-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Broken chunk tree - Was: Mount issue, mount /dev/sdc2: can't read superblock

2018-12-29 Thread Duncan

Tomáš Metelka posted on Sun, 30 Dec 2018 01:48:23 +0100 as excerpted:

> Ok, I've got it:-(
> 
> But just a few questions: I've tried (with btrfs-progs v4.19.1) to
> recover files through btrfs restore -s -m -S -v -i ... and following
> events occurred:
> 
> 1) Just 1 "hard" error:
> ERROR: cannot map block logical 117058830336 length 1073741824: -2 Error
> copying data for /mnt/...
> (file which absence really doesn't pain me:-))
> 
> 2) For 24 files a I got "too much loops" warning (U mean this: "if
> (loops >= 0 && loops++ >= 1024) { ..."). I've always answered yes but
> I'm afraid these files are corrupted (at least 2 of them seems
> corrupted).
> 
> How much bad is this? Does the error mentioned in #1 mean that it's the
> only file which is totally lost? I can live without those 24 + 1 files
> so if #1 and #2 would be the only errors then I could say the recovery
> was successful ... but I'm afraid things aren't such easy:-)

In this context, the biggest thing to know about btrfs restore is that 
because it's meant as a recovery measure and if it comes to using 
restore, the assumption is that the priority is recovery of /anything/ 
even if the checksums don't match (a chance of recovery of possibly bad 
data is considered better than rejecting possibly bad data entirely), it 
doesn't worry too much about checksums (AFAIK it ignores data checksums 
entirely, not sure whether it checks metadata checksums or not, but it 
probably ignores failure in at least some cases if it does, because 
that's the point for a tool like this).

Which means that anything recovered using btrfs recover doesn't have the 
usual btrfs checksums validation guarantees, and could very possibly be 
corrupt.

However, that's mitigated by the fact that most filesystems don't even 
have built-in checksumming and validation in the first place, so the data 
on them could go bad even in normal operation, and unless it was 
obviously corrupted into not working, you'd likely never even know it, so 
btrfs restore ignoring checksums simply returns the data to the state 
it's /normally/ in on most filesystems, completely unverified.

But if you happen to have checksums independently stored somewhere, or 
even just ordinary unvalidated backups you can compare against, and 
you're worried about the possibility of undiscovered corruption due to 
the restore, and/or you were using btrfs in part /because/ of its built-
in checksum verification, it could be worth doing that verification run 
against your old checksum database or backups.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs fi usage -T shows unallocated column as total in drive, not total available to btrfs

2018-12-27 Thread Duncan

Chris Murphy posted on Thu, 27 Dec 2018 16:37:55 -0700 as excerpted:

[Context is btrfs reports when btrfs is smaller on a device than the 
device it is on.  In this specific case it's due to btrfs replace to a 
larger device, before using btrfs filesystem resize to increase the size 
to that of the newer/larger device.]

> OK let me see if I get this right. You're saying it's confusing that
> 'btrfs fi sh' "devid size" does not change when doing a device replace;
> whereas 'btrfs fi us' device specific "unallocated" does change, even
> though you haven't yet done a resize.
> 
> I kinda sorta agree. While "unallocated" becomes 6.53TiB for this
> device, the idea it's unallocated suggests it could be allocated, which
> before a resize it cannot be allocated.

"It depends what the definition of "unallocated" is."[1]

Arguably, just as "unallocated" includes space not yet allocated to data/
metadata/system chunks, it could be argued it should include space on the 
device not yet allocated to the filesystem as well.  Clearly, that's what 
the coder of the btrfs filesystem usage functionality thought.

By that view, "unallocated" includes "not yet allocated to the filesystem 
itself also, but available on the block device the filesystem is on, to 
be allocated to the filesystem should the admin decide to do so."

OTOH, as the OP says it's still confusing, and as pointed out in a reply, 
it's btrfs _filesystem_ usage we're talking about here, not btrfs 
_device_ usage, and at minimum, _filesystem_ usage including space on the 
device that's not yet allocated to that filesystem is indeed confusing/
unintuitive, and arguably actually incorrect, particularly if the btrfs 
device usage report reports that space under its "device slack" line, 
which as admins we don't actually know at this point (it doesn't appear 
to be documented except presumably in the code itself).  And arguably, if 
btrfs filesystem usage is to report it at all, it should be under a 
separate (additional) line, presumably device slack, if that's what the 
device usage version does with that line.

---
[1] Quote paraphrases a famous US political/legal quote from some years 
ago...  OT as to the merits, but if you wish the background, 
s/unallocated/is/ and google it using the search engine of your choice.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs fi usage -T shows unallocated column as total in drive, not total available to btrfs

2018-12-26 Thread Duncan

Chris Murphy posted on Wed, 26 Dec 2018 17:36:19 -0700 as excerpted:


> I'm not really following this. An fs resize is implied by any device
> add, remove or replace command. In the case of replace, it will
> efficiently copy the device being replaced to the designated drive, and
> then once that succeeds resize the file system to reflect the size of
> the replacement device. I'm also confused why devid 4 seems to be
> present before and after your device replace, so I have to wonder if
> your copy paste really worked out as intended? And also, what version of
> kernel and btrfs-progs are you using?

I thought... yes...

Just checked the btrfs-replace manpage (v4.19.1) and it says:

Note
the filesystem has to be resized to fully take advantage of a larger 
target device; this can be achieved with btrfs filesystem
resize :max /path

So it does *not* auto-resize after the replace.


Also, I'm not positive on this, and I don't see it mentioned in the 
manpage, but I /think/ replace (unlike add/remove) keeps the same devid 
for the new device.

(And IIRC one of the devs commented that there's a devid 0 during the 
replace itself, but I'm unsure whether that's the source or the 
destination, that is, whether the old ID is switched to the new device at 
the beginning of the replace so the old one temporarily gets the 0 during 
the replace until it's deleted at the end, or end so the new one 
temporarily gets it until the id is transferred at the end.  That was in 
the context of a draft patch that didn't yet account for the possibility 
of devid 0 during replace, and the comment was pointing out the 
possibility.)

If that's correct then the devid 4 could indeed be the old device at 
first (when it refers to sda and has 164.5 GiB unallocated), but the new 
device later (when it refers to sdu and has 6.53 TiB unallocated), even 
before the resize, that being the point of confusion (6.53 TB unallocated 
even tho it can't actually use it as it hasn't been resized yet) that 
triggered the original post in the first place.

To address that point, I suppose ideally there'd be another line when the 
filesystem's smaller than the available device size, device-space outside 
filesystem, or some such.


Tho you are correct that fi show and fi df's output don't correspond 
exactly to fi usage, without some sort of decoder ring to translate 
between them, and even with the decoder ring, the numbers come out but 
slightly different things are reported.


Meanwhile, while I normally want the filesystem usage info and thus use 
that command, for something like this I'd be specifically interested in 
the specific device's usage, and thus would use btrfs device usage, in 
place of or in addition to btrfs filesystem usage.

It'd be interesting to see what device usage (as opposed to filesystem 
usage) did with the unreachable space in terms of reporting -- maybe it 
has that separate line tho I doubt it, but if not does it count it or 
not?.  But that wasn't posted and presumably the query wasn't run while 
in the still-unresized state, and I guess it's a bit late now to get it...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: SATA/SAS mixed pool

2018-12-13 Thread Duncan

Adam Borowski posted on Thu, 13 Dec 2018 08:29:05 +0100 as excerpted:

> On Wed, Dec 12, 2018 at 09:31:02PM -0600, Nathan Dehnel wrote:
>> Is it possible/safe to replace a SATA drive in a btrfs RAID10 pool with
>> an SAS drive?
> 
> For btrfs, a block device is a block device, it's not "racist".
> You can freely mix and/or replace.  If you want to, say, extend a SD
> card with NBD to remote spinning rust, it works well -- tested :p

FWIW (mostly for other readers not so much this particular case) the 
known exception/caveat to that is USB block devices, which do tend to 
have problems, tho some hardware is fine.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: HELP unmountable partition after btrfs balance to RAID0

2018-12-07 Thread Duncan

Thomas Mohr posted on Thu, 06 Dec 2018 12:31:15 +0100 as excerpted:

> We wanted to convert a file system to a RAID0 with two partitions.
> Unfortunately we had to reboot the server during the balance operation
> before it could complete.
> 
> Now following happens:
> 
> A mount attempt of the array fails with following error code:
> 
> btrfs recover yields roughly 1.6 out of 4 TB.

[Just another btrfs user and list regular, not a dev.  A dev may reply to 
your specific case, but meanwhile, for next time...]

That shouldn't be a problem.  Because with raid0 a failure of any of the 
components will take down the entire raid, making it less reliable than a 
single device, raid0 (in general, not just btrfs) is considered only 
useful for data of low enough value that its loss is no big deal, either 
because it's truly of little value (internet cache being a good example), 
or because backups are kept available and updated for whenever the raid0 
array fails.  Because with raid0, it's always a question of when it'll 
fail, not if.

So loss of a filesystem being converted to raid0 isn't a problem, because 
the data on it, by virtue of being in the process of conversion to raid0, 
is defined as of throw-away value in any case.  If it's of higher value 
than that, it's not going to be raid0 (or in the process of conversion to 
it) in the first place.

Of course that's simply an extension of the more general first sysadmin's 
rule of backups, that the true value of data is defined not by arbitrary 
claims, but by the number of backups of that data it's worth having.  
Because "things happen", whether it's fat-fingering, bad hardware, buggy 
software, or simply someone tripping over the power cable or running into 
the power pole outside at the wrong time.

So no backup is simply defining the data as worth less than the time/
trouble/resources necessary to make that backup.

Note that you ALWAYS save what was of most value to you, either the time/
trouble/resources to do the backup, if your actions defined that to be of 
more value than the data, or the data, if you had that backup, thereby 
defining the value of the data to be worth backing up.

Similarly, failure of the only backup isn't a problem because by virtue 
of there being only that one backup, the data is defined as not worth 
having more than one, and likewise, having an outdated backup isn't a 
problem, because that's simply the special case of defining the data in 
the delta between the backup time and the present as not (yet) worth the 
time/hassle/resources to make/refresh that backup.

(And FWIW, the second sysadmin's rule of backups is that it's not a 
backup until you've successfully tested it recoverable in the same sort 
of conditions you're likely to need to recover it in.  Because so many 
people have /thought/ they had backups, that turned out not to be, 
because they never tested that they could actually recover the data from 
them.  For instance, if the backup tools you'll need to recover the 
backup are on the backup itself, how do you get to them?  Can you create 
a filesystem for the new copy of the data and recover it from the backup 
with just the tools and documentation available from your emergency boot 
media?  Untested backup == no backup, or at best, backup still in 
process!)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: unable to fixup (regular) error

2018-11-26 Thread Duncan

Alexander Fieroch posted on Mon, 26 Nov 2018 11:23:00 +0100 as excerpted:

> Am 26.11.18 um 09:13 schrieb Qu Wenruo:
>> The corruption itself looks like some disk error, not some btrfs error
>> like transid error.
> 
> You're right! SMART has an increased value for one harddisk on
> reallocated sector count. Sorry, I missed to check this first...
> 
> I'll try to salvage my data...

FWIW as a general note about raid0 for updating your layout...

Because raid0 is less reliable than a single device (failure of any 
device of the raid0 is likely to take it out, and failure of any one of N 
is more likely than failure of any specific single device), admins 
generally consider it useful only for "throw-away" data, that is, data 
that can be lost without issue either because it really /is/ "throw-
away" (internet cache being a common example), or because it is 
considered a "throw-away" copy of the "real" data stored elsewhere, with 
that "real" copy being either the real working copy of which the raid0 is 
simply a faster cache, or with the raid0 being the working copy, but with 
sufficiently frequent backup updates that if the raid0 goes, it won't 
take anything of value with it (read as the effort to replace any data 
lost will be reasonably trivial, likely only a few minutes or hours, at 
worst perhaps a day's worth, of work, depending on how many people's work 
is involved and how much their time is considered to be worth).

So if it's raid0, you shouldn't be needing to worry about trying to 
recover what's on it, and probably shouldn't even be trying to run a 
btrfs check on it at all as it's likely to be more trouble and take more 
time than the throw-away data on it is worth.  If something goes wrong 
with a raid0, just declare it lost, blow it away and recreate fresh, 
restoring from the "real" copy if necessary.  Because for an admin, 
really with any data but particularly for a raid0, it's more a matter of 
when it'll die than if.

If that's inappropriate for the value of the data and status of the 
backups/real-copies, then you should really be reconsidering whether 
raid0 of any sort is appropriate, because it almost certainly is not.


For btrfs, what you might try instead of raid0, is raid1 metadata at 
least, raid0 or single mode data if there's not room enough to do raid1 
data as well.  And the raid1 metadata would have very likely saved the 
filesystem in this case, with some loss of files possible depending on 
where the damage is, but with the second copy of the metadata from the 
good device being used to fill in for and (attempt to, if the bad device 
is actively getting worse it might be a losing battle) repair any 
metadata damage on the bad device, thus giving you a far better chance of 
saving the filesystem as a whole.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Duncan

Adam Borowski posted on Sun, 04 Nov 2018 20:55:30 +0100 as excerpted:

> On Sun, Nov 04, 2018 at 06:29:06PM +0000, Duncan wrote:
>> So do consider adding noatime to your mount options if you haven't done
>> so already.  AFAIK, the only /semi-common/ app that actually uses
>> atimes these days is mutt (for read-message tracking), and then not for
>> mbox, so you should be safe to at least test turning it off.
> 
> To the contrary, mutt uses atimes only for mbox.

Figures that I'd get it reversed.
 
>> And YMMV, but if you do use mutt or something else that uses atimes,
>> I'd go so far as to recommend finding an alternative, replacing either
>> btrfs (because as I said, relatime is arguably enough on a traditional
>> non-COW filesystem) or whatever it is that uses atimes, your call,
>> because IMO it really is that big a deal.
> 
> Fortunately, mutt's use could be fixed by teaching it to touch atimes
> manually.  And that's already done, for both forks (vanilla and
> neomutt).

Thanks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Duncan

Sebastian Ochmann posted on Sun, 04 Nov 2018 14:15:55 +0100 as excerpted:

> Hello,
> 
> I have a btrfs filesystem on a single encrypted (LUKS) 10 TB drive which
> stopped working correctly.

> Kernel 4.18.16 (Arch Linux)

I see upgrading to 4.19 seems to have solved your problem, but this is 
more about something I saw in the trace that has me wondering...

> [  368.267315]  touch_atime+0xc0/0xe0

Do you have any atime-related mount options set?

FWIW, noatime is strongly recommended on btrfs.

Now I'm not a dev, just a btrfs user and list regular, and I don't know 
if that function is called and just does nothing when noatime is set, so 
you may well already have it set and this is "much ado about nothing", 
but the chance that it's relevant, if not for you, perhaps for others 
that may read it, begs for this post...

The problem with atime, access time, is that it turns most otherwise read-
only operations into read-and-write operations in ordered to update the 
access time.  And on copy-on-write (COW) based filesystems such as btrfs, 
that can be a big problem, because updating that tiny bit of metadata 
will trigger a rewrite of the entire metadata block containing it, which 
will trigger an update of the metadata for /that/ block in the parent 
metadata tier... all the way up the metadata tree, ultimately to its 
root, the filesystem root and the superblocks, at the next commit 
(normally every 30 seconds or less).

Not only is that a bunch of otherwise unnecessary work for a bit of 
metadata barely anything actually uses, but forcing most read operations 
to read-write obviously compounds the risk for all of those would-be read-
only operations when a filesystem already has problems.

Additionally, if your use-case includes regular snapshotting, with atime 
on, on mostly read workloads with few writes (other than atime updates), 
it may actually be the case that most of the changes in a snapshot are 
actually atime updates, making reoccurring snapshot updates far larger 
than they'd be otherwise.

Now a few years ago the kernel did change the default to relatime, 
basically updating the atime for any particular file only once a day, 
which does help quite a bit, and on traditional filesystems it's arguably 
a reasonably sane default, but COW makes atime tracking enough more 
expensive that setting noatime is still strongly recommended on btrfs, 
particularly if you're doing regular snapshotting.

So do consider adding noatime to your mount options if you haven't done 
so already.  AFAIK, the only /semi-common/ app that actually uses atimes 
these days is mutt (for read-message tracking), and then not for mbox, so 
you should be safe to at least test turning it off.

And YMMV, but if you do use mutt or something else that uses atimes, I'd 
go so far as to recommend finding an alternative, replacing either btrfs 
(because as I said, relatime is arguably enough on a traditional non-COW 
filesystem) or whatever it is that uses atimes, your call, because IMO it 
really is that big a deal.

Meanwhile, particularly after seeing that in the trace, if the 4.19 
update hadn't already fixed it, I'd have suggested trying a read-only 
mount, both as a test, and assuming it worked, at least allowing you to 
access the data without the lockup, which would have then been related to 
the write due to the atime update, not the actual read.

Actually, a read-only mount test is always a good troubleshooting step 
when the trouble is a filesystem that either won't mount normally, or 
will, but then locks up when you try to access something.  It's far lest 
risky than a normal writable mount, and at minimum it provides you the 
additional test data of whether it worked or not, plus if it does, a 
chance to access the data and make sure your backups are current, before 
actually trying to do any repairs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: BTRFS did it's job nicely (thanks!)

2018-11-03 Thread Duncan

waxhead posted on Fri, 02 Nov 2018 20:54:40 +0100 as excerpted:

> Note that I tend to interpret the btrfs de st / output as if the error
> was NOT fixed even if (seems clearly that) it was, so I think the output
> is a bit misleading... just saying...

See the btrfs-device manpage, stats subcommand, -z|--reset option, and 
device stats section:

-z|--reset
Print the stats and reset the values to zero afterwards.

DEVICE STATS
The device stats keep persistent record of several error classes related 
to doing IO. The current values are printed at mount time and
updated during filesystem lifetime or from a scrub run.


So stats keeps a count of historic errors and is only reset when you 
specifically reset it, *NOT* when the error is fixed.

(There's actually a recent patch, I believe in the current dev kernel 
4.20/5.0, that will reset a device's stats automatically for the btrfs 
replace case when it's actually a different device afterward anyway.  
Apparently, it doesn't even do /that/ automatically yet.  Keep that in 
mind if you replace that device.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan

ewer default 
is 16 KiB, while the old default was the (minimum for amd64/x86) 4 KiB, 
and the maximum is 64 KiB.  See the mkfs.btrfs manpage for the details as 
there's a tradeoff, smaller sizes increase (metadata) fragmentation but 
decrease lock contention, while larger sizes pack more efficiently and 
are less fragmented but updating is more expensive.  The change in 
default was because 16 KiB was a win over the old 4 KiB for most use-
cases, but the 32 or 64 KiB options may or may not be, depending on use-
case, and of course if you're bottlenecking on locks, 4 KiB may still be 
a win.


Among all those, I'd be especially interested in what thread_pool=n does 
or doesn't do for you, both because it specifically mentions 
parallelization and because I've seen little discussion of it.

space_cache=v2 may also be a big boost for you, if you're filesystems are 
the size the 6-device raid0 implies and are at all reasonably populated.

(Metadata) nodesize may or may not make a difference, tho I suspect if so 
it'll be mostly on writes (but I'm not familiar with the specifics there 
so could be wrong).  I'd be interested to see if it does.

In general I can recommend the no_holes and skinny_metadata features but 
you may well already have them, and the noatime mount option, which you 
may well already be using as well.  Similarly, I ensure that all my btrfs 
are mounted from first mount with autodefrag, so it's always on as the 
filesystem is populated, but I doubt you'll see a difference from that in 
your benchmarks unless you're specifically testing an aged filesystem 
that would be heavily fragmented on its own.


There's one guy here who has done heavy testing on the ssd stuff and 
knows btrfs on-device chunk allocation strategies very well, having come 
up with a utilization visualization utility and been the force behind the 
relatively recent (4.16-ish) changes to the ssd mount option's allocation 
strategy.  He'd be the one to talk to if you're considering diving into 
btrfs' on-disk allocation code, etc.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Understanding BTRFS RAID0 Performance

2018-10-05 Thread Duncan

Wilson, Ellis posted on Thu, 04 Oct 2018 21:33:29 + as excerpted:

> Hi all,
> 
> I'm attempting to understand a roughly 30% degradation in BTRFS RAID0
> for large read I/Os across six disks compared with ext4 atop mdadm
> RAID0.
> 
> Specifically, I achieve performance parity with BTRFS in terms of
> single-threaded write and read, and multi-threaded write, but poor
> performance for multi-threaded read.  The relative discrepancy appears
> to grow as one adds disks.

[...]

> Before I dive into the BTRFS source or try tracing in a different way, I
> wanted to see if this was a well-known artifact of BTRFS RAID0 and, even
> better, if there's any tunables available for RAID0 in BTRFS I could
> play with.  The man page for mkfs.btrfs and btrfstune in the tuning
> regard seemed...sparse.

This is indeed well known for btrfs at this point, as it hasn't been 
multi-read-thread optimized yet.  I'm personally more familiar with the 
raid1 case, where which one of the two copies gets the read is simply 
even/odd-PID-based, but AFAIK raid0 isn't particularly optimized either.

The recommended workaround is (as you might expect) btrfs on top of 
mdraid.  In fact, while it doesn't apply to your case, btrfs raid1 on top 
of mdraid0s is often recommended as an alternative to btrfs raid10, as 
that gives you the best of both worlds -- the data and metadata integrity 
protection of btrfs checksums and fallback (with writeback of the correct 
version) to the other copy if the first copy read fails checksum 
verification, with the much better optimized mdraid0 performance.  So it 
stands to reason that the same recommendation would apply to raid0 -- 
just do single-mode btrfs on mdraid0, for better performance than the as 
yet unoptimized btrfs raid0.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Transaction aborted error -28 clone_finish_inode_update

2018-10-05 Thread Duncan

David Goodwin posted on Thu, 04 Oct 2018 17:44:46 +0100 as excerpted:

> While trying to run/use bedup ( https://github.com/g2p/bedup )  I
> hit this :
> 
> 
> [Thu Oct  4 15:34:51 2018] [ cut here ]
> [Thu Oct  4 15:34:51 2018] BTRFS: Transaction aborted (error -28)
> [Thu Oct  4 15:34:51 2018] WARNING: CPU: 0 PID: 28832 at
> fs/btrfs/ioctl.c:3671 clone_finish_inode_update+0xf3/0x140 

> [Thu Oct  4 15:34:51 2018] CPU: 0 PID: 28832 Comm: bedup Not tainted
> 4.18.10-psi-dg1 #1

[snipping a bunch of stuff that I as a non-dev list regular can't do much 
with anyway]

> [Thu Oct  4 15:34:51 2018] BTRFS: error (device xvdg) in
> clone_finish_inode_update:3671: errno=-28 No space left
> [Thu Oct  4 15:34:51 2018] BTRFS info (device xvdg): forced readonly 

> % btrfs fi us /filesystem/
> Overall:
>      Device size:           7.12TiB
>  Device allocated:      6.80TiB
>  Device unallocated:330.93GiB
>  Device missing:    0.00B
>  Used:              6.51TiB
>  Free (estimated):  629.87GiB    (min: 629.87GiB)
>      Data ratio:            1.00
>  Metadata ratio:        1.00
>  Global reserve:        512.00MiB    (used: 0.00B)
> 
> Data+Metadata,single: Size:6.80TiB, Used:6.51TiB
>     /dev/xvdf       1.69TiB
> /dev/xvdg       3.12TiB
> /dev/xvdi       1.99TiB
> 
> System,single: Size:32.00MiB, Used:780.00KiB
>     /dev/xvdf      32.00MiB
> 
> Unallocated:
>     /dev/xvdf     320.97GiB
> /dev/xvdg     949.00MiB
> /dev/xvdi       9.03GiB
> 
> 
> I kind of think there is sufficient free space. at least globally
> within the filesystem.
> 
> Does it require balancing to redistribute the unallocated space better?
> Or is something misbehaving?

The latter, but unfortunately there's not much you can do about it at 
this point but wait for fixes, unless you want to spit up that huge 
filesystem into several smaller ones.

In general, btrfs has at least four kinds of "space" that it can run out 
of, tho in your case it appears you're running mixed-mode so data and 
metadata space are combined into one.

* Unallocated space:  This is space that remains entirely unallocated in 
the filesystem.  It matters most when the balance between data and 
metadata space gets off.

This isn't a problem for you as in single mode space can be allocated 
from any device and you have one with hundreds of gigs unallocated.  It 
also tends to be less of a problem on mixed-bg mode, which you're 
running, as there's no distinction in mixed-mode between data and 
metadata.

* Data chunk space:
* Metadata chunk space:

Because you're running mixed-bg mode, there's no distinction between 
these two, but for normal mode, running out of one or the other while all 
the free space is allocated to chunks of the other type, can be a problem.

* Global reserve:  Taken from metadata, the global reserve is space the 
system won't normally use, that it tries to keep clear in ordered to be 
able to finish transactions once they're started, as btrfs' copy-on-write 
semantics means even deleting stuff requires a bit of additional space 
temporarily.

This seems to actually be where the problem is, because currently, 
certain btrfs operations such as reflinking/cloning/snapshotting (that 
is, just what you were doing) don't really calculate the needed space 
correctly and use arbitrary figures, which can be *wildly* off, while 
conversely a bare half-gig of global-reserve for a huge 7+ TiB filesystem 
seems rather proportionally small.  (Consider that my small pair-device 
btrfs raid1 root filesystem, 8-GiB/device, 16 GiB total, has a 16 MiB 
reserve, proportionally, your 7+ TB filesystem would have 7+ GiB reserve, 
but it only has a half GiB.)

So relatively small btrfs' don't tend to run into the problem, because 
they have proportionally larger reserves to begin with.  Plus they 
probably don't have proportionally as many snapshots/reflinks/etc, 
either, so the problem simply doesn't trigger for them.

Now I'm not a dev and my own use-case doesn't include either snapshotting 
or deduping, so I haven't paid that much attention to the specifics, but 
I have seen some recent patches on-list that based on the explanations 
should go some way toward fixing this problem by using more realistic 
figures for global-reserve calculations.  At this point those patches 
would be for 4.20 (which might be 5.0), or possibly 4.21, but the devs 
are indeed working on the problem and it should get better within a 
couple kernel cycles.

Alternatively perhaps the global reserve size could be bumped up on such 
large filesystems, but let's see if the more realistic operations-reserve 
calculations can fix things, first, as arguably that shouldn't be 
necessary once the calculations aren't so arbitrarily wild.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: What to do with damaged root fllesystem (opensuse leap 42.2)

2018-10-05 Thread Duncan

s and the greatest 
chance at fixing things or for restore, scraping files off the damaged 
filesystem.

So before doing the btrfs restore, you should find a current btrfs-progs, 
4.17.1 ATM, to do it with, as that should give you the best results.  Try 
Fedora Rawhide or Arch (or the Gentoo I run), as they tend to have more 
current versions.

Then you need some place to put the scraped files, a writable filesystem 
with enough space to put what you're trying to restore.

Once you have some place to put the scraped files, with luck, it's a 
simple case of running...

btrfs restore   

... where ...

 is the damaged filesystem

 is the path on the writable filesystem where you want to dump the 
restored files

and  can include various options as found in the btrfs-restore 
manpage, like -m/--metadata if you want to try to restore owner/times/
perms for the files, -s/--symlinks if you want to try to restore them, 
-x/--xattr if you want to try to restore them, etc.

You may want to do a dry-run with -D/--dry-run first, to get some idea of 
whether it's looking like it can restore many of the files or not, and 
thus, of the sort of free space you may need on the writable filesystem 
to store the files it can restore.


If a simple btrfs restore doesn't seem to get anything, there is an 
advanced mode as well, with a link to the wiki page covering it in the 
btrfs-restore manpage, but it does get quite technical, and results may 
vary.  You will likely need help with that if you decide to try it, but 
as they say, that's a bridge we can cross when/if we get to it, no need 
to deal with it just yet.

Meanwhile, again, don't worry too much about whether you can recover 
anything here or not, because in any case you already have what was most 
important to you, either backups you can restore from if you considered 
the data worth having them, or the time and trouble you would have put 
into those backups, if you considered saving that more important than 
making the backups.  So losing the data on the filesystem, whether from 
filesystem error as seems to be the case here, due to admin fat-fingering 
(the infamous rm -rf .* or alike), or due to physical device loss if the 
disks/ssds themselves went bad, can never be a big deal, because the 
maximum value of the data in question is always strictly limited to that 
of the point at which having a backup is more important than the time/
trouble/resources you save(d) by not having one.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs problems

2018-09-21 Thread Duncan

Adrian Bastholm posted on Thu, 20 Sep 2018 23:35:57 +0200 as excerpted:

> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost Another variant I was
> considering is running a raid1 mirror on two of the drives and maybe a
> subvolume on the third, for less important stuff

Agreed with CMurphy's reply, but he didn't mention...

As I wrote elsewhere recently, don't remember if it was in a reply to you 
before you tried zfs and came back, or to someone else, so I'll repeat 
here, briefer this time...

Keep in mind that on btrfs, it's possible (and indeed the default with 
multiple devices) to run data and metadata at different raid levels.

IMO, as long as you're following an appropriate backup policy that backs 
up anything valuable enough to be worth the time/trouble/resources of 
doing so, so if you /do/ lose the array you still have a backup of 
anything you considered valuable enough to worry about (and that caveat 
is always the case, no matter where or how it's stored, value of data is 
in practice defined not by arbitrary claims but by the number of backups 
it's considered worth having of it)...

With that backups caveat, I'm now confident /enough/ about raid56 mode to 
be comfortable cautiously recommending it for data, tho I'd still /not/ 
recommend it for metadata, which I'd recommend should remain the multi-
device default raid1 level.

That way, you're only risking a limited amount of raid5 data to the not 
yet as mature and well tested raid56 mode, the metadata remains protected 
by the more mature raid1 mode, and if something does go wrong, it's much 
more likely to be only a few files lost instead of the entire filesystem, 
as is at risk if your metadata is raid56 as well, the metadata including 
checksums will be intact so scrub should tell you what files are bad, and 
if those few files are valuable they'll be on the backup and easy enough 
to restore, compared to restoring the entire filesystem.  But for most 
use-cases, metadata should be relatively small compared to data, so 
duplicating metadata as raid1 while doing raid5 for data should go much 
easier on the capacity needs than raid1 for both would.

Tho I'd still recommend raid1 data as well for higher maturity and tested 
ability to use the good copy to rewrite the bad one if one copy goes bad 
(in theory, raid56 mode can use parity to rewrite as well, but that's not 
yet as well tested and there's still the narrow degraded-mode crash write 
hole to worry about), if it's not cost-prohibitive for the amount of data 
you need to store.  But for people on a really tight budget or who are 
storing double-digit TB of data or more, I can understand why they prefer 
raid5, and I do think raid5 is stable enough for data now, as long as the 
metadata remains raid1, AND they're actually executing on a good backup 
policy.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-09-21 Thread Duncan

Axel Burri posted on Fri, 21 Sep 2018 11:46:37 +0200 as excerpted:

> I think you got me wrong here: There will not be binaries with the same
> filename. I totally agree that this would be a bad thing, no matter if
> you have bin/sbin merged or not, you'll end up in either having a
> collision or (even worse) rely on the order in $PATH.
> 
> With this "separated" patchset, you can install a binary
> "btrfs-subvolume-show", which has the same functionality as "btrfs
> subvolume show" (note the whitespace/dash), ending up with:
> 
> /sbin/btrfs
> /usr/bin/btrfs-subvolume-show
> /usr/bin/btrfs-subvolume-list

I did get you wrong (and had even understood the separately named 
binaries from an earlier post, too, but forgot).

Thanks. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-09-20 Thread Duncan

Axel Burri posted on Thu, 20 Sep 2018 00:02:22 +0200 as excerpted:

> Now not everybody wants to install these with fscaps or setuid, but it
> might also make sense to provide "/usr/bin/btrfs-subvolume-{show,list}",
> as they now work for a regular user. Having both root/user binaries
> concurrently is not an issue (e.g. in gentoo the full-featured btrfs
> command is in "/sbin/").

That's going to be a problem for distros (or users like me with advanced 
layouts, on gentoo too FWIW) that have the bin/sbin merge, where one is a 
symlink to the other.

FWIW I have both the /usr merge (tho reversed for me, so /usr -> . 
instead of having to have /bin and /sbin symlinks to /usr/bin) and the 
bin/sbin merge, along with, since I'm on amd64-nomultilib, the lib/lib64 
merge.  So:

$$ dir -gGd /bin /sbin /usr /lib /lib64
drwxr-xr-x 1 35688 Sep 18 22:56 /bin
lrwxrwxrwx 1 5 Aug  7 00:29 /lib -> lib64
drwxr-xr-x 1 78560 Sep 18 22:56 /lib64
lrwxrwxrwx 1 3 Mar 11  2018 /sbin -> bin
lrwxrwxrwx 1 1 Mar 11  2018 /usr -> .


Of course that last one (/usr -> .) leads to /share and /include hanging 
directly off of / as well, but it works.

But in that scheme /bin, /sbin, /usr/bin and /usr/sbin, are all the same 
dir, so only one executable of a particularly name can exist therein.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-20 Thread Duncan

Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted:

> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.
> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

The other replies are good but I've not seen this pointed out yet...

Perhaps you are accounting for this already, but you don't /say/ you are, 
while you do mention repeatedly toggling the space-cache options, which 
would trigger it so you /need/ to account for it...

I'm not sure about space_cache=v2 (it's probably more efficient with it 
even if it does have to do it), but I'm quite sure that space_cache=v1 
takes some time after initial mount with it to scan the filesystem and 
actually create the map of available free space that is the space_cache.

Now you said ssds, which should be reasonably fast, but you also say 3-
device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, 
which should be ~2 TB of data, which is likely somewhat fragmented so 
it's likely rather more than 2 TB of data chunks to scan for free space, 
and that's going to take /some/ time even on SSDs!

So if you're toggling settings like that in your tests, be sure to let 
the filesystem rebuild its cache that you just toggled and give it time 
to complete that and quiesce, before you start trying to measure write 
amplification.

Otherwise it's not write-amplification you're measuring, but the churn 
from the filesystem still trying to reset its cache after you toggled it!


Also, while 4.17 is well after the ssd mount option (usually auto-
detected, check /proc/mounts, mount output, or dmesg, to see if the ssd 
mount option is being added) fixes that went in in 4.14, if the 
filesystem has been in use for several kernel cycles and in particular 
before 4.14, with the ssd mount option active, and you've not rebalanced 
since then, you may well still have serious space fragmentation from 
that, which could increase the amount of data in the space_cache map 
rather drastically, thus increasing the time it takes to update the 
space_cache, particularly v1, after toggling it on.

A balance can help correct that, but it might well be easier and should 
result in a better layout to simply blow the filesystem away with a 
mkfs.btrfs and start over.


Meanwhile, as Remi already mentioned, you might want to reconsider nocow 
on btrfs raid1, since nocow defeats checksumming and thus scrub, which 
verifies checksums, simply skips it, and if the two copies get out of 
sync for some reason...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Duncan

Chris Murphy posted on Tue, 18 Sep 2018 13:34:14 -0600 as excerpted:

> I've run into some issue where grub2-mkconfig and grubby, can change the
> grub.cfg, and then do a really fast reboot without cleanly unmounting
> the volume - and what happens? Can't boot. The bootloader can't do log
> replay so it doesn't see the new grub.cfg at all. If all you do is mount
> the volume and unmount, log replay happens, the fs metadata is all fixed
> up just fine, and now the bootloader can see it.
> This same problem can happen with the kernel and initramfs
> installations.
> 
> (Hilariously the reason why this can happen is because of a process
> exempting itself from being forcibly killed by systemd *against* the
> documented advice of systemd devs that you should only do this for
> processes not on rootfs; but as a consequence of this process doing the
> wrong thing, systemd at reboot time ends up doing an unclean unmount and
> reboot because it won't kill the kill exempt process.)

That's... interesting!

FWIW here I use grub2, but as many admins I'm quite comfortable with 
bash, and the high-level grub2 config mechanisms simply didn't let me do 
what I needed to do.  So I had to learn the lower-level grub bash-like 
scripting language to do what I wanted to do, and I even go so far as to 
install-mask some of the higher level stuff so it doesn't get installed 
at all, and thus can't somehow run and screw up my config.

So I edit my grub scripts (and grubenv) much like I'd edit any other 
system script (and its separate config file where I have them) I might 
need to update, then save my work, and with both a bios-boot partition 
setup for grub-core and an entirely separate /boot that's not routinely 
mounted unless I'm updating it, I normally unmount it when I'm done, 
before I actually reboot.

So I've never had systemd interfere.

(And of course I have backups.  In fact, on my main personal system, with 
both the working root and its primary backup being btrfs pair-device 
raid1 on separate devices, I have four physical ssds installed, with a 
bios-boot partition with grub installed and a separate dedicated (btrfs 
dup mode) /boot on each of all four, so I have a working grub and /boot 
and three backups, each of which I can point the bios at and have tested 
separately as bootable.  So if upgrading grub or anything on /boot goes 
wrong I find that out testing the working copy, and boot one of the 
backups to resolve the problem before eventually upgrading all three 
backups after the working copy upgrade is well tested.)

> So *already* we have file systems that are becoming too complicated for
> the bootloader to reliably read, because they cannot do journal relay,
> let alone have any chance of modifying (nor would I want them to do
> this). So yeah I'm, very rapidly becoming opposed to grubenv on anything
> but super simple volumes like maybe ext4 without a journal (extents are
> nice); or even perhaps GRUB should just implement its own damn file
> system and we give it its own partition - similar to BIOS Boot - but
> probably a little bigger

You realize that solution is already standardized as EFI and its standard 
FAT filesystem, right?

=:^)

>>> but is the bootloader overwrite of gruvenv going to recompute parity
>>> and write to multiple devices? Eek!
>>
>> Recompute the parity should not be a big deal. Updating all the
>> (b)trees would be a too complex goal.
> 
> I think it's just asking for trouble. Sometimes the best answer ends up
> being no, no and definitely no.

Agreed.  I actually /like/ the fact that at the grub prompt I can rely on 
everything being read-only, and if that SuSE patch to put grubenv in the 
reserved space and make it writable gets upstreamed, I really hope 
there's a build-time configure option to disable the feature, because IMO 
grub doesn't /need/ to save state at that point, and allowing it to do so 
is effectively needlessly playing a risky Russian Roulette game with my 
storage devices.  Were it actually needed that'd be different, but it's 
not needed, so any risk is too much risk.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs panic problem

2018-09-17 Thread Duncan

ountable filesystem.

So for routine operation, it's no big deal if userspace is a bit old, at 
least as long as it's new enough to have all the newer command formats, 
etc, that you need, and for comparing against others when posted.  But 
once things go bad on you, you really want the newest btrfs-progs in 
ordered to give you the best chance at either fixing things, or worst-
case, at least retrieving the files off the dead filesystem.  So using 
the older distro btrfs-progs for routine running should be fine, but 
unless your backups are complete and frequent enough that if something 
goes wrong it's easiest to simply blow the bad version away with a fresh 
mkfs and start over, you'll probably want at least a reasonably current 
btrfs-progs on your rescue media at least.  Since the userspace version 
numbers are synced to the kernel cycle, a good rule of thumb is keep your 
btrfs-progs version to at least that of the oldest recommended LTS kernel 
version, as well, so you'd want at least btrfs-progs 4.9 on your rescue 
media, for now, and 4.14, coming up, since when the new kernel goes LTS 
that'll displace 4.9 and 4.14 will then be the second-back LTS.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: state of btrfs snapshot limitations?

2018-09-14 Thread Duncan

early, all at (nearly) the same 
time, and then simply deleting all in the appropriate directory beyond 
some cap time, instead of the thinning logic of the above traditional 
model, wouldn't actually be much less efficient in terms of snapshot 
taking, because snapshotting is /designed/ to be fast, while at the same 
time it would significantly simplify the logic of the deletion scripts 
since they could simply delete everything older than X, instead of having 
to do conditional thinning logic.

So your scheme with period slotting and capping as opposed to simply 
timestamping and thinning, is a new thought to me, but I like the idea 
for its simplicity, and as I said, it shouldn't really "cost" more, 
because taking snapshots is fast and relatively cost-free. =:^)

I'd still recommend taking it easy on the yearly, tho, perhaps beyond a 
year or two, preferring physically media swapping and archiving at the 
yearly level if yearly archiving is found necessary at all.  And 
depending on your particular needs, physical-swap archiving at six months 
or even quarterly might actually be appropriate, especially given that 
(with spinning rust at least, I guess ssds retain best with periodic 
power-up) on-the-shelf archiving should be more dependable as a last-
resort backup.

Or do similar online with for example Amazon Glacier (never used 
personally, tho I actually have the site open for reference as I write 
this and at US $0.004 per gig per month... so say $100 for a TB for 2 
years or a couple hundred gig for a decade, $10/yr with a much better 
chance at actually being able to use it after a fire/flood/etc that'd 
take out anything local, tho actually retrieving it would cost a bit 
too... I'm actually thinking perhaps I should consider it... obviously 
I'd well encrypt first... until now I'd always done onsite backup only, 
figuring if I had a fire or something that'd be the last thing I'd be 
worried about, but now I'm actually considering...)

OK, so I guess the bottom-line answer is "it depends."  But the above 
should give you more data to plugin for your specific use-case.

But if it's pure backup, you don't expect to expand to more devices in-
place and you can blow it away and don't have to consider check --repair, 
AND you can do a couple filesystems so as to keep your daily snapshots 
separate from the more frequent backups and thus avoid snapshot deletion, 
you may actually be able to do the 365 dailies for 2-3 years then swap-
out filesystems and devices without deleting snapshots, thus avoiding any 
of the maintenance-scaling issues that are the big limitation, and have 
it work just fine.

OTOH, if you're use-case is a bit more conventional, with more 
maintenance to have to worry about scaling, capping to 100 snapshots 
remains a reasonable recommendation, and if you need quotas as well and 
can't afford to disable them even temporarily for a balance, you may find 
under 50 snapshots to be your maintenance pain tolerance threshold.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-11 Thread Duncan

often aren't even aware of the 
tradeoffs they're taking on those solutions, so... I suppose when it's 
all said and done the only people aware of the issues on btrfs are likely 
going to be the highly technical and case-optimizer crowds, too.  
Everyone else will probably just use the defaults and not even be aware 
of the tradeoffs they're making by doing so, as is already the case on 
mdraid and zfs.

---
[1] As I'm no longer running either mdraid or parity-raid, I've not 
followed this extremely closely, but writing this actually spurred me to 
google the problem and see when and how mdraid fixed it.  So the links 
are from that. =:^)

[2] Journalling/journaling, one or two Ls?  The spellcheck flags both and 
last I tried googling it the answer was inconclusive.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: List of known BTRFS Raid 5/6 Bugs?

2018-09-08 Thread Duncan

7;s merged as well.  
Don't just jump on it immediately after merge unless you're deliberately 
doing so to help test for bugs and get them fixed and the feature 
stabilized as soon as possible.  Wait a few kernel cycles, follow the 
list to see how the feature's stability is coming, and /then/ use it, 
after factoring in its remaining then still new and less mature 
additional risk into your backup risks profile, of course.

Time?  Not a dev but following the list and obviously following the new 3-
way-mirroring, I'd say probably not 4.20 (5.0?) for the new mirroring 
modes, so 4.21/5.1 more reasonably likely (if all goes well, could be 
longer), probably another couple cycles (if all goes well) after that for 
the parity-raid logging code built on top of the new mirroring modes, so 
perhaps a year (~5 kernel cycles) to introduction for it.  Then wait 
however many cycles until you think it has stabilized.  Call that another 
year.  So say about 10 kernel cycles or two years.  It could be a bit 
less than that, say 5-7 cycles, if things go well and you take it before 
I'd really consider it stable enough to recommend, but given the 
historically much longer than predicted development and stabilization 
times for raid56 already, it could just as easily end up double that, 4-5 
years out, too.

But raid56 logging mode for write-hole mitigation is indeed actively 
being worked on right now.  That's what we know at this time.

And even before that, right now, raid56 mode should already be reasonably 
usable, especially if you do data raid5/6 and metadata raid1, as long as 
your backup policy and practice is equally reasonable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Re-mounting removable btrfs on different device

2018-09-06 Thread Duncan

Remi Gauvin posted on Thu, 06 Sep 2018 20:54:17 -0400 as excerpted:

> I'm trying to use a BTRFS filesystem on a removable drive.
> 
> The first drive drive was added to the system, it was /dev/sdb
> 
> Files were added and device unmounted without error.
> 
> But when I re-attach the drive, it becomes /dev/sdg (kernel is fussy
> about re-using /dev/sdb).
> 
> btrfs fi show: output:
> 
> Label: 'Archive 01'  uuid: 221222e7-70e7-4d67-9aca-42eb134e2041
>   Total devices 1 FS bytes used 515.40GiB
>   devid1 size 931.51GiB used 522.02GiB path /dev/sdg1
> 
> This causes BTRFS to fail mounting the device [errors snipped]

> I've seen some patches on this list to add a btrfs device forget option,
> which I presume would help with a situation like this.  Is there a way
> to do that manually?

Without the mentioned patches, the only way (other than reboot) is to 
remove and reinsert the btrfs kernel module (assuming it's a module, not 
built-in), thus forcing it to forget state.

Of course if other critical mounted filesystems (such as root) are btrfs, 
or if btrfs is a kernel-built-in not a module and thus can't be removed, 
the above doesn't work and a reboot is necessary.  Thus the need for 
those patches you mentioned.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: IO errors when building RAID1.... ?

2018-08-31 Thread Duncan

Chris Murphy posted on Fri, 31 Aug 2018 13:02:16 -0600 as excerpted:

> If you want you can post the output from 'sudo smartctl -x /dev/sda'
> which will contain more information... but this is in some sense
> superfluous. The problem is very clearly a bad drive, the drive
> explicitly report to libata a write error, and included the sector LBA
> affected, and only the drive firmware would know that. It's not likely a
> cable problem or something like. And that the write error is reported at
> all means it's persistent, not transient.

Two points:

1) Does this happen to be an archive/SMR (shingled magnetic recording) 
device?  If so that might be the problem as such devices really aren't 
suited to normal usage (they really are designed for archiving), and 
btrfs' COW patterns can exacerbate the issue.  It's quite possible that 
the original install didn't load up the IO as heavily as the balance-
convert does, so the problem appears with convert but not for install.

2) Assuming it's /not/ an SMR issue, and smartctl doesn't say it's dying, 
I'd suggest running badblocks -w (make sure the device doesn't have 
anything valuable on it!) on the device -- note that this will take 
awhile, probably a couple days perhaps longer, as it writes four 
different patterns to the entire device one at a time, reading everything 
back to verify the pattern was written correctly, so it's actually going 
over the entire device 8 times, alternating write and read, but it should 
settle the issue of the reliability of the device.

Or if you'd rather spend the money than the time and it's not under 
warrantee still, just replace it, or at least buy a new one to use while 
you run the tests on that one.  I fully understand that tying up the 
thing running tests on it for days straight may not be viable.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: How to erase a RAID1 (+++)?

2018-08-31 Thread Duncan

Alberto Bursi posted on Fri, 31 Aug 2018 14:54:46 + as excerpted:

> I just keep around a USB drive with a full Linux system on it, to act as
> "recovery". If the btrfs raid fails I boot into that and I can do
> maintenance with a full graphical interface and internet access so I can
> google things.

I do very similar, except my "recovery boot" is my backup (with normally 
including for root two levels of backup/recovery available, three for 
some things).

I've actually gone so far as to have /etc/fstab be a symlink to one of 
several files, depending on what version of root vs. the off-root 
filesystems I'm booting, with a set of modular files that get assembled 
by scripts to build the fstabs as appropriate.  So updating fstab is a 
process of updating the modules, then running the scripts to create the 
actual fstabs, and after I update a root backup the last step is changing 
the symlink to point to the appropriate fstab for that backup, so it's 
correct if I end up booting from it.

Meanwhile, each root, working and two backups, is its own set of two 
device partitions in btrfs raid1 mode.  (One set of backups is on 
separate physical devices, covering the device death scenario, the other 
is on different partitions on the same, newer and larger pair of physical 
devices as the working set, so it won't cover device death but still 
covers fat-fingering, filesystem fubaring, bad upgrades, etc.)

/boot is separate and there's four of those (working and three backups), 
one each on each device of the two physical pairs, with the bios able to 
point to any of the four.  I run grub2, so once the bios loads that, I 
can interactively load kernels from any of the other three /boots and 
choose to boot any of the three roots.

And I build my own kernels, with an initrd attached as an initramfs to 
each, and test that they boot.  So selecting a kernel by definition 
selects its attached initramfs as well, meaning the initr*s are backed up 
and selected with the kernels.

(As I said earlier it'd sure be nice to be able to do away with the 
initr*s again.  I was actually thinking about testing that today, which 
was supposed to be a day off, but got called in to work, so the test will 
have to wait once again...)

What's nice about all that is that just as you said, each recovery/backup 
is a snapshot of the working system at the time I took the backup, so 
it's not a limited recovery boot at all, it has the same access to tools, 
manpages, net, X/plasma, browsers, etc, that my normal system does, 
because it /is/ my normal system from whenever I took the backup.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: How to erase a RAID1 (+++)?

2018-08-30 Thread Duncan

purposes, go 
right ahead!  That's what btrfs raid1 is for, after all.  But if you were 
planning on mounting degraded (semi-)routinely, please do reconsider, 
because it's just not ready for that at this point, and you're going to 
run into all sorts of problems trying to do it on an ongoing basis due to 
the above issues.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Duncan

Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:

> Thinking again, this is totally acceptable. If the requirement was a
> good health disk, then I think I must check the disk health by myself.
> I may believe that the disk is in a good state, or make a quick test or
> make some very detailed tests to be sure.

For testing you might try badblocks.  It's most useful on a device that 
doesn't have a filesystem on it you're trying to save, so you can use the 
-w write-test option.  See the manpage for details.

The -w option should force the device to remap bad blocks where it can as 
well, and you can take your previous smartctl read and compare it to a 
new one after the test.

Hint if testing multiple spinning-rust devices:  Try running multiple 
tests at once.  While this might have been slower on old EIDE, at least 
with spinning rust, on SATA and similar you should be able to test 
multiple devices at once without them slowing down significantly, because 
the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
used badblocks years ago to test my new disks before setting up mdraid on 
them, and with full disk tests on spinning rust taking (at the time) 
nearly a day a pass and four passes for the -w test, the multiple tests 
at once trick saved me quite a bit of time!

It's not a great idea to do the test on new SSDs as it's unnecessary 
wear, writing the entire device four times with different patterns each 
time for a -w, but it might be worthwhile to try it on an ssd you're just 
trying to salvage, forcing it to swap out any bad sectors it encounters 
in the process.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: btrfs-convert missing in btrfs-tools v4.15.1

2018-08-23 Thread Duncan

Nicholas D Steeves posted on Thu, 23 Aug 2018 14:15:18 -0400 as excerpted:

>> It's in my interest to ship all tools in distros, but there's also only
>> that much what the upstream community can do. If you're going to
>> reconsider the status of btrfs-convert in Debian, please let me know.
> 
> Yes, I'd be happy to advocate for its reinclusion if the answer to 4/5
> of the following questions is "yes".  Does SUSE now recommend the use of
> btrfs-convert to its enterprise customers?  The following is a
> frustrating criteria, but: Can a random desktop user run btrfs-convert
> against their ext4 rootfs and expect the operation to succeed?  Is
> btrfs-convert now sufficiently trusted that it can be recommended with
> the same degree of confidence as a backup, mkfs.btrfs, then restore to
> new filesystem approach?  Does the user of a btrfs volume created with
> btrfs-convert have an equal or lesser probability of encountering bugs
> compared to a one who used mkfs.btrfs?

Just a user and list regular here, and gentoo not debian, but for what it 
counts...

I'd personally never consider or recommend a filesystem converter over 
the backup, mkfs-to-new-fs, restore-to-new-fs, method, for three reasons.

1) Regardless of how stable a filesystem converter is and what two 
filesystems the conversion is between, "things" /do/ occasionally happen, 
thus making it irresponsible to use or recommend use of such a converter 
without a suitably current and tested backup, "just in case."

(This is of course a special case of the sysadmin's first rule of 
backups, that the true value of data is defined not by any arbitrary 
claims, but by the number of backups of that data it's considered worth 
the time/trouble/resources to make/have.  If the data value is trivial 
enough, sure, don't bother with the backup, but if it's of /that/ low a 
value, so low it's not worth a backup even when doing something as 
theoretically risky as a filesystem conversion, why is it worth the time 
and trouble to bother converting it in the first place, instead of just 
blowing it away and starting clean?)

2) Once a backup is considered "strongly recommended", as we've just 
established that it should be in 1 regardless of the stability of the 
converter, just using the existing filesystem as that backup and starting 
fresh with a mkfs for the new filesystem and copying things over is 
simply put the easiest, simplest and cleanest method to change 
filesystems.

3) (Pretty much)[1] Regardless of the filesystems in question, a fresh 
mkfs and clean sequential transfer of files from the old-fs/backup to the 
new one is pretty well guaranteed to be better optimized than conversion 
from an existing filesystem of a different type, particularly one that 
has been in normal operation for awhile and thus has operational 
fragmentation of both data and free-space.  That's in addition to being 
less bug-prone, even for a "stable" converter.


Restating: So(1) doing a conversion without a backup is irresponsible, 
(2) the easiest backup and conversion method is directly using the old fs 
as the backup, and copying over to the freshly mkfs-ed new filesystem, 
and (3) a freshly mkfs-ed filesystem and sequential copy of files to it 
from backup, whether that be the old filesystem or not, is going to be 
more efficient and less bug-prone than an in-place conversion.

Given the above, why would /anyone/ /sane/ consider using a converter?  
It simply doesn't make sense, even if the converter were as stable as the 
most stable filesystems we have.


So as a distro btrfs package maintainer, do what you wish in terms of the 
converter, but were it me, I might actually consider replacing it with an 
executable that simply printed out some form of the above argument, with 
a pointer to the sources should they still be interested after having 
read that argument.[2] Then, if people really are determined to 
unnecessarily waste their time to get a less efficient filesystem, 
possibly risking their data in the process of getting it, they can always 
build the converter from sources themselves.

---
[1] I debated omitting the qualifier as I know of no exceptions, but I'm 
not a filesystem expert and while I'm a bit skeptical, I suppose it's 
possible that they might exist.

[2] There's actually btrfs precedent for this in the form of the 
executable built as fsck.btrfs, which does nothing (successfully) but 
possibly print a message referring people to btrfs check, if run in 
interactive mode.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: lazytime mount option—no support in Btrfs

2018-08-22 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 22 Aug 2018 07:30:09 -0400 as
excerpted:

>> Meanwhile, since broken rootflags requiring an initr* came up let me
>> take the opportunity to ask once again, does btrfs-raid1 root still
>> require an initr*?  It'd be /so/ nice to be able to supply the
>> appropriate rootflags=device=...,device=... and actually have it work
>> so I didn't need the initr* any longer!

> Last I knew, specifying appropriate `device=` options in rootflags works
> correctly without an initrd.

Just to confirm, that's with multi-device btrfs rootfs?  Because it used 
to work when the btrfs was single-device, but not multi-device.

(For multi-device, or at least raid1, one had to add degraded, also, or 
it would refuse to mount despite all the appropriate device= entries in 
rootflags, thus of course risking all the problems running degraded raid1 
operationally can bring, tho I never figured out for sure whether btrfs 
was smart enough to eventually pick up the other devices, after the scan 
before bringing other btrfs online or not, but either way it was a risk I 
wasn't willing to take.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 21 Aug 2018 13:01:00 -0400 as
excerpted:

> Otherwise, the only option for people who want it set is to patch the
> kernel to get noatime as the default (instead of relatime).  I would
> look at pushing such a patch upstream myself actually, if it weren't for
> the fact that I'm fairly certain that it would be immediately NACK'ed by
> at least Linus, and probably a couple of other people too.

What about making default-noatime a kconfig option, presumably set to 
default-relatime by default?  That seems to be the way many legacy-
incompatible changes work.  Then for most it's up to the distro, which in 
fact it is already, only if the distro set noatime-default they'd at 
least be using an upstream option instead of patching it themselves, 
making it upstream code that could be accounted for instead of downstream 
code that... who knows?

Meanwhile, I'd be interested in seeing your local patch.  I'm local-
patching noatime-default here too, but not being a dev, I'm not entirely 
sure I'm doing it "correctly", tho AFAICT it does seem to work.  FWIW, 
here's what I'm doing (posting inline so may be white-space damaged, and 
IIRC I just recently manually updated the line numbers so they don't 
reflect the code at the 2014 date any more, but as I'm not sure of the 
"correctness" it's not intended to be applied in any case):

--- fs/namespace.c.orig 2014-04-18 23:54:42.167666098 -0700
+++ fs/namespace.c  2014-04-19 00:19:08.622741946 -0700
@@ -2823,8 +2823,9 @@ long do_mount(const char *dev_name, cons
goto dput_out;
 
/* Default to relatime unless overriden */
-   if (!(flags & MS_NOATIME))
-   mnt_flags |= MNT_RELATIME;
+   /* JED: Make that noatime */
+   if (!(flags & MS_RELATIME))
+   mnt_flags |= MNT_NOATIME;
 
/* Separate the per-mountpoint flags */
if (flags & MS_NOSUID)
@@ -2837,6 +2837,8 @@ long do_mount(const char *dev_name, cons
mnt_flags |= MNT_NOATIME;
if (flags & MS_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;
+   if (flags & MS_RELATIME)
+   mnt_flags |= MNT_RELATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)

Sane, or am I "doing it wrong!"(TM), or perhaps doing it correctly, but 
missing a chunk that should be applied elsewhere?


Meanwhile, since broken rootflags requiring an initr* came up let me take 
the opportunity to ask once again, does btrfs-raid1 root still require an 
initr*?  It'd be /so/ nice to be able to supply the appropriate 
rootflags=device=...,device=... and actually have it work so I didn't 
need the initr* any longer!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Duncan

#x27;s" quotas shouldn't be 
affected, because that's not what btrfs quotas manage.  There are other 
(non-btrfs) tools for that.

>>> In short: values representing quotas are user-oriented ("the numbers
>>> one bought"), not storage-oriented ("the numbers they actually
>>> occupy").

Btrfs quotas are storage-oriented, and if you're using them, at least 
directly, for user-oriented, you're using the proverbial screwdriver as a 
proverbial hammer.

> What is VFS disk quotas and does Btrfs use that at all? If not, why not?
> It seems to me there really should be a high level basic per directory
> quota implementation at the VFS layer, with a single kernel interface as
> well as a single user space interface, regardless of the file system.
> Additional file system specific quota features can of course have their
> own tools, but all of this re-invention of the wheel for basic directory
> quotas is a mystery to me.

As mentioned above and by others, btrfs quotas don't use vfs quotas (or 
the reverse, really, it'd be vfs quotas using information exposed by 
btrfs quotas... if it worked that way), because there's an API mis-match 
because their intended usage and the information they convey and control 
is different, and (AFAIK) was never intended or claimed to be the same.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

Re: recover broken partition on external HDD

2018-08-06 Thread Duncan

ime/hassle/resources, before you ever 
lost the data, and the data loss isn't a big deal because it, by 
definition of not having a backup, can be of only trivial value not worth 
the hassle.

There's no #3.  The data was either defined as worth a backup by virtue 
of having one, and can be restored from there, or it wasn't, but no big 
deal because the time/trouble/resources that would have otherwise gone 
into that backup was defined as more important, and was saved before the 
data was ever lost in the first place.

Thus, while the loss of the data due to fat-fingering (which all 
sysadmins come to appreciate the real risk of, after a few events of 
their own) the placement of that ZFS might be a bit of a bother, it's not 
worth spending huge amounts of time trying to recover, because it was 
either worth having a backup, in which case you simply recover from it, 
or it wasn't, in which case it's not worth spending huge amounts to time 
trying to recover, either.

Of course there's still the pre-disaster weighed risk that something will 
go wrong vs. the post-disaster it DID go wrong, now how do I best get 
back to normal operation question, but in the context of the backups rule 
above resolving that question is more a matter of whether it's most 
efficient to spend a little time trying to recover the existing data with 
no guarantee of full success, or to simply jump directly into the wipe 
and restore from known-good (because tested!) backups, which might take 
more time, but has a (near) 100% chance at recovery to the point of the 
backup.  (The slight chance of failure to recover from tested backups is 
what multiple levels of backups covers for, with the the value of the 
data and the weighed risk balanced against the value of the time/hassle/
resources necessary to do that one more level of backup.)

So while it might be worth a bit of time to quick-test recovery of the 
damaged data, it very quickly becomes not worth the further hassle, 
because either the data was already defined as not worth it due to not 
having a backup, or restoring from that backup will be faster and less 
hassle, with a far greater chance of success, than diving further into 
the data recovery morass, with ever more limited chances of success.

Live by that sort of policy from now on, and the results of the next 
failure, whether it be hardware, software, or wetware (another fat-
fingering, again, this is coming from someone, me, who has had enough of 
their own!), won't be anything to write the list about, unless of course 
it's a btrfs bug and quite apart from worrying about your data, you're 
just trying to get it fixed so it won't continue to happen.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and databases

2018-08-01 Thread Duncan

MegaBrutal posted on Wed, 01 Aug 2018 05:45:15 +0200 as excerpted:

> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW nature
> that is elsewhere a blessing, with databases it's a drawback). But are
> there any advantages of still sticking to BTRFS for a database albeit
> CoW is disabled, or should I just return to the old and reliable ext4
> for those applications?

Good question, on which I might expect some honest disagreement on the 
answer.

Personally, I tend to hate nocow with a passion, and would thus recommend 
putting databases and similar write-pattern (VM images...) files on their 
own dedicated non-btrfs (ext4, etc) if at all reasonable.

But that comes from a general split partition-favoring viewpoint, where 
doing another partition/lvm-volume and putting a different filesystem on 
it is no big deal, as it's just one more partition/volume to manage of 
(likely) several.

Some distros/companies/installations have policies strongly favoring 
btrfs for its "storage pool" features, trying to keep things simple and 
flexible by using just the one solution and one big btrfs and throwing 
everything onto it, often using btrfs subvolumes where others would use 
separate partitions/volumes with independent filesystems.  For these 
folks, the flexibility of being able to throw it all on one filesystem 
with subvolumes overrides the down sides of having to deal with nocow and 
its conditions, rules and additional risk.

And a big part of that flexibility, along with being a feature in its own 
right, is btrfs built-in multi-device, without having to resort to an 
additional multi-device layer such as lvm or mdraid.


So if you're using btrfs for multi-device or other features that nocow 
doesn't affect, it's plausible that you'd prefer nocow on btrfs to 
/having/ to do partitioning/lvm/mdraid and setup that separate non-btrfs 
just for your database (or vm image) files.

But from your post you're perfectly fine with partitioning and the like 
already, and won't consider it a heavy imposition to deal with a separate 
non-btrfs, ext4 or whatever, and in that case, at least here, I'd 
strongly recommend you do just that, avoiding the nocow that I honestly 
see as a compromise best left to those that really need it because they 
aren't prepared to deal with the hassle of setting up the separate 
filesystem along with all that entails.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed on raid1 even after clean scrub?

2018-08-01 Thread Duncan

Sterling Windmill posted on Mon, 30 Jul 2018 21:06:54 -0400 as excerpted:

> Both drives are identical, Seagate 8TB external drives

Are those the "shingled" SMR drives, normally sold as archive drives and 
first commonly available in the 8TB size, and often bought for their 
generally better price-per-TB without fully realizing the implications.

There have been bugs regarding those drives in the past, and while I 
believe those bugs were fixed and AFAIK current status is no known SMR-
specific bugs, they really are /not/ particularly suited to btrfs usage 
even for archiving, and definitely not to general usage (that is, pretty 
much anything but the straight-up archiving use-case they are sold for) 
use-cases.

Of course USB connections are notorious for being unreliable in terms of 
btrfs usage as well, and I'd really hate to think what a combination of 
SMR on USB might wreak.

If they're not SMR then carry-on! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: File permissions lost during send/receive?

2018-07-24 Thread Duncan

Marc Joliet posted on Tue, 24 Jul 2018 22:42:06 +0200 as excerpted:

> On my system I get:
> 
> % sudo getcap /bin/ping /sbin/unix_chkpwd
> /bin/ping = cap_net_raw+ep
> /sbin/unix_chkpwd = cap_dac_override+ep
> 
>> (getcap on unix_chkpwd returns nothing, but while I use kde/plasma I
>> don't normally use the lockscreen at all, so for all I know that's
>> broken here too.)

OK, after remerging pam, I get the same for unix_chkpwd (tho here I have 
sbin merge so it's /bin/unix_chkpwd with sbin -> bin), so indeed, it must 
have been the same problem for you with it, that I've simply not run into 
since whatever killed the filecaps here, because I don't use the 
lockscreen.

But if I start using the lockscreen again and it fails, I know one not-so-
intuitive thing to check, now. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: File permissions lost during send/receive?

2018-07-24 Thread Duncan

Andrei Borzenkov posted on Tue, 24 Jul 2018 20:53:15 +0300 as excerpted:

> 24.07.2018 15:16, Marc Joliet пишет:
>> Hi list,
>> 
>> (Preemptive note: this was with btrfs-progs 4.15.1, I have since
>> upgraded to 4.17.  My kernel version is 4.14.52-gentoo.)
>> 
>> I recently had to restore the root FS of my desktop from backup (extent
>> tree corruption; not sure how, possibly a loose SATA cable?). 
>> Everything was fine,
>> even if restoring was slower than expected.  However, I encountered two
>> files with permission problems, namely:
>> 
>> - /bin/ping, which caused running ping as a normal user to fail due to
>> missing permissions, and
>> 
>> - /sbin/unix_chkpwd (part of PAM), which prevented me from unlocking
>> the KDE Plasma lock screen; I needed to log into a TTY and run
>> "loginctl unlock- session".
>> 
>> Both were easily fixed by reinstalling the affected packages (iputils
>> and pam), but I wonder why this happened after restoring from backup.
>> 
>> I originally thought it was related to the SUID bit not being set,
>> because of the explanation in the ping(8) man page (section
>> "SECURITY"), but cannot find evidence of that -- that is, after
>> reinstallation, "ls -lh" does not show the sticky bit being set, or any
>> other special permission bits, for that matter:
>> 
>> % ls -lh /bin/ping /sbin/unix_chkpwd
>> -rwx--x--x 1 root root 60K 22. Jul 14:47 /bin/ping*
>> -rwx--x--x 1 root root 31K 23. Jul 00:21 /sbin/unix_chkpwd*
>> 
>> (Note: no ACLs are set, either.)
>> 
>> 
> What "getcap /bin/ping" says? You may need to install package providing
> getcap (libcap-progs here on openSUSE).

sys-libs/libcap on gentoo.  Here's what I get:

$ getcap /bin/ping
/bin/ping = cap_net_raw+ep

(getcap on unix_chkpwd returns nothing, but while I use kde/plasma I 
don't normally use the lockscreen at all, so for all I know that's broken 
here too.)

As hinted, it's almost certainly a problem with filecaps.  While I'll 
freely admit to not fully understanding how file-caps work, and my use-
case doesn't use send/receive, I do recall filecaps are what ping uses 
these days instead of SUID/SGID (on gentoo it'd be iputils' filecaps and 
possibly caps USE flags controlling this for ping), and also that btrfs 
send/receive did have a recent bugfix related to the extended-attributes 
normally used to record filecaps, so the symptoms match the bug and 
that's probably what you were seeing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem corruptions with 4.18. git kernels

2018-07-20 Thread Duncan

Alexander Wetzel posted on Fri, 20 Jul 2018 23:28:42 +0200 as excerpted:

> A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO mSATA
> 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard is
> enabled as mount option and there were roughly 5 other subvolumes.

Regardless of what your trigger problem is, running with the discard 
mount option considerably increases your risks in at least two ways:

1) Btrfs normally has a feature that tracks old root blocks, which are 
COWed out at each commit.  Should something be wrong with the current 
one, btrfs can fall back to an older one using the usebackuproot 
(formerly recovery, but that clashed with the (no)recovery standard 
option a used on other OSs so they renamed it usebackuproot) mount 
option.  This won't always work, but when it does it's one of the first-
line recovery/repair options, as it tends to mean losing only 30-90 
seconds (first thru third old roots) worth of writes, while being quite 
likely to get you the working filesystem as it was at that commit.

But once the root goes unused, with discard, it gets marked for discard, 
and depending on the hardware/firmware implementation, it may be 
discarded immediately.  If it is, that means no backup roots available 
for recovery should the current root be bad for whatever reason, which 
pretty well takes out your first and best three chances of a quick fix 
without much risk.

2) In the past there have been bugs that triggered on discard.  AFAIK 
there are no such known bugs at this time, but in addition to the risk of 
point one, there is the additional risk of bugs that trigger on discard 
itself, and due to the nature of the discard feature itself, these sorts 
of bugs have a much higher chance than normal of being data eating bugs.

3) Depending on the device, the discard mount option may or may not have 
negative performance implications as well.

So while the discard mount option is there, it's definitely not 
recommended, unless you really are willing to deal with that extra risk 
and the loss of the backuproot safety-nets, and of course have 
additionally researched its effects on your hardware to make sure it's 
not actually slowing you down (which granted, on good mSATA, it may not 
be, as those are new enough to have a higher likelihood of actually 
having working queued-trim support).

The discard mount option alternative is a scheduled timer/cron job (like 
the one systemd has, just activate it) that does a periodic (weekly for 
systemd's timer) fstrim.  That lowers the risk to the few commits 
immediately after the fstrim job runs -- as long as you don't crash 
during that time, you'll have backup roots available as the current root 
will have moved on since then, creating backups again as it did so.

Or just leave a bit of extra room on the ssd untouched (ideally initially 
trimmed before partitioning and then left unpartitioned, so the firmware 
knows its clean and can use it at its convenience), so the ssd can use 
that extra room to do its wear-leveling, and don't do trim/discard at all.

FWIW I actually do both of these here, leaving significant space on the 
device unpartitioned, and enabling that systemd fstrim timer job, as well.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:

>> As implemented in BTRFS, raid1 doesn't have striping.
> 
> The argument is that because there's only two copies, on multi-device
> btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> alternate device pairs, it's effectively striped at the macro level,
> with the 1 GiB device-level chunks effectively being huge individual
> device strips of 1 GiB.
> 
> At 1 GiB strip size it doesn't have the typical performance advantage of
> striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> strips/chunks.

I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually 
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing 
phrases.  I've seen the argument made on-list, but while I understand the 
argument and agree with it to some extent, I'm still a bit uncomfortable 
with it and don't normally make it myself, this thread being a noted 
exception tho originally I simply repeated what someone else already said 
in-thread, because I too agree it's stretching things a bit.  But it does 
appear to be a useful conceptual equivalency for some, and I do see the 
similarity.

Perhaps it's a case of coder's view (no code doing it that way, it's just 
a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
(code or not, accidental or not, it's a reasonably accurate high-level 
description of how it ends up working most of the time with equivalent 
sized devices).)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Duncan

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:

> On 07/17/2018 11:12 PM, Duncan wrote:
>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>> excerpted:
>> 
>>> On 07/15/2018 04:37 PM, waxhead wrote:
>> 
>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>> parity are mutually exclusive.
>> 
>> I can't agree.  I don't know whether you meant that in the global
>> sense,
>> or purely in the btrfs context (which I suspect), but either way I
>> can't agree.
>> 
>> In the pure btrfs context, while striping and mirroring/pairing are
>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>> flexible enough to allow both together and the feature may at some
>> point be added, so it makes sense to have a layout notation format
>> flexible enough to allow it as well.
> 
> When I say orthogonal, It means that these can be combined: i.e. you can
> have - striping (RAID0)
> - parity  (?)
> - striping + parity  (e.g. RAID5/6)
> - mirroring  (RAID1)
> - mirroring + striping  (RAID10)
> 
> However you can't have mirroring+parity; this means that a notation
> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
> too verbose.

Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
top of mirroring or mirroring on top of raid5/6, much as raid10 is 
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
on top of raid0.  

While it's not possible today on (pure) btrfs (it's possible today with 
md/dm-raid or hardware-raid handling one layer), it's theoretically 
possible both for btrfs and in general, and it could be added to btrfs in 
the future, so a notation with the flexibility to allow parity and 
mirroring together does make sense, and having just that sort of 
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I 
can see a case being made for it if one layer is hardware/firmware raid, 
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61 
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't 
arguably be a at least as good a match to the use-case.  Perhaps one of 
the other experts in such things here might help with that.

>>> Question #2: historically RAID10 is requires 4 disks. However I am
>>> guessing if the stripe could be done on a different number of disks:
>>> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
>>> that every 64k, the data are stored on a different disk
>> 
>> As someone else pointed out, md/lvm-raid10 already work like this. 
>> What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
>> much works this way except with huge (gig size) chunks.
> 
> As implemented in BTRFS, raid1 doesn't have striping.

The argument is that because there's only two copies, on multi-device 
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to 
alternate device pairs, it's effectively striped at the macro level, with 
the 1 GiB device-level chunks effectively being huge individual device 
strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of 
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB 
strips/chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-17 Thread Duncan

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

I can't agree.  I don't know whether you meant that in the global sense, 
or purely in the btrfs context (which I suspect), but either way I can't 
agree.

In the pure btrfs context, while striping and mirroring/pairing are 
orthogonal today, Hugo's whole point was that btrfs is theoretically 
flexible enough to allow both together and the feature may at some point 
be added, so it makes sense to have a layout notation format flexible 
enough to allow it as well.

In the global context, just to complete things and mostly for others 
reading as I feel a bit like a simpleton explaining to the expert here, 
just as raid10 is shorthand for raid1+0, aka raid0 layered on top of 
raid1 (normally preferred to raid01 due to rebuild characteristics, and 
as opposed to raid01, aka raid0+1, aka raid1 on top of raid0, sometimes 
recommended as btrfs raid1 on top of whatever raid0 here due to btrfs' 
data integrity characteristics and less optimized performance), so 
there's also raid51 and raid15, raid61 and raid16, etc, with or without 
the + symbols, involving mirroring and parity conceptually at two 
different levels altho they can be combined in a single implementation 
just as raid10 and raid01 commonly are.  These additional layered-raid 
levels can be used for higher reliability, with differing rebuild and 
performance characteristics between the two forms depending on which is 
the top layer.

> Question #1: for "parity" profiles, does make sense to limit the maximum
> disks number where the data may be spread ? If the answer is not, we
> could omit the last S. IMHO it should.

As someone else already replied, btrfs doesn't currently have the ability 
to specify spread limit, but the idea if we're going to change the 
notation is to allow for the flexibility in the new notation so the 
feature can be added later without further notation changes.

Why might it make sense to specify spread?  At least two possible reasons:

a) (stealing an already posted example) Consider a multi-device layout 
with two or more device sizes.  Someone may want to limit the spread in 
ordered to keep performance and risk consistent as the smaller devices 
fill up, limiting further usage to a lower number of devices.  If that 
lower number is specified as the spread originally it'll make things more 
consistent between the room on all devices case and the room on only some 
devices case.

b) Limiting spread can change the risk and rebuild performance profiles.  
Stripes of full width mean all stripes have a strip on each device, so 
knock a device out and (assuming parity or mirroring) replace it, and all 
stripes are degraded and must be rebuilt.  With less than maximum spread, 
some stripes won't be stripped to the replaced device, and won't be 
degraded or need rebuilt, tho assuming the same overall fill, a larger 
percentage of stripes that /do/ need rebuilt will be on the replaced 
device.  So the risk profile is more "objects" (stripes/chunks/files) 
affected but less of each object, or less of the total affected, but more 
of each affected object.

> Question #2: historically RAID10 is requires 4 disks. However I am
> guessing if the stripe could be done on a different number of disks:
> What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
> that every 64k, the data are stored on a different disk

As someone else pointed out, md/lvm-raid10 already work like this.  What 
btrfs calls raid10 is somewhat different, but btrfs raid1 pretty much 
works this way except with huge (gig size) chunks.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-08 Thread Duncan

Andrei Borzenkov posted on Fri, 06 Jul 2018 07:28:48 +0300 as excerpted:

> 03.07.2018 10:15, Duncan пишет:
>> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as
>> excerpted:
>> 
>>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>>> bit dangerous to do it while writes are happening).
>>>
>>> Could you please elaborate? Do you mean btrfs can trim data before new
>>> writes are actually committed to disk?
>> 
>> No.
>> 
>> But normally old roots aren't rewritten for some time simply due to
>> odds (fuller filesystems will of course recycle them sooner), and the
>> btrfs mount option usebackuproot (formerly recovery, until the
>> norecovery mount option that parallels that of other filesystems was
>> added and this option was renamed to avoid confusion) can be used to
>> try an older root if the current root is too damaged to successfully
>> mount.

>> But other than simply by odds not using them again immediately, btrfs
>> has
>> no special protection for those old roots, and trim/discard will
>> recover them to hardware-unused as it does any other unused space, tho
>> whether it simply marks them for later processing or actually processes
>> them immediately is up to the individual implementation -- some do it
>> immediately, killing all chances at using the backup root because it's
>> already zeroed out, some don't.
>> 
>> 
> How is it relevant to "while writes are happening"? Will trimming old
> tress immediately after writes have stopped be any different? Why?

Define "while writes are happening" vs. "immediately after writes have 
stopped".  How soon is "immediately", and does the writes stopped 
condition account for data that has reached the device-hardware write 
buffer (so is no longer being transmitted to the device across the bus) 
but not been actually written to media, or not?

On a reasonably quiescent system, multiple empty write cycles are likely 
to have occurred since the last write barrier, and anything in-process is 
likely to have made it to media even if software is missing a write 
barrier it needs (software bug) or the hardware lies about honoring the 
write barrier (hardware bug, allegedly sometimes deliberate on hardware 
willing to gamble with your data that a crash won't happen in a critical 
moment, a somewhat rare occurrence, in ordered to improve normal 
operation performance metrics).

On an IO-maxed system, data and write-barriers are coming down as fast as 
the system can handle them, and write-barriers become critical -- crash 
after something was supposed to get to media but didn't, either because 
of a missing write barrier or because the hardware/firmware lied about 
the barrier and said the data it was supposed to ensure was on-media was, 
when it wasn't, and the btrfs atomic-cow commit guarantees of consistent 
state at each commit go out the window.

At this point it becomes useful to have a number of previous "guaranteed 
consistent state" roots to fall back on, with the /hope/ being that at 
least /one/ of them is usably consistent.  If all but the last one are 
wiped due to trim...

When the system isn't write-maxed the write will have almost certainly 
made it regardless of whether the barrier is there or not, because 
there's enough idle time to finish the current write before another one 
comes down the pipe, so the last-written root is almost certain to be 
fine regardless of barriers, and the history of past roots doesn't matter 
even if there's a crash.

If "immediately after writes have stopped" is strictly defined as a 
condition when all writes including the btrfs commit updating the current 
root and the superblock pointers to the current root have completed, with 
no new writes coming down the pipe in the mean time that might have 
delayed a critical update if a barrier was missed, then trimming old 
roots in this state should be entirely safe, and the distinction between 
that state and the "while writes are happening" is clear.

But if "immediately after writes have stopped" is less strictly defined, 
then the distinction between that state and "while writes are happening" 
remains blurry at best, and having old roots around to fall back on in 
case a write-barrier was missed (for whatever reason, hardware or 
software) becomes a very good thing.

Of course the fact that trim/discard itself is an instruction written to 
the device in the combined command/data stream complexifies the picture 
substantially.  If those write barriers get missed who knows what state 
the new root is in, and if the old ones got erased...

Re: unsolvable technical issues?

2018-07-03 Thread Duncan

Austin S. Hemmelgarn posted on Mon, 02 Jul 2018 07:49:05 -0400 as
excerpted:

> Notably, most Intel systems I've seen have the SATA controllers in the
> chipset enumerate after the USB controllers, and the whole chipset
> enumerates after add-in cards (so they almost always have this issue),
> while most AMD systems I've seen demonstrate the exact opposite
> behavior,
> they enumerate the SATA controller from the chipset before the USB
> controllers, and then enumerate the chipset before all the add-in cards
> (so they almost never have this issue).

Thanks.  That's a difference I wasn't aware of, and would (because I tend 
to favor amd) explain why I've never seen a change in enumeration order 
unless I've done something like unplug my sata cables for maintenance and 
forget which ones I had plugged in where -- random USB stuff left plugged 
in doesn't seem to matter, even choosing different boot media from the 
bios doesn't seem to matter by the time the kernel runs (I'm less sure 
about grub).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-03 Thread Duncan

Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:

> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>> bit dangerous to do it while writes are happening).
> 
> Could you please elaborate? Do you mean btrfs can trim data before new
> writes are actually committed to disk?

No.

But normally old roots aren't rewritten for some time simply due to odds 
(fuller filesystems will of course recycle them sooner), and the btrfs 
mount option usebackuproot (formerly recovery, until the norecovery mount 
option that parallels that of other filesystems was added and this option 
was renamed to avoid confusion) can be used to try an older root if the 
current root is too damaged to successfully mount.

But other than simply by odds not using them again immediately, btrfs has 
no special protection for those old roots, and trim/discard will recover 
them to hardware-unused as it does any other unused space, tho whether it 
simply marks them for later processing or actually processes them 
immediately is up to the individual implementation -- some do it 
immediately, killing all chances at using the backup root because it's 
already zeroed out, some don't.

In the context of the discard mount option, that can mean there's never 
any old roots available ever, as they've already been cleaned up by the 
hardware due to the discard option telling the hardware to do it.

But even not using that mount option, and simply doing the trims 
periodically, as done weekly by for instance the systemd fstrim timer and 
service units, or done manually if you prefer, obviously potentially 
wipes the old roots at that point.  If the system's effectively idle at 
the time, not much risk as the current commit is likely to represent a 
filesystem in full stasis, but if there's lots of writes going on at that 
moment *AND* the system happens to crash at just the wrong time, before 
additional commits have recreated at least a bit of root history, again, 
you'll potentially be left without any old roots for the usebackuproot 
mount option to try to fall back to, should it actually be necessary.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs send/receive vs rsync

2018-06-30 Thread Duncan

Marc MERLIN posted on Fri, 29 Jun 2018 09:24:20 -0700 as excerpted:

>> If instead of using a single BTRFS filesystem you used LVM volumes
>> (maybe with Thin provisioning and monitoring of the volume group free
>> space) for each of your servers to backup with one BTRFS filesystem per
>> volume you would have less snapshots per filesystem and isolate
>> problems in case of corruption. If you eventually decide to start from
>> scratch again this might help a lot in your case.
> 
> So, I already have problems due to too many block layers:
> - raid 5 + ssd - bcache - dmcrypt - btrfs
> 
> I get occasional deadlocks due to upper layers sending more data to the
> lower layer (bcache) than it can process. I'm a bit warry of adding yet
> another layer (LVM), but you're otherwise correct than keeping smaller
> btrfs filesystems would help with performance and containing possible
> damage.
> 
> Has anyone actually done this? :)

So I definitely use (and advocate!) the split-em-up strategy, and I use 
btrfs, but that's pretty much all the similarity we have.

I'm all ssd, having left spinning rust behind.  My strategy avoids 
unnecessary layers like lvm (tho crypt can arguably be necessary), 
preferring direct on-device (gpt) partitioning for simplicity of 
management and disaster recovery.  And my backup and recovery strategy is 
an equally simple mkfs and full-filesystem-fileset copy to an identically 
sized filesystem, with backups easily bootable/mountable in place of the 
working copy if necessary, and multiple backups so if disaster takes out 
the backup I was writing at the same time as the working copy, I still 
have a backup to fall back to.

So it's different enough I'm not sure how much my experience will help 
you.  But I /can/ say the subdivision is nice, as it means I can keep my 
root filesystem read-only by default for reliability, my most-at-risk log 
filesystem tiny for near-instant scrub/balance/check, and my also at risk 
home small as well, with the big media files being on a different 
filesystem that's mostly read-only, so less at risk and needing less 
frequent backups.  The tiny boot and large updates (distro repo, sources, 
ccache) are also separate, and mounted only for boot maintenance or 
updates.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Duncan

cause while a "regular user" may not know it because it's not his /job/ 
to know it, if there's anything an admin knows *well* it's that the 
working copy of data **WILL** be damaged.  It's not a matter of if, but 
of when, and of whether it'll be a fat-finger mistake, or a hardware or 
software failure, or wetware (theft, ransomware, etc), or wetware (flood, 
fire and the water that put it out damage, etc), tho none of that 
actually matters after all, because in the end, the only thing that 
matters was how the value of that data was defined by the number of 
backups made of it, and how quickly and conveniently at least one of 
those backups can be retrieved and restored.


Meanwhile, an admin worth the label will also know the relative risk 
associated with various options they might use, including nocow, and 
knowing that downgrades the stability rating of the storage approximately 
to the same degree that raid0 does, they'll already be aware that in such 
a case the working copy can only be defined as "throw-away" level in case 
of problems in the first place, and will thus not even consider their 
working copy to be a permanent copy at all, just a temporary garbage 
copy, only slightly more reliable than one stored on tmpfs, and will thus 
consider the first backup thereof the true working copy, with an 
additional level of backup beyond what they'd normally have thrown in to 
account for that fact.

So in case of problems people can simply restore nocow files from a near-
line stable working copy, much as they'd do after reboot or a umount/
remount cycle for a file stored in tmpfs.  And if they didn't have even a 
stable working copy let alone a backup... well, much like that file in 
tmpfs, what did they expect?  They *really* defined that data as of no 
more than trivial value, didn't they?


All that said, making the NOCOW warning labels a bit more bold print 
couldn't hurt; and making scrub in the nocow case at least compare copies 
and report differences, simply makes it easier for people to know they 
need to reach for that near-line stable working copy, or mkfs and start 
from scratch if they defined the data value as not worth the trouble of 
(in this case) even a stable working copy, let alone a backup, so that'd 
be a good thing too. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unsolvable technical issues?

2018-06-29 Thread Duncan

Hugo Mills posted on Mon, 25 Jun 2018 16:54:36 + as excerpted:

> On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote:
> [snip]
>> I hope I am not asking for too much (but I know I probably am), but I
>> suggest that having a small snippet of information on the status page
>> showing a little bit about what is either currently the development
>> focus , or what people are known for working at would be very valuable
>> for users and it may of course work both ways, such as exciting people
>> or calming them down. ;)
>> 
>> For example something simple like a "development focus" list...
>> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology
>> 2018-Q3: (planned) Magical feature X
>> 2018-Q2: N-Way mirroring
>> 2018-Q1: Feature work "RAID"5/6
>> 
>> I think it would be good for people living their lives outside as it
>> would perhaps spark some attention from developers and perhaps even
>> media as well.
> 
> I started doing this a couple of years ago, but it turned out to be
> impossible to keep even vaguely accurate or up to date, without going
> round and bugging the developers individually on a per-release basis. I
> don't think it's going to happen.

In addition, anything like quarter, kernel cycle, etc, has been 
repeatedly demonstrated to be entirely broken beyond "current", because 
roadmapped tasks have rather consistently taken longer, sometimes /many/ 
/times/ longer (by a factor of 20+ in the case of raid56), than first 
predicted.

But in theory it might be double, with just a roughly ordered list, no 
dates beyond "current focus", and with suitably big disclaimers about 
other things (generally bugs in otherwise more stable features, but 
occasionally a quick sub-feature that is seen to be easier to introduce 
at the current state than it might be later, etc) possibly getting 
priority and temporarily displacing roadmapped items.

In fact, this last one is the big reason why raid56 has taken so long to 
even somewhat stabilize -- the devs kept finding bugs in already semi-
stable features that took priority... for kernel cycle after kernel 
cycle.  The quotas/qgroups feature, already introduced and intended to be 
at least semi-stable was one such culprit, requiring repeated rewrite and 
kernel cycles worth of bug squashing.  A few critical under the right 
circumstances compression bugs, where compression was supposed to be an 
already reasonably stable feature, were another, tho these took far less 
developer bandwidth than quotas.  Getting a reasonably usable fsck was a 
bunch of little patches.  AFAIK that one wasn't actually an original 
focus and was intended to be back-burnered for some time, but once btrfs 
hit mainline, users started demanding it, so the priority was bumped.  
And of course having it has been good for finding and ultimately fixing 
other bugs as well, so it wasn't a bad thing, but the hard fact is the 
repairing fsck has taken, all told, I'd guess about the same number of 
developer cycles as quotas, and those developer cycles had to come from 
stuff that had been roadmapped for earlier.

As a bit of an optimist I'd be inclined to argue that OK, we've gotten 
btrfs in far better shape general stability-wise now, and going forward, 
the focus can be back on the stuff that was roadmapped for earlier that 
this stuff displaced, so one might hope things will move faster again 
now, but really, who knows?  That's arguably what the devs thought when 
they mainlined btrfs, too, and yet it took all this much longer to mature 
and stabilize since then.  Still, it /has/ to happen at /some/ point, 
right?  And I know for a fact that btrfs is far more stable now than it 
was... because things like ungraceful shutdowns that used to at minimum 
trigger (raid1 mode) scrub fixes on remount and scrub, now... don't -- 
btrfs is now stable enough that the atomic COW is doing its job and 
things "just work", where before, they required scrub repair at best, and 
occasional blow away and restore from backups.  So I can at least /hope/ 
that the worst of the plague of bugs is behind us, and people can work on 
what they intended to do most (say 80%) of the time now, spending say a 
day's worth a week (20%) on bugs, instead of the reverse, 80% (4 days a 
week) on bugs and if they're lucky, a day a week on what they were 
supposed to be focused on, which is what we were seeing for awhile.

Plus the tools to do the debugging, etc, are far more mature now, another 
reason bugs should hopefully take less time now.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: unsolvable technical issues?

2018-06-29 Thread Duncan

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:

> On 2018-06-24 16:22, Goffredo Baroncelli wrote:
>> On 06/23/2018 07:11 AM, Duncan wrote:
>>> waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:
>>>
>>>> According to this:
>>>>
>>>> https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
>>>> section 1.2
>>>>
>>>> It claims that BTRFS still have significant technical issues that may
>>>> never be resolved.
>>>
>>> I can speculate a bit.
>>>
>>> 1) When I see btrfs "technical issue that may never be resolved", the
>>> #1 first thing I think of, that AFAIK there are _definitely_ no plans
>>> to resolve, because it's very deeply woven into the btrfs core by now,
>>> is...
>>>
>>> [1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
>>> Universally Unique quite literally, assuming they really *are*
>>> unique, at least on that system[.]  Because
>>> btrfs uses this supposedly unique ID to ID devices that belong to the
>>> filesystem, it can get *very* mixed up, with results possibly
>>> including dataloss, if it sees devices that don't actually belong to a
>>> filesystem with the same UUID as a mounted filesystem.
>> 
>> As partial workaround you can disable udev btrfs rules and then do a
>> "btrfs dev scan" manually only for the device which you need.

> You don't even need `btrfs dev scan` if you just specify the exact set
> of devices in the mount options.  The `device=` mount option tells the
> kernel to check that device during the mount process.

Not that lvm does any better in this regard[1], but has btrfs ever solved 
the bug where only one device= in the kernel commandline's rootflags= 
would take effect, effectively forcing initr* on people (like me) who 
would otherwise not need them and prefer to do without them, if they're 
using a multi-device btrfs as root?

Not to mention the fact that as kernel people will tell you, device 
enumeration isn't guaranteed to be in the same order every boot, so 
device=/dev/* can't be relied upon and shouldn't be used -- but of course 
device=LABEL= and device=UUID= and similar won't work without userspace, 
basically udev (if they work at all, IDK if they actually do).

Tho in practice from what I've seen, device enumeration order tends to be 
dependable /enough/ for at least those without enterprise-level numbers 
of devices to enumerate.  True, it /does/ change from time to time with a 
new kernel, but anybody sane keeps a tested-dependable old kernel around 
to boot to until they know the new one works as expected, and that sort 
of change is seldom enough that users can boot to the old kernel and 
adjust their settings for the new one as necessary when it does happen.  
So as "don't do it that way because it's not reliable" as it might indeed 
be in theory, in practice, just using an ordered /dev/* in kernel 
commandlines does tend to "just work"... provided one is ready for the 
occasion when that device parameter might need a bit of adjustment, of 
course.

> Also, while LVM does have 'issues' with cloned PV's, it fails safe (by
> refusing to work on VG's that have duplicate PV's), while BTRFS fails
> very unsafely (by randomly corrupting data).

And IMO that "failing unsafe" is both serious and common enough that it 
easily justifies adding the point to a list of this sort, thus my putting 
it #1.

>>> 2) Subvolume and (more technically) reflink-aware defrag.
>>>
>>> It was there for a couple kernel versions some time ago, but
>>> "impossibly" slow, so it was disabled until such time as btrfs could
>>> be made to scale rather better in this regard.

> I still contend that the biggest issue WRT reflink-aware defrag was that
> it was not optional.  The only way to get the old defrag behavior was to
> boot a kernel that didn't have reflink-aware defrag support.  IOW,
> _everyone_ had to deal with the performance issues, not just the people
> who wanted to use reflink-aware defrag.

Absolutely.

Which of course suggests making it optional, with a suitable warning as 
to the speed implications with lots of snapshots/reflinks, when it does 
get enabled again (and as David mentions elsewhere, there's apparently 
some work going into the idea once again, which potentially moves it from 
the 3-5 year range, at best, back to a 1/2-2-year range, time will tell).

>>> 3) N-way-mirroring.
>>>
>> [...]
>> This is not an issue, but a not implemented feature
> If you're looking

Re: unsolvable technical issues?

2018-06-22 Thread Duncan

, since it'll 
use some of that code", since at least 3.5, when raid56 was supposed to 
be introduced in 3.6.  I know because this is the one I've been most 
looking forward to personally, tho my original reason, aging but still 
usable devices that I wanted extra redundancy for, has long since itself 
been aged out of rotation.

Of course we know the raid56 story and thus the implied delay here, if 
it's even still roadmapped at all now, and as with reflink-aware-defrag, 
there's no hint yet as to when we'll actually see this at all, let alone 
see it in a reasonably stable form, so at least in the practical sense, 
it's arguably "might never be resolved."

4) (Until relatively recently, and still in terms of scaling) Quotas.

Until relatively recently, quotas could arguably be added to the list.  
They were rewritten multiple times, and until recently, appeared to be 
effectively eternally broken.

While that has happily changed recently and (based on the list, I don't 
use 'em personally) quotas actually seem at least someone usable these 
days (altho less critical bugs are still being fixed), AFAIK quota 
scalability while doing btrfs maintenance remains a serious enough issue 
that the recommendation is to turn them off before doing balances, and 
the same would almost certainly apply to reflink-aware-defrag (turn 
quotas off before defraging) were it available, as well.  That 
scalability alone could arguably be a "technical issue that may never be 
resolved", and while quotas themselves appear to be reasonably functional 
now, that could arguably justify them still being on the list.


And of course that's avoiding the two you mentioned, tho arguably they 
could go on the "may in practice never be resolved, at least not in the 
non-bluesky lifetime" list as well.


As for stratis, supposedly they're deliberately taking existing proven in 
multi-layer-form technology and simply exposing it in unified form.  They 
claim this dramatically lessens the required new code and shortens time-
to-stability to something reasonable, in contrast to the about a decade 
btrfs has taken already, without yet reaching a full feature set and full 
stability.  IMO they may well have a point, tho AFAIK they're still new 
and immature themselves and (I believe) don't have it either, so it's a 
point that AFAIK has yet to be fully demonstrated.

We'll see how they evolve.  I do actually expect them to move faster than 
btrfs, but also expect the interface may not be as smooth and unified as 
they'd like to present as I expect there to remain some hiccups in 
smoothing over the layering issues.  Also, because they've deliberately 
chosen to go with existing technology where possible in ordered to evolve 
to stability faster, by the same token they're deliberately limiting the 
evolution to incremental over existing technology, and I expect there's 
some stuff btrfs will do better as a result... at least until btrfs (or a 
successor) becomes stable enough for them to integrate (parts of?) it as 
existing demonstrated-stable technology.

The other difference, AFAIK, is that stratis is specifically a 
corporation making it a/the main money product, whereas btrfs was always 
something the btrfs devs used at their employers (oracle, facebook), who 
have other things as their main product.  As such, stratis is much more 
likely to prioritize things like raid status monitors, hot-spares, etc, 
that can be part of the product they sell, where they've been lower 
priority for btrfs.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

Gandalf Corvotempesta posted on Wed, 20 Jun 2018 11:15:03 +0200 as
excerpted:

> Il giorno mer 20 giu 2018 alle ore 10:34 Duncan <1i5t5.dun...@cox.net>
> ha scritto:
>> Parity-raid is certainly nice, but mandatory, especially when there's
>> already other parity solutions (both hardware and software) available
>> that btrfs can be run on top of, should a parity-raid solution be
>> /that/ necessary?
> 
> You can't be serious. hw raid as much more flaws than any sw raid.

I didn't say /good/ solutions, I said /other/ solutions.
FWIW, I'd go for mdraid at the lower level, were I to choose, here.

But for a 4-12-ish device solution, I'd probably go btrfs raid1 on a pair 
of mdraid-0s.  That gets you btrfs raid1 data integrity and recovery from 
its other mirror, while also being faster than the still not optimized 
btrfs raid 10.  Beyond about a dozen devices, six per "side" of the btrfs 
raid1, the risk of multi-device breakdown before recovery starts to get 
too high for comfort, but six 8 TB devices in raid0 gives you up to 48 TB 
to work with, and more than that arguably should be broken down into 
smaller blocks to work with in any case, because otherwise you're simply 
dealing with so much data it'll take you unreasonably long to do much of 
anything non-incremental with it, from any sort of fscks or btrfs 
maintenance, to trying to copy or move the data anywhere (including for 
backup/restore purposes), to ... whatever.

Actually, I'd argue that point is reached well before 48 TB, but the 
point remains, at some point it's just too much data to do much of 
anything with, too much to risk losing all at once, too much to backup 
and restore all at once as it just takes too much time to do it, just too 
much...  And that point's well within ordinary raid sizes with a dozen 
devices or less, mirrored, these days.

Which is one of the reasons I'm so skeptical about parity-raid being 
mandatory "nowadays".  Maybe it was in the past, when disks were (say) 
half a TB or less and mirroring a few TB of data was resource-
prohibitive, but now?

Of course we've got a guy here who works with CERN and deals with their 
annual 50ish petabytes of data (49 in 2016, see wikipedia's CERN 
article), but that's simply problems on a different scale.

Even so, I'd say it needs broken up into manageable chunks, and 50 PB is 
"only" a bit over 1000 48 TB filesystems worth.  OK, say 2000, so you're 
not filling them all absolutely full.

Meanwhile, I'm actually an N-way-mirroring proponent, here, as opposed to 
a parity-raid proponent.  And at that sort of scale, you /really/ don't 
want to have to restore from backups, so 3-way or even 4-5 way mirroring 
makes a lot of sense.  Hmm... 2.5 dozen for 5-way-mirroring, 2000 times, 
2.5*12*2000=... 60K devices!  That's a lot of hard drives!  And a lot of 
power to spin them.  But I guess it's a rounding error compared to what 
CERN uses for the LHC.

FWIW, N-way-mirroring has been on the btrfs roadmap, since at least 
kernel 3.6, for "after raid56".  I've been waiting awhile too; no sign of 
it yet so I guess I'll be waiting awhile longer.  So as they say, 
"welcome to the club!"  I'm 51 now.  Maybe I'll see it before I die.  
Imagine, I'm in my 80s in the retirement home and get the news btrfs 
finally has N-way-mirroring in mainline.  I'll be jumping up and down and 
cause a ruckus when I break my hip!  Well, hoping it won't be /that/ 
long, but... =;^]

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs balance did not progress after 12H

2018-06-20 Thread Duncan

Austin S. Hemmelgarn posted on Tue, 19 Jun 2018 12:58:44 -0400 as
excerpted:

> That said, I would question the value of repacking chunks that are
> already more than half full.  Anything above a 50% usage filter
> generally takes a long time, and has limited value in most cases (higher
> values are less likely to reduce the total number of allocated chunks).
> With `-duszge=50` or less, you're guaranteed to reduce the number of
> chunk if at least two match, and it isn't very time consuming for the
> allocator, all because you can pack at least two matching chunks into
> one 'new' chunk (new in quotes because it may re-pack them into existing
> slack space on the FS). Additionally, `-dusage=50` is usually sufficient
> to mitigate the typical ENOSPC issues that regular balancing is supposed
> to help with.

While I used to agree, 50% for best efficiency, perhaps 66 or 70% if 
you're really pressed for space, now that the allocator can repack into 
existing chunks more efficiently than it used to (at least in ssd mode, 
which all my storage is now), I've seen higher values result in practical/
noticeable recovery of space to unallocated as well.

In fact, I routinely use usage=70 these days, and sometimes use higher, 
to 99 or even 100%[1].  But of course I'm on ssd so it's far faster, and 
partition it up with the biggest partitions being under 100 GiB, so even 
full unfiltered balances are normally under 10 minutes and normal 
filtered balances under a minute, to the point I usually issue the 
balance command and actually wait for completion, so it's a far different 
ball game than issuing a balance command on a multi-TB hard drive and 
expecting it to take hours or even days.  In that case, yeah, a 50% cap 
arguably makes sense, tho he was using 60, which still shouldn't (sans 
bugs like we seem to have here) be /too/ bad.

---
[1] usage=100: -musage=1..100 is the only way I've found to balance 
metadata without rebalancing system as well, with the unfortunate penalty 
for rebalancing system on small filesystems being an increase of the 
system chunk size from 8 MB original mkfs.btrfs size to 32 MB... only a 
few KiB used! =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56

2018-06-20 Thread Duncan

on that 
I'm not sure has been settled yet.

> Based on official BTRFS status page, RAID56 is the only "unstable" item
> marked in red.
> No interested from Suse in fixing that?

As the above should make clear, it's _not_ a question as simple as 
"interest"!

> I think it's the real missing part for a feature-complete filesystem.
> Nowadays parity raid is mandatory, we can't only rely on mirroring.

"Nowdays"?  "Mandatory"?

Parity-raid is certainly nice, but mandatory, especially when there's 
already other parity solutions (both hardware and software) available 
that btrfs can be run on top of, should a parity-raid solution be /that/ 
necessary?  Of course btrfs isn't the only next-gen fs out there, either, 
there's other solutions such as zfs available too, if btrfs doesn't have 
the features required at the maturity required.

So I'd like to see the supporting argument to parity-raid being mandatory 
for btrfs, first, before I'll take it as a given.  Nice, sure.  
Mandatory?  Call me skeptical.

---
[1] "Still cautious" use:  In addition to the raid56-specific reliability 
issues described above, as well as to cover Waxhead's referral to my 
usual backups advice:

Sysadmin's[2] first rule of data value and backups:  The real value of 
your data is not defined by any arbitrary claims, but rather by how many 
backups you consider it worth having of that data.  No backups simply 
defines the data as of such trivial value that it's worth less than the 
time/trouble/resources necessary to do and have at least one level of 
backup.

With such a definition, data loss can never be a big deal, because even 
in the event of data loss, what was defined as of most importance, the 
time/trouble/resources necessary to have a backup (or at least one more 
level of backup, in the event there were backups but they failed too), 
was saved.  So regardless of whether the data was recoverable or not, you 
*ALWAYS* save what you defined as most important, either the data if you 
had a backup to retrieve it from, or the time/trouble/resources necessary 
to make that backup, if you didn't have it because saving that time/
trouble/resources was considered more important than making that backup.

Of course the sysadmin's second rule of backups is that it's not a 
backup, merely a potential backup, until you've tested that you can 
actually recover the data from it in similar conditions to those under 
which you'd need to recover it.  IOW, boot to the backup or to the 
recovery environment, and be sure the backup's actually readable and can 
be recovered from using only the resources available in the recovery 
environment, then reboot back to the normal or recovered environment and 
be sure that what you recovered from the recovery environment is actually 
bootable or readable in the normal environment.  Once that's done, THEN 
it can be considered a real backup.

"Still cautious use" is simply ensuring that you're following the above 
rules, as any good admin will be regardless, and that those backups are 
actually available and recoverable in a timely manner should that be 
necessary.  IOW, an only backup "to the cloud" that's going to take a 
week to download and recover to, isn't "still cautious use", if you can 
only afford a few hours down time.  Unfortunately, that's a real life 
scenario I've seen people say they're in here more than once.

[2] Sysadmin:  As used here, "sysadmin" simply refers to the person who 
has the choice of btrfs, as compared to say ext4, in the first place, 
that is, the literal admin of at least one system, regardless of whether 
that's administering just their own single personal system, or thousands 
of systems across dozens of locations in some large corporation or 
government institution.

[3] Raid56 mode reliability implications:  For raid56 data, this isn't 
/that/ big of a deal, tho depending on what's in the rest of the stripe, 
it could still affect files not otherwise written in some time.  For 
metadata, however, it's a huge deal, since an incorrectly reconstructed 
metadata stripe could take out much or all of the filesystem, depending 
on what metadata was actually in that stripe.  This is where waxhead's 
recommendation to use raid1/10 for metadata even if using raid56 for data 
comes in.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bug 199931] New: systemd/rtorrent file data corruption when using echo 3 >/proc/sys/vm/drop_caches

2018-06-08 Thread Duncan

Marc Lehmann posted on Wed, 06 Jun 2018 21:06:35 +0200 as excerpted:

> Not sure what exactly you mean with btrfs mirroring (there are many
> btrfs features this could refer to), but the closest thing to that that
> I use is dup for metadata (which is always checksummed), data is always
> single. All btrfs filesystems are on lvm (not mirrored), and most (but
> not all) are encrypted. One affected fs is on a hardware raid
> controller, one is on an ssd. I have a single btrfs fs in that box with
> raid1 for metadata, as an experiment, but I haven't used it for testing
> yet.

On the off chance, tho it doesn't sound like it from your description...

You're not doing LVM snapshots of the volumes with btrfs on them, 
correct?  Because btrfs depends on filesystem GUIDs being just that, 
globally unique, using them to find the possible multiple devices of a 
multi-device btrfs (normal single-device filesystems don't have the issue 
as they don't have to deal with multi-device as btrfs does), and btrfs 
can get very confused, with data-loss potential, if it sees multiple 
copies of a device with the same filesystem GUID, as can happen if lvm 
snapshots (which obviously have the same filesystem GUID as the original) 
are taken and both the snapshot and the source are exposed to btrfs 
device scan (which is auto-triggered by udev when the new device 
appears), with one of them mounted.

Presumably you'd consider lvm snapshotting a form of mirroring and you've 
already said you're not doing that in any form, but just in case, because 
this is a rather obscure trap people using lvm could find themselves in, 
without a clue as to the danger, and the resulting symptoms could be 
rather hard to troubleshoot if this possibility wasn't considered.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID-1 refuses to balance large drive

2018-05-28 Thread Duncan

Brad Templeton posted on Sun, 27 May 2018 11:22:07 -0700 as excerpted:

> BTW, I decided to follow the original double replace strategy suggested 
--
> replace 6TB with 8TB and replace 4TB with 6TB.  That should be sure to
> leave the 2 large drives each with 2TB free once expanded, and thus able
> to fully use all space.
> 
> However, the first one has been going for 9 hours and is "189.7% done" 
> and still going.   Some sort of bug in calculating the completion
> status, obviously.  With luck 200% will be enough?

IIRC there was an over-100% completion status bug fixed, I'd guess about 
18 months to two years ago now, long enough it would have slipped 
regular's minds so nobody would have thought about it even knowing you're 
still on 4.4, that being one of the reasons we don't do as well 
supporting stuff that old.

If it is indeed the same bug, anything even half modern should have it 
fixed

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID-1 refuses to balance large drive

2018-05-26 Thread Duncan

try to support, it's the last two kernel
release series in each of the current and LTS tracks.  So as the first
release back from current 4.16, 4.15, tho EOLed upstream, is still
reasonably supported for the moment here, tho people should be
upgrading to 4.16 by now as 4.17 should be out in a couple weeks or
so and 4.15 would be out of the two-current-kernel-series window at that
time.

Meanwhile, the two latest LTS series are as already stated 4.14, and the
earlier 4.9.  4.4 is the one previous to that and it's still mainline
supported in general, but it's out of the two LTS-series window of best
support here, and truth be told, based on history, even supporting the
second newest LTS series starts to get more difficult at about a year and
a half out, 6 months or so before the next LTS comes out.  As it happens
that's about where 4.9 is now, and 4.14 has had about 6 months to
stabilize now, so for LTS I'd definitely recommend 4.14, now.

Of course that doesn't mean that we /refuse/ to support 4.4, we still
try, but it's out of primary focus now and in many cases, should you
have problems, the first recommendation is going to be try something
newer and see if the problem goes away or presents differently.  Or
as mentioned, check with your distro if it's a distro kernel, since
in that case they're best positioned to support it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root raveled during balance

2018-05-23 Thread Duncan

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:

>> IMHO the best course of action would be to disable checksumming for you
>> vm files.
>> 
>> 
> Do you mean '-o nodatasum' mount flag? Is it possible to disable
> checksumming for singe file by setting some magical chattr? Google
> thinks it's not possible to disable csums for a single file.

You can use nocow (-C), but of course that has other restrictions (like 
setting it on the files when they're zero-length, easiest done for 
existing data by setting it on the containing dir and copying files (no 
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a 
snapshot (which locks the existing copy in place so changes written to a 
block /must/ be written elsewhere, thus the cow1, aka cow the first time 
written after the snapshot but retain the nocow for repeated writes 
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: property: Set incompat flag of lzo/zstd compression

2018-05-15 Thread Duncan

Su Yue posted on Tue, 15 May 2018 16:05:01 +0800 as excerpted:

> 
> On 05/15/2018 03:51 PM, Misono Tomohiro wrote:
>> Incompat flag of lzo/zstd compression should be set at:
>>  1. mount time (-o compress/compress-force)
>>  2. when defrag is done 3. when property is set
>> 
>> Currently 3. is missing and this commit adds this.
>> 
>> 
> If I don't misunderstand, compression property of an inode is only apply
> for *the* inode, not the whole filesystem.
> So the original logical should be okay.

But the inode is on the filesystem, and if it's compressed with lzo/zstd, 
the incompat flag should be set to avoid mounting with an earlier kernel 
that doesn't understand that compression and would therefore, if we're 
lucky, simply fail to read the data compressed in that file/inode.  (If 
we're unlucky it could blow up with kernel memory corruption like James 
Harvey's current case of unexpected, corrupted compressed data in a nocow 
file that being nocow, doesn't have csum validation to fail and abort the 
decompression, and shouldn't be compressed at all.)

So better to set the incompat flag and refuse to mount at all on kernels 
that don't have the required compression support.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-11 Thread Duncan

Darrick J. Wong posted on Fri, 11 May 2018 17:06:34 -0700 as excerpted:

> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
>> Right now we return EINVAL if a process does not have permission to dedupe a
>> file. This was an oversight on my part. EPERM gives a true description of
>> the nature of our error, and EINVAL is already used for the case that the
>> filesystem does not support dedupe.
>> 
>> Signed-off-by: Mark Fasheh 
>> ---
>>  fs/read_write.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 77986a2e2a3b..8edef43a182c 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -2038,7 +2038,7 @@ int vfs_dedupe_file_range(struct file *file, struct 
>> file_dedupe_range *same)
>>  info->status = -EINVAL;
>>  } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
>>   uid_eq(current_fsuid(), dst->i_uid))) {
>> -info->status = -EINVAL;
>> +info->status = -EPERM;
> 
> Hmm, are we allowed to change this aspect of the kabi after the fact?
> 
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)

>From the 0/2 cover-letter:

>>> This has also popped up in duperemove, mostly in the form of cryptic
>>> error messages. Because this is a code returned to userspace, I did
>>> check the other users of extent-same that I could find. Both 'bees'
>>> and 'rust-btrfs' do the same as duperemove and simply report the error
>>> (as they should).

> --D
> 
>>  } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>>  info->status = -EXDEV;
>>  } else if (S_ISDIR(dst->i_mode)) {
>> -- 
>> 2.15.1
>>

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Goffredo Baroncelli posted on Wed, 02 May 2018 22:40:27 +0200 as
excerpted:

> Anyway, my "rant" started when Ducan put near the missing of parity
> checksum and the write hole. The first might be a performance problem.
> Instead the write hole could lead to a loosing data. My intention was to
> highlight that the parity-checksum is not related to the reliability and
> safety of raid5/6.

Thanks for making that point... and to everyone else for the vigorous 
thread debating it, as I'm learning quite a lot! =:^)

>From your first reply:

>> Why the fact that the parity is not checksummed is a problem ?
>> I read several times that this is a problem. However each time the
>> thread reached the conclusion that... it is not a problem.

I must have missed those threads, or at least, missed that conclusion 
from them (maybe believing they were about something rather narrower, or 
conflating... for instance), because AFAICT, this is the first time I've 
seen the practical merits of checksummed parity actually debated, at 
least in terms I as a non-dev can reasonably understand.  To my mind it 
was settled (or I'd have worded my original claim rather differently) and 
only now am I learning different.

And... to my credit... given the healthy vigor of the debate, it seems 
I'm not the only one that missed them...

But I'm surely learning of it now, and indeed, I had somewhat conflated 
parity-checksumming with the in-place-stripe-read-modify-write atomicity 
issue.  I'll leave the parity-checksumming debate (now that I know it at 
least remains debatable) to those more knowledgeable than myself, but in 
addition to what I've learned of it, I've definitely learned that I can't 
properly conflate it with the in-place stripe-rmw atomicity issue, so 
thanks!

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-02 Thread Duncan

Gandalf Corvotempesta posted on Wed, 02 May 2018 19:25:41 + as
excerpted:

> On 05/02/2018 03:47 AM, Duncan wrote:
>> Meanwhile, have you looked at zfs? Perhaps they have something like
>> that?
> 
> Yes, i've looked at ZFS and I'm using it on some servers but I don't
> like it too much for multiple reasons, in example:
> 
> 1) is not officially in kernel, we have to build a module every time
> with DKMS

FWIW zfz is excluded from my choice domain as well, due to the well known 
license issues.  Regardless of strict legal implications, because Oracle 
has copyrights they could easily solve that problem and the fact that 
they haven't strongly suggests they have no interest in doing so.  That 
in turn means they have no interest in people like me running zfs, which 
means I have no interest in it either.

But because it does remain effectively the nearest to btrfs features and 
potential features "working now" solution out there, for those who simply 
_must_ have it and/or find it a more acceptable solution than cobbling 
together a multi-layer solution out of a standard filesystem on top of 
device-mapper or whatever, it's what I and others point to when people 
wonder about missing or unstable btrfs features.

> I'm new to BTRFS (if fact, i'm not using it) and I've seen in the status
> page that "it's almost ready".
> The only real missing part is a stable, secure and properly working
> RAID56,
> so i'm thinking why most effort aren't directed to fix RAID56 ?

Well, they are.  But finding and fixing corner-case bugs takes time and 
early-adopter deployments, and btrfs doesn't have the engineering 
resources to simply assign to the problem that Sun had with zfs.

Despite that, as I stated, current btrfs raid56 is, to the best of my/
list knowledge, the current code is now reasonably ready, tho it'll take 
another year or two without serious bug reports to actually test that, 
but it simply has the well known write hole that applies to all parity-
raid unless they've taken specific measures such as partial-stripe-write 
logging (slow), writing a full stripe even if it's partially empty 
(wastes space and needs periodic maintenance to reclaim it), or variable-
stripe-widths (needs periodic maintenance and more complex than always 
writing full stripes even if they're partially empty) (both of the latter 
avoiding the problem by avoiding in-place read-modify-write cycle 
entirely).

So to a large degree what's left is simply time for testing to 
demonstrate stability on the one hand, and a well known problem with 
parity-raid in general on the other.  There's the small detail that said 
well-known write hole has additional implementation-detail implications 
on btrfs, but at it's root it's the same problem all parity-raid has, and 
people choosing parity-raid as a solution are already choosing to either 
live with it or ameliorate it in some other way (tho some parity-raid 
solutions have that amelioration built-in).

> There are some environments where a RAID1/10 is too expensive and a
> RAID6 is mandatory,
> but with the current state of RAID56, BTRFS can't be used for valuable
> data

Not entirely true.  Btrfs, even btrfs raid56 mode, _can_ be used for 
"valuable" data, it simply requires astute /practical/ definitions of 
"valuable", as opposed to simple claims that don't actually stand up in 
practice.

Here's what I mean:  The sysadmin's first rule of backups defines 
"valuable data" by the number of backups it's worth making of that data.  
If there's no backups, then by definition the data is worth less than the 
time/hassle/resources necessary to have that backup, because it's not a 
question of if, but rather when, something's going to go wrong with the 
working copy and it won't be available any longer.

Additional layers of backup and whether one keeps geographically 
separated off-site backups as well are simply extensions of the first-
level-backup case/rule.  The more valuable the data, the more backups 
it's worth having of it, and the more effort is justified in ensuring 
that single or even multiple disasters aren't going to leave no working 
backup.

With this view, it's perfectly fine to use btrfs raid56 mode for 
"valuable" data, because that data is backed up and that backup can be 
used as a fallback if necessary.  True, the "working copy" might not be 
as reliable as it is in some cases, but statistically, that simply brings 
the 50% chance of failure rate (or whatever other percentage chance you 
choose) closer, to say once a year, or once a month, rather than perhaps 
once or twice a decade.  Working copy failure is GOING to happen in any 
case, it's just a matter of playing the chance

Re: RAID56 - 6 parity raid

2018-05-01 Thread Duncan

Gandalf Corvotempesta posted on Tue, 01 May 2018 21:57:59 + as
excerpted:

> Hi to all I've found some patches from Andrea Mazzoleni that adds
> support up to 6 parity raid.
> Why these are wasn't merged ?
> With modern disk size, having something greater than 2 parity, would be
> great.

1) Btrfs parity-raid was known to be seriously broken until quite 
recently (and still has the common parity-raid write-hole, which is more 
serious on btrfs because btrfs otherwise goes to some lengths to ensure 
data/metadata integrity via checksumming and verification, and the parity 
isn't checksummed, risking even old data due to the write hole, but there 
are a number of proposals to fix that), and piling even more not well 
tested patches on top was _not_ the way toward a solution.

2) Btrfs features in general have taken longer to merge and stabilize 
than one might expect, and parity-raid has been a prime example, with the 
original roadmap calling for parity-raid merge back in the 3.5 timeframe 
or so... partial/runtime (not full recovery) code was finally merged ~3 
years later in (IIRC) 3.19, took several development cycles for the 
initial critical bugs to be worked out but by 4.2 or so was starting to 
look good, then more bugs were found and reported, that took several more 
years to fix, tho IIRC LTS-4.14 has them.

Meanwhile, consider that N-way-mirroring was fast-path roadmapped for 
"right after raid56 mode, because some of its code depends on that), so 
was originally expected in 3.6 or so...  As someone who had been wanting 
to use /that/, I personally know the pain of "still waiting".

And that was "fast-pathed".

So even if the multi-way-parity patches were on the "fast" path, it's 
only "now" (for relative values of now, for argument say by 4.20/5.0 or 
whatever it ends up being called) that such a thing could be reasonably 
considered.


3) AFAIK none of the btrfs devs have flat rejected the idea, but btrfs 
remains development opportunity rich and implementing dev poor... there's 
likely 20 years or more of "good" ideas out there.  And the N-way-parity-
raid patches haven't hit any of the current devs' (or their employers') 
"personal itch that needs to be scratched" interest points, so while it 
certainly does remain a "nice idea", given the implementation timeline 
history for even 'fast-pathed" ideas, realistically we're looking at at 
least a decade out.  But with the practical projection horizon no more 
than 5-7 years out (beyond that other, unpredicted, developments, are 
likely to change things so much that projection is effectively 
impossible), in practice, a decade out is "bluesky", aka "it'd be nice to 
have someday, but it's not a priority, and with current developer 
manpower, it's unlikely to happen any time in the practically projectable 
future.

4) Of course all that's subject to no major new btrfs developer (or 
sponsor) making it a high priority, but even should such a developer (and/
or sponsor) appear, they'd probably need to spend at least two years 
coming up to speed with the code first, fixing normal bugs and improving 
the existing code quality, then post the updated and rebased N-way-parity 
patches for discussion, and get them roadmapped for merge probably some 
years later due to other then-current project feature dependencies.

So even if the N-way-parity patches became some new developer's (or 
sponsor's) personal itch to scratch, by the time they came up to speed 
and the code was actually merged, there's no realistic projection that it 
would be in under 5 years, plus another couple to stabilize, so at least 
7 years to properly usable stability.  So even then, we're already at the 
5-7 years practical projectability limit.


Meanwhile, have you looked at zfs?  Perhaps they have something like 
that?  And there's also a new(?) one, stratis, AFAIK commercially 
sponsored and device-mapper based, that I saw an article on recently, tho 
I've seen/heard no kernel-community discussion on it (there's a good 
chance followup here will change that if it's worth discussing, as 
there's several folks here for whom knowing about such things is part of 
their job) and no other articles (besides the pt 1 of the series 
mentioned below), so for all I know it's pie-in-the-sky or still new 
enough it'd be 5-7 years before it can be used in practice, as well.  But 
assuming it's a viable project, presumably it would get support if device-
mapper did/has.

The stratis article I saw (apparently part 2 in a series):
https://opensource.com/article/18/4/stratis-lessons-learned

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: NVMe SSD + compression - benchmarking

2018-04-29 Thread Duncan

Brendan Hide posted on Sat, 28 Apr 2018 09:30:30 +0200 as excerpted:

> My real worry is that I'm currently reading at 2.79GB/s (see result
> above and below) without compression when my hardware *should* limit it
> to 2.0GB/s. This tells me either `sync` is not working or my benchmark
> method is flawed.

No answer but a couple additional questions/suggestions:

* Tarfile:  Just to be sure, you're using an uncompressed tarfile, not a 
(compressed tarfile) tgz/tbz2/etc, correct?

* How does hdparm -t and -T compare?  That's read-only and bypasses the 
filesystem, so it should at least give you something to compare the 2.79 
GB/s to, both from-raw-device (-t) and cached/memory-only (-T).  See the 
hdparm (8) manpage for the details.

* And of course try the compressed tarball too, since it should be easy 
enough and should give you compressable vs. uncompressable numbers for 
sanity checking.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What is recommended level of btrfs-progs and kernel please

2018-04-29 Thread Duncan

David C. Partridge posted on Sat, 28 Apr 2018 15:09:07 +0100 as excerpted:

> To what level of btrfs-progs do you recommend I should upgrade once my
> corrupt FS is fixed?  What is the kernel pre-req for that?
> 
> Would prefer not to build from source ... currently running Ubuntu
> 16.04LTS

The way it works is as follows:

In normal operation, the kernel does most of the work, with commands such 
as balance and scrub simply making the appropriate calls to the kernel to 
do the real work.  So the kernel version is what's critical in normal 
operation.  (IIRC, the receive side of btrfs send/receive is an 
exception, userspace is doing the work there, tho the kernel does it on 
the send side.)

This list is mainline and forward-looking development focused, so 
recommended kernels, the ones people here are most familiar with, tend to 
be relatively new.  The two support tracks are current and LTS, and we 
try to support the latest two kernels of each.  On the current kernel 
track, 4.16 is the latest, so the 4.16 and 4.15 series are currently 
supported.  On the LTS track, 4.14 is the newest LTS series and is 
recommended, with 4.9 the previous one, still supported, tho as it gets 
older and memories of what was going on at the time fade, it gets harder 
to support.

That doesn't mean we don't try to help people with older kernels, but 
truth is, the best answer may well be "try it with a newer kernel and see 
if the problem persists".

Similarly for distro kernels, particularly older ones.  We track mainline 
and in general[1] have little idea what patches specific distros may have 
backported... or not.  With newer kernels there's not so much to backport, 
and hopefully none of their added patches actually interferes, but 
particularly outside the mainline LTS series kernels, and older than the 
second newest LTS series kernel for the real LTS distros, the distros 
themselves are choosing what to backport and support, and thus are in a 
better position to support those kernels than we on this list will be.


But when something goes wrong and you need to use the debugging tools or 
btrfs check or restore, it's the btrfs userspace (btrfs-progs) that is 
doing the work, so it becomes the most critical when you have a problem 
you are trying to find/repair/restore-from.

So in normal operation, userspace isn't critical, and the biggest problem 
is simply keeping it current enough that the output remains comparable to 
current output.  With btrfs userspace release numbering following that of 
the kernel, for operational use, a good rule of thumb is to keep 
userspace updated to at least the version of the oldest supported LTS 
kernel series, as mentioned 4.9 at present, thus keeping it at least 
within approximately two years of current.

But once something goes wrong, the newest available userspace, or close 
to it, has the latest fixes, and generally provides the best chance at a 
fix with the least hassle or chance of further breakage instead.  So 
there, basically something within the current track, above, thus 
currently at least a 4.15 if not a 4.16 userspace (btrfs-progs) is your 
best bet.

And often the easiest way to get that if your distro doesn't make it 
directly available, is to make it a point to keep around the latest 
LiveRescue (often install/rescue combined) image of a distro such as 
Fedora or Arch that stays relatively current.  That's often the newest or 
close enough, and if it's not, it at least gives you a way to get back 
online to fetch something newer after booting the rescue image, if you 
have to.

---
[1] In general:  I think one regular btrfs dev works with SuSE, and one 
non-dev but well-practiced support list regular is most familiar with 
Fedora, tho of course Fedora doesn't to be /too/ outdated.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: status page

2018-04-25 Thread Duncan

ith the least chance of 
introducing new bugs so the testing and bugfixing cycle should be shorter 
as well, but ouch, that logged-write penalty on top of the read-modify-
write penalty that short-stripe-writes on parity-raid already incurs, 
will really do a number to performance!  But it /should/ finally fix the 
write hole risk, and it'd be the fastest way to do it on top of existing 
code, with the least risk of additional bugs because it's the least new 
code to write.


What I personally suspect will happen is this last solution in the 
shorter term, tho it'll still take some years to be written and tested to 
stability, with the possibility of someone undertaking a btrfs parity-
raid-g2 project implementing the first/cleanest possibility in the longer 
term, say a decade out (which effectively means "whenever someone with 
the skills and motivation decides to try it, could be 5 years out if they 
start today and devote the time to it, could be 15 years out, or never, 
if nobody ever decides to do it).  I honestly don't see the intermediate 
possibilities as worth the trouble, as they'd take too long for not 
enough payback compared to the solutions at either end, but of course, 
someone might just come along that likes and actually implements that 
angle instead.  As always with FLOSS, the one actually doing the 
implementation is the one who decides (subject to maintainer veto, of 
course, and possible distro and ultimate mainlining of the de facto 
situation override of the maintainer, as well).


A single paragraph summary answer?

Current raid56 status-quo is semi-stable, and subject to testing over 
time, is likely to remain there for some time, with the known parity-raid 
write-hole caveat as the biggest issue.  There's discussion of attempts 
to mitigate the write-hole, but the final form such mitigation will take 
remains to be settled, and the shortest-to-stability alternative, logged 
partial-stripe-writes, has serious performance negatives, but that might 
be acceptable given that parity-raid already has read-modify-write 
performance issues so people don't choose it for write performance in any 
case.  That'd be probably 3 years out to stability at the earliest.  
There's a cleaner alternative but it'd be /much/ farther out as it'd 
involve a pretty heavy rewrite along with the long testing and bugfix 
cycle that implies, so ~10 years out if ever, for that.  And there's a 
couple intermediate alternatives as well, but unless something changes I 
don't really see them going anywhere.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs release 4.16.1

2018-04-25 Thread Duncan

David Sterba posted on Wed, 25 Apr 2018 13:02:34 +0200 as excerpted:

> On Wed, Apr 25, 2018 at 06:31:20AM +0000, Duncan wrote:
>> David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted:
>> 
>> > btrfs-progs version 4.16.1 have been released.  This is a bugfix
>> > release.
>> > 
>> > Changes:
>> > 
>> >   * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log,
>> >   btrfs-show-super, btrfs-calc-size
>> 
>> Cue the admin-side gripes about developer definitions of micro-upgrade
>> explicit "bugfix release" that allow disappearance of "obsolete tools".
>> 
>> Arguably such removals can be expected in a "feature release", but
>> shouldn't surprise unsuspecting admins doing a micro-version upgrade
>> that's specifically billed as a "bugfix release".
> 
> A major version release would be a better time for the removal, I agree
> and should have considered that.
> 
> However, the tools have been obsoleted for a long time (since 2015 or
> 2016) so I wonder if the deprecation warnings have been ignored by the
> admins all the time.

Indeed, in practice, anybody still using the stand-alone tools in a 
current version has been ignoring deprecation warnings for awhile, and 
the difference between 4.16.1 and 4.17(.0) isn't likely to make much of a 
difference to them.

It's just that from here anyway, if I did a big multi-version upgrade and 
saw tools go missing I'd expect it, and if I did an upgrade from 4.16 to 
4.17 I'd expect it and blame myself for not getting with the program 
sooner.  But on an upgrade from 4.16 to 4.16.1, furthermore, an explicit 
"bugfix release", I'd be annoyed with upstream when they went missing, 
because it's just not expected in such a minor release, particularly when 
it's an explicit "bugfix release".

>> (Further support for btrfs being "still stabilizing, not yet fully
>> stable and mature."  But development mode habits need to end
>> /sometime/, if stability is indeed a goal.)
> 
> What happened here was a bad release management decision, a minor one in
> my oppinion but I hear your complaint and will keep that in mind for
> future releases.

That's all I was after.  A mere trifle indeed in the filesystem context 
where there's a real chance that bugs can eat data, but equally trivially 
held off for a .0 release.  What's behind is done, but it can and should 
be used to inform the future, and I simply mentioned it here with the 
goal /of/ informing future release decisions.  To the extent that it does 
so, my post accomplished its purpose. =:^)

Seems my way of saying that ended up coming across way more negative than 
intended.  So I have some changes to make in the way I handle things in 
the future as well. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs progs release 4.16.1

2018-04-24 Thread Duncan

David Sterba posted on Tue, 24 Apr 2018 13:58:57 +0200 as excerpted:

> btrfs-progs version 4.16.1 have been released.  This is a bugfix
> release.
> 
> Changes:
> 
>   * remove obsolete tools: btrfs-debug-tree, btrfs-zero-log,
>   btrfs-show-super, btrfs-calc-size

Cue the admin-side gripes about developer definitions of micro-upgrade 
explicit "bugfix release" that allow disappearance of "obsolete tools".

Arguably such removals can be expected in a "feature release", but 
shouldn't surprise unsuspecting admins doing a micro-version upgrade 
that's specifically billed as a "bugfix release".

(Further support for btrfs being "still stabilizing, not yet fully stable 
and mature."  But development mode habits need to end /sometime/, if 
stability is indeed a goal.) 

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recovery from full metadata with all device space consumed?

2018-04-20 Thread Duncan

adata, RAID1: total=3.00GiB, used=2.50GiB
>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>
>> All of the consumable space on the backing devices also seems to be in
>> use:
>>
>> # btrfs fi show /broken Label: 'mon_data'  uuid:
>> 85e52555-7d6d-4346-8b37-8278447eb590
>> Total devices 4 FS bytes used 69.50GiB
>> devid1 size 931.51GiB used 931.51GiB path /dev/sda1
>> devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
>> devid3 size 931.51GiB used 931.51GiB path /dev/sdc1
>> devid4 size 931.51GiB used 931.51GiB path /dev/sdd1

As you suggest, all space on all devices is used.  While fi usage breaks 
out unallocated as its own line-item, both per device and overall, with
fi show/df you have to derive it from the difference between size and 
used on each device listed in the fi show report.

If (after getting it that way with balance) you keep fi show per-device 
used under say 250 or 500 MiB, that'll go to unallocated, as fi usage 
will make clearer.

Meanwhile, for fi df, that data line says 3.6+ TiB total data chunk 
allocations, but only 67 GiB used.  As I said, that's ***WAY*** out of 
whack, and getting it back into something a bit more normal and keeping 
it there, for under 100 GiB actually used, say under say 250 or 500 GiB 
total, with the rest returned to unallocated, dropping the used in the fi 
df report and increasing unallocated in fi usage, should keep you well 
out of trouble.

As for fi usage, While I use a bunch of much smaller filesystems here, 
all raid1 or dup, so it'll be of limited direct help, I'll post the 
output from one of mine, just so you can see how much easier it is to 
read the fi usage report:

$$ sudo btrfs filesystem usage /
Overall:
Device size:  16.00GiB
Device allocated:  7.02GiB
Device unallocated:8.98GiB
Device missing:  0.00B
Used:  4.90GiB
Free (estimated):  5.25GiB  (min: 5.25GiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID1: Size:3.00GiB, Used:2.24GiB
   /dev/sda5   3.00GiB
   /dev/sdb5   3.00GiB

Metadata,RAID1: Size:512.00MiB, Used:209.59MiB
   /dev/sda5 512.00MiB
   /dev/sdb5 512.00MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/sda5   8.00MiB
   /dev/sdb5   8.00MiB

Unallocated:
   /dev/sda5   4.49GiB
   /dev/sdb5   4.49GiB

(FWIW there's also btrfs device usage, if you want a device-focused 
report.)

This is a btrfs raid1 both data and metadata, on a pair of 8 GiB devices, 
thus 16 GiB total.

Of that 8 GiB per device, a very healthy 4.49 GiB per device, over half 
the filesystem, remains entirely chunk-level unallocated and thus free to 
allocate to data or metadata chunks as needed.

Meanwhile, data chunk allocation is 3 GiB total per device, of which 2.24 
GiB is used.  Again, that's healthy, as data chunks are nominally 1 GiB 
so that's probably three 1 GiB chunks allocated, with 2.24 GiB of it used.

By contrast, your in-trouble fi usage report will show (near) 0 
unallocated and a ***HUGE*** gap between size/total and used for data, 
while you should be easily able to get per-device data totals down to say 
250 GiB or so (or down to 10 GiB or so with more work), with it all 
switching to unallocated, and then keep it healthy by doing a balance 
with -dusage= as necessary any time the numbers start getting out of line 
again.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: remounted ro during operation, unmountable since

2018-04-15 Thread Duncan

Qu Wenruo posted on Sat, 14 Apr 2018 22:41:50 +0800 as excerpted:

>> sectorsize        4096
>> nodesize        4096
> 
> Nodesize is not the default 16K, any reason for this?
> (Maybe performance?)
> 
>>> 3) Extra hardware info about your sda
>>>     Things like SMART and hardware model would also help here.

>> Model Family: Samsung based SSDs Device Model: SAMSUNG SSD 830
>> Series
> 
> At least I haven't hear much problem about Samsung SSD, so I don't think
> it's the hardware to blamce. (Unlike Intel 600P)

830 model is a few years old, IIRC (I have 850s, and I think I saw 860s 
out in something I read probably on this list, but am not sure of it).  I 
suspect the filesystem was created with an old enough btrfs-tools that 
the default nodesize was still 4K, either due to older distro, or simply 
due to using the filesystem that long.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs fails to mount after power outage

2018-04-12 Thread Duncan

Qu Wenruo posted on Thu, 12 Apr 2018 07:25:15 +0800 as excerpted:

> On 2018年04月11日 23:33, Tom Vincent wrote:
>> My btrfs laptop had a power outage and failed to boot with "parent
>> transid verify failed..." errors. (I have backups).
> 
> Metadata corruption, again.
> 
> I'm curious about what's the underlying disk?
> Is it plain physical device? Or have other layers like bcache/lvm?
> 
> And what's the physical device? SSD or HDD?

The last line of his message said progs 4.15, kernel 4.15.15, NVMe, so 
it's SSD.

Another important question, tho, if not for this instance, than for 
easiest repair the next time something goes wrong:

What mount options?  In particular, is the discard option used (and of 
course I'm assuming nothing as insane as nobarrier)?

Because as came up on a recent thread here...

Btrfs normally keeps a few generations of root blocks around and one 
method of recovery is using the usebackuproot (or the deprecated 
recovery) option to try to use them if the current root is bad.  But 
apparently nobody considered how discard and the backup roots would 
interact, and there's (currently) nothing keeping them from being marked 
for discard just as soon as the next new root becomes current.  Now some 
device firmware batches up discards as garbage-collection that can be 
done periodically, when the number of unwritten erase-blocks gets low, 
but others do discards basically immediately, meaning those backup roots 
are lost effectively immediately, making the usebackuproots recovery 
feature worthless. =:^(

Not a tradeoff that would occur to most people, obviously including the 
btrfs devs that setup btrfs discard behavior, considering whether to 
enable discard or not. =:^(

But it's definitely a tradeoff to consider once you /do/ know it!

Presumably that'll be fixed at some point, but not being a dev nor 
knowing how complex the fix might be, I won't venture a guess as to when, 
or whether it'd be considered stable-kernel backport material or not, 
when it happens.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Out of space and incorrect size reported

2018-03-22 Thread Duncan

Shane Walton posted on Thu, 22 Mar 2018 00:56:05 + as excerpted:

>>>> btrfs fi df /mnt2/pool_homes
>>> Data, RAID1: total=240.00GiB, used=239.78GiB
>>> System, RAID1: total=8.00MiB, used=64.00KiB
>>> Metadata, RAID1: total=8.00GiB, used=5.90GiB
>>> GlobalReserve, single: total=512.00MiB, used=59.31MiB
>>> 
>>>> btrfs filesystem show /mnt2/pool_homes
>>> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>>> Total devices 2 FS bytes used 245.75GiB
>>> devid1 size 465.76GiB used 248.01GiB path /dev/sdaThe output 
from the (relatively new and thus possibly not yet in the old 4.4 you 
posted with above and upgraded from) btrfs filesystem usage command makes 
this somewhat clearer, tho 
>>> devid2 size 465.76GiB used 248.01GiB path /dev/sdb
>>> 
>>> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB
>>> almost full and limited to 240 GiB when there is I have 2x 500 GB HDD?

>>> What can I do to make this larger or closer to the full size of 465
>>> GiB (minus the System and Metadata overhead)?

By my read, Hugo answered correctly, but (I think) not the question you 
asked.

The upgrade was certainly a good idea, 4.4 being quite old now and not 
really supported well here now, as this is a development list and we tend 
to be focused on new, not long ago history, but it didn't change the 
report output as you expected, because based on your question you're 
misreading it and it doesn't say what you are interpreting it as saying.

BTW, you might like the output from btrfs filesystem usage a bit better 
as it's somewhat clearer than the previously required (usage is a 
relatively new subcommand that might not have been in 4.4 yet) btrfs fi 
df and btrfs fi show, but understanding how btrfs works and what the 
reported numbers mean is still useful.

Btrfs does two-stage allocation.  First, it allocates chunks of a 
specific type, normally data or metadata (system is special, normally 
only one chunk so no more allocated, and global reserve is actually 
reserved from metadata and counts as part of it) from unused/unallocated 
space (which isn't shown by show/df, but usage shows it separately), then 
when necessary, btrfs actually uses space from the chunks it allocated 
previously.

So what the above df line is saying is that 240 GiB of space have been 
allocated as data chunks, and 239.78 GiB of that, almost all of it, is 
used.

But you should still have 200+ GiB of unallocated space on each of the 
devices, as here shown by the individual device lines of the show command 
(465 total, 248 used), tho as I said, btrfs filesystem usage makes that 
rather clearer.

And btrfs should normally allocate additional space from that 200+ gigs 
unallocated, to data or metadata chunks, as necessary.  Further, because 
btrfs can't directly take chunks allocated as data and reallocate them as 
metadata, you *WANT* lots of unallocated space.  You do NOT want all that 
extra space allocated as data chunks, because then they wouldn't be 
available to allocate as metadata if needed.

Now with 200+ GiB of space on each of the two devices unallocated, you 
shouldn't yet be running into ENOSPC (error no space) errors.  If you 
are, that's a bug, and there have actually been a couple bugs like that 
recently, but that doesn't mean you want btrfs to unnecessarily allocate 
all that unallocated space as data space, which would be what it did if 
it reported all that as data.  Rather, you need btrfs to allocate data, 
and metadata, chunks as needed, and any space related errors you are 
seeing would be bugs related to that.

Now that you have a newer btrfs-progs and kernel, and have read my 
attempt at an explanation above, try btrfs filesystem usage and see if 
things are clearer.  If not, maybe Hugo or someone else can do better 
now, answering /that/ question.  And of course if with the newer 4.12 
kernel you're getting ENOSPC errors, please report that too, tho be aware 
that 4.14 is the latest LTS series, with 4.9 the LTS before that, and as 
a normal non-LTS series kernel 4.12 support has ended as well, so you 
might wish to either upgrade to a current 4.14 LTS or downgrade to the 
older 4.9 LTS, for best support.

Or of course you could go with a current non-LTS.  Normally the latest 
two release series in both normal and LTS are best supported, so with 
4.15 out and 4.16 nearing release, that's the latest 4.15 stable release 
now, or 4.14, to be 4.16 and 4.15 at 4.16 release, or on the LTS track 
the previously mentioned 4.14 and 4.9 series, tho at a year old plus, 4.9 
is already getting rather harder to support, and 4.14 is old enough now 
it's preferred for LTS track.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree progra

Re: grub_probe/grub-mkimage does not find all drives in BTRFS RAID1

2018-03-22 Thread Duncan

ut bad upgrades or fat-fingering my /boot, that I kept it!

But in addition to two-way raid1 redundancy on multiple devices, btrfs 
has the dup mode, two-way dup redundancy on a single device, so that's 
what I do with my /boot and its backups on other devices now, instead of 
making them raid1s across multiple devices.

So while most of my filesystems and their backups are btrfs raid1 both 
data and metadata across two physical devices (with another pair of 
physical devices for the btrfs raid1 backups), /boot and its backups are 
all btrfs dup mixed-bg-mode (so data and metadata mixed, easier to work 
with on small filesystems), giving me one primary /boot and three 
backups, and I can still select which one to boot from the hardware/BIOS 
(legacy not EFI mode, tho I do use GPT and have EFI-boot partitions 
reserved in case I decide to switch to EFI at some point).


So my suggestion would be to do something similar, multiple /boot, one 
per device, one as the working copy and the other(s) as backups, instead 
of btrfs raid1 across multiple devices.  If you still want to take 
advantage of btrfs' ability to error-correct from a second copy if the 
first fails checksum, as I do, btrfs dup mode is useful, but regardless, 
you'll then have a backup in case the working /boot entirely fails.  Tho 
of course with dup mode you can only use a bit under half the capacity.

Your btrfs fi show says 342 MB used (as data) of the 1 GiB, so dup mode 
should be possible as you'd have a bit under 500 MiB capacity then.  Your 
individual devices say nearly 700 MiB each used, but with only 342 MiB of 
that as data, the rest is likely partially used chunks that a filtered 
balance can take care of.  A btrfs fi usage report would tell the details 
(or btrfs fi df, combined with the show you've already posted).  At a 
GiB, creating the filesystem as mixed-mode is also recommended, tho that 
does make a filtered balance a bit more of a hassle since you have to use 
the same filters for both data and metadata because they're the same 
chunks.

FWIW, I started out with 256 MiB /boot here, btrfs dup mode so ~ 100 MiB 
usable, but after ssd upgrades and redoing the layout, now use 512 MiB 
/boots, for 200+ MiB usable.  That's better.  Your 1 GiB doubles that, so 
should be no trouble at all, even with dup, unless you're storing way 
more in /boot than I do.  (Being gentoo I do configure and run a rather 
slimmer custom initramfs and monolithic kernel configured for only the 
hardware and dracut initr* modules I need, and a fatter generic initr* 
and kernel modules would likely need more space, but your show output 
says it's only using 342 MiB for data, so as I said your 1 GiB for ~500 
MiB usable in dup mode should be quite reasonable.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-15 Thread Duncan

Piotr Pawłow posted on Tue, 13 Mar 2018 08:08:27 +0100 as excerpted:

> Hello,
>> Put differently, 4.7 is missing a year and a half worth of bugfixes
>> that you won't have when you run it to try to check or recover that
>> btrfs that won't mount! Do you *really* want to risk your data on bugs
>> that were after all discovered and fixed over a year ago?
> 
> It is also missing newly introduced bugs. Right now I'm dealing with
> btrfs raid1 server that had the fs getting stuck and kernel oopses due
> to a regression:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=198861
> 
> I had to cherry-pick commit 3be8828fc507cdafe7040a3dcf361a2bcd8e305b and
> recompile the kernel to even start moving the data off the failing
> drive, as the fix is not in stable yet, and encountering any i/o error
> would break the kernel. And now it seems the fs is corrupted, maybe due
> to all the crashes earlier.
> 
> FYI in case you decide to switch to 4.15

In context I was referring to userspace as the 4.7 was userspace btrfs-
progs, not kernelspace.

For kernelspace he was on 4.9, which is the second-newest LTS (long-term-
stable) kernel series, and thus should continue to be at least somewhat 
supported on this list for another year or so, as we try to support the 
two newest kernels from both the current and LTS series.  Tho 4.9 does 
lack the newer raid1 per-chunk degraded-writable scanning feature, and 
AFAIK that won't be stable-backported as it's more a feature than a bugfix 
and as such, doesn't meet the requirements for stable-series backports.  
Which is why Adam recommended a newer kernel, since that was the 
particular problem needing addressed here.

But for someone on an older kernel, presumably because they like 
stability, I'd suggest the newer 4.14 LTS series kernel as an upgrade, 
not the only short-term supported 4.15 series... unless the intent is to 
continue staying current after that, with 4.16, 4.17, etc.  Which your 
point about newer kernels coming with newer bugs in addition to fixes 
supports as well.  Moving to the 4.14 LTS should get the real fixes and 
the longer stabilization time, tho not the feature adds, which would 
bring a higher chance of more bugs, as well.

And with 4.15 out for awhile now and 4.16 close, 4.14 should be 
reasonably stabilizing by now and should be pretty safe to move to.

But of course there's some risk of new bugs in addition to fixes for 
newer userspace versions too.  But since it's kernelspace that's the 
operational code and userspace is primarily recovery, and we know that 
older bugs ARE fixed in newer userspace, and assuming a sane backups 
policy which I stressed in the same post (if you don't have a backup, 
you're defining the data as of less value than the time/trouble/resources 
to create the backup, thus defining it as of relatively low/trivial value 
in the first place, because you're more willing to risk losing it than 
you are to spend the time/resources/hassle to ensure against that risk), 
the better chance at an updated userspace being able to fix problems with 
less risk of further damage really does justify considering updating to 
reasonably current userspace.  If there's any doubt, stay a version or 
two behind the latest release and watch for reports of problems with the 
latest, but certainly, with 4.15 userspace out and no serious reports of 
new damage from 4.14 userspace, the latter should now be a reasonably 
safe upgrade.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-11 Thread Duncan

ckups are fast enough now 
that as I predicted, I make them far more often.  So I'm walking my own 
talk, and am able to sleep much more comfortably now as I'm not worrying 
about that backup I put off and the chance fate might take me up on my 
formerly too-high-for-comfort "trivial" threshold definition.=:^)

(And as it happens, I'm actually running from a system/root filesystem 
backup ATM, as an upgrade didn't go well and x wouldn't start, so I 
reverted.  But my root/system filesystem is under 10 gigs, on SSD for the 
backup as well as the working copy, so a full backup copy of root takes 
only a few minutes and I made one before upgrading a few packages I had 
some doubts about due to previous upgrade issues with them, so the delta 
between working and that backup was literally the five package upgrades I 
was it turned out rightly worried about.  So that investment in ssds for 
backup has paid off.  While in this particular case simply taking a 
snapshot and recovering to it when the upgrade went bad would have worked 
just as well, having the independent filesystem backup on a different set 
of physical devices means I don't have to worry about loss of the 
filesystem or physical devices containing it, either! =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread Duncan

Andrei Borzenkov posted on Sat, 10 Mar 2018 13:27:03 +0300 as excerpted:


> And "missing" is not the answer because I obviously may have more than
> one missing device.

"missing" is indeed the answer when using btrfs device remove.  See the 
btrfs-device manpage, which explains that if there's more than one device 
missing, either just the first one described by the metadata will be 
removed (if missing is only specified once), or missing can be specified 
multiple times.

raid6 with two devices missing is the only normal candidate for that 
presently, tho on-list we've seen aborted-add cases where it still worked 
as well, because while the metadata listed the new device it didn't 
actually have any data when it became apparent it was bad and thus needed 
to be removed again.

Note that because btrfs raid1 and raid10 only does two-way-mirroring 
regardless of the number of devices, and because of the per-chunk (as 
opposed to per-device) nature of btrfs raid10, those modes can only 
expect successful recovery with a single missing device, altho as 
mentioned above we've seen on-list at least one case where an aborted 
device-add of device found to be bad after the add didn't actually have 
anything on it, so it could still be removed along with the device it was 
originally intended to replace.

Of course the N-way-mirroring mode, whenever it eventually gets 
implemented, will allow missing devices upto N-1, and N-way-parity mode, 
if it's ever implemented, similar, but N-way-mirroring was scheduled for 
after raid56 mode so it could make use of some of the same code, and that 
has of course taken years on years to get merged and stabilize, and 
there's no sign yet of N-way-mirroring patches, which based on the raid56 
case could take years to stabilize and debug after original merge, so the 
still somewhat iffy raid6 mode is likely to remain the only normal usage 
of multiple missing for years, yet.

For btrfs replace, the manpage says ID's the only way to handle missing, 
but getting that ID, as you've indicated, could be difficult.  For 
filesystems with only a few devices that haven't had any or many device 
config changes, it should be pretty easy to guess (a two device 
filesystem with no changes should have IDs 1 and 2, so if only one is 
listed, the other is obvious, and a 3-4 device fs with only one or two 
previous device changes, likely well remembered by the admin, should 
still be reasonably easy to guess), but as the number of devices and the 
number of device adds/removes/replaces increases, finding/guessing the 
missing one becomes far more difficult.

Of course the sysadmin's first rule of backups states in simple form that 
not having one == defining the value of the data as trivial, not worth 
the trouble of a backup, which in turn means that at some point before 
there's /too/ many device change events, it's likely going to be less 
trouble (particularly after factoring in reliability) to restore from 
backups to a fresh filesystem than it is to do yet another device change, 
and together with the current practical limits btrfs imposes on the 
number of missing devices, that tends to impose /some/ limit on the 
possibilities for missing device IDs, so the situation, while not ideal, 
isn't yet /entirely/ out of hand, either, because a successful guess 
based on available information should be possible without /too/ many 
attempts.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: spurious full btrfs corruption

2018-03-06 Thread Duncan

Christoph Anton Mitterer posted on Tue, 06 Mar 2018 01:57:58 +0100 as
excerpted:

> In the meantime I had a look of the remaining files that I got from the
> btrfs-restore (haven't run it again so far, from the OLD notebook, so
> only the results from the NEW notebook here:):
> 
> The remaining ones were multi-GB qcow2 images for some qemu VMs.
> I think I had non of these files open (i.e. VMs running) while in the
> final corruption phase... but at least I'm sure that not *all* of them
> were running.
> 
> However, all the qcow2 files from the restore are more or less garbage.
> During the btrfs-restore it already complained on them, that it would
> loop too often on them and whether I want to continue or not (I choose n
> and on another full run I choose y).
> 
> Some still contain a partition table, some partitions even filesystems
> (btrfs again)... but I cannot mount them.

Just a note on format choices FWIW, nothing at all to do with your 
current problem...

As my own use-case doesn't involve VMs I'm /far/ from an expert here, but 
if I'm screwing things up I'm sure someone will correct me and I'll learn 
something too, but it does /sound/ reasonable, so assuming I'm 
remembering correctly from a discussion here...

Tip: Btrfs and qcow2 are both copy-on-write/COW (it's in the qcow2 name, 
after all), and doing multiple layers of COW is both inefficient and a 
good candidate to test for corner-case bugs that wouldn't show up in 
more normal use-cases.  Assuming bug-free it /should/ work properly, of 
course, but equally of course, bug-free isn't an entirely realistic 
assumption. =8^0

... And you're putting btrfs on qcow2 on btrfs... THREE layers of COW!

The recommendation was thus to pick what layer you wish to COW at, and 
use something that's not COW-based at the other layers.  Apparently, qemu 
has raw-format as a choice as well as qcow2, and that was recommended as 
preferred for use with btrfs (and IIRC what the recommender was using 
himself).

But of course that still leaves cow-based btrfs on both the top and the 
bottom layers.  I suppose which of those is best to remain btrfs, while 
making the other say ext4 as widest used and hopefully safest general 
purpose non-COW alternative, depends on the use-case.

Of course keeping btrfs at both levels but nocowing the image files on 
the host btrfs is a possibility as well, but nocow on btrfs has enough 
limits and caveats that I consider it a second-class "really should have 
used a different filesystem for this but didn't want to bother setting up 
a dedicated one" choice, and as such, don't consider it a viable option 
here.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-03-02 Thread Duncan

vinayak hegde posted on Thu, 01 Mar 2018 14:56:46 +0530 as excerpted:

> This will happen over and over again until we have completely
> overwritten the original extent, at which point your space usage will go
> back down to ~302g.We split big extents with cow, so unless you've got
> lots of space to spare or are going to use nodatacow you should probably
> not pre-allocate virt images

Indeed.  Preallocation with COW doesn't make the sense it does on an 
overwrite-in-place filesystem.  Either nocow it and take the penalties 
that brings[1], or configure your app not to preallocate in the first 
place[2].

---
[1] On btrfs, nocow implies no checksumming or transparent compression, 
either.  Also, the nocow attribute needs to be set on the empty file, 
with the easiest way to do that being to set it on the parent directory 
before file creation, so it's inherited by any newly created files/
subdirs within it.

[2] Many apps that preallocate by default have an option to turn 
preallocation off.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-02-28 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 28 Feb 2018 14:24:40 -0500 as
excerpted:

>> I believe this effect is what Austin was referencing when he suggested
>> the defrag, tho defrag won't necessarily /entirely/ clear it up.  One
>> way to be /sure/ it's cleared up would be to rewrite the entire file,
>> deleting the original, either by copying it to a different filesystem
>> and back (with the off-filesystem copy guaranteeing that it can't use
>> reflinks to the existing extents), or by using cp's --reflink=never
>> option.
>> (FWIW, I prefer the former, just to be sure, using temporary copies to
>> a suitably sized tmpfs for speed where possible, tho obviously if the
>> file is larger than your memory size that's not possible.)

> Correct, this is why I recommended trying a defrag.  I've actually never
> seen things so bad that a simple defrag didn't fix them however (though
> I have seen a few cases where the target extent size had to be set
> higher than the default of 20MB).

Good to know.  I knew larger target extent sizes could help, but between 
not being sure they'd entirely fix it and not wanting to get too far down 
into the detail when the copy-off-the-filesystem-and-back option is 
/sure/ to fix the problem, I decided to handwave that part of it. =:^)

> Also, as counter-intuitive as it
> might sound, autodefrag really doesn't help much with this, and can
> actually make things worse.

I hadn't actually seen that here, but suspect I might, now, as previous 
autodefrag behavior on my system tended to rewrite the entire file[1], 
thereby effectively giving me the benefit of the copy-away-and-back 
technique without actually bothering, while that "bug" has now been fixed.

I sort of wish the old behavior remained an option, maybe 
radicalautodefrag or something, and must confess to being a bit concerned 
over the eventual impact here now that autodefrag does /not/ rewrite the 
entire file any more, but oh, well...  Chances are it's not going to be 
/that/ big a deal since I /am/ on fast ssd, and if it becomes one, I 
guess I can just setup say firefox-profile-defrag.timer jobs or whatever, 
as necessary.

---
[1] I forgot whether it was ssd behavior, or compression, or what, but 
something I'm using here apparently forced autodefrag to rewrite the 
entire file, and a recent "bugfix" changed that so it's more in line with 
the normal autodefrag behavior.  I rather preferred the old behavior, 
especially since I'm on fast ssd and all my large files tend to be write-
once no-rewrite anyway, but I understand the performance implications on 
large active-rewrite files such as gig-plus database and VM-image files, 
so...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-02-28 Thread Duncan

ing extents), or by using cp's --reflink=never option.  
(FWIW, I prefer the former, just to be sure, using temporary copies to a 
suitably sized tmpfs for speed where possible, tho obviously if the file 
is larger than your memory size that's not possible.)

Of course where applicable, snapshots and dedup keep reflink-references 
to the old extents, so they must be adjusted or deleted as well, to 
properly free that space.

---
[1] du: Because its purpose is different.  du's primary purpose is 
telling you in detail what space files take up, per-file and per-
directory, without particular regard to usage on the filesystem itself.  
df's focus, by contrast, is on the filesystem as a whole.  So where two 
files share the same extent due to reflinking, du should and does count 
that usage for each file, because that's what each file /uses/ even if 
they both use the same extents.


-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ongoing Btrfs stability issues

2018-02-16 Thread Duncan

Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as
excerpted:

> This will probably sound like an odd question, but does BTRFS think your
> storage devices are SSD's or not?  Based on what you're saying, it
> sounds like you're running into issues resulting from the
> over-aggressive SSD 'optimizations' that were done by BTRFS until very
> recently.
> 
> You can verify if this is what's causing your problems or not by either
> upgrading to a recent mainline kernel version (I know the changes are in
> 4.15, I don't remember for certain if they're in 4.14 or not, but I
> think they are), or by adding 'nossd' to your mount options, and then
> seeing if you still have the problems or not (I suspect this is only
> part of it, and thus changing this will reduce the issues, but not
> completely eliminate them).  Make sure and run a full balance after
> changing either item, as the aforementioned 'optimizations' have an
> impact on how data is organized on-disk (which is ultimately what causes
> the issues), so they will have a lingering effect if you don't balance
> everything.

According to the wiki, 4.14 does indeed have the ssd changes.

According to the bug, he's running 4.13.x on one server and 4.14.x on 
two.  So upgrading the one to 4.14.x should mean all will have that fix.

However, without a full balance it /will/ take some time to settle down 
(again, assuming btrfs was using ssd mode), so the lingering effect could 
still be creating problems on the 4.14 kernel servers for the moment.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

2018-02-14 Thread Duncan

Tomasz Chmielewski posted on Thu, 15 Feb 2018 16:02:59 +0900 as excerpted:

>> Not sure if the removal of 80G has anything to do with this, but this
>> seems that your metadata (along with data) is quite scattered.
>> 
>> It's really recommended to keep some unallocated device space, and one
>> of the method to do that is to use balance to free such scattered space
>> from data/metadata usage.
>> 
>> And that's why balance routine is recommened for btrfs.
> 
> The balance might work on that server - it's less than 0.5 TB SSD disks.
> 
> However, on multi-terabyte servers with terabytes of data on HDD disks,
> running balance is not realistic. We have some servers where balance was
> taking 2 months or so, and was not even 50% done. And the IO load the
> balance was adding was slowing the things down a lot.

Try a filtered balance.  Something along the lines of:

btrfs balance start -dusage=10 

The -dusage number, a limit on the chunk usage percentage, can start 
small, even 0, and be increased as necessary, until btrfs fi usage 
reports data size (currently 411 GiB) closer to data usage (currently 
246.14 GiB), with the freed space returning to unallocated.

I'd shoot for reducing data size to under 300 GiB, thus returning over 
100 GiB to unallocated, while hopefully not requiring too high a -dusage 
percentage and thus too long a balance time.  You could get it down under 
250 gig size, but that would likely take a lot of rewriting for little 
additional gain, since with it under 300 gig size you should already have 
over 100 gig unallocated.

Balance time should be quite short for low percentages, with a big 
payback if there's quite a few chunks with little usage, because at 10%, 
the filesystem can get rid of 10 chunks while only rewriting the 
equivalent of a single full chunk.

Obviously as the chunk usage percentage goes up, the payback goes down, 
so at 50%, it can only clear two chunks while writing one, and at 66%, it 
has to write two chunks worth to clear three.  Above that (tho I tend to 
round up to 70% here) is seldom worth it until the filesystem gets quite 
full and you're really fighting to keep a few gigs of unallocated space.  
(As Qu indicated, you always want at least a gig of unallocated space, on 
at least two devices if you're doing raid1.)

If you really wanted you could do the same with -musage for metadata, 
except that's not so bad, only 9 gig size, 3 gig used.  But you could 
free 5 gigs or so, if desired.


That's assuming there's no problem.  I see a followup indicating you're 
seeing problems in dmesg with a balance, however, and will let others 
deal with that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

2018-02-14 Thread Duncan

Qu Wenruo posted on Thu, 15 Feb 2018 09:42:27 +0800 as excerpted:

> The easiest way to get a basic idea of how large your extent tree is
> using debug tree:
> 
> # btrfs-debug-tree -r -t extent 
> 
> You would get something like:
> btrfs-progs v4.15 extent tree key (EXTENT_TREE ROOT_ITEM 0) 30539776
> level 0  <<<
> total bytes 10737418240 bytes used 393216 uuid
> 651fcf0c-0ffd-4351-9721-84b1615f02e0
> 
> That level is would give you some basic idea of the size of your extent
> tree.
> 
> For level 0, it could contains about 400 items for average.
> For level 1, it could contains up to 197K items.
> ...
> For leven n, it could contains up to 400 * 493 ^ (n - 1) items.
> ( n <= 7 )

So for level 2 (which I see on a couple of mine here, ran it out of 
curiosity):

400 * 493 ^ (2 - 1) = 400 * 493 = 197200

197K for both level 1 and level 2?  Doesn't look correct.

Perhaps you meant a simple power of n, instead of (n-1)?  That would 
yield ~97M for level 2, and would yield the given numbers for levels 0 
and 1 as well, whereby using n-1 for level 0 yields less than a single 
entry, and 400 for level 1.

Or the given numbers were for level 1 and 2, with level 0 not holding 
anything, not levels 0 and 1.  But that wouldn't jive with your level 0 
example, which I would assume could never happen if it couldn't hold even 
a single entry.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

2018-02-14 Thread Duncan

Tomasz Chmielewski posted on Wed, 14 Feb 2018 23:19:20 +0900 as excerpted:

> Just FYI, how dangerous running btrfs can be - we had a fatal,
> unrecoverable MySQL corruption when btrfs decided to do one of these "I
> have ~50 GB left, so let's do out of space (and corrupt some files at
> the same time, ha ha!)".

Ouch!

> Running btrfs RAID-1 with kernel 4.14.

Kernel 4.14... quite current... good.  But 4.14.0 first release, 4.14.x 
current stable, or somewhere (where?) in between?

And please post the output of btrfs fi usage for that filesystem.  
Without that (or fi sh and fi df, the pre-usage method of getting nearly 
the same info), it's hard to say where or what the problem was.

Meanwhile, FWIW there was a recent metadata over-reserve bug that should 
be fixed in 4.15 and the latest 4.14 stable, but IDR whether it affected 
4.14.0 original or only the 4.13 series and early 4.14-rcs and was fixed 
by 4.14.0.  The bug seemed to trigger most frequently when doing balances 
or other major writes to the filesystem, on middle to large sized 
filesystems.  (My all under quarter-TB each btrfs didn't appear to be 
affected.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Status of FST and mount times

2018-02-14 Thread Duncan

d if you can avoid it because the btrfs check --repair 
fix is trivial, it's worth doing so.

Valid case, but there's nothing in your post indicating it's valid as 
/your/ case.

Of course the other possibility is live-failover, which is sure to be 
facebook's use-case.  But with live-failover, the viability of btrfs 
check --repair more or less ceases to be of interest, because the failover 
happens (relative to the offline check or restore time) instantly, and 
once the failed devices/machine is taken out of service it's far more 
effective to simply blow away the filesystem (if not replacing the 
device(s) entirely) and restore "at leisure" from backup, a relatively 
guaranteed procedure compared to the "no guarantees" of attempting to 
check --repair the filesystem out of trouble.

Which is very likely why the free-space-tree still isn't well supported 
by btrfs-progs, including btrfs check, several kernel (and thus -progs) 
development cycles later.  The people who really need the one (whichever 
one of the two)... don't tend to (or at least /shouldn't/) make use of 
the other so much.

It's also worth mentioning that btrfs raid0 mode, as well as single mode, 
hobbles the btrfs data and metadata integrity feature, because while 
checksums can and are still generated, stored and checked by default, and 
integrity problems can still be detected, because raid0 (and single) 
includes no redundancy, there's no second copy (raid1/10) or parity 
redundancy (raid5/6) to rebuild the bad data from, so it's simply gone.  
(Well, for data you can try btrfs restore of the otherwise inaccessible 
file and hope for the best, and for metadata, you can try check --repair 
and again hope for the best, but...)  If you're using that feature of 
btrfs and want/need more than just detection of a problem that can't be 
fixed due to lack of redundancy, there's a good chance you want a real 
redundancy raid mode on multi-device, or dup mode on single device.

So bottom line... given the sacrificial lack of redundancy and 
reliability of raid0, btrfs or not, in an enterprise setting with tens of 
TB of data, why are you worrying about the viability of btrfs check --
repair on what the placement on raid0 decrees to be throw-away data 
anyway?  At first glance anyway, one or the other, either the raid0 mode 
and thus declared throw-away value of tens of TB of data, or the 
viability of btrfs check --repair, indicating you don't consider the data 
you just declared to be of throw-away value by placing it on raid0, to be 
of throw-away value after all, must be wrong.  Which one is wrong is your 
call, and there's certainly individual cases (one of which I even named) 
where concern about the viability of btrfs check --repair on raid0 might 
be valid, but your post has no real indication that your case is such a 
case, and honestly, that worries me!

> 2. There's another thread on-going about mount delays.  I've been
> completely blind to this specific problem until it caught my eye.  Does
> anyone have ballpark estimates for how long very large HDD-based
> filesystems will take to mount?  Yes, I know it will depend on the
> dataset.  I'm looking for O() worst-case approximations for
> enterprise-grade large drives (12/14TB), as I expect it should scale
> with multiple drives so approximating for a single drive should be good
> enough.

No input on that question here (my own use-case couldn't be more 
different, multiple small sub-half-TB independent btrfs raid1s on 
partitioned ssds), but another concern, based on real-world reports I've 
seen on-list:

12-14 TB individual drives?

While you /did/ say enterprise grade so this probably doesn't apply to 
you, it might apply to others that will read this.

Be careful that you're not trying to use the "archive application" 
targeted SMR drives for general purpose use.  Occasionally people will 
try to buy and use such drives in general purpose use due to their 
cheaper per-TB cost, and it just doesn't go well.  We've had a number of 
reports of that. =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs - kernel warning

2018-02-04 Thread Duncan

Duncan posted on Fri, 02 Feb 2018 02:49:52 + as excerpted:

> As CMurphy says, 4.11-ish is starting to be reasonable.  But you're on
> the LTS kernel 4.14 series and userspace 4.14 was developed in parallel,
> so btrfs-progs-3.14 would be ideal.

Umm... obviously that should be 4.14.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs - kernel warning

2018-02-01 Thread Duncan

ast backup and the current state.  As soon as 
the change to your data since the last backup becomes more valuable than 
the time/trouble/resources necessary to update your backup, you will do 
so.  If you haven't, it simply means you're defining the changes since 
your last backup as of less value than the time/trouble/resources 
necessary to do that update, so again, you can *always* rest easy in the 
face of filesystem or device problems, because you either have it backed 
up, or by definition of /not/ having it backed up, it was self-evidently 
not worth the trouble to do so yet, so you saved what was most important 
to you either way.

So think about your value definitions regarding your data and change them 
if you need to... while you still have the chance. =:^)

(And the implications of the above change how you deal with a broken 
filesystem too.  With either current backups or what you've literally 
defined as throw-away data due to it not being worth the trouble of 
backups, it makes little sense to spend more than a trivial amount of 
time trying to recover data from a messed up filesystem, especially given 
that there's no guarantee you'll get it all back undamaged even if you 
/do/ spend time time.  It's often simpler and takes less time, as well as 
more success-sure, to simply blow away the defective filesystem with a 
fresh mkfs and restore the data from backups, since that way you know 
you'll have a fresh filesystem and known-good data from the backup, as 
opposed to no guarantees /what/ you'll end up with trying to recover/
repair the old filesystem.)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-28 Thread Duncan

Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:

> 27.01.2018 18:22, Duncan пишет:
>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>> 
>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>
>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>> process stop on initramfs.
>>>>>>
>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>
>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>> everything is fine from their point of view.
>>>
>>> It's quite obvious who's the culprit: every single remaining rc system
>>> manages to mount degraded btrfs without problems.  They just don't try
>>> to outsmart the kernel.
>> 
>> No kidding.
>> 
>> All systemd has to do is leave the mount alone that the kernel has
>> already done,
> 
> Are you sure you really understand the problem? No mount happens because
> systemd waits for indication that it can mount and it never gets this
> indication.

As Tomaz indicates, I'm talking about manual mounting (after the initr* 
drops to a maintenance prompt if it's root being mounted, or on manual 
mount later if it's an optional mount) here.  The kernel accepts the 
degraded mount and it's mounted for a fraction of a second, but systemd 
actually undoes the successful work of the kernel to mount it, so by the 
time the prompt returns and a user can check, the filesystem is unmounted 
again, with the only indication that it was mounted at all being the log.

He says that's because the kernel still says it's not ready, but that's 
for /normal/ mounting.  The kernel accepted the degraded mount and 
actually mounted the filesystem, but systemd undoes that.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-27 Thread Duncan

Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:

> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>> 
>> >> I just tested to boot with a single drive (raid1 degraded), even
>> >> with degraded option in fstab and grub, unable to boot !  The boot
>> >> process stop on initramfs.
>> >> 
>> >> Is there a solution to boot with systemd and degraded array ?
>> > 
>> > No. It is finger pointing. Both btrfs and systemd developers say
>> > everything is fine from their point of view.
> 
> It's quite obvious who's the culprit: every single remaining rc system
> manages to mount degraded btrfs without problems.  They just don't try
> to outsmart the kernel.

No kidding.

All systemd has to do is leave the mount alone that the kernel has 
already done, instead of insisting it knows what's going on better than 
the kernel does, and immediately umounting it.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bad key ordering - repairable?

2018-01-24 Thread Duncan

tube or whatever on fullscreen, and to now my second generation of 
ssds, a pair of 1 TB samsung evos, but this reminds me that at nearing 
six years old the main system's aging too, so I better start thinking of 
replacing it again...)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-23 Thread Duncan

ein posted on Tue, 23 Jan 2018 09:38:13 +0100 as excerpted:

> On 01/22/2018 09:59 AM, Duncan wrote:
>> 
>> And to tie up a loose end, xfs has somewhat different design principles
>> and may well not be particularly sensitive to the dirty_* settings,
>> while btrfs, due to COW and other design choices, is likely more
>> sensitive to them than the widely used ext* and reiserfs (my old choice
>> and the basis of my own settings, above).

> Excellent booklike writeup showing how /proc/sys/vm/ works, but I
> wonder, how can you explain why does XFS work in this case?

I can't, directly, which is why I glossed over it so fast above.  I do 
have some "educated guesswork", but that's _all_ it is, as I've not had 
reason to get particularly familiar with xfs and its quirks.  You'd have 
to ask the xfs folks if my _guess_ is anything approaching reality, but 
if you do please be clear that I explicitly said I don't know and that 
this is simply my best guess based on the very limited exposure to xfs 
discussions I've had.

So I'm not experience-familiar with xfs and other than what I've happened 
across in cross-list threads here, know little about it except that it 
was ported to Linux from other *ix.  I understand the xfs port to 
"native" is far more complete than that of zfs, for example.  
Additionally, I know from various vfs discussion threads cross-posted to 
this and other filesystem lists that xfs remains rather different than 
some -- apparently (if I've gotten it right) it handles "objects" rather 
than inodes and extents, for instance.

Apparently, if the vfs threads I've read are to be believed, xfs would 
have some trouble with a proposed vfs interface that would allow requests 
to write out and free N pages or N KiB of dirty RAM from the write 
buffers in ordered to clear memory for other usage, because it tracks 
objects rather than dirty pages/KiB of RAM.  Sure it could do it, but it 
wouldn't be an efficient enough operation to be worth the trouble for 
xfs.  So apparently xfs just won't make use of that feature of the 
proposed new vfs API, there's nothing that says it /has/ to, after all -- 
it's proposed to be optional, not mandatory.

Now that discussion was in a somewhat different context than the 
vm.dirty_* settings discussion here, but it seems reasonable to assume 
that if xfs would have trouble converting objects to the size of the 
memory they take in the one case, the /proc/sys/vm/dirty_* dirty writeback 
cache tweaking features may not apply to xfs, at least in a direct/
intuitive way, either.

Which is why I suggested xfs might not be particularly sensitive to those 
settings -- I don't know that it ignores them entirely, and it may use 
them in /some/ way, possibly indirectly, but the evidence I've seen does 
suggest that xfs may, if it uses those settings at all, not be as 
sensitive to them as btrfs/reiserfs/ext*.

Meanwhile, due to the extra work btrfs does with checksumming and cow, 
while AFAIK it uses the settings "straight", having them out of whack 
likely has a stronger effect on btrfs than it does on ext* and reiserfs 
(with reiserfs likely being slightly more strongly affected than ext*, 
but not to the level of btrfs).

And there has indeed been confirmation on-list that adjusting these 
settings *does* have a very favorable effect on btrfs for /some/ use-
cases.

(In one particular case, the posting was to the main LKML, but on btrfs 
IIRC, and Linus got involved.  I don't believe that lead to the 
/creation/ of the relatively new per-device throttling stuff as I believe 
the patches were already around, but I suspect it may have lead to their 
integration in mainline a few kernel cycles earlier than they may have 
been otherwise.  Because it's a reasonably well known "secret" that the 
default ratios are out of whack on modern systems, it's just not settled 
what the new defaults /should/ be, so in the absence of agreement or 
pressing problem, they remain as they are.  But Linus blew his top as 
he's known to do, he and others pointed the reporter at the vm.dirty_* 
settings tho Linus wanted to know why the defaults were so insane for 
today's machines, and tweaking those did indeed help.  Then a kernel 
cycle or two later the throttling options appeared in mainline, very 
possibly as a result of Linus "routing around the problem" to some 
extent.)

So in my head I have a picture of the possible continuum of vm.dirty_ 
effect that looks like this:

<- weak effectstrong ->

zfsxfs.ext*reiserfs.btrfs

zfs, no or almost no effect, because it uses non-native mechanism and is 
poorly adapted to Linux.

xfs, possibly some effect, but likely relatively light, becaus

Re: Periodic frame losses when recording to btrfs volume with OBS

2018-01-22 Thread Duncan

- the default is 
the venerable CFQ but deadline may well be better for a streaming use-
case, and now there's the new multi-queue stuff and the multi-queue kyber 
and bfq schedulers, as well -- and setting IO priority -- probably by 
increasing the IO priority of the streaming app.  The tool to use for the 
latter is called ionice.  Do note, however, that not all schedulers 
implement IO priorities.  CFQ does, but while I think deadline should 
work better for the streaming use-case, it's simpler code and I don't 
believe it implements IO priority.  Similarly for multi-queue, I'd guess 
the low-code-designed-for-fast-direct-PCIE-connected-SSD kyber doesn't 
implement IO priorities, while the more complex and general purpose 
suitable-for-spinning-rust bfq /might/ implement IO priorities.

But I know less about that stuff and it's googlable, should you decide to 
try playing with it too.  I know what the dirty_* stuff does from 
personal experience. =:^)


And to tie up a loose end, xfs has somewhat different design principles 
and may well not be particularly sensitive to the dirty_* settings, while 
btrfs, due to COW and other design choices, is likely more sensitive to 
them than the widely used ext* and reiserfs (my old choice and the basis 
of my own settings, above).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs volume corrupt. btrfs-progs bug or need to rebuild volume?

2018-01-19 Thread Duncan

Rosen Penev posted on Fri, 19 Jan 2018 13:45:35 -0800 as excerpted:

> v2: Add proper subject

=:^)

> I've been playing around with a specific kernel on a specific device
> trying to figure out why btrfs keeps throwing csum errors after ~15
> hours. I've almost nailed it down to some specific CONFIG option in the
> kernel, possibly related to IRQs.
> 
> Anyway, I managed to get my btrfs RAID5 array corrupted to the point
> where it will just mount to read-only mode.

[...]

> This is with version 4.14 of btrfs-progs. Do I need a newer version or
> should I just reinitialize my array and copy everything back?
> 
> Log on mount attached below:

[...]

> Fri Jan 19 14:26:08 2018 kern.warn kernel:
> [168383.378239] CPU: 0 PID:
> 2496 Comm: kworker/u8:2 Tainted: GW   4.9.75 #0

Tho as the penultimate LTS kernel series 4.9 is still on the btrfs-list 
supported list in general... 4.9 still had known btrfs raid56 mode issues 
and is strongly negatively recommended for use with btrfs raid56 mode.  
Those weren't fixed until 4.12, which /finally/ brought raid56 mode into 
generally working and not negatively recommended state.

While as an LTS applicable general btrfs bug fixes would be backported to 
4.9, because raid56 mode had never worked /well/ at that point, I'm not 
sure those fixes were backported.

So you really need either kernel 4.12+, presumably the LTS 4.14 series 
since you're on LTS 4.9 series now, for btrfs raid56 mode, or don't use 
raid56 mode if you plan on staying with the 4.9 LTS, as it still had 
severe known issues back then and I haven't seen on-list confirmation 
that the 4.12 btrfs raid56 mode fixes were backported to 4.9-LTS.  

If you need/choose to stick with 4.9 and dump raid56 mode, the 
recommended alternative depends on the number of devices in the 
filesystem.

For a small number of devices in the filesystem, btrfs raid1 is 
effectively as stable as the still stabilizing and maturing btrfs itself 
is at this point and is recommended.

For a larger number of devices, btrfs raid1 is still a good choice 
because it /is/ the most mature, but btrfs raid10 is /reasonably/ stable 
tho IMO not quite as stable as raid1, or for better performance (due to 
btrfs raid10 not being read-optimized yet) while keeping btrfs 
checksumming and error repair from the second copy when available, 
consider a layered approach, with btrfs raid1 on top of a pair of mdraid0s 
(or dmraid0s, or hardware raid0s).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: big volumes only work reliable with ssd_spread

2018-01-15 Thread Duncan

Stefan Priebe - Profihost AG posted on Mon, 15 Jan 2018 10:55:42 +0100 as
excerpted:

> since around two or three years i'm using btrfs for incremental VM
> backups.
> 
> some data:
> - volume size 60TB
> - around 2000 subvolumes
> - each differential backup stacks on top of a subvolume
> - compress-force=zstd
> - space_cache=v2
> - no quote / qgroup
> 
> this works fine since Kernel 4.14 except that i need ssd_spread as an
> option. If i do not use ssd_spread i always end up with very slow
> performance and a single kworker process using 100% CPU after some days.
> 
> With ssd_spread those boxes run fine since around 6 month. Is this
> something expected? I haven't found any hint regarding such an impact.

My understanding of the technical details is "limited" as I'm not a dev, 
and I expect you'll get a more technically accurate response later, but 
sometimes a first not particularly technical response can be helpful as 
long as it's not /wrong/.  (And if it is this is a good way to have my 
understanding corrected as well. =:^)  With that caveat, based on my 
understanding of what I've seen on-list...

The kernel v4.14 ssd mount-option changes apparently primarily affected 
data, not metadata.  Apparently, ssd_spread has a heavier metadata 
effect, and the v4.14 changes moved additional (I believe metadata) 
functionality to ssd-spread that had originally been part of ssd as 
well.  There has been some discussion of metadata tweaks similar to those 
in 4.14 for the ssd option with data, but they weren't deemed as 
demonstrably needed as the ssd option tweaks and needed further 
discussion, so were put off until the effect of the 4.14 tweaks could be 
gauged in more widespread use, after which they were to be reconsidered, 
if necessary.

Meanwhile, in the discussion I saw, Chris Mason mentioned that Facebook 
is using ssd-spread for various reasons there, so it's well-tested with 
their deployments, which I'd assume have many of the same qualities yours 
do, thus implying that your observations about ssd-spread are no accident.

In fact, if I interpreted Chris's comments correctly, they use ssd_spread 
on very large multi-layered non-ssd storage arrays, in part because the 
larger layout-alignment optimizations make sense there as well as on 
ssds.  That would appear to be precisely what you are seeing. =:^)  If 
that's the case, then arguably the option is misnamed and the ssd_spread 
name may well at some point be deprecated in favor of something more 
descriptive of its actual function and target devices.  Purely my own 
speculation here, but perhaps something like vla_spread (very-large-
array)?

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Hanging after frequent use of systemd-nspawn --ephemeral

2018-01-14 Thread Duncan

Qu Wenruo posted on Sun, 14 Jan 2018 10:27:40 +0800 as excerpted:

> Despite of that, did that really hangs?
> Qgroup dramatically increase overhead to delete a subvolume or balance
> the fs.
> Maybe it's just a little slow?

Same question about the "hang" here.

Note that btrfs is optimized to make snapshot creation fast, while 
snapshot deletion has to do more work to clean things up.  So even 
without qgroup enabled, deletion can take a bit of time (much longer than 
creation, which should be nearly instantaneous in human terms) if there's 
a lot of relinks and the like to clean up.

And qgroups makes btrfs do much more work to track that as well, so as Qu 
says, that'll make snapshot deletion take even longer, and you probably 
want it disabled unless you actually need the feature for something 
you're doing.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1873 matches

Mail list logo