Re: System unable to mount partition after a power loss

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-07 01:43, Doni Crosby wrote:

This is qemu-kvm? What's the cache mode being used? It's possible the
usual write guarantees are thwarted by VM caching.

Yes it is a proxmox host running the system so it is a qemu vm, I'm
unsure on the caching situation.
On the note of QEMU and the cache mode, the only cache mode I've seen to 
actually cause issues for BTRFS volumes _inside_ a VM is 'cache=unsafe', 
but that causes problems for most filesystems, so it's probably not the 
issue here.


OTOH, I've seen issues with most of the cache modes other than 
'cache=writeback' and 'cache=writethrough' when dealing with BTRFS as 
the back-end storage on the host system, and most of the time such 
issues will manifest as both problems with the volume inside the VM 
_and_ the volume the disk images are being stored on.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-07 Thread Austin S. Hemmelgarn

On 2018-12-06 23:09, Andrei Borzenkov wrote:

06.12.2018 16:04, Austin S. Hemmelgarn пишет:


* On SCSI devices, a discard operation translates to a SCSI UNMAP
command.  As pointed out by Ronnie Sahlberg in his reply, this command
is purely advisory, may not result in any actual state change on the
target device, and is not guaranteed to wipe the data.  To actually wipe
things, you have to explicitly write bogus data to the given regions
(using either regular writes, or a WRITESAME command with the desired
pattern), and _then_ call UNMAP on them.


WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.

Good to know.  I've not looked at the SCSI code much, and actually 
didn't know about the UNMAP bit for the WRITE SAME command, so I just 
assumed that the kernel only used the UNMAP command.


Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Austin S. Hemmelgarn

On 2018-12-06 01:11, Robert White wrote:
(1) Automatic and selective wiping of unused and previously used disk 
blocks is a good security measure, particularly when there is an 
encryption layer beneath the file system.


(2) USB attached devices _never_ support TRIM and they are the most 
likely to fall into strangers hands.
Not true on the first count.  Some really nice UAS devices do support 
SCSI UNMAP and WRITESAME commands.


(3) I vaguely recall that some flash chips will take bulk writhes of 
full sectors of 0x00 or 0xFF (I don't remember which) were second-best 
to TRIM for letting the flash controllers defragment their internals.


So it would be dog-slow, but it would be neat if BTRFS had a mount 
option to convert any TRIM command from above into the write of a zero, 
0xFF, or trash block to the device below if that device doesn't support 
TRIM. Real TRIM support would override the block write.


Obviously doing an fstrim would involve a lot of slow device writes but 
only for people likely to do that sort of thing.


For testing purposes the destruction of unused pages in this manner 
might catch file system failures or coding errors.


(The other layer where this might be most appropriate is in cryptsetup 
et al, where it could lie about TRIM support, but that sort of stealth 
lag might be bad for filesystem-level operations. Doing it there would 
also loose the simpler USB use cases.)


...Just a thought...
First off, TRIM is an ATA command, not the kernel term.  `fstrim` 
inherited the ATA name, but in kernel it's called a discard operation, 
and it's kind of important to understand here that a discard operation 
can result in a number of different behaviors.


In particular, you have at least the following implementations:

* On SCSI devices, a discard operation translates to a SCSI UNMAP 
command.  As pointed out by Ronnie Sahlberg in his reply, this command 
is purely advisory, may not result in any actual state change on the 
target device, and is not guaranteed to wipe the data.  To actually wipe 
things, you have to explicitly write bogus data to the given regions 
(using either regular writes, or a WRITESAME command with the desired 
pattern), and _then_ call UNMAP on them.
* On dm-thinp devices, a discard operation results in simply unmapping 
the blocks in the region it covers.  The underlying blocks themselves 
are not wiped until they get reallocated (which may not happen when you 
write to that region of the dm-thinp device again), and may not even be 
wiped then (depending on how the dm-thinp device is configured).  Thus, 
the same behavior as for SCSI is required here.
* On SD/MMC devices, a discard operation results in an SD ERASE command 
being issued.  This one is non-advisory (that is, it's guaranteed to 
happen), and is supposed to guarantee an overwrite of the region with 
zeroes or ones.
* eMMC devices additionally define a discard operation independent of 
the SD ERASE command which unmaps the region in the translation layer, 
but does not wipe the blocks either on issuing the command or on 
re-allocating the low-level blocks.  Essentially, it's just a hint for 
the wear-leveling algorithm.
* NVMe provides two different discard operations, and I'm not sure which 
the kernel uses for NVMe block emulation.  They correspond almost 
exactly to the SCSI UNMAP and SD ERASE commands in terms of behavior.
* For ATA devices, a discard operation translates to an ATA TRIM 
command.  This command doesn't even require that the data read back from 
a region the command has been issued against be consistent between 
reads, let alone that it actually returns zeroes, and it is completely 
silent on how the device should actually implement the operation.  In 
practice, most drives that implement it actually behave like dm-thinp 
devices, unmapping the low-level blocks in the region and only clearing 
them when they get reallocated, while returning any data they want on 
subsequent reads to that logical region until a write happens.
* The MTD subsystem has support for discard operations in the various 
FTL's, and they appear from a cursory look at the code to behave like a 
non-advisory version of the SCSI UNMAP command (FWIW, MTD's are what the 
concept of a discard operation was originally implemented in Linux for).


Notice that the only implementations that are actually guaranteed to 
clear out the low-level physical blocks are the SD ERASE and one of the 
two NVMe options, and all others require you to manually wipe the data 
before issuing the discard operation to guarantee that no data is retained.


Given this, I don't think this should be done as a mechanism of 
intercepting or translating discard operations, but as something else 
entirely.  Perhaps as a block-layer that wipes the region then issues a 
discard for it to the lower level device if the device supports it?


Re: btrfs progs always assume devid 1?

2018-12-05 Thread Austin S. Hemmelgarn

On 2018-12-05 14:50, Roman Mamedov wrote:

Hello,

To migrate my FS to a different physical disk, I have added a new empty device
to the FS, then ran the remove operation on the original one.

Now my FS has only devid 2:

Label: 'p1'  uuid: d886c190-b383-45ba-9272-9f00c6a10c50
Total devices 1 FS bytes used 36.63GiB
devid2 size 50.00GiB used 45.06GiB path /dev/mapper/vg-p1

And all the operations of btrfs-progs now fail to work in their default
invocation, such as:

# btrfs fi resize max .
Resize '.' of 'max'
ERROR: unable to resize '.': No such device

[768813.414821] BTRFS info (device dm-5): resizer unable to find device 1

Of course this works:

# btrfs fi resize 2:max .
Resize '.' of '2:max'

But this is inconvenient and seems to be a rather simple oversight. If what I
got is normal (the device staying as ID 2 after such operation), then count
that as a suggestion that btrfs-progs should use the first existing devid,
rather than always looking for hard-coded devid 1.



I've been meaning to try and write up a patch to special-case this for a 
while now, but have not gotten around to it yet.


FWIW, this is one of multiple reasons that it's highly recommended to 
use `btrfs replace` instead of adding a new device and deleting the old 
one when replacing a device.  Other benefits include:


* It doesn't have to run in the foreground (and doesn't by default).
* It usually takes less time.
* Replace operations can be queried while running to get a nice 
indication of the completion percentage.


The only disadvantage is that the new device has to be at least as large 
as the old one (though you can get around this to a limited degree by 
shrinking the old device), and it needs the old and new device to be 
plugged in at the same time (add/delete doesn't, if you flip the order 
of the add and delete commands).


Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Austin S. Hemmelgarn

On 2018-12-04 08:37, Graham Cobb wrote:

On 04/12/2018 12:38, Austin S. Hemmelgarn wrote:

In short, USB is _crap_ for fixed storage, don't use it like that, even
if you are using filesystems which don't appear to complain.


That's useful advice, thanks.

Do you (or anyone else) have any experience of using btrfs over iSCSI? I
was thinking about this for three different use cases:

1) Giving my workstation a data disk that is actually a partition on a
server -- keeping all the data on the big disks on the server and
reducing power consumption (just a small boot SSD in the workstation).

2) Splitting a btrfs RAID1 between a local disk and a remote iSCSI
mirror to provide  redundancy without putting more disks in the local
system. Of course, this would mean that one of the RAID1 copies would
have higher latency than the other.

3) Like case 1 but actually exposing an LVM logical volume from the
server using iSCSI, rather than a simple disk partition. I would then
put both encryption and RAID running on the server below that logical
volume.

NBD could also be an alternative to iSCSI in these cases as well.

Any thoughts?
I've not run it over iSCSI (I tend to avoid that overly-complicated 
mess), but I have done it over NBD and ATAoE, as well as some more 
exotic arrangements, and it's really not too bad.  The important part is 
making sure your block layer and all the stuff under it are reliable, 
and USB is not.


Re: experiences running btrfs on external USB disks?

2018-12-04 Thread Austin S. Hemmelgarn

On 2018-12-04 00:37, Tomasz Chmielewski wrote:

I'm trying to use btrfs on an external USB drive, without much success.

When the drive is connected for 2-3+ days, the filesystem gets remounted 
readonly, with BTRFS saying "IO failure":


[77760.444607] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550933] BTRFS error (device sdb1): bad tree block start, want 
378372096 have 0
[77760.550972] BTRFS: error (device sdb1) in __btrfs_free_extent:6804: 
errno=-5 IO failure

[77760.550979] BTRFS info (device sdb1): forced readonly
[77760.551003] BTRFS: error (device sdb1) in 
btrfs_run_delayed_refs:2935: errno=-5 IO failure

[77760.553223] BTRFS error (device sdb1): pending csums is 4096


Note that there are no other kernel messages (i.e. that would indicate a 
problem with disk, cable disconnection etc.).


The load on the drive itself can be quite heavy at times (i.e. 100% IO 
for 1-2 h and more) - can it contribute to the problem (i.e. btrfs 
thinks there is some timeout somewhere)?


Running 4.19.6 right now, but was experiencing the issue also with 4.18 
kernels.




# btrfs device stats /data
[/dev/sda1].write_io_errs    0
[/dev/sda1].read_io_errs 0
[/dev/sda1].flush_io_errs    0
[/dev/sda1].corruption_errs  0
[/dev/sda1].generation_errs  0
It looks to me like the typical USB issues that are present with almost 
all filesystems but only seem to be noticed by BTRFS because it does 
more rigorous checking of data.


In short, USB is _crap_ for fixed storage, don't use it like that, even 
if you are using filesystems which don't appear to complain.


Re: BTRFS on production: NVR 16+ IP Cameras

2018-11-16 Thread Austin S. Hemmelgarn

On 2018-11-15 13:39, Juan Alberto Cirez wrote:
Is BTRFS mature enough to be deployed on a production system to underpin 
the storage layer of a 16+ ipcameras-based NVR (or VMS if you prefer)?
For NVR, I'd say no.  BTRFS does pretty horribly with append-only 
workloads, even if they are WORM style.  It also does a really bad job 
with most relational database systems that you would likely use for 
indexing.


If you can suggest your reasoning for wanting to use BTRFS though, I can 
probably point you at alternatives that would work more reliably for 
your use case.




Re: [PATCH RFC] btrfs: harden agaist duplicate fsid

2018-11-13 Thread Austin S. Hemmelgarn

On 11/13/2018 10:31 AM, David Sterba wrote:

On Mon, Oct 01, 2018 at 09:31:04PM +0800, Anand Jain wrote:

+    /*
+ * we are going to replace the device path, make sure its the
+ * same device if the device mounted
+ */
+    if (device->bdev) {
+    struct block_device *path_bdev;
+
+    path_bdev = lookup_bdev(path);
+    if (IS_ERR(path_bdev)) {
+    mutex_unlock(_devices->device_list_mutex);
+    return ERR_CAST(path_bdev);
+    }
+
+    if (device->bdev != path_bdev) {
+    bdput(path_bdev);
+    mutex_unlock(_devices->device_list_mutex);
+    return ERR_PTR(-EEXIST);

It would be _really_ nice to have an informative error message printed
here.  Aside from the possibility of an admin accidentally making a
block-level copy of the volume,



this code triggering could represent an
attempted attack against the system, so it's arguably something that
should be reported as happening.



   Personally, I think a WARN_ON_ONCE for
this would make sense, ideally per-volume if possible.


   Ah. Will add an warn. Thanks, Anand


The requested error message is not in the patch you posted or I have
missed that (https://patchwork.kernel.org/patch/10641041/) .

Austin, is the following ok for you?

   "BTRFS: duplicate device fsid:devid for %pU:%llu old:%s new:%s\n"

   BTRFS: duplicate device fsid:devid 7c667b96-59eb-43ad-9ae9-c878f6ad51d8:2 
old:/dev/sda6 new:/dev/sdb6

As the UUID and paths are long I tried to squeeeze the rest so it's
still comprehensible but this would be better confirmed. Thanks.


Looks perfectly fine to me.


Re: BTRFS did it's job nicely (thanks!)

2018-11-05 Thread Austin S. Hemmelgarn

On 11/4/2018 11:44 AM, waxhead wrote:

Sterling Windmill wrote:

Out of curiosity, what led to you choosing RAID1 for data but RAID10
for metadata?

I've flip flipped between these two modes myself after finding out
that BTRFS RAID10 doesn't work how I would've expected.

Wondering what made you choose your configuration.

Thanks!
Sure,


The "RAID"1 profile for data was chosen to maximize disk space 
utilization since I got a lot of mixed size devices.


The "RAID"10 profile for metadata was chosen simply because it *feels* a 
bit faster for some of my (previous) workload which was reading a lot of 
small files (which I guess was embedded in the metadata). While I never 
remembered that I got any measurable performance increase the system 
simply felt smoother (which is strange since "RAID"10 should hog more 
disks at once).


I would love to try "RAID"10 for both data and metadata, but I have to 
delete some files first (or add yet another drive).


Would you like to elaborate a bit more yourself about how BTRFS "RAID"10 
does not work as you expected?


As far as I know BTRFS' version of "RAID"10 means it ensure 2 copies (1 
replica) is striped over as many disks it can (as long as there is free 
space).


So if I am not terribly mistaking a "RAID"10 with 20 devices will stripe 
over (20/2) x 2 and if you run out of space on 10 of the devices it will 
continue to stripe over (5/2) x 2. So your stripe width vary with the 
available space essentially... I may be terribly wrong about this (until 
someones corrects me that is...)
He's probably referring to the fact that instead of there being a 
roughly 50% chance of it surviving the failure of at least 2 devices 
like classical RAID10 is technically able to do, it's currently 
functionally 100% certain it won't survive more than one device failing.




Re: Understanding "btrfs filesystem usage"

2018-10-30 Thread Austin S. Hemmelgarn

On 10/30/2018 12:10 PM, Ulli Horlacher wrote:


On Mon 2018-10-29 (17:57), Remi Gauvin wrote:

On 2018-10-29 02:11 PM, Ulli Horlacher wrote:


I want to know how many free space is left and have problems in
interpreting the output of:

btrfs filesystem usage
btrfs filesystem df
btrfs filesystem show


In my not so humble opinion, the filesystem usage command has the
easiest to understand output.  It' lays out all the pertinent information.

You can clearly see 825GiB is allocated, with 494GiB used, therefore,
filesystem show is actually using the "Allocated" value as "Used".
Allocated can be thought of "Reserved For".


And what is "Device unallocated"? Not reserved?



As the output of the Usage command and df command clearly show, you have
almost 400GiB space available.


This is the good part :-)



The disparity between 498GiB used and 823Gib is pretty high.  This is
probably the result of using an SSD with an older kernel.  If your
kernel is not very recent, (sorry, I forget where this was fixed,
somewhere around 4.14 or 4.15), then consider mounting with the nossd
option.


I am running kernel 4.4 (it is a Ubuntu 16.04 system)
But /local is on a SSD. Should I really use nossd mount option?!

Probably, and you may even want to use it on newer (patched) kernels.

This requires some explanation though.

SSD's are write limited media (write to them too much, and they stop 
working).  This is generally a pretty well known fact, and while it is 
true, it's not anywhere near as much of an issue on modern SSD"s as 
people make it out to be (pretty much, if you've got an SSD made in the 
last 5 years, you almost certainly don't have to worry about this).  The 
`ssd` code in BTRFS behaves as if this is still an issue (and does so in 
a way that doesn't even solve it well).


Put simply, when BTRFS goes to look for space, it treats requests for 
space that ask for less than a certain size as if they are that minimum 
size, and only tries to look for smaller spots if it can't find one at 
least that minimum size.  This has a couple of advantages in terms of 
write performance, especially in the common case of a mostly empty 
filesystem.


For the default (`nossd`) case, that minimum size is 64kB.  So, in most 
cases, the potentially wasted space actually doesn't matter much (most 
writes are bigger than 64k) unless you're doing certain things.


For the old (`ssd`) case, that minimum size is 2MB.  Even with the 
common cases that would normally not have an issue with the 64k default, 
this ends up wasting a _huge_ amount of space.


For the new `ssd` behavior, the minimum is different for data and 
metadata (IIRC, metadata uses the 64k default, while data still uses the 
2M size).  This solves the biggest issues (which were seen with 
metadata), but doesn't completely remove the problem.


Expanding on this further, some unusual workloads actually benefit from 
the old `ssd` behavior, so on newer kernels `ssd_spread` gives that 
behavior.  However, many workloads actually do better with the `nossd` 
behavior (especially the pathological worst case stuff like databases 
and VM disk images), so if you have a recent SSD, you probably want to 
just use that.




You can improve this by running a balance.

Something like:
btrfs balance start -dusage=55


I run balance via cron weekly (adapted
https://software.opensuse.org/package/btrfsmaintenance)






Re: CRC mismatch

2018-10-18 Thread Austin S. Hemmelgarn

On 18/10/2018 08.02, Anton Shepelev wrote:

I wrote:


What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

csum failed ino 175524 off 1876295680 csum 451760558
expected csum 1446289185

Shall I seek the culprit in the host machine on in the
guest one?  Supposing the host machine healty, what
operations on the gueest might have caused a CRC mismatch?


Thank you, Austin and Chris, for your replies.  While
describing the problem for the client, I tried again to copy
the corrupt file and this time it was copied without error,
which is of course scary because errors that miraculously
disappear may suddenly reappear in the same manner.

If The filesystem was running some profile that supports repairs (pretty 
much, anything except single or raid0 profiles), then BTRFS will have 
fixed that particular block for you automatically.


Of course, the other possibility is that it was a transient error in the 
block layer that caused it tor return bogus data when the data that was 
on-disk was in fact correct.


Re: CRC mismatch

2018-10-17 Thread Austin S. Hemmelgarn

On 2018-10-16 16:27, Chris Murphy wrote:

On Tue, Oct 16, 2018 at 9:42 AM, Austin S. Hemmelgarn
 wrote:

On 2018-10-16 11:30, Anton Shepelev wrote:


Hello, all

What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

 csum failed ino 175524 off 1876295680 csum 451760558
 expected csum 1446289185

Shall I seek the culprit in the host machine on in the guest
one?  Supposing the host machine healty, what operations on
the gueest might have caused a CRC mismatch?


Possible causes include:

* On the guest side:
   - Unclean shutdown of the guest system (not likely even if this did
happen).
   - A kernel bug on in the guest.
   - Something directly modifying the block device (also not very likely).

* On the host side:
   - Unclean shutdown of the host system without properly flushing data from
the guest.  Not likely unless you're using an actively unsafe caching mode
for the guest's storage back-end.
   - At-rest data corruption in the storage back-end.
   - A bug in the host-side storage stack.
   - A transient error in the host-side storage stack.
   - A bug in the hypervisor.
   - Something directly modifying the back-end storage.

Of these, the statistically most likely location for the issue is probably
the storage stack on the host.


Is there still that O_DIRECT related "bug" (or more of a limitation)
if the guest is using cache=none on the block device?
I had actually forgotten about this, and I'm not quite sure if it's 
fixed or not.


Anton what virtual machine tech are you using? qemu/kvm managed with
virt-manager? The configuration affects host behavior; but the
negative effect manifests inside the guest as corruption. If I
remember correctly.





Re: CRC mismatch

2018-10-16 Thread Austin S. Hemmelgarn

On 2018-10-16 11:30, Anton Shepelev wrote:

Hello, all

What may be the reason of a CRC mismatch on a BTRFS file in
a virutal machine:

csum failed ino 175524 off 1876295680 csum 451760558
expected csum 1446289185

Shall I seek the culprit in the host machine on in the guest
one?  Supposing the host machine healty, what operations on
the gueest might have caused a CRC mismatch?


Possible causes include:

* On the guest side:
  - Unclean shutdown of the guest system (not likely even if this did 
happen).

  - A kernel bug on in the guest.
  - Something directly modifying the block device (also not very likely).

* On the host side:
  - Unclean shutdown of the host system without properly flushing data 
from the guest.  Not likely unless you're using an actively unsafe 
caching mode for the guest's storage back-end.

  - At-rest data corruption in the storage back-end.
  - A bug in the host-side storage stack.
  - A transient error in the host-side storage stack.
  - A bug in the hypervisor.
  - Something directly modifying the back-end storage.

Of these, the statistically most likely location for the issue is 
probably the storage stack on the host.


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Austin S. Hemmelgarn

On 2018-10-15 10:42, Anton Shepelev wrote:

Hugo Mills to Anton Shepelev:


While trying to resolve free space problems, and found
that I cannot interpret the output of:


btrfs filesystem show


Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
Total devices 1 FS bytes used 34.06GiB
devid1 size 40.00GiB used 37.82GiB path /dev/sda2

How come the total used value is less than the value
listed for the only device?


   "Used" on the device is the mount of space allocated.
"Used" on the FS is the total amount of actual data and
metadata in that allocation.

   You will also need to look at the output of "btrfs fi
df" to see the breakdown of the 37.82 GiB into data,
metadata and currently unused.

See
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
 for the details


Thank you, Hugo, understood.  mount/amount is a very fitting
typo :-)

Does the standard `du' tool work correctly for btfrfs?

For the default 'physical usage' mode, it functionally does not work 
correctly, because it does not know about reflinks.  The easiest way to 
see this is to create a couple of snapshots of a subvolume alongside the 
subvolume, and then run `du -s --totals` on those snapshots and the 
subvolume.  It will report the total space usage to be equal to the sum 
of the values reported for each snapshot and the subvolume, when it 
should instead only count the space usage for shared data once.


For the 'apparent usage' mode provided by the GNU implementation, it 
does work correctly.


Re: reproducible builds with btrfs seed feature

2018-10-15 Thread Austin S. Hemmelgarn

On 2018-10-13 18:28, Chris Murphy wrote:

Is it practical and desirable to make Btrfs based OS installation
images reproducible? Or is Btrfs simply too complex and
non-deterministic? [1]

The main three problems with Btrfs right now for reproducibility are:
a. many objects have uuids other than the volume uuid; and mkfs only
lets us set the volume uuid
b. atime, ctime, mtime, otime; and no way to make them all the same
c. non-deterministic allocation of file extents, compression, inode
assignment, logical and physical address allocation

I'm imagining reproducible image creation would be a mkfs feature that
builds on Btrfs seed and --rootdir concepts to constrain Btrfs
features to maybe make reproducible Btrfs volumes possible:

- No raid
- Either all objects needing uuids can have those uuids specified by
switch, or possibly a defined set of uuids expressly for this use
case, or possibly all of them can just be zeros (eek? not sure)
- A flag to set all times the same
- Possibly require that target block device is zero filled before
creation of the Btrfs
- Possibly disallow subvolumes and snapshots
- Require the resulting image is seed/ro and maybe also a new
compat_ro flag to enforce that such Btrfs file systems cannot be
modified after the fact.
- Enforce a consistent means of allocation and compression

The end result is creating two Btrfs volumes would yield image files
with matching hashes.
So in other words, you care about matching the block layout _exactly_. 
This is a great idea for paranoid people, but it's usually overkill. 
Realistically, almost nothing in userspace cares about the block layout, 
worrying about it just makes verifying the reproduced image a bit easier 
(there's no reason you can't verify all the relevant data without doing 
a checksum or HMAC of the image as a whole).


If I had to guess, the biggest challenge would be allocation. But it's
also possible that such an image may have problems with "sprouts". A
non-removable sprout seems fairly straightforward and safe; but if a
"reproducible build" type of seed is removed, it seems like removal
needs to be smart enough to refresh *all* uuids found in the sprout: a
hard break from the seed.

Competing file systems, ext4 with make_ext4 fork, and squashfs. At the
moment I'm thinking it might be easier to teach squashfs integrity
checking than to make Btrfs reproducible.  But then I also think
restricting Btrfs features, and applying some requirements to
constrain Btrfs to make it reproducible, really enhances the Btrfs
seed-sprout feature.

Any thoughts? Useful? Difficult to implement?

Squashfs might be a better fit for this use case *if* it can be taught
about integrity checking. It does per file checksums for the purpose
of deduplication but those checksums aren't retained for later
integrity checking.
I've seen projects with SquashFS that store integrity data separately 
but leverage other infrastructure.  Methods I've seen so far include:


* GPG-signed SquashFS images, usually with detached signatures
* SquashFS with PAR2 integrity checking data
* SquashFS on top of dm-verity
* SquashFS on top of dm-integrity

The first two need to be externally checked prior to mount, but doing so 
is not hard.  The fourth is tricky to set up right, but provides better 
integration with encrypted images.  The third does exactly what's needed 
though.  You just use the embedded data variant of dm-verity, bind the 
resultant image to a loop device, activate dm-verity on the loop device, 
and mount the resultant mapped device like any other SquashFS image.


I've also seen some talk of using SquashFS with IMA and IMA appraisal, 
but I've not seen anybody actually _do_ that, and it wouldn't be on 
quite the level you seem to want (it verifies the files in the image, 
but not the image as a whole).


Re: BTRFS bad block management. Does it exist?

2018-10-15 Thread Austin S. Hemmelgarn

On 2018-10-14 07:08, waxhead wrote:

In case BTRFS fails to WRITE to a disk. What happens?
Does the bad area get mapped out somehow? Does it try again until it 
succeed or until it "times out" or reach a threshold counter?
Does it eventually try to write to a different disk (in case of using 
the raid1/10 profile?)


Building on Qu's answer (which is absolutely correct), BTRFS makes the 
perfectly reasonable assumption that you're not trying to use known bad 
hardware.  It's not alone in this respect either, pretty much every 
Linux filesystem makes the exact same assumption (and almost all 
non-Linux ones too), because it really is a perfectly reasonable 
assumption.  The only exception is ext[234], but they only support it 
statically (you can set the bad block list at mkfs time, but not 
afterwards, and they don't update it at runtime), and it's a holdover 
from earlier filesystems which originated at a time when storage was 
sufficiently expensive _and_ unreliable that you kept using disks until 
they were essentially completely dead.


The reality is that with modern storage hardware, if you have 
persistently bad sectors the device is either defective (and should be 
returned under warranty), or it's beyond expected EOL (and should just 
be replaced).  Most people know about SSD's doing block remapping to 
avoid bad blocks, but hard drives do it to, and they're actually rather 
good at it.  In both cases, enough spare blocks are provided that the 
device can handle average rates of media errors through the entirety of 
it's average life expectancy without running out of spare blocks.


On top of all of that though, it's fully possible to work around bad 
blocks in the block layer if you take the time to actually do it.  With 
a bit of reasonably simple math, you can easily set up an LVM volume 
that actively avoids all the bad blocks on a disk while still fully 
utilizing the rest of the volume.  Similarly, with a bit of work (and a 
partition table that supports _lots_ of partitions) you can work around 
bad blocks with an MD concatenated device.


Re: Monitoring btrfs with Prometheus (and soon OpenMonitoring)

2018-10-08 Thread Austin S. Hemmelgarn

On 2018-10-07 09:37, Holger Hoffstätte wrote:


The Prometheus statistics collection/aggregation/monitoring/alerting system
[1] is quite popular, easy to use and will probably be the basis for the
upcoming OpenMetrics "standard" [2].

Prometheus collects metrics by polling host-local "exporters" that respond
to http requests; many such exporters exist, from the generic node_exporter
for OS metrics to all sorts of application-/service-specific varieties.

Since btrfs already exposes quite a lot of monitorable and - more
importantly - actionable runtime information in sysfs it only makes sense
to expose these metrics for visualization & alerting. I noodled over the
idea some time ago but got sidetracked, besides not being thrilled at all
by the idea of doing this in golang (which I *really* dislike).

However, exporters can be written in any language as long as they speak
the standard response protocol, so an alternative would be to use one
of the other official exporter clients. These provide language-native
"mini-frameworks" where one only has to fill in the blanks (see [3]
for examples).

Since the issue just came up in the node_exporter bugtracker [3] I
figured I ask if anyone here is interested in helping build a proper
standalone btrfs_exporter in C++? :D

..just kidding, I'd probably use python (which I kind of don't really
know either :) and build on Hans' python-btrfs library for anything
not covered by sysfs.

Anybody interested in helping? Apparently there are also golang libs
for btrfs [5] but I don't know anything about them (if you do, please
comment on the bug), and the idea of adding even more stuff into the
monolithic, already creaky and somewhat bloated node_exporter is not
appealing to me.

Potential problems wrt. btrfs are access to root-only information,
like e.g. the btrfs device stats/errors in the aforementioned bug,
since exporters are really supposed to run unprivileged due to network
exposure. The S.M.A.R.T. exporter [6] solves this with dual-process
contortions; obviously it would be better if all relevant metrics were
accessible directly in sysfs and not require privileged access, but
forking a tiny privileged process every polling interval is probably
not that bad.

All ideas welcome!
You might be interested in what Netdata [1] is doing.  We've already got 
tracking of space allocations via the sysfs interface (fun fact, you 
actually don't have to be root on most systems to read that data), and 
also ship some per-defined alarms that will trigger when the device gets 
close to full at a low-level (more specifically, if total chunk 
allocations exceed 90% of the total space of all the devices in the volume).


Actual data collection is being done in C (Netdata already has a lot of 
infrastructure for parsing things out of /proc or /sys), and there ahs 
been some discussion in the past of adding collection of device error 
counters (I've been working on and off on it myself, but I still don't 
have a good enough understanding of the C code to get anything actually 
working yet).


[1] https://my-netdata.io/


Re: Understanding BTRFS RAID0 Performance

2018-10-08 Thread Austin S. Hemmelgarn

On 2018-10-05 20:34, Duncan wrote:

Wilson, Ellis posted on Fri, 05 Oct 2018 15:29:52 + as excerpted:


Is there any tuning in BTRFS that limits the number of outstanding reads
at a time to a small single-digit number, or something else that could
be behind small queue depths?  I can't otherwise imagine what the
difference would be on the read path between ext4 vs btrfs when both are
on mdraid.


It seems I forgot to directly answer that question in my first reply.
Thanks for restating it.

Btrfs doesn't really expose much performance tuning (yet?), at least
outside the code itself.  There are a few very limited knobs, but they're
just that, few and limited or broad-stroke.

There are mount options like ssd/nossd, ssd_spread/nossd_spread, the
space_cache set of options (see below), flushoncommit/noflushoncommit,
commit=, etc (see the btrfs (5) manpage), but nothing really to
influence stride length, etc, or to optimize chunk placement between ssd
and non-ssd devices, for instance.

And there's a few filesystem features, normally set at mkfs.btrfs time
(and thus covered in the mkfs.btrfs manpage) but some of which can be
tuned later, but generally, the defaults have changed over time to
reflect the best case, and the older variants are there primarily to
retain backward compatibility with old kernels and tools that didn't
handle the newer variants.

That said, as I think about it there are some tunables that may be worth
experimenting with.  Most or all of these are covered in the btrfs (5)
manpage.

* Given the large device numbers you mention and raid0, you're likely
dealing with multi-TB-scale filesystems.  At this level, the
space_cache=v2 mount option may be useful.  It's not the default yet as
btrfs check, etc, don't yet handle it, but given your raid0 choice you
may not be concerned about that.  Need only be given once after which v2
is "on" for the filesystem until turned off.

* Consider experimenting with the thread_pool=n mount option.  I've seen
very little discussion of this one, but given your interest in
parallelization, it could make a difference.
Probably not as much as you might think.  I'll explain a bit more 
further down where this is being mentioned again.


* Possibly the commit= (default 30) mount option.  In theory,
upping this may allow better write merging, tho your interest seems to be
more on the read side, and the commit time has consequences at crash time.
Based on my own experience, having a higher commit time doesn't impact 
read or write performance much or really help all that much with write 
merging.  All it really helps with is minimizing overhead, but it's not 
even all that great at doing that.


* The autodefrag mount option may be considered if you do a lot of
existing file updates, as is common with database or VM image files.  Due
to COW this triggers high fragmentation on btrfs, and autodefrag should
help control that.  Note that autodefrag effectively increases the
minimum extent size from 4 KiB to, IIRC, 16 MB, tho it may be less, and
doesn't operate at whole-file size, so larger repeatedly-modified files
will still have some fragmentation, just not as much.  Obviously, you
wouldn't see the read-time effects of this until the filesystem has aged
somewhat, so it may not show up on your benchmarks.

(Another option for such files is setting them nocow or using the
nodatacow mount option, but this turns off checksumming and if it's on,
compression for those files, and has a few other non-obvious caveats as
well, so isn't something I recommend.  Instead of using nocow, I'd
suggest putting such files on a dedicated traditional non-cow filesystem
such as ext4, and I consider nocow at best a workaround option for those
who prefer to use btrfs as a single big storage pool and thus don't want
to do the dedicated non-cow filesystem for some subset of their files.)

* Not really for reads but for btrfs and any cow-based filesystem, you
almost certainly want the (not btrfs specific) noatime mount option.
Actually...  This can help a bit for some workloads.  Just like the 
commit time, it comes down to a matter of overhead.  Essentially, if you 
read a file regularly, than with the default of relatime, you've got a 
guaranteed write requiring a commit of the metadata tree once every 24 
hours.  It's not much to worry about for just one file, but if you're 
reading a very large number of files all the time, it can really add up.


* While it has serious filesystem integrity implications and thus can't
be responsibly recommended, there is the nobarrier mount option.  But if
you're already running raid0 on a large number of devices you're already
gambling with device stability, and this /might/ be an additional risk
you're willing to take, as it should increase performance.  But for
normal users it's simply not worth the risk, and if you do choose to use
it, it's at your own risk.
Agreed, if you're running RAID0 with this many drives, nobarrier may be 
worth it for a 

Re: [PATCH RFC] btrfs: harden agaist duplicate fsid

2018-10-01 Thread Austin S. Hemmelgarn

On 2018-10-01 04:56, Anand Jain wrote:

Its not that impossible to imagine that a device OR a btrfs image is
been copied just by using the dd or the cp command. Which in case both
the copies of the btrfs will have the same fsid. If on the system with
automount enabled, the copied FS gets scanned.

We have a known bug in btrfs, that we let the device path be changed
after the device has been mounted. So using this loop hole the new
copied device would appears as if its mounted immediately after its
been copied.

For example:

Initially.. /dev/mmcblk0p4 is mounted as /

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part /
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show
Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
Total devices 1 FS bytes used 1.40GiB
devid1 size 4.00GiB used 3.00GiB path /dev/mmcblk0p4

Copy mmcblk0 to sda
dd if=/dev/mmcblk0 of=/dev/sda

And immediately after the copy completes the change in the device
superblock is notified which the automount scans using
btrfs device scan and the new device sda becomes the mounted root
device.

lsblk
NAMEMAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda   8:01 14.9G  0 disk
|-sda48:414G  0 part /
|-sda28:21  500M  0 part
|-sda38:31  256M  0 part
`-sda18:11  256M  0 part
mmcblk0 179:00 29.2G  0 disk
|-mmcblk0p4 179:404G  0 part
|-mmcblk0p2 179:20  500M  0 part /boot
|-mmcblk0p3 179:30  256M  0 part [SWAP]
`-mmcblk0p1 179:10  256M  0 part /boot/efi

btrfs fi show /
  Label: none  uuid: 07892354-ddaa-4443-90ea-f76a06accaba
  Total devices 1 FS bytes used 1.40GiB
  devid1 size 4.00GiB used 3.00GiB path /dev/sda4

The bug is quite nasty that you can't either unmount /dev/sda4 or
/dev/mmcblk0p4. And the problem does not get solved until you take
sda out of the system on to another system to change its fsid
using the 'btrfstune -u' command.

Signed-off-by: Anand Jain 
---

Hi,

There was previous attempt to fix this bug ref:
www.spinics.net/lists/linux-btrfs/msg37466.html

which broke the Ubuntu subvol mount at boot. The reason
for that is, Ubuntu changes the device path in the boot
process, and the earlier fix checked for the device-path
instead of block_device as in here and so we failed the
subvol mount request and thus the bootup process.

I have tested this with Oracle Linux with btrfs as boot device
with a subvol to be mounted at boot. And also have verified
with new test case btrfs/173.

It will be good if someone run this through Ubuntu boot test case.

  fs/btrfs/volumes.c | 23 +++
  1 file changed, 23 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index f4405e430da6..62173a3abcc4 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -850,6 +850,29 @@ static noinline struct btrfs_device *device_list_add(const 
char *path,
return ERR_PTR(-EEXIST);
}
  
+		/*

+* we are going to replace the device path, make sure its the
+* same device if the device mounted
+*/
+   if (device->bdev) {
+   struct block_device *path_bdev;
+
+   path_bdev = lookup_bdev(path);
+   if (IS_ERR(path_bdev)) {
+   mutex_unlock(_devices->device_list_mutex);
+   return ERR_CAST(path_bdev);
+   }
+
+   if (device->bdev != path_bdev) {
+   bdput(path_bdev);
+   mutex_unlock(_devices->device_list_mutex);
+   return ERR_PTR(-EEXIST);
It would be _really_ nice to have an informative error message printed 
here.  Aside from the possibility of an admin accidentally making a 
block-level copy of the volume, this code triggering could represent an 
attempted attack against the system, so it's arguably something that 
should be reported as happening.  Personally, I think a WARN_ON_ONCE for 
this would make sense, ideally per-volume if possible.

+   }
+   bdput(path_bdev);
+   pr_info("BTRFS: device fsid:devid %pU:%llu old path:%s new 
path:%s\n",
+   disk_super->fsid, devid, 
rcu_str_deref(device->name), path);
+   }
+
name = rcu_string_strdup(path, GFP_NOFS);
if (!name) {
mutex_unlock(_devices->device_list_mutex);





Re: GRUB writing to grubenv outside of kernel fs code

2018-09-19 Thread Austin S. Hemmelgarn

On 2018-09-19 15:08, Goffredo Baroncelli wrote:

On 18/09/2018 19.15, Goffredo Baroncelli wrote:

b. The bootloader code, would have to have sophisticated enough Btrfs
knowledge to know if the grubenv has been reflinked or snapshot,
because even if +C, it may not be valid to overwrite, and COW must
still happen, and there's no way the code in GRUB can do full blow COW
and update a bunch of metadata.



And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
it a so big problem that the data is changed in all the snapshots ?
It would be interested if the same problem happens for a swap file.


I gave a look to the Sandoval's patches about implementing swap on BTRFS. This 
patch set
prevents the subvolume containing the swapfile to be snapshot-ted (and the file 
to be balanced and so on...); what if we would add the same constraint to the 
grubenv file ?
We would need to have a generalized mechanism of doing this then, 
because there's no way in hell a patch special-casing a single filename 
is going to make it into mainline.


Whatever mechanism is used, it should also:

* Force the file to not be inlined in metadata.
* Enforce the file having the NOCOW attribute being set.


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 15:00, Chris Murphy wrote:

On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn
 wrote:


It actually is independent of /boot already.  I've got it running just fine
on my laptop off of the EFI system partition (which is independent of my
/boot partition), and thus have no issues with handling of the grubenv file.
The problem is that all the big distros assume you want it in /boot, so they
have no option for putting it anywhere else.

Actually installing it elsewhere is not hard though, you just pass
`--boot-directory=/wherever` to the `grub-install` script and turn off your
distributions automatic reinstall mechanism so it doesn't get screwed up by
the package manager when the GRUB package gets updated. You can also make
`/boot/grub` a symbolic link pointing to the real GRUB directory, so that
you don't have to pass any extra options to tools like grub-reboot or
grub-set-default.


This is how Fedora builds their signed grubx64.efi to behave. But you
cannot ever run grub-install on a Secure Boot enabled computer, or you
now have to learn all about signing your own binaries. I don't even
like doing that, let alone saner users.

So for those distros that support Secure Boot, in practice you're
stuck with the behavior of their prebuilt GRUB binary that goes on the
ESP.
Agreed, but that avoids the issues we're talking about here completely 
because the grubenv file ends up on the ESP too.




Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:57, Chris Murphy wrote:

On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  wrote:

18.09.2018 08:37, Chris Murphy пишет:



The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).


The only case where I'm aware of this happens is Fedora on UEFI where
they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
seen it elsewhere (except Fedora on UEFI).

I'm not sure this is much easier. Yet another volume that would be
persistently mounted? Where? A nested mount at /boot/grub? I'm not
liking that at all. Even Windows and macOS have saner and simpler to
understand booting methods than this.
On this front maybe, but Windows' boot sequence is insane in it's own 
way (fun fact, if you have the Windows 8/8.1/10 boot-loader set up to 
multi-boot and want it to boot to something other than the default, it 
has to essentially _reboot the machine_ to actually boot that 
alternative entry).


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:38, Andrei Borzenkov wrote:

18.09.2018 21:25, Austin S. Hemmelgarn пишет:

On 2018-09-18 14:16, Andrei Borzenkov wrote:

18.09.2018 08:37, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
 wrote:

18.09.2018 07:21, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
 wrote:

...


There are a couple of reserve locations in Btrfs at the start and I
think after the first superblock, for bootloader embedding. Possibly
one or both of those areas could be used for this so it's outside the
file system. But other implementations are going to run into this
problem too.



That's what SUSE grub2 version does - it includes patches to redirect
writes on btrfs to reserved area. I am not sure how it behaves in case
of multi-device btrfs though.


The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).

It actually is independent of /boot already.  I've got it running just
fine on my laptop off of the EFI system partition (which is independent
of my /boot partition), and thus have no issues with handling of the
grubenv file.  The problem is that all the big distros assume you want
it in /boot, so they have no option for putting it anywhere else.



This requires more than just explicit --boot-directory. With current
monolithic configuration file listing all available kernels this file
cannot be in the same location, it must be together with kernels (think
about rollback to snapshot with completely different content). Or some
different, more flexible configuration is needed.
Uh, no, it doesn't need to be with the kernels.  Fedora stores it on the 
ESP separate from the kernels (which are still on the boot partition) if 
you use Secure Boot, and I'm doing the same (without secure boot) 
without issue.  You do have to explicitly set the `root` variable 
correctly in the config though to get it to work though, and the default 
upstream 'easy configuration' arrangement does not do this consistently. 
 It's not too hard to hack in though, and it's positively trivial if 
you just write your own configuration files by hand like I do (no, I'm 
not crazy, the default configuration generator just produces a 
brobdingnagian monstrosity of a config that has tons of stuff I don't 
need and makes invalid assumptions about how I want things invoked, and 
the config syntax is actually not that hard).


As is now grub silently assumes everything is under /boot. This turned
out to be oversimplified.
No, it assumes everything is under whatever you told GRUB to set the 
default value of the `prefix` variable to when you built the GRUB image, 
which is automatically set to the path you pass to `--boot-directory` 
when you use grub-install.  This persists until you explicitly set that 
variable to a different location, or change the `root` variable (but 
GRUB still uses `prefix` for module look-ups if you just change the 
`root` variable).


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:16, Andrei Borzenkov wrote:

18.09.2018 08:37, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov  wrote:

18.09.2018 07:21, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:

https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F

Does anyone know if this is still a problem on Btrfs if grubenv has
xattr +C set? In which case it should be possible to overwrite and
there's no csums that are invalidated.

I kinda wonder if in 2018 it's specious for, effectively out of tree
code, to be making modifications to the file system, outside of the
file system.


a. The bootloader code (pre-boot, not user space setup stuff) would
have to know how to read xattr and refuse to overwrite a grubenv
lacking xattr +C.
b. The bootloader code, would have to have sophisticated enough Btrfs
knowledge to know if the grubenv has been reflinked or snapshot,
because even if +C, it may not be valid to overwrite, and COW must
still happen, and there's no way the code in GRUB can do full blow COW
and update a bunch of metadata.

So answering my own question, this isn't workable. And it seems the
same problem for dm-thin.

There are a couple of reserve locations in Btrfs at the start and I
think after the first superblock, for bootloader embedding. Possibly
one or both of those areas could be used for this so it's outside the
file system. But other implementations are going to run into this
problem too.



That's what SUSE grub2 version does - it includes patches to redirect
writes on btrfs to reserved area. I am not sure how it behaves in case
of multi-device btrfs though.


The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).
It actually is independent of /boot already.  I've got it running just 
fine on my laptop off of the EFI system partition (which is independent 
of my /boot partition), and thus have no issues with handling of the 
grubenv file.  The problem is that all the big distros assume you want 
it in /boot, so they have no option for putting it anywhere else.


Actually installing it elsewhere is not hard though, you just pass 
`--boot-directory=/wherever` to the `grub-install` script and turn off 
your distributions automatic reinstall mechanism so it doesn't get 
screwed up by the package manager when the GRUB package gets updated. 
You can also make `/boot/grub` a symbolic link pointing to the real GRUB 
directory, so that you don't have to pass any extra options to tools 
like grub-reboot or grub-set-default.


Re: Transactional btrfs

2018-09-06 Thread Austin S. Hemmelgarn

On 2018-09-06 03:23, Nathan Dehnel wrote:

https://lwn.net/Articles/287289/

In 2008, HP released the source code for a filesystem called advfs so
that its features could be incorporated into linux filesystems. Advfs
had a feature where a group of file writes were an atomic transaction.

https://www.usenix.org/system/files/conference/fast15/fast15-paper-verma.pdf

These guys used advfs to add a "syncv" system call that makes writes
across multiple files atomic.

https://lwn.net/Articles/715918/

A patch was later submitted based on the previous paper in some way.

So I guess my question is, does btrfs support atomic writes across
multiple files? Or is anyone interested in such a feature?

I'm fairly certain that it does not currently, but in theory it would 
not be hard to add.


Realistically, the only cases I can think of where cross-file atomic 
_writes_ would be of any benefit are database systems.


However, if this were extended to include rename, unlink, touch, and a 
handful of other VFS operations, then I can easily think of a few dozen 
use cases.  Package managers in particular would likely be very 
interested in being able to atomically rename a group of files as a 
single transaction, as it would make their job _much_ easier.


Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-30 Thread Austin S. Hemmelgarn

On 2018-08-30 13:13, Axel Burri wrote:

On 29/08/2018 21.02, Austin S. Hemmelgarn wrote:

On 2018-08-29 13:24, Axel Burri wrote:

This patch allows to build distinct binaries for specific btrfs
subcommands, e.g. "btrfs-subvolume-show" which would be identical to
"btrfs subvolume show".


Motivation:

While btrfs-progs offer the all-inclusive "btrfs" command, it gets
pretty cumbersome to restrict privileges to the subcommands [1].
Common approaches are to either setuid root for "/sbin/btrfs" (which
is not recommended at all), or to write sudo rules for each
subcommand.

Separating the subcommands into distinct binaries makes it easy to set
elevated privileges using capabilities(7) or setuid. A typical use
case where this is needed is when it comes to automated scripts,
e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.

Let me start by saying I think this is a great idea to have as an
option, and that the motivation is a particularly good one.

I've posted my opinions on your two open questions below, but there's
two other comments I'd like to make:

* Is there some particular reason that this only includes the commands
it does, and _hard codes_ which ones it works with?  if we just do
everything instead of only the stuff we think needs certain
capabilities, then we can auto-generate the list of commands to be
processed based on function names in the C files, and it will
automatically pick up any newly added commands.  At the very least, it
could still parse through the C files and look for tags in the comments
for the functions to indicate which ones need to be processed this way.
Either case will make it significantly easier to add new commands, and
would also better justify the overhead of shipping all the files
pre-generated (because there would be much more involved in
pre-generating them).


It includes the commands that are required by btrbk. It was quite
painful to figure out the required capabilities (reading kernel code and
some trial and error involved), and I did not get around to include
other commands yet.
Yeah, I can imagine that it was not an easy task.  I've actually been 
thinking of writing a script to scan the kernel sources and assemble a 
summary of the permissions checks performed by each system call and 
ioctl so that stuff like this is a bit easier, but that's unfortunately 
way beyond my abilities right now (parsing C and building call graphs is 
not easy no matter what language you're doing it with).


I like your idea of adding some tags in the C files, I'll try to
implement this, and we'll see what it gets to.
Something embedded in the comments is likely to be the easiest option in 
terms of making sure it doesn't break the regular build.  Just the 
tagging in general would be useful as documentation though.


It would be kind of neat to have the list of capabilities needed for 
each one auto-generated from what it calls, but that's getting into some 
particularly complex territory that would likely require call graphs to 
properly implement.



* While not essential, it would be really neat to have the `btrfs`
command detect if an associated binary exists for whatever command was
just invoked, and automatically exec that (possibly with some
verification) instead of calling the command directly so that desired
permissions are enforced.  This would mitigate the need for users to
remember different command names depending on execution context.


Hmm this sounds a bit too magic for me, and would probably be more
confusing than useful. It would mean than running "btrfs" as user would
work when splitted commands are available, and would not work if not.
It would also mean scripts would not have to add special handling for 
the case of running as a non-root user and seeing if the split commands 
actually exist or not (and, for that matter, would not have to directly 
depend on having the split commands at all), and that users would not 
need to worry about how to call BTRFS based on who they were running as. 
 Realistically, I'd expect the same error to show if the binary isn't 
available as if it's not executable, so that it just becomes a case of 
'if you see this error, re-run the same thing as root and it should work'.





Description:

Patch 1 adds a template as well as a generator shell script for the
splitted subcommands.

Patch 2 adds the generated subcommand source files.

Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
approaches (either hardcoded in Makefile, or more generically by
including "Makefile.install_setcap" generated by "splitcmd-gen.sh").


Open Questions:

1. "make install-splitcmd-setcap" installs the binaries with hardcoded
group "btrfs". This needs to be configurable (how?). Another approach
would be to not set the group at all, and leave this to the user or
distro packaging script.

Leave it to the user or dis

Re: [RFC PATCH 0/6] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-08-29 Thread Austin S. Hemmelgarn

On 2018-08-29 13:24, Axel Burri wrote:

This patch allows to build distinct binaries for specific btrfs
subcommands, e.g. "btrfs-subvolume-show" which would be identical to
"btrfs subvolume show".


Motivation:

While btrfs-progs offer the all-inclusive "btrfs" command, it gets
pretty cumbersome to restrict privileges to the subcommands [1].
Common approaches are to either setuid root for "/sbin/btrfs" (which
is not recommended at all), or to write sudo rules for each
subcommand.

Separating the subcommands into distinct binaries makes it easy to set
elevated privileges using capabilities(7) or setuid. A typical use
case where this is needed is when it comes to automated scripts,
e.g. btrbk [2] [3] creating snapshots and send/receive them via ssh.
Let me start by saying I think this is a great idea to have as an 
option, and that the motivation is a particularly good one.


I've posted my opinions on your two open questions below, but there's 
two other comments I'd like to make:


* Is there some particular reason that this only includes the commands 
it does, and _hard codes_ which ones it works with?  if we just do 
everything instead of only the stuff we think needs certain 
capabilities, then we can auto-generate the list of commands to be 
processed based on function names in the C files, and it will 
automatically pick up any newly added commands.  At the very least, it 
could still parse through the C files and look for tags in the comments 
for the functions to indicate which ones need to be processed this way. 
Either case will make it significantly easier to add new commands, and 
would also better justify the overhead of shipping all the files 
pre-generated (because there would be much more involved in 
pre-generating them).


* While not essential, it would be really neat to have the `btrfs` 
command detect if an associated binary exists for whatever command was 
just invoked, and automatically exec that (possibly with some 
verification) instead of calling the command directly so that desired 
permissions are enforced.  This would mitigate the need for users to 
remember different command names depending on execution context.



Description:

Patch 1 adds a template as well as a generator shell script for the
splitted subcommands.

Patch 2 adds the generated subcommand source files.

Patch 3-5 adds a "install-splitcmd-setcap" make target, with different
approaches (either hardcoded in Makefile, or more generically by
including "Makefile.install_setcap" generated by "splitcmd-gen.sh").


Open Questions:

1. "make install-splitcmd-setcap" installs the binaries with hardcoded
group "btrfs". This needs to be configurable (how?). Another approach
would be to not set the group at all, and leave this to the user or
distro packaging script.
Leave it to the user or distro.  It's likely to end up standardized on 
the name 'btrfs', but it should be agnostic of that.


2. Instead of the "install-splitcmd-setcap" make target, we could
introduce a "configure --enable-splitted-subcommands" option, which
would simply add all splitcmd binaries to the "all" and "install"
targets without special treatment, and leave the setcap stuff to the
user or distro packaging script (at least in gentoo, this needs to be
specified using the "fcaps" eclass anyways [5]).
A bit of a nitpick, but 'split' is the proper past tense of the word 
'split', it's one of those exceptions that English has all over the 
place.  Even aside from that though, I think `separate` sounds more 
natural for the configure option, or better yet, just make it 
`--enable-fscaps` like most other packages do.


That aside, I think having a configure option is the best way to do 
this, it makes it very easy for distro build systems to handle it 
because this is what they're used to doing anyway.  It also makes it a 
bit easier on the user, because it just becomes `make` to build 
whichever version you want installed.


Re: [PATCH 0/4] Userspace support for FSID change

2018-08-29 Thread Austin S. Hemmelgarn

On 2018-08-29 08:33, Nikolay Borisov wrote:



On 29.08.2018 15:09, Qu Wenruo wrote:



On 2018/8/29 下午4:35, Nikolay Borisov wrote:

Here is the userspace tooling support for utilising the new metadata_uuid field,
enabling the change of fsid without having to rewrite every metadata block. This
patchset consists of adding support for the new field to various tools and
files (Patch 1). The actual implementation of the new -m|-M options (which are
described in more detail in Patch 2). A new misc-tests testcasei (Patch 3) which
exercises the new options and verifies certain invariants hold (these are also
described in Patch2). Patch 4 is more or less copy of the kernel conuterpart
just reducing some duplication between btrfs_fs_info and btrfs_fs_devices
structures.


So to my understand, now we have another layer of UUID.

Before we have one fsid, both used in superblock and tree blocks.

Now we have 2 fsid, the one used in tree blocks are kept the same, but
changed its name to metadata_uuid in superblock.
And superblock::fsid will become a new field, and although they are the
same at mkfs time, they could change several times during its operation.

This indeed makes uuid change super fast, only needs to update all
superblocks of the fs, instead of all tree blocks.

However I have one nitpick of the design. Unlike XFS, btrfs supports
multiple devices.
If we have a raid10 fs with 4 devices, and it has already gone through
several UUID change (so its metadata uuid is already different from fsid).

And during another UUID change procedure, we lost power while only
updated 2 super blocks, what will happen for kernel device assembly?

(Although considering how fast the UUID change would happen, such case
should be super niche)


Then I guess you will be fucked. I'm all ears for suggestion how to
rectify this without skyrocketing the complexity. The current UUID
rewrite method sets a flag int he superblock that FSID change is in
progress and clears it once every metadatablock has been rewritten. I
can piggyback on this mechanism but I'm not sure it provides 100%
guarantee. Because by the some token you can set this flag, start
writing the super blocks then lose power and then only some of the
superblocks could have this flag set so we back at square 1.





The intended usecase of this feature is to give the sysadmin the ability to
create copies of filesystesm, change their uuid quickly and mount them alongside
the original filesystem for, say, forensic purposes.

One thing which still hasn't been set in stone is whether the new options
will remain as -m|-M or whether they should subsume the current -u|-U - from
the point of view of users nothing should change.


Well, user would be surprised by how fast the new -m is, thus there is
still something changed :)

I prefer to subsume current -u/-U, and use the new one if the incompat
feature is already set. Or fall back to original behavior.

But I'm not a fan of using INCOMPAT flags as an indicator of changed
fsid/metadata uuid.
INCOMPAT feature should not change so easily nor acts as an indicator.

That's to say, the flag should only be set at mkfs time, and then never
change unlike the 2nd patch (I don't even like btrfstune to change
incompat flags).

E.g.
mkfs.btrfs -O metadata_uuid , then we could use the new way to
change fsid without touching metadata uuid.
Or we could only use the old method.


I disagree, I don't see any benefit in this but only added complexity.
Can you elaborate more ?
Same here, I see essentially zero benefit to this, and one _big_ 
drawback, namely that you can't convert an existing volume to use this 
approach if it's a feature that can only be set at mkfs time.


That one drawback means that this is effectively useless for all 
existing BTRFS volumes, which is a pretty big limitation.


I also do think an INCOMPAT feature bit is appropriate here.  Volumes 
with this feature will potentially be enumerated with the wrong UUID on 
older kernels, which is a pretty big behavioral issue (on the level of 
completely breaking boot on some systems, keep in mind that almost all 
major distros use volume UUID's to identify volumes in /etc/fstab).





Thanks,
Qu


So this is something which
I'd like to hear from the community. Of course the alternative of rewriting
the metadata blocks will be assigne new options - perhaps -m|M ?

I've tested this with multiple xfstest runs with the new tools installed as
well as running btrfs-progs test and have observed no regressions.

Nikolay Borisov (4):
   btrfs-progs: Add support for metadata_uuid field.
   btrfstune: Add support for changing the user uuid
   btrfs-progs: tests: Add tests for changing fsid feature
   btrfs-progs: Remove fsid/metdata_uuid fields from fs_info

  btrfstune.c| 174 -
  check/main.c   |   2 +-
  chunk-recover.c|  17 ++-
  cmds-filesystem.c  |  

Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Austin S. Hemmelgarn

On 2018-08-28 15:14, Menion wrote:

You are correct, indeed in order to cleanup you need

1) someone realize that snapshots have been created
2) apt-brtfs-snapshot is manually installed on the system
Your second requirement is only needed if you want the nice automated 
cleanup.  There's absolutely nothing preventing you from manually 
removing the snapshots.


Assuming also that the snapshots created during do-release-upgrade are 
managed for auto cleanup


Il martedì 28 agosto 2018, Noah Massey <mailto:noah.mas...@gmail.com>> ha scritto:


On Tue, Aug 28, 2018 at 1:25 PM Menion mailto:men...@gmail.com>> wrote:
 >
 > Ok, I have removed the snapshot and the free expected space is
here, thank you!
 > As a side note: apt-btrfs-snapshot was not installed, but it is
 > present in Ubuntu repository and I have used it (and I like the idea
 > of automatic snapshot during upgrade)
 > This means that the do-release-upgrade does it's own job on BTRFS,
 > silently which I believe is not good from the usability perspective,

You are correct. DistUpgradeController.py from python3-distupgrade
imports 'apt_btrfs_snapshot', which I read as coming from
/usr/lib/python3/dist-packages/apt_btrfs_snapshot.py, supplied by
apt-btrfs-snapshot, but I missed the fact that python3-distupgrade
ships its own
/usr/lib/python3/dist-packages/DistUpgrade/apt_btrfs_snapshot.py

So now it looks like that cannot be easily disabled, and without the
apt-btrfs-snapshot package scheduling cleanups it's not ever
automatically removed?

 > just google it, there is no mention of this behaviour
 > Il giorno mar 28 ago 2018 alle ore 19:07 Austin S. Hemmelgarn
 > mailto:ahferro...@gmail.com>> ha scritto:
 > >
 > > On 2018-08-28 12:05, Noah Massey wrote:
 > > > On Tue, Aug 28, 2018 at 11:47 AM Austin S. Hemmelgarn
 > > > mailto:ahferro...@gmail.com>> wrote:
 > > >>
 > > >> On 2018-08-28 11:27, Noah Massey wrote:
 > > >>> On Tue, Aug 28, 2018 at 10:59 AM Menion mailto:men...@gmail.com>> wrote:
 > > >>>>
 > > >>>> [sudo] password for menion:
 > > >>>> ID      gen     top level       path
 > > >>>> --      ---     -       
 > > >>>> 257     600627  5               /@
 > > >>>> 258     600626  5               /@home
 > > >>>> 296     599489  5
 > > >>>>
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55
 > > >>>> 297     599489  5
 > > >>>>
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08
 > > >>>> 298     599489  5
 > > >>>>
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30
 > > >>>>
 > > >>>> So, there are snapshots, right? The time stamp is when I
have launched
 > > >>>> do-release-upgrade, but it didn't ask anything about
snapshot, neither
 > > >>>> I asked for it.
 > > >>>
 > > >>> This is an Ubuntu thing
 > > >>> `apt show apt-btrfs-snapshot`
 > > >>> which "will create a btrfs snapshot of the root filesystem
each time
 > > >>> that apt installs/removes/upgrades a software package."
 > > >> Not Ubuntu, Debian.  It's just that Ubuntu installs and
configures the
 > > >> package by default, while Debian does not.
 > > >
 > > > Ubuntu also maintains the package, and I did not find it in
Debian repositories.
 > > > I think it's also worth mentioning that these snapshots were
created
 > > > by the do-release-upgrade script using the package directly,
not as a
 > > > result of the apt configuration. Meaning if you do not want a
snapshot
 > > > taken prior to upgrade, you have to remove the apt-btrfs-snapshot
 > > > package prior to running the upgrade script. You cannot just
update
 > > > /etc/apt/apt.conf.d/80-btrfs-snapshot
 > > Hmm... I could have sworn that it was in the Debian repositories.
 > >
 > > That said, it's kind of stupid that the snapshot is not trivially
 > > optional for a release upgrade.  Yes, that's where it's
arguably the
 > > most important, but it's still kind of stupid to have to remove a
 > > package to get rid of that behavior and then reinstall it again
afterwards.





Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Austin S. Hemmelgarn

On 2018-08-28 12:05, Noah Massey wrote:

On Tue, Aug 28, 2018 at 11:47 AM Austin S. Hemmelgarn
 wrote:


On 2018-08-28 11:27, Noah Massey wrote:

On Tue, Aug 28, 2018 at 10:59 AM Menion  wrote:


[sudo] password for menion:
ID  gen top level   path
--  --- -   
257 600627  5   /@
258 600626  5   /@home
296 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55
297 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08
298 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30

So, there are snapshots, right? The time stamp is when I have launched
do-release-upgrade, but it didn't ask anything about snapshot, neither
I asked for it.


This is an Ubuntu thing
`apt show apt-btrfs-snapshot`
which "will create a btrfs snapshot of the root filesystem each time
that apt installs/removes/upgrades a software package."

Not Ubuntu, Debian.  It's just that Ubuntu installs and configures the
package by default, while Debian does not.


Ubuntu also maintains the package, and I did not find it in Debian repositories.
I think it's also worth mentioning that these snapshots were created
by the do-release-upgrade script using the package directly, not as a
result of the apt configuration. Meaning if you do not want a snapshot
taken prior to upgrade, you have to remove the apt-btrfs-snapshot
package prior to running the upgrade script. You cannot just update
/etc/apt/apt.conf.d/80-btrfs-snapshot

Hmm... I could have sworn that it was in the Debian repositories.

That said, it's kind of stupid that the snapshot is not trivially 
optional for a release upgrade.  Yes, that's where it's arguably the 
most important, but it's still kind of stupid to have to remove a 
package to get rid of that behavior and then reinstall it again afterwards.


Re: 14Gb of space lost after distro upgrade on BTFS root partition (long thread with logs)

2018-08-28 Thread Austin S. Hemmelgarn

On 2018-08-28 11:27, Noah Massey wrote:

On Tue, Aug 28, 2018 at 10:59 AM Menion  wrote:


[sudo] password for menion:
ID  gen top level   path
--  --- -   
257 600627  5   /@
258 600626  5   /@home
296 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:29:55
297 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:30:08
298 599489  5
/@apt-snapshot-release-upgrade-bionic-2018-08-27_15:33:30

So, there are snapshots, right? The time stamp is when I have launched
do-release-upgrade, but it didn't ask anything about snapshot, neither
I asked for it.


This is an Ubuntu thing
`apt show apt-btrfs-snapshot`
which "will create a btrfs snapshot of the root filesystem each time
that apt installs/removes/upgrades a software package."
Not Ubuntu, Debian.  It's just that Ubuntu installs and configures the 
package by default, while Debian does not.


This behavior in general is not specific to Debian either, a lot of 
distributions are either working on or already have this type of 
functionality, because it's the only sane and correct way to handle 
updates short of rebuilding the entire system from scratch.



During the do-release-upgrade I got some issues due to the (very) bad
behaviour of the script in remote terminal, then I have fixed
everything manually and now the filesystem is operational in bionic
version
If it is confirmed, how can I remove the unwanted snapshot, keeping
the current "visible" filesystem contents


By default, the package runs a weekly cron job to cleanup old
snapshots. (Defaults to 90d, but you can configure that in
APT::Snapshots::MaxAge) Alternatively, you can cleanup with the
command yourself. Run `sudo apt-btrfs-snapshot list`, and then `sudo
apt-btrfs-snapshot delete `


Re: corruption_errs

2018-08-28 Thread Austin S. Hemmelgarn

On 2018-08-27 18:53, John Petrini wrote:

Hi List,

I'm seeing corruption errors when running btrfs device stats but I'm
not sure what that means exactly. I've just completed a full scrub and
it reported no errors. I'm hoping someone here can enlighten me.
Thanks!


The first thing to understand here is that the error counters reported 
by `btrfs device stats` are cumulative.  In other words, they count 
errors since the last time they were reset (which means that if you've 
never run `btrfs device stats -z` on this filesystem, then they will 
count errors since the filesystem was created).  As a result, seeing a 
non-zero value there just means that errors of that type happened at 
some point in time since they were reset.


Building on this a bit further, corruption errors are checksum 
mismatches.  Each time a block is read and it's checksum does not match 
the stored checksum for it, a corruption error is recorded.  The thing 
is though, if you are using a profile which can rebuild that block (dup, 
raid1, raid10, or one of the parity profiles), the error gets corrected 
automatically by the filesystem (it will attempt to rebuild that block, 
then write out the correct block).  If that fix succeeds, there will be 
no errors there anymore, but the record of the error stays around 
(because there _was_ an error).


Given this, my guess is that you _had_ checksum mismatches somewhere, 
but they were fixed before you ran scrub.


Re: BTRFS support per-subvolume compression, isn't it?

2018-08-28 Thread Austin S. Hemmelgarn

On 2018-08-27 17:05, Eugene Bright wrote:

Greetings!

BTRFS wiki says there is no per-subvolume compression option [1].

At the same time next command allow me to set properties per-subvolume:
 btrfs property set /volume compression zstd

Corresponding get command shows distinct properties for every subvolume.
Should wiki be updated?


The wiki should be updated, but it's not technically wrong.

What the wiki is talking about is per-subvolume mount options to control 
compression (so, mounting individual subvolumes from the same volume 
with different `compress=` or `compress-force=` mount options), which is 
not currently supported.


You are correct though that properties can be used to achieve a similar 
result (compressing differently for different subvolumes.


Re: Device Delete Stalls

2018-08-23 Thread Austin S. Hemmelgarn

On 2018-08-23 10:04, Stefan Malte Schumacher wrote:

Hallo,

I originally had RAID with six 4TB drives, which was more than 80
percent full. So now I bought
a 10TB drive, added it to the Array and gave the command to remove the
oldest drive in the array.

  btrfs device delete /dev/sda /mnt/btrfs-raid

I kept a terminal with "watch btrfs fi show" open and It showed that
the size of /dev/sda had been set to zero and that data was being
redistributed to the other drives. All seemed well, but now the
process stalls at 8GB being left on /dev/sda/. It also seems that the
size of the drive has been reset the original value of 3,64TiB.

Label: none  uuid: 1609e4e1-4037-4d31-bf12-f84a691db5d8
 Total devices 7 FS bytes used 8.07TiB
 devid1 size 3.64TiB used 8.00GiB path /dev/sda
 devid2 size 3.64TiB used 2.73TiB path /dev/sdc
 devid3 size 3.64TiB used 2.73TiB path /dev/sdd
 devid4 size 3.64TiB used 2.73TiB path /dev/sde
 devid5 size 3.64TiB used 2.73TiB path /dev/sdf
 devid6 size 3.64TiB used 2.73TiB path /dev/sdg
 devid7 size 9.10TiB used 2.50TiB path /dev/sdb

I see no more btrfs worker processes and no more activity in iotop.
How do I proceed? I am using a current debian stretch which uses
Kernel 4.9.0-8 and btrfs-progs 4.7.3-1.

How should I proceed? I have a Backup but would prefer an easier and
less time-comsuming way out of this mess.


Not exactly what you asked for, but I do have some advice on how to 
avoid this situation in the future:


If at all possible, use `btrfs device replace` instead of an add/delete 
cycle.  The replace operation requires two things.  First, you have to 
be able to connect the new device to the system while all the old ones 
except the device you are removing are present.  Second, the new device 
has to be at least as big as the old one.  Assuming both conditions are 
met and you can use replace, it's generally much faster and is a lot 
more reliable than an add/delete cycle (especially when the array is 
near full).  This is because replace just copies the data that's on the 
old device directly (or rebuilds it directly if it's not present anymore 
or corrupted), whereas the add/delete method implicitly re-balances the 
entire array (which takes a long time and may fail if the array is 
mostly full).



Now, as far as what's actually going on here, I'm unfortunately not 
quite sure, and therefore I'm really not the best person to be giving 
advice on how to fix it.  I will comment that having info on the 
allocations for all the devices (not just /dev/sda) would be useful in 
debugging, but even with that I don't know that I personally can help.


Re: lazytime mount option—no support in Btrfs

2018-08-22 Thread Austin S. Hemmelgarn

On 2018-08-22 11:01, David Sterba wrote:

On Wed, Aug 22, 2018 at 09:56:59AM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-22 09:48, David Sterba wrote:

On Tue, Aug 21, 2018 at 01:01:00PM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-21 12:05, David Sterba wrote:

On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-21 09:32, Janos Toth F. wrote:

so pretty much everyone who wants to avoid the overhead from them can just
use the `noatime` mount option.


It would be great if someone finally fixed this old bug then:
https://bugzilla.kernel.org/show_bug.cgi?id=61601
Until then, it seems practically impossible to use both noatime (this
can't be added as rootflag in the command line and won't apply if the
kernel already mounted the root as RW) and space-cache-v2 (has to be
added as a rootflag along with RW to take effect) for the root
filesystem (at least without an init*fs, which I never use, so can't
tell).


Last I knew, it was fixed.  Of course, it's been quite a while since I
last tried this, as I run locally patched kernels that have `noatime` as
the default instead of `relatime`.


I'm using VMs without initrd, tested the rootflags=noatime and it still
fails, the same way as in the bugreport.

As the 'noatime' mount option is part of the mount(2) API (passed as a
bit via mountflags), the remaining option in the filesystem is to
whitelist the generic options and ignore them. But this brings some
layering violation question.

On the other hand, this would be come confusing as the user expectation
is to see the effects of 'noatime'.


Ideally there would be a way to get this to actually work properly.  I
think ext4 at least doesn't panic, though I'm not sure if it actually
works correctly.


No, ext4 also refuses to mount, the panic happens in VFS that tries
either the rootfstype= or all available filesystems.

[3.763602] EXT4-fs (sda): Unrecognized mount option "noatime" or missing 
value

[3.761315] BTRFS info (device sda): unrecognized mount option 'noatime'


Otherwise, the only option for people who want it set is to patch the
kernel to get noatime as the default (instead of relatime).  I would
look at pushing such a patch upstream myself actually, if it weren't for
the fact that I'm fairly certain that it would be immediately NACK'ed by
at least Linus, and probably a couple of other people too.


An acceptable solution could be to parse the rootflags and translate
them to the MNT_* values, ie. what the commandline tool mount does
before it calls the mount syscall.


That would be helpful, but at that point you might as well update the
CLI mount tool to just pass all the named options to the kernel and have
it do the parsing (I mean, keep the old interface too obviously, but
provide a new one and use that preferentially).


The initial mount is not done by the mount tool but internally by
kernel init sequence (files in init/):

mount_block_root
   do_mount_root
 ksys_mount

The mount options (as a string) is passed unchanged via variable
root_mount_data (== rootflags). So before this step, the options would
have to be filtered and all known generic options turned into bit flags.

What I'm saying is that if there's going to be parsing for it in the 
kernel anyway, why not expose that interface to userspace too so that 
the regular `mount` tool can take advantage of it as well.


Re: lazytime mount option—no support in Btrfs

2018-08-22 Thread Austin S. Hemmelgarn

On 2018-08-22 09:48, David Sterba wrote:

On Tue, Aug 21, 2018 at 01:01:00PM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-21 12:05, David Sterba wrote:

On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-21 09:32, Janos Toth F. wrote:

so pretty much everyone who wants to avoid the overhead from them can just
use the `noatime` mount option.


It would be great if someone finally fixed this old bug then:
https://bugzilla.kernel.org/show_bug.cgi?id=61601
Until then, it seems practically impossible to use both noatime (this
can't be added as rootflag in the command line and won't apply if the
kernel already mounted the root as RW) and space-cache-v2 (has to be
added as a rootflag along with RW to take effect) for the root
filesystem (at least without an init*fs, which I never use, so can't
tell).


Last I knew, it was fixed.  Of course, it's been quite a while since I
last tried this, as I run locally patched kernels that have `noatime` as
the default instead of `relatime`.


I'm using VMs without initrd, tested the rootflags=noatime and it still
fails, the same way as in the bugreport.

As the 'noatime' mount option is part of the mount(2) API (passed as a
bit via mountflags), the remaining option in the filesystem is to
whitelist the generic options and ignore them. But this brings some
layering violation question.

On the other hand, this would be come confusing as the user expectation
is to see the effects of 'noatime'.


Ideally there would be a way to get this to actually work properly.  I
think ext4 at least doesn't panic, though I'm not sure if it actually
works correctly.


No, ext4 also refuses to mount, the panic happens in VFS that tries
either the rootfstype= or all available filesystems.

[3.763602] EXT4-fs (sda): Unrecognized mount option "noatime" or missing 
value

[3.761315] BTRFS info (device sda): unrecognized mount option 'noatime'


Otherwise, the only option for people who want it set is to patch the
kernel to get noatime as the default (instead of relatime).  I would
look at pushing such a patch upstream myself actually, if it weren't for
the fact that I'm fairly certain that it would be immediately NACK'ed by
at least Linus, and probably a couple of other people too.


An acceptable solution could be to parse the rootflags and translate
them to the MNT_* values, ie. what the commandline tool mount does
before it calls the mount syscall.

That would be helpful, but at that point you might as well update the 
CLI mount tool to just pass all the named options to the kernel and have 
it do the parsing (I mean, keep the old interface too obviously, but 
provide a new one and use that preferentially).


I also like Duncan's suggestion to expose the default value for the 
atime options as a kconfig option (Chris Murphy emailed me directly 
about essentially the same thing).


Re: lazytime mount option—no support in Btrfs

2018-08-22 Thread Austin S. Hemmelgarn

On 2018-08-21 23:57, Duncan wrote:

Austin S. Hemmelgarn posted on Tue, 21 Aug 2018 13:01:00 -0400 as
excerpted:


Otherwise, the only option for people who want it set is to patch the
kernel to get noatime as the default (instead of relatime).  I would
look at pushing such a patch upstream myself actually, if it weren't for
the fact that I'm fairly certain that it would be immediately NACK'ed by
at least Linus, and probably a couple of other people too.


What about making default-noatime a kconfig option, presumably set to
default-relatime by default?  That seems to be the way many legacy-
incompatible changes work.  Then for most it's up to the distro, which in
fact it is already, only if the distro set noatime-default they'd at
least be using an upstream option instead of patching it themselves,
making it upstream code that could be accounted for instead of downstream
code that... who knows?
That's probably a lot more likely to make it upstream, but it's a bit 
beyond my skills when it comes to stuff like this.


Meanwhile, I'd be interested in seeing your local patch.  I'm local-
patching noatime-default here too, but not being a dev, I'm not entirely
sure I'm doing it "correctly", tho AFAICT it does seem to work.  FWIW,
here's what I'm doing (posting inline so may be white-space damaged, and
IIRC I just recently manually updated the line numbers so they don't
reflect the code at the 2014 date any more, but as I'm not sure of the
"correctness" it's not intended to be applied in any case):

--- fs/namespace.c.orig 2014-04-18 23:54:42.167666098 -0700
+++ fs/namespace.c  2014-04-19 00:19:08.622741946 -0700
@@ -2823,8 +2823,9 @@ long do_mount(const char *dev_name, cons
goto dput_out;
  
  	/* Default to relatime unless overriden */

-   if (!(flags & MS_NOATIME))
-   mnt_flags |= MNT_RELATIME;
+   /* JED: Make that noatime */
+   if (!(flags & MS_RELATIME))
+   mnt_flags |= MNT_NOATIME;
  
  	/* Separate the per-mountpoint flags */

if (flags & MS_NOSUID)
@@ -2837,6 +2837,8 @@ long do_mount(const char *dev_name, cons
mnt_flags |= MNT_NOATIME;
if (flags & MS_NODIRATIME)
mnt_flags |= MNT_NODIRATIME;
+   if (flags & MS_RELATIME)
+   mnt_flags |= MNT_RELATIME;
if (flags & MS_STRICTATIME)
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)

Sane, or am I "doing it wrong!"(TM), or perhaps doing it correctly, but
missing a chunk that should be applied elsewhere?
Mine only has the first part, not the second, which seems to cover 
making sure it's noatime by default.  I never use relatime though, so 
that may be broken with my patch because of me not having the second part.



Meanwhile, since broken rootflags requiring an initr* came up let me take
the opportunity to ask once again, does btrfs-raid1 root still require an
initr*?  It'd be /so/ nice to be able to supply the appropriate
rootflags=device=...,device=... and actually have it work so I didn't
need the initr* any longer!
Last I knew, specifying appropriate `device=` options in rootflags works 
correctly without an initrd.




Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Austin S. Hemmelgarn

On 2018-08-21 12:05, David Sterba wrote:

On Tue, Aug 21, 2018 at 10:10:04AM -0400, Austin S. Hemmelgarn wrote:

On 2018-08-21 09:32, Janos Toth F. wrote:

so pretty much everyone who wants to avoid the overhead from them can just
use the `noatime` mount option.


It would be great if someone finally fixed this old bug then:
https://bugzilla.kernel.org/show_bug.cgi?id=61601
Until then, it seems practically impossible to use both noatime (this
can't be added as rootflag in the command line and won't apply if the
kernel already mounted the root as RW) and space-cache-v2 (has to be
added as a rootflag along with RW to take effect) for the root
filesystem (at least without an init*fs, which I never use, so can't
tell).


Last I knew, it was fixed.  Of course, it's been quite a while since I
last tried this, as I run locally patched kernels that have `noatime` as
the default instead of `relatime`.


I'm using VMs without initrd, tested the rootflags=noatime and it still
fails, the same way as in the bugreport.

As the 'noatime' mount option is part of the mount(2) API (passed as a
bit via mountflags), the remaining option in the filesystem is to
whitelist the generic options and ignore them. But this brings some
layering violation question.

On the other hand, this would be come confusing as the user expectation
is to see the effects of 'noatime'.

Ideally there would be a way to get this to actually work properly.  I 
think ext4 at least doesn't panic, though I'm not sure if it actually 
works correctly.


Otherwise, the only option for people who want it set is to patch the 
kernel to get noatime as the default (instead of relatime).  I would 
look at pushing such a patch upstream myself actually, if it weren't for 
the fact that I'm fairly certain that it would be immediately NACK'ed by 
at least Linus, and probably a couple of other people too.


Re: Are the btrfs mount options inconsistent?

2018-08-21 Thread Austin S. Hemmelgarn

On 2018-08-21 09:43, David Howells wrote:

Qu Wenruo  wrote:


But to be more clear, NOSSD shouldn't be a special case.
In fact currently NOSSD only affects whether we will output the message
"enabling ssd optimization", no real effect if I didn't miss anything.


That's not quite true.  In:

if (!btrfs_test_opt(fs_info, NOSSD) &&
!fs_info->fs_devices->rotating) {
btrfs_set_and_info(fs_info, SSD, "enabling ssd optimizations");
}

the call to btrfs_set_and_info() will turn on SSD.

What this seems to me is that, normally, SSD will be turned on automatically
unless at least one of the devices is a rotating medium - but this appears to
be explicitly suppressed by the NOSSD option.

That's my understanding too (though I may be wrong, I'm not an expert on C).

If this _isn't_ what's happening, then it needs to be changed so it is, 
that's what the documentation has pretty much always said, and is 
therefore how people expect it to work (also, it needs to work because 
there needs to be an option other than poking around at sysfs attributes 
to disable this on non-rotational media where it's not want4ed).


Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Austin S. Hemmelgarn

On 2018-08-21 09:32, Janos Toth F. wrote:

so pretty much everyone who wants to avoid the overhead from them can just
use the `noatime` mount option.


It would be great if someone finally fixed this old bug then:
https://bugzilla.kernel.org/show_bug.cgi?id=61601
Until then, it seems practically impossible to use both noatime (this
can't be added as rootflag in the command line and won't apply if the
kernel already mounted the root as RW) and space-cache-v2 (has to be
added as a rootflag along with RW to take effect) for the root
filesystem (at least without an init*fs, which I never use, so can't
tell).

Last I knew, it was fixed.  Of course, it's been quite a while since I 
last tried this, as I run locally patched kernels that have `noatime` as 
the default instead of `relatime`.


Also, once you've got the space cache set up by mounting once writable 
with the appropriate flag and then waiting for it to initialize, you 
should not ever need to specify the `space_cache` option again.


Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Austin S. Hemmelgarn

On 2018-08-21 08:06, Adam Borowski wrote:

On Mon, Aug 20, 2018 at 08:16:16AM -0400, Austin S. Hemmelgarn wrote:

Also, slightly OT, but atimes are not where the real benefit is here for
most people.  No sane software other than mutt uses atimes (and mutt's use
of them is not sane, but that's a different argument)


Right.  There are two competing forks of mutt: neomutt and vanilla:
https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f
https://gitlab.com/dops/mutt/commit/489a1c394c29e4b12b705b62da413f322406326f

So this has already been taken care of.


so pretty much everyone who wants to avoid the overhead from them can just
use the `noatime` mount option.


atime updates (including relatime) are bad not only for performance, they
also explode disk size used by snapshots (btrfs, LVM, ...) -- to the tune of
~5% per snapshot for some non-crafted loads.  And, are bad for media with
low write endurance (SD cards, as used by most SoCs).

Thus, atime needs to die.


The real benefit for most people is with mtimes, for which there is no
other way to limit the impact they have on performance.


With btrfs, any write already triggers metadata update (except nocow), thus
there's little benefit of lazytime for mtimes.
But does that actually propagate all the way up to the point of updating 
the inode itself?  If so, then yes, there is not really any point.  if 
not though, then there is still a benefit.




Re: lazytime mount option—no support in Btrfs

2018-08-20 Thread Austin S. Hemmelgarn

On 2018-08-19 06:25, Andrei Borzenkov wrote:



Отправлено с iPhone


19 авг. 2018 г., в 11:37, Martin Steigerwald  написал(а):

waxhead - 18.08.18, 22:45:

Adam Hunt wrote:

Back in 2014 Ted Tso introduced the lazytime mount option for ext4
and shortly thereafter a more generic VFS implementation which was
then merged into mainline. His early patches included support for
Btrfs but those changes were removed prior to the feature being
merged. His>
changelog includes the following note about the removal:
   - Per Christoph's suggestion, drop support for btrfs and xfs for
   now,

 issues with how btrfs and xfs handle dirty inode tracking.  We
 can add btrfs and xfs support back later or at the end of this
 series if we want to revisit this decision.

My reading of the current mainline shows that Btrfs still lacks any
support for lazytime. Has any thought been given to adding support
for lazytime to Btrfs?

[…]

Is there any new regarding this?


I´d like to know whether there is any news about this as well.

If I understand it correctly this could even help BTRFS performance a
lot cause it is COW´ing metadata.



I do not see how btrfs can support it exactly due to cow. Modified atime means 
checksum no more matches so you must update all related metadata. At which 
point you have kind of shadow in-memory metadata trees. And if this metadata is 
not written out, then some other metadata that refers to them becomes invalid.
I think you might be misunderstanding something here, either how 
lazytime actually works, or how BTRFS checksumming works.


Lazytime prevents timestamp updates from triggering writeback of a 
cached inode.  Other changes will trigger writeback, as will anything 
that evicts the inode from the cache, and an automatic writeback will be 
triggered if the timestamp changed more than 24 hours ago, but until any 
of those situations happens, no writeback will be triggered.


BTRFS checksumming only verifies checksums of blocks which are being 
read.  If the inode is in the cache (which it has to be for lazytime to 
have _any_ effect on it), the block containing it on disk does not need 
to be read, so no checksum verification happens.  Even if there was 
verification, we would not be verifying blocks that are in memory using 
the on-disk checksums (because that would break writeback caching, which 
we already do and already works correctly).


So, given all this, the only inconsistency on-disk for BTRFS with this 
would be identical to the inconsistency it causes for other filesystems, 
namely that mtimes and atimes may not be accurate.


Also, slightly OT, but atimes are not where the real benefit is here for 
most people.  No sane software other than mutt uses atimes (and mutt's 
use of them is not sane, but that's a different argument), so pretty 
much everyone who wants to avoid the overhead from them can just use the 
`noatime` mount option.  The real benefit for most people is with 
mtimes, for which there is no other way to limit the impact they have on 
performance.


I suspect any file system that keeps checksums of metadata will run into the 
same issue.

Nope, only if they verify checksums on stuff that's already cached _and_ 
they pull the checksums for verification from the block device and not 
the cache.


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 08:50, Roman Mamedov wrote:

On Fri, 17 Aug 2018 14:28:25 +0200
Martin Steigerwald  wrote:


First off, keep in mind that the SSD firmware doing compression only
really helps with wear-leveling.  Doing it in the filesystem will help
not only with that, but will also give you more space to work with.


While also reducing the ability of the SSD to wear-level. The more data
I fit on the SSD, the less it can wear-level. And the better I compress
that data, the less it can wear-level.


Do not consider SSD "compression" as a factor in any of your calculations or
planning. Modern controllers do not do it anymore, the last ones that did are
SandForce, and that's 2010 era stuff. You can check for yourself by comparing
write speeds of compressible vs incompressible data, it should be the same. At
most, the modern ones know to recognize a stream of binary zeroes and have a
special case for that.
All that testing write speeds forz compressible versus incompressible 
data tells you is if the SSD is doing real-time compression of data, not 
if they are doing any compression at all..  Also, this test only works 
if you turn the write-cache on the device off.


Besides, you can't prove 100% for certain that any manufacturer who does 
not sell their controller chips isn't doing this, which means there are 
a few manufacturers that may still be doing it.


Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 08:28, Martin Steigerwald wrote:

Thanks for your detailed answer.

Austin S. Hemmelgarn - 17.08.18, 13:58:

On 2018-08-17 05:08, Martin Steigerwald wrote:

[…]

I have seen a discussion about the limitation in point 2. That
allowing to add a device and make it into RAID 1 again might be
dangerous, cause of system chunk and probably other reasons. I did
not completely read and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have
*all*
data. Also for the system chunk, which according to btrfs fi df /
btrfs fi sh was indeed RAID 1. If so, then period. Then I don´t see
why it would need to disallow me to make it into an RAID 1 again
after one device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I
was
able to copy of all date of the degraded mount, I´d say it was a
RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just
does two copies regardless of how many drives you use.)


So, what's happening here is a bit complicated.  The issue is entirely
with older kernels that are missing a couple of specific patches, but
it appears that not all distributions have their kernels updated to
include those patches yet.

In short, when you have a volume consisting of _exactly_ two devices
using raid1 profiles that is missing one device, and you mount it
writable and degraded on such a kernel, newly created chunks will be
single-profile chunks instead of raid1 chunks with one half missing.
Any write has the potential to trigger allocation of a new chunk, and
more importantly any _read_ has the potential to trigger allocation of
a new chunk if you don't use the `noatime` mount option (because a
read will trigger an atime update, which results in a write).

When older kernels then go and try to mount that volume a second time,
they see that there are single-profile chunks (which can't tolerate
_any_ device failures), and refuse to mount at all (because they
can't guarantee that metadata is intact).  Newer kernels fix this
part by checking per-chunk if a chunk is degraded/complete/missing,
which avoids this because all the single chunks are on the remaining
device.


How new the kernel needs to be for that to happen?

Do I get this right that it would be the kernel used for recovery, i.e.
the one on the live distro that needs to be new enough? To one on this
laptop meanwhile is already 4.18.1.
Yes, the kernel used for recovery is the important one here.  I don't 
remember for certain when the patches went in, but I'm pretty sure it's 
been no eariler than 4.14.  FWIW, I'm pretty sure SystemRescueCD has a 
new enough kernel, but they still (sadly) lack zstd support.


I used latest GRML stable release 2017.05 which has an 4.9 kernel.
While I don't know exactly when the patches went in, I'm fairly certain 
that 4.9 never got them.



As far as avoiding this in the future:


I hope that with the new Samsung Pro 860 together with the existing
Crucial m500 I am spared from this for years to come. That Crucial SSD
according to SMART status about lifetime used has still quite some time
to go.
Yes, hopefully.  And the SMART status on that Crucial is probably right, 
they tend to do a very good job in my experience with accurately 
measuring life expectancy (that or they're just _really_ good at 
predicting failures, I've never had a Crucial SSD that did not indicate 
correctly in the SMART status that it would fail in the near future).



* If you're just pulling data off the device, mark the device
read-only in the _block layer_, not the filesystem, before you mount
it.  If you're using LVM, just mark the LV read-only using LVM
commands  This will make 100% certain that nothing gets written to
the device, and thus makes sure that you won't accidentally cause
issues like this.



* If you're going to convert to a single device,
just do it and don't stop it part way through.  In particular, make
sure that your system will not lose power.



* Otherwise, don't mount the volume unless you know you're going to
repair it.


Thanks for those. Good to keep in mind.
The last one is actually good advice in general, not just for BTRFS.  I 
can't count how many stories I've heard of people who tried to run half 
an array simply to avoid downtime, and ended up making things far worse 
than they were as a result.



For this laptop it was not all that important but I wonder about
BTRFS RAID 1 in enterprise environment, cause restoring from backup
adds a significantly higher downtime.

Anyway, creating a new filesystem may have been better here anyway,
cause it replaced an BTRFS that aged over several years with a new
one. Due to the increased capacity and due to me thinking that
Samsung 860 Pro compresses itself, I removed LZO compression. This
would also give larger extents on files that are not fragmented or
only slightly fragmented. I think that Intel SSD 320 did not
compress, but Crucial m500 mSATA SSD does

Re: Experiences on BTRFS Dual SSD RAID 1 with outage of one SSD

2018-08-17 Thread Austin S. Hemmelgarn

On 2018-08-17 05:08, Martin Steigerwald wrote:

Hi!

This happened about two weeks ago. I already dealt with it and all is
well.

Linux hung on suspend so I switched off this ThinkPad T520 forcefully.
After that it did not boot the operating system anymore. Intel SSD 320,
latest firmware, which should patch this bug, but apparently does not,
is only 8 MiB big. Those 8 MiB just contain zeros.

Access via GRML and "mount -fo degraded" worked. I initially was even
able to write onto this degraded filesystem. First I copied all data to
a backup drive.

I even started a balance to "single" so that it would work with one SSD.

But later I learned that secure erase may recover the Intel SSD 320 and
since I had no other SSD at hand, did that. And yes, it did. So I
canceled the balance.

I partitioned the Intel SSD 320 and put LVM on it, just as I had it. But
at that time I was not able to mount the degraded BTRFS on the other SSD
as writable anymore, not even with "-f" "I know what I am doing". Thus I
was not able to add a device to it and btrfs balance it to RAID 1. Even
"btrfs replace" was not working.

I thus formatted a new BTRFS RAID 1 and restored.

A week later I migrated the Intel SSD 320 to a Samsung 860 Pro. Again
via one full backup and restore cycle. However, this time I was able to
copy most of the data of the Intel SSD 320 with "mount -fo degraded" via
eSATA and thus the copy operation was way faster.

So conclusion:

1. Pro: BTRFS RAID 1 really protected my data against a complete SSD
outage.

Glad to hear I'm not the only one!


2. Con:  It does not allow me to add a device and balance to RAID 1 or
replace one device that is already missing at this time.
See below where you comment about this more, I've replied regarding it 
there.


3. I keep using BTRFS RAID 1 on two SSDs for often changed, critical
data.

4. And yes, I know it does not replace a backup. As it was holidays and
I was lazy backup was two weeks old already, so I was happy to have all
my data still on the other SSD.

5. The error messages in kernel when mounting without "-o degraded" are
less than helpful. They indicate a corrupted filesystem instead of just
telling that one device is missing and "-o degraded" would help here.
Agreed, the kernel error messages need significant improvement, not just 
for this case, but in general (I would _love_ to make sure that there 
are exactly zero exit paths for open_ctree that don't involve a proper 
error message being printed beyond the ubiquitous `open_ctree failed` 
message you get when it fails).



I have seen a discussion about the limitation in point 2. That allowing
to add a device and make it into RAID 1 again might be dangerous, cause
of system chunk and probably other reasons. I did not completely read
and understand it tough.

So I still don´t get it, cause:

Either it is a RAID 1, then, one disk may fail and I still have *all*
data. Also for the system chunk, which according to btrfs fi df / btrfs
fi sh was indeed RAID 1. If so, then period. Then I don´t see why it
would need to disallow me to make it into an RAID 1 again after one
device has been lost.

Or it is no RAID 1 and then what is the point to begin with? As I was
able to copy of all date of the degraded mount, I´d say it was a RAID 1.

(I know that BTRFS RAID 1 is not a regular RAID 1 anyway, but just does
two copies regardless of how many drives you use.)
So, what's happening here is a bit complicated.  The issue is entirely 
with older kernels that are missing a couple of specific patches, but it 
appears that not all distributions have their kernels updated to include 
those patches yet.


In short, when you have a volume consisting of _exactly_ two devices 
using raid1 profiles that is missing one device, and you mount it 
writable and degraded on such a kernel, newly created chunks will be 
single-profile chunks instead of raid1 chunks with one half missing. 
Any write has the potential to trigger allocation of a new chunk, and 
more importantly any _read_ has the potential to trigger allocation of a 
new chunk if you don't use the `noatime` mount option (because a read 
will trigger an atime update, which results in a write).


When older kernels then go and try to mount that volume a second time, 
they see that there are single-profile chunks (which can't tolerate 
_any_ device failures), and refuse to mount at all (because they can't 
guarantee that metadata is intact).  Newer kernels fix this part by 
checking per-chunk if a chunk is degraded/complete/missing, which avoids 
this because all the single chunks are on the remaining device.


As far as avoiding this in the future:

* If you're just pulling data off the device, mark the device read-only 
in the _block layer_, not the filesystem, before you mount it.  If 
you're using LVM, just mark the LV read-only using LVM commands  This 
will make 100% certain that nothing gets written to the device, and thus 
makes sure that you won't accidentally cause issues 

Re: How to ensure that a snapshot is not corrupted?

2018-08-15 Thread Austin S. Hemmelgarn

On 2018-08-10 06:07, Cerem Cem ASLAN wrote:

Original question is here: https://superuser.com/questions/1347843

How can we sure that a readonly snapshot is not corrupted due to a disk failure?

Is the only way calculating the checksums one on another and store it
for further examination, or does BTRFS handle that on its own?

I've posted an answer for the linked question on SuperUser, under the 
assumption that it will be more visible to people simply searching for 
it there than it would be on the ML.


Here's the text of the answer though so people here can see it too:

There are two possible answers depending on what you mean by 'corrupted 
by a disk failure'.


### If you mean simple at-rest data corruption

BTRFS handles this itself, transparently to the user.  It checksums 
everything, including data in snapshots, internally and then verifies 
the checksums as it reads each block.  There are a couple of exceptions 
to this though:


* If the volume is mounted with the `nodatasum` or `nodatacow` options, 
you will have no checksumming of data blocks.  In most cases, you should 
not be mounting with these options, so this should not e an issue.
* Any files for which the `NOCOW` attribute is set (`C` in the output of 
the `lsattr` command) are also not checked.  You're not likely to have 
any truly important files with this attribute set (systemd journal files 
have it set, but that's about it unless you set it manually).


### If you mean non-trivial destruction of data on the volume because of 
loss of too many devices


You can't protect against this except by having another copy of the data 
somewhere.  Pretty much, if you've lost more devices than however many 
the storage profiles for the volume can tolerate, your data is gone, and 
nothing is going to get it back for you short of restoring from a backup.


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-13 Thread Austin S. Hemmelgarn

On 2018-08-12 03:04, Andrei Borzenkov wrote:

12.08.2018 06:16, Chris Murphy пишет:

On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote:

Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted:


But whether data is shared or exclusive seems potentially ephemeral, and
not something a sysadmin should even be able to anticipate let alone
individual users.


Define "user(s)".


The person who is saving their document on a network share, and
they've never heard of Btrfs.



Arguably, in the context of btrfs tool usage, "user" /is/ the admin,


I'm not talking about btrfs tools. I'm talking about rational,
predictable behavior of a shared folder.

If I try to drop a 1GiB file into my share and I'm denied, not enough
free space, and behind the scenes it's because of a quota limit, I
expect I can delete *any* file(s) amounting to create 1GiB free space
and then I'll be able to drop that file successfully without error.

But if I'm unwittingly deleting shared files, my quota usage won't go
down, and I still can't save my file. So now I somehow need a secret
incantation to discover only my exclusive files and delete enough of
them in order to save this 1GiB file. It's weird, it's unexpected, I
think it's a use case failure. Maybe Btrfs quotas isn't meant to work
with samba or NFS shares. *shrug*



That's how both NetApp and ZFS work as well. I doubt anyone can
seriously call NetApp "not meant to work with NFS or CIFS shares".

On NetApp space available to NFS/CIFS user is volume size minus space
frozen in snapshots. If file, captured in snapshot, is deleted in active
file system, it does not make a single byte available to external user.
That's what surprised most every first time NetApp users.

On ZFS snapshots are contained in dataset and you limit total dataset
space consumption including all snapshots. Thus end effect is the same -
deleting data that is itself captured in snapshot does not make a single
byte available. ZFS allows you to additionally restrict active file
system size ("referenced" quota in ZFS) - this more closely matches your
expectation - deleting file in active file system decreases its
"referenced" size thus allowing user to write more data (as long as user
does not exceed total dataset quota). This is different from btrfs
"exculsive" and "shared". This should not be hard to implement in btrfs,
as "referenced" simply means all data in current subvolume, be it
exclusive or shared.

IOW ZFS allows to place restriction on both how much data user can use
and how much data user is allowed additionally to protect (snapshot).
Except user created snapshots are kind of irrelevant here.  If we're 
talking about NFS/CIFS/SMB, there is no way for the user to create a 
snapshot (at least, not in-band), so provided the admin is sensible and 
only uses the referenced quota for limiting space usage by users, things 
behave no differently on  ZFS than they do on ext4 or XFS using user quotas.


Note also that a lot of storage appliances that use ZFS as the 
underlying storage don't expose any way for the admin to use anything 
other than the referenced quota (and usually space reservations).  They 
do this because it makes the system behave as pretty much everyone 
intuitively expects, and it ensures that users don't have to go to an 
admin to remedy their free space issues.






"Regular users" as you use the term, that is the non-admins who just need
to know how close they are to running out of their allotted storage
resources, shouldn't really need to care about btrfs tool usage in the
first place, and btrfs commands in general, including btrfs quota related
commands, really aren't targeted at them, and aren't designed to report
the type of information they are likely to find useful.  Other tools will
be more appropriate.


I'm not talking about any btrfs commands or even the term quota for
regular users. I'm talking about saving a file, being denied, and how
does the user figure out how to free up space?



Users need to be educated. Same as with NetApp and ZFS. There is no
magic, redirect-on-write filesystems work differently than traditional
and users need to adapt.

Of course devil is in details, and usability of btrfs quota is far lower
than NetApp/ZFS. In those space consumption information is first class
citizen integrated into the very basic tools, not something bolted on
later and mostly incomprehensible to end user.
Except that this _CAN_ be made to work and behave just like classic 
quotas.  Your example of ZFS above proves it (referenced quotas behave 
just like classic VFS quotas).  Yes, we need to educate users regarding 
qgroups, but we need a _WORKING_ alternative so they can do things like 
they always have, and like most stuff that uses ZFS as part of a 
pre-built system (FreeNAS for example) does.



Anyway, it's a hypothetical scenario. While I have Samba running on a
Btrfs volume with various shares as subvolumes, I don't have quotas
enabled.


Given 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-10 14:07, Chris Murphy wrote:

On Thu, Aug 9, 2018 at 5:35 PM, Qu Wenruo  wrote:



On 8/10/18 1:48 AM, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.


In reality that's definitely not the case.

 From what I see, most users would care more about exclusively used space
(excl), other than the total space one subvolume is referring to (rfer).


I'm confused.

So what happens in the following case with quotas enabled on Btrfs:

1. Provision a user with a directory, pre-populated with files, using
snapshot. Let's say it's 1GiB of files.
2. Set a quota for this user's directory, 1GiB.

The way I'm reading the description of Btrfs quotas, the 1GiB quota
applies to exclusive used space. So for starters, they have 1GiB of
shared data that does not affect their 1GiB quota at all.

3. User creates 500MiB worth of new files, this is exclusive usage.
They are still within their quota limit.
4. The shared data becomes obsolete for all but this one user, and is deleted.

Suddenly, 1GiB of shared data for this user is no longer shared data,
it instantly becomes exclusive data and their quota is busted. Now
consider scaling this to 12TiB of storage, with hundreds of users, and
dozens of abruptly busted quotas following this same scenario on a
weekly basis.

I *might* buy off on the idea that an overlay2 based initial
provisioning would not affect quotas. But whether data is shared or
exclusive seems potentially ephemeral, and not something a sysadmin
should even be able to anticipate let alone individual users.

Going back to the example, I'd expect to give the user a 2GiB quota,
with 1GiB of initially provisioned data via snapshot, so right off the
bat they are at 50% usage of their quota. If they were to modify every
single provisioned file, they'd in effect go from 100% shared data to
100% exclusive data, but their quota usage would still be 50%. That's
completely sane and easily understandable by a regular user. The idea
that they'd start modifying shared files, and their quota usage climbs
is weird to me. The state of files being shared or exclusive is not
user domain terminology anyway.
And it's important to note that this is the _only_ way this can sanely 
work for actually partitioning resources, which is the primary classical 
use case for quotas.


Being able to see how much data is shared and exclusive in a subvolume 
is nice, but quota groups are the wrong name for it because the current 
implementation does not work at all like quotas and can trivially result 
in both users escaping quotas (multiple ways), and in quotas being 
overreached by very large amounts for potentially indefinite periods of 
time because of actions of individuals who _don't_ own the data the 
quota is for.





The most common case is, you do a snapshot, user would only care how
much new space can be written into the subvolume, other than the total
subvolume size.


I think that's expecting a lot of users.

I also wonder if it expects a lot from services like samba and NFS who
have to communicate all of this in some sane way to remote clients? My
expectation is that a remote client shows Free Space on a quota'd
system to be based on the unused amount of the quota. I also expect if
I delete a 1GiB file, that my quota consumption goes down. But you're
saying it would be unchanged if I delete a 1GiB shared file, and would
only go down if I delete a 1GiB exclusive file. Do samba and NFS know
about shared and exclusive files? If samba and NFS don't understand
this, then how is a user supposed to understand it?
It might be worth looking at how Samba and NFS work on top of ZFS on a 
platform like FreeNAS and trying to emulate that.


Behavior there is as-follows:

* The total size of the 'disk' reported over SMB (shown on Windows only 
if you map the share as a drive) is equal to the quota for the 
underlying dataset.
* The reported space used on the 'disk' reported over SMB is based on 
physical space usage after compression, with a few caveats relating to 
deduplication:
- Data which is shared across multiple datasets is accounted 
against _all_ datasets that reference it.
- Data which is shared only within a given dataset is accounted 
only once.

* Free space is reported simply as the total size minus the used space.
* Usage reported by 

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-10 14:21, Tomasz Pala wrote:

On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:


I.e.: every shared segment should be accounted within quota (at least once).

I think what you mean to say here is that every shared extent should be
accounted to quotas for every location it is reflinked from.  IOW, that
if an extent is shared between two subvolumes each with it's own quota,
they should both have it accounted against their quota.


Yes.


Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).


This is irrelevant to your point here.  In fact, it goes against it,
you're arguing for quotas to report data like `du`, but all of
chunk-profile stuff is invisible to `du` (and everything else in
userspace that doesn't look through BTRFS ioctls).


My point is user-point, not some system tool like du. Consider this:
1. user wants higher (than default) protection of some data,
2. user wants more storage space with less protection.

Ad. 1 - requesting better redundancy is similar to cp --reflink=never
- there are functional differences, but the cost is similar: trading
   space for security,

Ad. 2 - many would like to have .cache, .ccache, tmp or some build
system directory with faster writes and no redundancy at all. This
requires per-file/directory data profile attrs though.

Since we agreed that transparent data compression is user's storage bonus,
gains from the reduced redundancy should also profit user.
Do you actually know of any services that do this though?  I mean, 
Amazon S3 and similar services have the option of reduced redundancy 
(and other alternate storage tiers), but they charge 
per-unit-data-per-unit-time with no hard limit on how much space they 
use, and charge different rates for different storage tiers.  In 
comparison, what you appear to be talking about is something more 
similar to Dropbox or Google Drive, where you pay up front for a fixed 
amount of storage for a fixed amount of time and can't use more than 
that, and all the services I know of like that offer exactly one option 
for storage redundancy.


That aside, you seem to be overthinking this.  No sane provider is going 
to give their users the ability to create subvolumes themselves (there's 
too much opportunity for a tiny bug in your software to cost you a _lot_ 
of lost revenue, because creating subvolumes can let you escape qgroups) 
 That means in turn that what you're trying to argue for is no 
different from the provider just selling units of storage for different 
redundancy levels separately, and charging different rates for each of 
them.  In fact, that approach is better, because it works independent of 
the underlying storage technology (it will work with hardware RAID, 
LVM2, MD, ZFS, and even distributed storage platforms like Ceph and 
Gluster), _and_ it lets them charge differently than the trivial case of 
N copies costing N times as much as one copy (which is not quite 
accurate in terms of actual management costs).


Now, if BTRFS were to have the ability to set profiles per-file, then 
this might be useful, albeit with the option to tune how it gets accounted.


Disclaimer: all the above statements in relation to conception and
understanding of quotas, not to be confused with qgroups.





Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-09 13:48, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.

So - the limit set on any user should enforce maximum and absolute space
he has allocated, including the shared stuff. I could even imagine that
creating a snapshot might immediately "eat" the available quota. In a
way, that quota returned matches (give or take) `du` reported usage,
unless "do not account reflinks withing single qgroup" was easy to implemet.

I.e.: every shared segment should be accounted within quota (at least once).
I think what you mean to say here is that every shared extent should be 
accounted to quotas for every location it is reflinked from.  IOW, that 
if an extent is shared between two subvolumes each with it's own quota, 
they should both have it accounted against their quota.


And the numbers accounted should reflect the uncompressed sizes.
This is actually inconsistent with pretty much every other VFS level 
quota system in existence.  Even ZFS does it's accounting _after_ 
compression.  At this point, it's actually expected by most sysadmins 
that things behave that way.



Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).
This is irrelevant to your point here.  In fact, it goes against it, 
you're arguing for quotas to report data like `du`, but all of 
chunk-profile stuff is invisible to `du` (and everything else in 
userspace that doesn't look through BTRFS ioctls).



In short: values representing quotas are user-oriented ("the numbers one
bought"), not storage-oriented ("the numbers they actually occupy").


Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Austin S. Hemmelgarn

On 2018-08-09 19:35, Qu Wenruo wrote:



On 8/10/18 1:48 AM, Tomasz Pala wrote:

On Tue, Jul 31, 2018 at 22:32:07 +0800, Qu Wenruo wrote:


2) Different limitations on exclusive/shared bytes
Btrfs can set different limit on exclusive/shared bytes, further
complicating the problem.

3) Btrfs quota only accounts data/metadata used by the subvolume
It lacks all the shared trees (mentioned below), and in fact such
shared tree can be pretty large (especially for extent tree and csum
tree).


I'm not sure about the implications, but just to clarify some things:

when limiting somebody's data space we usually don't care about the
underlying "savings" coming from any deduplicating technique - these are
purely bonuses for system owner, so he could do larger resource overbooking.


In reality that's definitely not the case.

 From what I see, most users would care more about exclusively used space
(excl), other than the total space one subvolume is referring to (rfer).

The most common case is, you do a snapshot, user would only care how
much new space can be written into the subvolume, other than the total
subvolume size.
I would really love to know exactly who these users are, because it 
sounds to me like you've heard from exactly zero people who are 
currently using conventional quotas to impose actual resource limits on 
other filesystems (instead of just using them for accounting, which is a 
valid use case but not what they were originally designed for).




So - the limit set on any user should enforce maximum and absolute space
he has allocated, including the shared stuff. I could even imagine that
creating a snapshot might immediately "eat" the available quota. In a
way, that quota returned matches (give or take) `du` reported usage,
unless "do not account reflinks withing single qgroup" was easy to implemet.


In fact, that's the case. In current implementation, accounting on
extent is the easiest (if not the only) way to implement.



I.e.: every shared segment should be accounted within quota (at least once).


Already accounted, at least for rfer.



And the numbers accounted should reflect the uncompressed sizes.


No way for current extent based solution.

While this may be true, this would be a killer feature to have.





Moreover - if there would be per-subvolume RAID levels someday, the data
should be accouted in relation to "default" (filesystem) RAID level,
i.e. having a RAID0 subvolume on RAID1 fs should account half of the
data, and twice the data in an opposite scenario (like "dup" profile on
single-drive filesystem).


No possible again for current extent based solution.




In short: values representing quotas are user-oriented ("the numbers one
bought"), not storage-oriented ("the numbers they actually occupy").


Well, if something is not possible or brings so big performance impact,
there will be no argument on how it should work in the first place.

Thanks,
Qu





Re: BTRFS and databases

2018-08-02 Thread Austin S. Hemmelgarn

On 2018-08-02 06:56, Qu Wenruo wrote:



On 2018年08月02日 18:45, Andrei Borzenkov wrote:



Отправлено с iPhone


2 авг. 2018 г., в 10:02, Qu Wenruo  написал(а):




On 2018年08月01日 11:45, MegaBrutal wrote:
Hi all,

I know it's a decade-old question, but I'd like to hear your thoughts
of today. By now, I became a heavy BTRFS user. Almost everywhere I use
BTRFS, except in situations when it is obvious there is no benefit
(e.g. /var/log, /boot). At home, all my desktop, laptop and server
computers are mainly running on BTRFS with only a few file systems on
ext4. I even installed BTRFS in corporate productive systems (in those
cases, the systems were mainly on ext4; but there were some specific
file systems those exploited BTRFS features).

But there is still one question that I can't get over: if you store a
database (e.g. MySQL), would you prefer having a BTRFS volume mounted
with nodatacow, or would you just simply use ext4?

I know that with nodatacow, I take away most of the benefits of BTRFS
(those are actually hurting database performance – the exact CoW
nature that is elsewhere a blessing, with databases it's a drawback).
But are there any advantages of still sticking to BTRFS for a database
albeit CoW is disabled, or should I just return to the old and
reliable ext4 for those applications?


Since I'm not a expert in database, so I can totally be wrong, but what
about completely disabling database write-ahead-log (WAL), and let
btrfs' data CoW to handle data consistency completely?



This would make content of database after crash completely unpredictable, thus 
making it impossible to reliably roll back transaction.


Btrfs itself (with datacow) can ensure the fs is updated completely.

That's to say, even a crash happens, the content of the fs will be the
same state as previous btrfs transaction (btrfs sync).

Thus there is no need to rollback database transaction though.
(Unless database transaction is not sync to btrfs transaction)


Two issues with this statement:

1. Not all database software properly groups logically related 
operations that need to be atomic as a unit into transactions.
2. Even aside from point 1 and the possibility of database corruption, 
there are other legitimate reasons that you might need to roll-back a 
transaction (for example, the rather obvious case of a transaction that 
should not have happened in the first place).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Austin S. Hemmelgarn

On 2018-07-20 14:41, Hugo Mills wrote:

On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:

20.07.2018 20:16, Goffredo Baroncelli пишет:

[snip]

Limiting the number of disk per raid, in BTRFS would be quite simple to implement in the 
"chunk allocator"



You mean that currently RAID5 stripe size is equal to number of disks?
Well, I suppose nobody is using btrfs with disk pools of two or three
digits size.


But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

That's the largest I can recall seeing mention of, though.
I've talked to at least two people using it on 100+ disks in a SAN 
situation.  In both cases however, BTRFS itself was only seeing about 20 
devices and running in raid0 mode on them, with each of those being a 
RAID6 volume configured on the SAN node holding the disks for it.  From 
what I understood when talking to them, they actually got rather good 
performance in this setup, though maintenance was a bit of a pain.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Austin S. Hemmelgarn

On 2018-07-20 13:13, Goffredo Baroncelli wrote:

On 07/19/2018 09:10 PM, Austin S. Hemmelgarn wrote:

On 2018-07-19 13:29, Goffredo Baroncelli wrote:

[...]


So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring


No, not quite.  At least, not in the combinations you're saying make sense if 
you are using standard terminology.  RAID05 and RAID06 are not the same thing 
as 'striping+parity' as BTRFS implements that case, and can be significantly 
more optimized than the trivial implementation of just limiting the number of 
disks involved in each chunk (by, you know, actually striping just like what we 
currently call raid10 mode in BTRFS does).


Could you provide more information ?
Just parity by itself is functionally equivalent to a really stupid 
implementation of 2 or more copies of the data.  Setups with only one 
disk more than the number of parities in RAID5 and RAID6 are called 
degenerate for this very reason.  All sane RAID5/6 implementations do 
striping across multiple devices internally, and that's almost always 
what people mean when talking about striping plus parity.


What I'm referring to is different though.  Just like RAID10 used to be 
implemented as RAID1 on top of RAID0, RAID05 is RAID0 on top of RAID5. 
That is, you're striping your data across multiple RAID5 arrays instead 
of using one big RAID5 array to store it all.  As I mentioned, this 
mitigates the scaling issues inherent in RAID5 when it comes to rebuilds 
(namely, the fact that device failure rates go up faster for larger 
arrays than rebuild times do).


Functionally, such a setup can be implemented in BTRFS by limiting 
RAID5/6 stripe width, but that will have all kinds of performance 
limitations compared to actually striping across all of the underlying 
RAID5 chunks.  In fact, it will have the exact same performance 
limitations you're calling out BTRFS single mode for below.






RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.

If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1

I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in the 
upper layer is correct when they don't agree.


This happens only because there is a redundancy greater than 1. Anyway BTRFS 
has the checksum, which helps a lot in this area
The checksum helps, but what do you do when all copies fail the 
checksum?  Or, worse yet, what do you do with both copies have the 
'right' checksum, but different data?  Yes, you could have one more 
copy, but that just reduces the chances of those cases happening, it 
doesn't eliminate them.


Note that I'm not necessarily saying it makes sense to have support for 
this in BTRFS, just that it's a real-world counter-example to your 
statement that only those combinations make sense.  In the case of 
BTRFS, these would make more sense than RAID51 and RAID61, but they 
still aren't particularly practical.  For classic RAID though, they're 
really important, because you don't have checksumming (unless you have 
T10 DIF capable hardware and a RAID implementation that understands how 
to work with it, but that's rare and expensive) and it makes it easier 
to resize an array than having three copies (you only need 2 new disks 
for RAID15 or RAID16 to increase the size of the array, but you need 3 
for 3-copy RAID1 or RAID10).



It doesn't really provide significantly better redundancy (they can technically 
sustain more disk failures without failing completely than simple two-copy 
RAID1 can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so it does 
not do the same thing that higher-order parity does.




The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size"

Re: Healthy amount of free space?

2018-07-20 Thread Austin S. Hemmelgarn

On 2018-07-20 01:01, Andrei Borzenkov wrote:

18.07.2018 16:30, Austin S. Hemmelgarn пишет:

On 2018-07-18 09:07, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
 wrote:


If you're doing a training presentation, it may be worth mentioning that
preallocation with fallocate() does not behave the same on BTRFS as
it does
on other filesystems.  For example, the following sequence of commands:

  fallocate -l X ./tmp
  dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any
value of
X between zero and just below the total amount of free space on the
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for
values
of X that are greater than _half_ of the total amount of free space
on the
filesystem (actually, greater than just short of half).  In essence,
preallocating space does not prevent COW semantics for the first write
unless the file is marked NOCOW.


Is this a bug, or is it suboptimal behavior, or is it intentional?

It's been discussed before, though I can't find the email thread right
now.  Pretty much, this is _technically_ not incorrect behavior, as the
documentation for fallocate doesn't say that subsequent writes can't
fail due to lack of space.  I personally consider it a bug though
because it breaks from existing behavior in a way that is avoidable and
defies user expectations.

There are two issues here:

1. Regions preallocated with fallocate still do COW on the first write
to any given block in that region.  This can be handled by either
treating the first write to each block as NOCOW, or by allocating a bit


How is it possible? As long as fallocate actually allocates space, this
should be checksummed which means it is no more possible to overwrite
it. May be fallocate on btrfs could simply reserve space. Not sure
whether it complies with fallocate specification, but as long as
intention is to ensure write will not fail for the lack of space it
should be adequate (to the extent it can be ensured on btrfs of course).
Also hole in file returns zeros by definition which also matches
fallocate behavior.
Except it doesn't _have_ to be checksummed if there's no data there, and 
that will always be the case for a new allocation.   When I say it could 
be NOCOW, I'm talking specifically about the first write to each newly 
allocated block (that is, one either beyond the previous end of the 
file, or one in a region that used to be a hole).  This obviously won't 
work for places where there are already data.



of extra space and doing a rotating approach like this for writes:
     - Write goes into the extra space.
     - Once the write is done, convert the region covered by the write
   into a new block of extra space.
     - When the final block of the preallocated region is written,
   deallocate the extra space.
2. Preallocation does not completely account for necessary metadata
space that will be needed to store the data there.  This may not be
necessary if the first issue is addressed properly.


And then I wonder what happens with XFS COW:

   fallocate -l X ./tmp
   cp --reflink ./tmp ./tmp2
   dd if=/dev/zero of=./tmp bs=1 count=X

I'm not sure.  In this particular case, this will fail on BTRFS for any
X larger than just short of one third of the total free space.  I would
expect it to fail for any X larger than just short of half instead.

ZFS gets around this by not supporting fallocate (well, kind of, if
you're using glibc and call posix_fallocate, that _will_ work, but it
will take forever because it works by writing out each block of space
that's being allocated, which, ironically, means that that still suffers
from the same issue potentially that we have).


What happens on btrfs then? fallocate specifies that new space should be
initialized to zero, so something should still write those zeros?

For new regions (places that were holes previously, or were beyond the 
end of the file), we create an unwritten extent, which is a region 
that's 'allocated', but everything reads back as zero.  The problem is 
that we don't write into the blocks allocated for the unwritten extent 
at all, and only deallocate them once a write to another block finishes. 
 In essence, we're (either explicitly or implicitly) applying COW 
semantics to a region that should not be COW until after the first write 
to each block.


For the case of calling fallocate on existing data, we don't really do 
anything (unless the flag telling fallocate to unshare the region is 
passed).  This is actually consistent with pretty much every other 
filesystem in existence, but that's because pretty much every other 
filesystem in existence implicitly provides the same guarantee that 
fallocate does for regions that already have data.  This case can in 
theory be handled by the same looping algorithm I described above 
without needing the base amount of space allocated, but I wouldn't 
consider it important

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-19 13:29, Goffredo Baroncelli wrote:

On 07/19/2018 01:43 PM, Austin S. Hemmelgarn wrote:

On 2018-07-18 15:42, Goffredo Baroncelli wrote:

On 07/18/2018 09:20 AM, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


[...]


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where you 
have lots of disks.  The RAID5/6 lower layer mitigates the data loss risk of 
RAID0, and the RAID0 upper-layer mitigates the rebuild scalability issues of 
RAID5/6.  In fact, this is pretty much the standard recommended configuration 
for large ZFS arrays that want to use parity RAID.  This could be reasonably 
easily supported to a rudimentary degree in BTRFS by providing the ability to 
limit the stripe width for the parity profiles.

Some people use RAID50 or RAID60, although they are strictly speaking inferior 
in almost all respects to RAID05 and RAID06.

RAID01 is also used on occasion, it ends up having the same storage capacity as 
RAID10, but for some RAID implementations it has a different performance 
envelope and different rebuild characteristics.  Usually, when it is used 
though, it's software RAID0 on top of hardware RAID1.

RAID51 and RAID61 used to be used, but aren't much now.  They provided an easy 
way to have proper data verification without always having the rebuild overhead 
of RAID5/6 and without needing to do checksumming. They are pretty much useless 
for BTRFS, as it can already tell which copy is correct.


So until now you are repeating what I told: the only useful raid profile are
- striping
- mirroring
- striping+paring (even limiting the number of disk involved)
- striping+mirroring


No, not quite.  At least, not in the combinations you're saying make 
sense if you are using standard terminology.  RAID05 and RAID06 are not 
the same thing as 'striping+parity' as BTRFS implements that case, and 
can be significantly more optimized than the trivial implementation of 
just limiting the number of disks involved in each chunk (by, you know, 
actually striping just like what we currently call raid10 mode in BTRFS 
does).




RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they might 
actually make sense in BTRFS to provide a backup means of rebuilding blocks 
that fail checksum validation if both copies fail.

If you need further redundancy, it is easy to implement a parity3 and parity4 
raid profile instead of stacking a raid6+raid1

I think you're misunderstanding what I mean here.

RAID15/16 consist of two layers:
* The top layer is regular RAID1, usually limited to two copies.
* The lower layer is RAID5 or RAID6.

This means that the lower layer can validate which of the two copies in 
the upper layer is correct when they don't agree.  It doesn't really 
provide significantly better redundancy (they can technically sustain 
more disk failures without failing completely than simple two-copy RAID1 
can, but just like BTRFS raid10, they can't reliably survive more than 
one (or two if you're using RAID6 as the lower layer) disk failure), so 
it does not do the same thing that higher-order parity does.




The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in par

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-19 03:27, Qu Wenruo wrote:



On 2018年07月14日 02:46, David Sterba wrote:

Hi,

I have some goodies that go into the RAID56 problem, although not
implementing all the remaining features, it can be useful independently.

This time my hackweek project

https://hackweek.suse.com/17/projects/do-something-about-btrfs-and-raid56

aimed to implement the fix for the write hole problem but I spent more
time with analysis and design of the solution and don't have a working
prototype for that yet.

This patchset brings a feature that will be used by the raid56 log, the
log has to be on the same redundancy level and thus we need a 3-copy
replication for raid6. As it was easy to extend to higher replication,
I've added a 4-copy replication, that would allow triple copy raid (that
does not have a standardized name).


So this special level will be used for RAID56 for now?
Or it will also be possible for metadata usage just like current RAID1?

If the latter, the metadata scrub problem will need to be considered more.

For more copies RAID1, it's will have higher possibility one or two
devices missing, and then being scrubbed.
For metadata scrub, inlined csum can't ensure it's the latest one.

So for such RAID1 scrub, we need to read out all copies and compare
their generation to find out the correct copy.
At least from the changeset, it doesn't look like it's addressed yet.

And this also reminds me that current scrub is not as flex as balance, I
really like we could filter block groups to scrub just like balance, and
do scrub in a block group basis, other than devid basis.
That's to say, for a block group scrub, we don't really care which
device we're scrubbing, we just need to ensure all device in this block
is storing correct data.

This would actually be rather useful for non-parity cases too.  Being 
able to scrub only metadata when the data chunks are using a profile 
that provides no rebuild support would be great for performance.


On the same note, it would be _really_ nice to be able to scrub a subset 
of the volume's directory tree, even if it were only per-subvolume.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-18 15:42, Goffredo Baroncelli wrote:

On 07/18/2018 09:20 AM, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


On 07/15/2018 04:37 PM, waxhead wrote:



Striping and mirroring/pairing are orthogonal properties; mirror and
parity are mutually exclusive.


I can't agree.  I don't know whether you meant that in the global
sense,
or purely in the btrfs context (which I suspect), but either way I
can't agree.

In the pure btrfs context, while striping and mirroring/pairing are
orthogonal today, Hugo's whole point was that btrfs is theoretically
flexible enough to allow both together and the feature may at some
point be added, so it makes sense to have a layout notation format
flexible enough to allow it as well.


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on top 
of) ???

Seriously, of course you can combine a lot of different profile; however the 
only ones that make sense are the ones above.

No, there are cases where other configurations make sense.

RAID05 and RAID06 are very widely used, especially on NAS systems where 
you have lots of disks.  The RAID5/6 lower layer mitigates the data loss 
risk of RAID0, and the RAID0 upper-layer mitigates the rebuild 
scalability issues of RAID5/6.  In fact, this is pretty much the 
standard recommended configuration for large ZFS arrays that want to use 
parity RAID.  This could be reasonably easily supported to a rudimentary 
degree in BTRFS by providing the ability to limit the stripe width for 
the parity profiles.


Some people use RAID50 or RAID60, although they are strictly speaking 
inferior in almost all respects to RAID05 and RAID06.


RAID01 is also used on occasion, it ends up having the same storage 
capacity as RAID10, but for some RAID implementations it has a different 
performance envelope and different rebuild characteristics.  Usually, 
when it is used though, it's software RAID0 on top of hardware RAID1.


RAID51 and RAID61 used to be used, but aren't much now.  They provided 
an easy way to have proper data verification without always having the 
rebuild overhead of RAID5/6 and without needing to do checksumming. 
They are pretty much useless for BTRFS, as it can already tell which 
copy is correct.


RAID15 and RAID16 are a similar case to RAID51 and RAID61, except they 
might actually make sense in BTRFS to provide a backup means of 
rebuilding blocks that fail checksum validation if both copies fail.


The fact that you can combine striping and mirroring (or pairing) makes sense 
because you could have a speed gain (see below).
[]


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.


The striping concept is based to the fact that if the "stripe size" is small 
enough you have a speed benefit because the reads may be performed in parallel from 
different disks.
That's not the only benefit of striping though.  The other big one is 
that you now have one volume that's the combined size of both of the 
original devices.  Striping is arguably better for this even if you're 
using a large stripe size because it better balances the wear across the 
devices than simple concatenation.



With a "stripe size" of 1GB, it is very unlikely that this would happens.
That's a pretty big assumption.  There are all kinds of access patterns 
that will still distribute the load reasonably evenly across the 
constituent devices, even if they don't parallelize things.


If, for example, all your files are 64k or less, and you only read whole 
files, there's no functional difference between RAID0 with 1GB blocks 
and RAID0 with 64k blocks.  Such a workload is not unusual on a very 
busy mail-server.


  

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but 

Re: Healthy amount of free space?

2018-07-19 Thread Austin S. Hemmelgarn

On 2018-07-18 17:32, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 12:01 PM, Austin S. Hemmelgarn
 wrote:

On 2018-07-18 13:40, Chris Murphy wrote:


On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy 
wrote:


I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?



Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.

  truncate --size=4G ./test-fs
  mkfs.btrfs ./test-fs
  mkdir ./test
  mount -t auto ./test-fs ./test
  dd if=/dev/zero of=./test/test bs=65536 count=32768
  fallocate -l 2147483650 ./test/test && echo "Success!"


My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.

[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp


Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?


Yes, you're right, I had things backwards (well, kind of, this does work on
ext4 and regular XFS, so it arguably should work here).


I guess I'm confused what it even means to fallocate over a file with
in-use blocks unless either -d or -p options are used. And from the
man page, I don't grok the distinction between -d and -p either. But
based on their descriptions I'd expect they both should work without
enospc.

Without any specific options, it forces allocation of any sparse regions 
in the file (that is, it gets rid of holes in the file).  On BTRFS, I 
believe the command also forcibly unshares all the extents in the file 
(for the system call, there's a special flag for doing this). 
Additionally, you can extend a file with fallocate this way by 
specifying a length longer than the current size of the file, which 
guarantees that writes into that region will succeed, unlike truncating 
the file to a larger size, which just creates a hole at the end of the 
file to bring it up to size.


As far as `-d` versus `-p`:  `-p` directly translates to the option for 
the system call that punches a hole.  It requires a length and possibly 
an offset, and will punch a hole at that exact location of that exact 
size.  `-d` is a special option that's only available for the command. 
It tells the `fallocate` command to search the file for zero-filled 
regions, and punch holes there.  Neither option should ever trigger an 
ENOSPC, except possibly if it has to split an extent for some reason and 
you are completely out of metadata space.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 13:40, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 11:14 AM, Chris Murphy  wrote:


I don't know for sure, but based on the addresses reported before and
after dd for the fallocated tmp file, it looks like Btrfs is not using
the originally fallocated addresses for dd. So maybe it is COWing into
new blocks, but is just as quickly deallocating the fallocated blocks
as it goes, and hence doesn't end up in enospc?


Previous thread is "Problem with file system" from August 2017. And
there's these reproduce steps from Austin which have fallocate coming
after the dd.

 truncate --size=4G ./test-fs
 mkfs.btrfs ./test-fs
 mkdir ./test
 mount -t auto ./test-fs ./test
 dd if=/dev/zero of=./test/test bs=65536 count=32768
 fallocate -l 2147483650 ./test/test && echo "Success!"


My test Btrfs is 2G not 4G, so I'm cutting the values of dd and
fallocate in half.

[chris@f28s btrfs]$ sudo dd if=/dev/zero of=tmp bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 7.13391 s, 147 MB/s
[chris@f28s btrfs]$ sync
[chris@f28s btrfs]$ df -h
FilesystemSize  Used Avail Use% Mounted on
/dev/mapper/vg-btrfstest  2.0G 1018M  1.1G  50% /mnt/btrfs
[chris@f28s btrfs]$ sudo fallocate -l 1000m tmp


Succeeds. If I do it with a 1200M file for dd and fallocate 1200M over
it, this fails, but I kinda expect that because there's only 1.1G free
space. But maybe that's what you're saying is the bug, it shouldn't
fail?
Yes, you're right, I had things backwards (well, kind of, this does work 
on ext4 and regular XFS, so it arguably should work here).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 09:07, Chris Murphy wrote:

On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
 wrote:


If you're doing a training presentation, it may be worth mentioning that
preallocation with fallocate() does not behave the same on BTRFS as it does
on other filesystems.  For example, the following sequence of commands:

 fallocate -l X ./tmp
 dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value of
X between zero and just below the total amount of free space on the
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for values
of X that are greater than _half_ of the total amount of free space on the
filesystem (actually, greater than just short of half).  In essence,
preallocating space does not prevent COW semantics for the first write
unless the file is marked NOCOW.


Is this a bug, or is it suboptimal behavior, or is it intentional?
It's been discussed before, though I can't find the email thread right 
now.  Pretty much, this is _technically_ not incorrect behavior, as the 
documentation for fallocate doesn't say that subsequent writes can't 
fail due to lack of space.  I personally consider it a bug though 
because it breaks from existing behavior in a way that is avoidable and 
defies user expectations.


There are two issues here:

1. Regions preallocated with fallocate still do COW on the first write 
to any given block in that region.  This can be handled by either 
treating the first write to each block as NOCOW, or by allocating a bit 
of extra space and doing a rotating approach like this for writes:

- Write goes into the extra space.
- Once the write is done, convert the region covered by the write
  into a new block of extra space.
- When the final block of the preallocated region is written,
  deallocate the extra space.
2. Preallocation does not completely account for necessary metadata 
space that will be needed to store the data there.  This may not be 
necessary if the first issue is addressed properly.


And then I wonder what happens with XFS COW:

  fallocate -l X ./tmp
  cp --reflink ./tmp ./tmp2
  dd if=/dev/zero of=./tmp bs=1 count=X
I'm not sure.  In this particular case, this will fail on BTRFS for any 
X larger than just short of one third of the total free space.  I would 
expect it to fail for any X larger than just short of half instead.


ZFS gets around this by not supporting fallocate (well, kind of, if 
you're using glibc and call posix_fallocate, that _will_ work, but it 
will take forever because it works by writing out each block of space 
that's being allocated, which, ironically, means that that still suffers 
from the same issue potentially that we have).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 03:20, Duncan wrote:

Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
excerpted:


On 07/17/2018 11:12 PM, Duncan wrote:

Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
excerpted:


On 07/15/2018 04:37 PM, waxhead wrote:



Striping and mirroring/pairing are orthogonal properties; mirror and
parity are mutually exclusive.


I can't agree.  I don't know whether you meant that in the global
sense,
or purely in the btrfs context (which I suspect), but either way I
can't agree.

In the pure btrfs context, while striping and mirroring/pairing are
orthogonal today, Hugo's whole point was that btrfs is theoretically
flexible enough to allow both together and the feature may at some
point be added, so it makes sense to have a layout notation format
flexible enough to allow it as well.


When I say orthogonal, It means that these can be combined: i.e. you can
have - striping (RAID0)
- parity  (?)
- striping + parity  (e.g. RAID5/6)
- mirroring  (RAID1)
- mirroring + striping  (RAID10)

However you can't have mirroring+parity; this means that a notation
where both 'C' ( = number of copy) and 'P' ( = number of parities) is
too verbose.


Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on
top of mirroring or mirroring on top of raid5/6, much as raid10 is
conceptually just raid0 on top of raid1, and raid01 is conceptually raid1
on top of raid0.

While it's not possible today on (pure) btrfs (it's possible today with
md/dm-raid or hardware-raid handling one layer), it's theoretically
possible both for btrfs and in general, and it could be added to btrfs in
the future, so a notation with the flexibility to allow parity and
mirroring together does make sense, and having just that sort of
flexibility is exactly why Hugo made the notation proposal he did.

Tho a sensible use-case for mirroring+parity is a different question.  I
can see a case being made for it if one layer is hardware/firmware raid,
but I'm not entirely sure what the use-case for pure-btrfs raid16 or 61
(or 15 or 51) might be, where pure mirroring or pure parity wouldn't
arguably be a at least as good a match to the use-case.  Perhaps one of
the other experts in such things here might help with that.


Question #2: historically RAID10 is requires 4 disks. However I am
guessing if the stripe could be done on a different number of disks:
What about RAID1+Striping on 3 (or 5 disks) ? The key of striping is
that every 64k, the data are stored on a different disk


As someone else pointed out, md/lvm-raid10 already work like this.
What btrfs calls raid10 is somewhat different, but btrfs raid1 pretty
much works this way except with huge (gig size) chunks.


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level, with
the 1 GiB device-level chunks effectively being huge individual device
strips of 1 GiB.
Actually, it also behaves like LVM and MD RAID10 for any number of 
devices greater than 2, though the exact placement may diverge because 
of BTRFS's concept of different chunk types.  In LVM and MD RAID10, each 
block is stored as two copies, and what disks it ends up on is dependent 
on the block number modulo the number of disks (so, for 3 disks A, B, 
and C, block 0 is on A and B, block 1 is on C and A, and block 2 is on B 
and C, with subsequent blocks following the same pattern).  In an 
idealized model of BTRFS with only one chunk type, you get exactly the 
same behavior (because BTRFS allocates chunks based on disk utilization, 
and prefers lower numbered disks to higher ones in the event of a tie).


At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-18 04:39, Duncan wrote:

Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:


As implemented in BTRFS, raid1 doesn't have striping.


The argument is that because there's only two copies, on multi-device
btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
alternate device pairs, it's effectively striped at the macro level,
with the 1 GiB device-level chunks effectively being huge individual
device strips of 1 GiB.

At 1 GiB strip size it doesn't have the typical performance advantage of
striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
strips/chunks.


I forgot this bit...

Similarly, multi-device single is regarded by some to be conceptually
equivalent to raid0 with really huge GiB strips/chunks.

(As you may note, "the argument is" and "regarded by some" are distancing
phrases.  I've seen the argument made on-list, but while I understand the
argument and agree with it to some extent, I'm still a bit uncomfortable
with it and don't normally make it myself, this thread being a noted
exception tho originally I simply repeated what someone else already said
in-thread, because I too agree it's stretching things a bit.  But it does
appear to be a useful conceptual equivalency for some, and I do see the
similarity.
If the file is larger than the data chunk size, it _is_ striped, because 
it spans multiple chunks which are on separate devices.  Otherwise, it's 
more similar to what in GlusterFS is called a 'distributed volume'.  In 
such a Gluster volume, each file is entirely stored on one node (or you 
have a complete copy on N nodes where N is the number of replicas), with 
the selection of what node is used for the next file created being based 
on which node has the most free space.


That said, the main reason I explain single and raid1 the way I do is 
that I've found it's a much simpler way to explain generically how they 
work to people who already have storage background but may not care 
about the specifics.


Perhaps it's a case of coder's view (no code doing it that way, it's just
a coincidental oddity conditional on equal sizes), vs. sysadmin's view
(code or not, accidental or not, it's a reasonably accurate high-level
description of how it ends up working most of the time with equivalent
sized devices).)


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-18 Thread Austin S. Hemmelgarn

On 2018-07-17 13:54, Martin Steigerwald wrote:

Nikolay Borisov - 17.07.18, 10:16:

On 17.07.2018 11:02, Martin Steigerwald wrote:

Nikolay Borisov - 17.07.18, 09:20:

On 16.07.2018 23:58, Wolf wrote:

Greetings,
I would like to ask what what is healthy amount of free space to
keep on each device for btrfs to be happy?

This is how my disk array currently looks like

 [root@dennas ~]# btrfs fi usage /raid
 
 Overall:

 Device size:  29.11TiB
 Device allocated: 21.26TiB
 Device unallocated:7.85TiB
 Device missing:  0.00B
 Used: 21.18TiB
 Free (estimated):  3.96TiB  (min: 3.96TiB)
 Data ratio:   2.00
 Metadata ratio:   2.00
 Global reserve:  512.00MiB  (used: 0.00B)


[…]


Btrfs does quite good job of evenly using space on all devices.
No,
how low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new
disk?


Btrfs will start running into problems when you run out of
unallocated space. So the best advice will be monitor your device
unallocated, once it gets really low - like 2-3 gb I will suggest
you run balance which will try to free up unallocated space by
rewriting data more compactly into sparsely populated block
groups. If after running balance you haven't really freed any
space then you should consider adding a new drive and running
balance to even out the spread of data/metadata.


What are these issues exactly?


For example if you have plenty of data space but your metadata is full
then you will be getting ENOSPC.


Of that one I am aware.

This just did not happen so far.

I did not yet add it explicitly to the training slides, but I just make
myself a note to do that.

Anything else?


If you're doing a training presentation, it may be worth mentioning that 
preallocation with fallocate() does not behave the same on BTRFS as it 
does on other filesystems.  For example, the following sequence of commands:


fallocate -l X ./tmp
dd if=/dev/zero of=./tmp bs=1 count=X

Will always work on ext4, XFS, and most other filesystems, for any value 
of X between zero and just below the total amount of free space on the 
filesystem.  On BTRFS though, it will reliably fail with ENOSPC for 
values of X that are greater than _half_ of the total amount of free 
space on the filesystem (actually, greater than just short of half).  In 
essence, preallocating space does not prevent COW semantics for the 
first write unless the file is marked NOCOW.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Healthy amount of free space?

2018-07-17 Thread Austin S. Hemmelgarn

On 2018-07-16 16:58, Wolf wrote:

Greetings,
I would like to ask what what is healthy amount of free space to keep on
each device for btrfs to be happy?

This is how my disk array currently looks like

 [root@dennas ~]# btrfs fi usage /raid
 Overall:
 Device size:  29.11TiB
 Device allocated: 21.26TiB
 Device unallocated:7.85TiB
 Device missing:  0.00B
 Used: 21.18TiB
 Free (estimated):  3.96TiB  (min: 3.96TiB)
 Data ratio:   2.00
 Metadata ratio:   2.00
 Global reserve:  512.00MiB  (used: 0.00B)

 Data,RAID1: Size:10.61TiB, Used:10.58TiB
/dev/mapper/data1   1.75TiB
/dev/mapper/data2   1.75TiB
/dev/mapper/data3 856.00GiB
/dev/mapper/data4 856.00GiB
/dev/mapper/data5   1.75TiB
/dev/mapper/data6   1.75TiB
/dev/mapper/data7   6.29TiB
/dev/mapper/data8   6.29TiB

 Metadata,RAID1: Size:15.00GiB, Used:13.00GiB
/dev/mapper/data1   2.00GiB
/dev/mapper/data2   3.00GiB
/dev/mapper/data3   1.00GiB
/dev/mapper/data4   1.00GiB
/dev/mapper/data5   3.00GiB
/dev/mapper/data6   1.00GiB
/dev/mapper/data7   9.00GiB
/dev/mapper/data8  10.00GiB
Slightly OT, but the distribution of metadata chunks across devices 
looks a bit sub-optimal here.  If you can tolerate the volume being 
somewhat slower for a while, I'd suggest balancing these (it should get 
you better performance long-term).


 System,RAID1: Size:64.00MiB, Used:1.50MiB
/dev/mapper/data2  32.00MiB
/dev/mapper/data6  32.00MiB
/dev/mapper/data7  32.00MiB
/dev/mapper/data8  32.00MiB

 Unallocated:
/dev/mapper/data11004.52GiB
/dev/mapper/data21004.49GiB
/dev/mapper/data31006.01GiB
/dev/mapper/data41006.01GiB
/dev/mapper/data51004.52GiB
/dev/mapper/data61004.49GiB
/dev/mapper/data71005.00GiB
/dev/mapper/data81005.00GiB

Btrfs does quite good job of evenly using space on all devices. No, how
low can I let that go? In other words, with how much space
free/unallocated remaining space should I consider adding new disk?

Disclaimer: What I'm about to say is based on personal experience.  YMMV.

It depends on how you use the filesystem.

Realistically, there are a couple of things I consider when trying to 
decide on this myself:


* How quickly does the total usage increase on average, and how much can 
it be expected to increase in one day in the worst case scenario?  This 
isn't really BTRFS specific, but it's worth mentioning.  I usually don't 
let an array get close enough to full that it wouldn't be able to safely 
handle at least one day of the worst case increase and another 2 of 
average increases.  In BTRFS terms, the 'safely handle' part means you 
should be adding about 5GB for a multi-TB array like you have, or about 
1GB for a sub-TB array.


* What are the typical write patterns?  Do files get rewritten in-place, 
or are they only ever rewritten with a replace-by-rename? Are writes 
mostly random, or mostly sequential?  Are writes mostly small or mostly 
large?  The more towards the first possibility listed in each of those 
question (in-place rewrites, random access, and small writes), the more 
free space you should keep on the volume.


* Does this volume see heavy usage of fallocate() either to preallocate 
space (note that this _DOES NOT WORK SANELY_ on BTRFS), or to punch 
holes or remove ranges from files.  If whatever software you're using 
does this a lot on this volume, you want even more free space.


* Do old files tend to get removed in large batches?  That is, possibly 
hundreds or thousands of files at a time.  If so, and you're running a 
reasonably recent (4.x series) kernel or regularly balance the volume to 
clean up empty chunks, you don't need quite as much free space.


* How quickly can you get a new device added, and is it critical that 
this volume always be writable?  Sounds stupid, but a lot of people 
don't consider this.  If you can trivially get a new device added 
immediately, you can generally let things go a bit further than you 
would normally, same for if the volume being read-only can be tolerated 
for a while without significant issues.


It's worth noting that I explicitly do not care about snapshot usage. 
It rarely has much impact on this other than changing how the total 
usage increases in a day.


Evaluating all of this is of course something I can't really do for you. 
 If I had to guess, with no other information that the allocations 
shown, I'd say that you're probably generically fine until you get down 
to about 5GB more than twice the average 

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-16 Thread Austin S. Hemmelgarn

On 2018-07-16 14:29, Goffredo Baroncelli wrote:

On 07/15/2018 04:37 PM, waxhead wrote:

David Sterba wrote:

An interesting question is the naming of the extended profiles. I picked
something that can be easily understood but it's not a final proposal.
Years ago, Hugo proposed a naming scheme that described the
non-standard raid varieties of the btrfs flavor:

https://marc.info/?l=linux-btrfs=136286324417767

Switching to this naming would be a good addition to the extended raid.


As just a humble BTRFS user I agree and really think it is about time to move 
far away from the RAID terminology. However adding some more descriptive 
profile names (or at least some aliases) would be much better for the commoners 
(such as myself).

For example:

Old format / New Format / My suggested alias
SINGLE  / 1C / SINGLE
DUP / 2CD    / DUP (or even MIRRORLOCAL1)
RAID0   / 1CmS   / STRIPE




RAID1   / 2C / MIRROR1
RAID1c3 / 3C / MIRROR2
RAID1c4 / 4C / MIRROR3
RAID10  / 2CmS   / STRIPE.MIRROR1


Striping and mirroring/pairing are orthogonal properties; mirror and parity are 
mutually exclusive. What about

RAID1 -> MIRROR1
RAID10 -> MIRROR1S
RAID1c3 -> MIRROR2
RAID1c3+striping -> MIRROR2S

and so on...


RAID5   / 1CmS1P / STRIPE.PARITY1
RAID6   / 1CmS2P / STRIPE.PARITY2


To me these should be called something like

RAID5 -> PARITY1S
RAID6 -> PARITY2S

The S final is due to the fact that usually RAID5/6 spread the data on all 
available disks

Question #1: for "parity" profiles, does make sense to limit the maximum disks 
number where the data may be spread ? If the answer is not, we could omit the last S. 
IMHO it should.
Currently, there is no ability to cap the number of disks that striping 
can happen across.  Ideally, that will change in the future, in which 
case not only the S will be needed, but also a number indicating how 
wide the stripe is.



Question #2: historically RAID10 is requires 4 disks. However I am guessing if 
the stripe could be done on a different number of disks: What about 
RAID1+Striping on 3 (or 5 disks) ? The key of striping is that every 64k, the 
data are stored on a different disk
This is what MD and LVM RAID10 do.  They work somewhat differently from 
what BTRFS calls raid10 (actually, what we currently call raid1 works 
almost identically to MD and LVM RAID10 when more than 3 disks are 
involved, except that the chunk size is 1G or larger).  Short of drastic 
internal changes to how that profile works, this isn't likely to happen.


In spite of both of these, there is practical need for indicating the 
stripe width.  Depending on the configuration of the underlying storage, 
it's fully possible (and sometimes even certain) that you will see 
chunks with differing stripe widths, so properly reporting the stripe 
width (in devices, not bytes) is useful for monitoring purposes).


Consider for example a 6-device array using what's currently called a 
raid10 profile where 2 of the disks are smaller than the other four.  On 
such an array, chunks will span all six disks (resulting in 2 copies 
striped across 3 disks each) until those two smaller disks are full, at 
which point new chunks will span only the remaining four disks 
(resulting in 2 copies striped across 2 disks each).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-07-03 Thread Austin S. Hemmelgarn

On 2018-07-03 03:35, Duncan wrote:

Austin S. Hemmelgarn posted on Mon, 02 Jul 2018 07:49:05 -0400 as
excerpted:


Notably, most Intel systems I've seen have the SATA controllers in the
chipset enumerate after the USB controllers, and the whole chipset
enumerates after add-in cards (so they almost always have this issue),
while most AMD systems I've seen demonstrate the exact opposite
behavior,
they enumerate the SATA controller from the chipset before the USB
controllers, and then enumerate the chipset before all the add-in cards
(so they almost never have this issue).


Thanks.  That's a difference I wasn't aware of, and would (because I tend
to favor amd) explain why I've never seen a change in enumeration order
unless I've done something like unplug my sata cables for maintenance and
forget which ones I had plugged in where -- random USB stuff left plugged
in doesn't seem to matter, even choosing different boot media from the
bios doesn't seem to matter by the time the kernel runs (I'm less sure
about grub).

Additionally though, if you in some way make sure SATA drivers are 
loaded before USB ones, you will also never see this issue because of 
USB devices (same goes for GRUB).  A lot of laptops that use connections 
other than USB for the keyboard and mouse behave like this if you use a 
properly stripped down initramfs because you won't have USB drivers in 
the initramfs (and therefore the SATA drivers always load first).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 13:34, Marc MERLIN wrote:

On Mon, Jul 02, 2018 at 12:59:02PM -0400, Austin S. Hemmelgarn wrote:

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?


Actually, because of the online resize ability in BTRFS, you don't
technically _need_ to use thin provisioning here.  It makes the maintenance
a bit easier, but it also adds a much more complicated layer of indirection
than just doing regular volumes.


You're right that I can use btrfs resize, but then I still need an LVM
device underneath, correct?
So, if I have 10 backup targets, I need 10 LVM LVs, I give them 10%
each of the full size available (as a guess), and then I'd have to
- btrfs resize down one that's bigger than I need
- LVM shrink the LV
- LVM grow the other LV
- LVM resize up the other btrfs

and I think LVM resize and btrfs resize are not linked so I have to do
them separately and hope to type the right numbers each time, correct?
(or is that easier now?)

I kind of linked the thin provisioning idea because it's hands off,
which is appealing. Any reason against it?
No, not currently, except that it adds a whole lot more stuff between 
BTRFS and whatever layer is below it.  That increase in what's being 
done adds some overhead (it's noticeable on 7200 RPM consumer SATA 
drives, but not on decent consumer SATA SSD's).


There used to be issues running BTRFS on top of LVM thin targets which 
had zero mode turned off, but AFAIK, all of those problems were fixed 
long ago (before 4.0).



You could (in theory) merge the LVM and software RAID5 layers, though that
may make handling of the RAID5 layer a bit complicated if you choose to use
thin provisioning (for some reason, LVM is unable to do on-line checks and
rebuilds of RAID arrays that are acting as thin pool data or metadata).
  
Does LVM do built in raid5 now? Is it as good/trustworthy as mdadm

radi5?
Actually, it uses MD's RAID5 implementation as a back-end.  Same for 
RAID6, and optionally for RAID0, RAID1, and RAID10.



But yeah, if it's incompatible with thin provisioning, it's not that
useful.
It's technically not incompatible, just a bit of a pain.  Last time I 
tried to use it, you had to jump through hoops to repair a damaged RAID 
volume that was serving as an underlying volume in a thin pool, and it 
required keeping the thin pool offline for the entire duration of the 
rebuild.



Alternatively, you could increase your array size, remove the software RAID
layer, and switch to using BTRFS in raid10 mode so that you could eliminate
one of the layers, though that would probably reduce the effectiveness of
bcache (you might want to get a bigger cache device if you do this).


Sadly that won't work. I have more data than will fit on raid10

Thanks for your suggestions though.
Still need to read up on whether I should do thin provisioning, or not.
If you do go with thin provisioning, I would encourage you to make 
certain to call fstrim on the BTRFS volumes on a semi regular basis so 
that the thin pool doesn't get filled up with old unused blocks, 
preferably when you are 100% certain that there are no ongoing writes on 
them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit 
dangerous to do it while writes are happening).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: So, does btrfs check lowmem take days? weeks?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 11:19, Marc MERLIN wrote:

Hi Qu,

thanks for the detailled and honest answer.
A few comments inline.

On Mon, Jul 02, 2018 at 10:42:40PM +0800, Qu Wenruo wrote:

For full, it depends. (but for most real world case, it's still flawed)
We have small and crafted images as test cases, which btrfs check can
repair without problem at all.
But such images are *SMALL*, and only have *ONE* type of corruption,
which can represent real world case at all.
  
right, they're just unittest images, I understand.



1) Too large fs (especially too many snapshots)
The use case (too many snapshots and shared extents, a lot of extents
get shared over 1000 times) is in fact a super large challenge for
lowmem mode check/repair.
It needs O(n^2) or even O(n^3) to check each backref, which hugely
slow the progress and make us hard to locate the real bug.
  
So, the non lowmem version would work better, but it's a problem if it

doesn't fit in RAM.
I've always considered it a grave bug that btrfs check repair can use so
much kernel memory that it will crash the entire system. This should not
be possible.
While it won't help me here, can btrfs check be improved not to suck all
the kernel memory, and ideally even allow using swap space if the RAM is
not enough?

Is btrfs check regular mode still being maintained? I think it's still
better than lowmem, correct?


2) Corruption in extent tree and our objective is to mount RW
Extent tree is almost useless if we just want to read data.
But when we do any write, we needs it and if it goes wrong even a
tiny bit, your fs could be damaged really badly.

For other corruption, like some fs tree corruption, we could do
something to discard some corrupted files, but if it's extent tree,
we either mount RO and grab anything we have, or hopes the
almost-never-working --init-extent-tree can work (that's mostly
miracle).
  
I understand that it's the weak point of btrfs, thanks for explaining.



1) Don't keep too many snapshots.
Really, this is the core.
For send/receive backup, IIRC it only needs the parent subvolume
exists, there is no need to keep the whole history of all those
snapshots.


You are correct on history. The reason I keep history is because I may
want to recover a file from last week or 2 weeks ago after I finally
notice that it's gone.
I have terabytes of space on the backup server, so it's easier to keep
history there than on the client which may not have enough space to keep
a month's worth of history.
As you know, back when we did tape backups, we also kept history of at
least several weeks (usually several months, but that's too much for
btrfs snapshots).
Bit of a case-study here, but it may be of interest.  We do something 
kind of similar where I work for our internal file servers.  We've got 
daily snapshots of the whole server kept on the server itself for 7 days 
(we usually see less than 5% of the total amount of data in changes on 
weekdays, and essentially 0 on weekends, so the snapshots rarely take up 
more than ab out 25% of the size of the live data), and then we 
additionally do daily backups which we retain for 6 months.  I've 
written up a short (albeit rather system specific script) for recovering 
old versions of a file that first scans the snapshots, and then pulls it 
out of the backups if it's not there.  I've found this works remarkably 
well for our use case (almost all the data on the file server follows a 
WORM access pattern with most of the files being between 100kB and 100MB 
in size).


We actually did try moving it all over to BTRFS for a while before we 
finally ended up with the setup we currently have, but aside from the 
whole issue with massive numbers of snapshots, we found that for us at 
least, Amanda actually outperforms BTRFS send/receive for everything 
except full backups and uses less storage space (though that last bit is 
largely because we use really aggressive compression).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-07-02 11:18, Marc MERLIN wrote:

Hi Qu,

I'll split this part into a new thread:


2) Don't keep unrelated snapshots in one btrfs.
I totally understand that maintain different btrfs would hugely add
maintenance pressure, but as explains, all snapshots share one
fragile extent tree.


Yes, I understand that this is what I should do given what you
explained.
My main problem is knowing how to segment things so I don't end up with
filesystems that are full while others are almost empty :)

Am I supposed to put LVM thin volumes underneath so that I can share
the same single 10TB raid5?
Actually, because of the online resize ability in BTRFS, you don't 
technically _need_ to use thin provisioning here.  It makes the 
maintenance a bit easier, but it also adds a much more complicated layer 
of indirection than just doing regular volumes.


If I do this, I would have
software raid 5 < dmcrypt < bcache < lvm < btrfs
That's a lot of layers, and that's also starting to make me nervous :)

Is there any other way that does not involve me creating smaller block
devices for multiple btrfs filesystems and hope that they are the right
size because I won't be able to change it later?
You could (in theory) merge the LVM and software RAID5 layers, though 
that may make handling of the RAID5 layer a bit complicated if you 
choose to use thin provisioning (for some reason, LVM is unable to do 
on-line checks and rebuilds of RAID arrays that are acting as thin pool 
data or metadata).


Alternatively, you could increase your array size, remove the software 
RAID layer, and switch to using BTRFS in raid10 mode so that you could 
eliminate one of the layers, though that would probably reduce the 
effectiveness of bcache (you might want to get a bigger cache device if 
you do this).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-30 02:33, Duncan wrote:

Austin S. Hemmelgarn posted on Fri, 29 Jun 2018 14:31:04 -0400 as
excerpted:


On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy

wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if
the copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At
least,
not unless BTRFS gets proper generational tracking to handle
temporarily missing devices.  As of right now, sanely fixing things
requires significant manual intervention, as you have to bypass the
device read selection algorithm to be able to look at the state of the
individual copies so that you can pick one to use and forcibly rewrite
the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.



The thing is though, as things stand right now, you need to manually
edit the data on-disk directly or restore the file from a backup to fix
the file.  While it's technically true that you can manually repair this
type of thing, both of the cases for doing it without those patches I
mentioned, it's functionally impossible for a regular user to do it
without potentially losing some data.


[Usual backups rant, user vs. admin variant, nowcow/tmpfs edition.
Regulars can skip as the rest is already predicted from past posts, for
them. =;^]

"Regular user"?

"Regular users" don't need to bother with this level of detail.  They
simply get their "admin" to do it, even if that "admin" is their kid, or
the kid from next door that's good with computers, or the geek squad (aka
nsa-agent-squad) guy/gal, doing it... or telling them to install "a real
OS", meaning whatever MS/Apple/Google something that they know how to
deal with.

If the "user" is dealing with setting nocow, choosing btrfs in the first
place, etc, then they're _not_ a "regular user" by definition, they're
already an admin.I'd argue that that's not always true.  'Regular users' also bli9ndly 
follow advice they find online about how to make their system run 
better, and quite often don't keep backups.


And as any admin learns rather quickly, the value of data is defined by
the number of backups it's worth having of that data.

Which means it's not a problem.  Either the data had a backup and it's
(reasonably) trivial to restore the data from that backup, or the data
was defined by lack of having that backup as of only trivial value, so
low as to not be worth the time/trouble/resources necessary to make that
backup in the first place.

Which of course means what was defined as of most value, either the data
of there was a backup, or the time/trouble/resources that would have gone
into creating it if not, is *always* saved.

(And of course the same goes for "I had a backup, but it's old", except
in this case it's the value of the data delta between the backup and
current.  As soon as it's worth more than the time/trouble/hassle of
updating the backup, it will by definition be updated.  Not having a
newer backup available thus simply means the value of the data that
changed between the last backup and current was simply not enough to
justify updating the backup, and again, what was of most value is
*always* saved, either the data, or the time that would have otherwise
gone into making the newer backup.)

Because while a "regular user" may not know it because it's not his /job/
to know it, if there's anything an admin knows *well* it's that the
working copy of data **WILL** be damaged.  It's not a matter of if, but
of when, and of whether it'll be a fat-finger mistake, or a hardware or
software failure, or wetware (theft, ransomware, etc), or wetware (flood,
fire and the water that put it out damage, etc), tho none of that
actually matters after all, because in the end, the only thing that
matters was how the value of t

Re: unsolvable technical issues?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-30 01:32, Andrei Borzenkov wrote:

30.06.2018 06:22, Duncan пишет:

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:


On 2018-06-24 16:22, Goffredo Baroncelli wrote:

On 06/23/2018 07:11 AM, Duncan wrote:

waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:


According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.


I can speculate a bit.

1) When I see btrfs "technical issue that may never be resolved", the
#1 first thing I think of, that AFAIK there are _definitely_ no plans
to resolve, because it's very deeply woven into the btrfs core by now,
is...

[1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
Universally Unique quite literally, assuming they really *are*
unique, at least on that system[.]  Because
btrfs uses this supposedly unique ID to ID devices that belong to the
filesystem, it can get *very* mixed up, with results possibly
including dataloss, if it sees devices that don't actually belong to a
filesystem with the same UUID as a mounted filesystem.


As partial workaround you can disable udev btrfs rules and then do a
"btrfs dev scan" manually only for the device which you need.



You don't even need `btrfs dev scan` if you just specify the exact set
of devices in the mount options.  The `device=` mount option tells the
kernel to check that device during the mount process.


Not that lvm does any better in this regard[1], but has btrfs ever solved
the bug where only one device= in the kernel commandline's rootflags=
would take effect, effectively forcing initr* on people (like me) who
would otherwise not need them and prefer to do without them, if they're
using a multi-device btrfs as root?



This requires in-kernel device scanning; I doubt we will ever see it.


Not to mention the fact that as kernel people will tell you, device
enumeration isn't guaranteed to be in the same order every boot, so
device=/dev/* can't be relied upon and shouldn't be used -- but of course
device=LABEL= and device=UUID= and similar won't work without userspace,
basically udev (if they work at all, IDK if they actually do).

Tho in practice from what I've seen, device enumeration order tends to be
dependable /enough/ for at least those without enterprise-level numbers
of devices to enumerate.


Just boot with USB stick/eSATA drive plugged in, there are good chances
it changes device order.
It really depends on your particular hardware.  If your USB controllers 
are at lower PCI addresses than your primary disk controllers, then yes, 
this will cause an issue.  Same for whatever SATA controller your eSATA 
port is on (or stupid systems where the eSATA port is port 0 on the main 
SATA controller).


Notably, most Intel systems I've seen have the SATA controllers in the 
chipset enumerate after the USB controllers, and the whole chipset 
enumerates after add-in cards (so they almost always have this issue), 
while most AMD systems I've seen demonstrate the exact opposite 
behavior, they enumerate the SATA controller from the chipset before the 
USB controllers, and then enumerate the chipset before all the add-in 
cards (so they almost never have this issue).


That said, one of the constraints for them remaining consistent is that 
you don't change hardware.



  True, it /does/ change from time to time with a
new kernel, but anybody sane keeps a tested-dependable old kernel around
to boot to until they know the new one works as expected, and that sort
of change is seldom enough that users can boot to the old kernel and
adjust their settings for the new one as necessary when it does happen.
So as "don't do it that way because it's not reliable" as it might indeed
be in theory, in practice, just using an ordered /dev/* in kernel
commandlines does tend to "just work"... provided one is ready for the
occasion when that device parameter might need a bit of adjustment, of
course.


...


---
[1] LVM is userspace code on top of the kernelspace devicemapper, and
therefore requires an initr* if root is on lvm, regardless.  So btrfs
actually does a bit better here, only requiring it for multi-device btrfs.



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: unsolvable technical issues?

2018-07-02 Thread Austin S. Hemmelgarn

On 2018-06-29 23:22, Duncan wrote:

Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
excerpted:


On 2018-06-24 16:22, Goffredo Baroncelli wrote:

On 06/23/2018 07:11 AM, Duncan wrote:

waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:


According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.


I can speculate a bit.

1) When I see btrfs "technical issue that may never be resolved", the
#1 first thing I think of, that AFAIK there are _definitely_ no plans
to resolve, because it's very deeply woven into the btrfs core by now,
is...

[1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
Universally Unique quite literally, assuming they really *are*
unique, at least on that system[.]  Because
btrfs uses this supposedly unique ID to ID devices that belong to the
filesystem, it can get *very* mixed up, with results possibly
including dataloss, if it sees devices that don't actually belong to a
filesystem with the same UUID as a mounted filesystem.


As partial workaround you can disable udev btrfs rules and then do a
"btrfs dev scan" manually only for the device which you need.



You don't even need `btrfs dev scan` if you just specify the exact set
of devices in the mount options.  The `device=` mount option tells the
kernel to check that device during the mount process.


Not that lvm does any better in this regard[1], but has btrfs ever solved
the bug where only one device= in the kernel commandline's rootflags=
would take effect, effectively forcing initr* on people (like me) who
would otherwise not need them and prefer to do without them, if they're
using a multi-device btrfs as root?

I haven't tested this recently myself, so I don't know.


Not to mention the fact that as kernel people will tell you, device
enumeration isn't guaranteed to be in the same order every boot, so
device=/dev/* can't be relied upon and shouldn't be used -- but of course
device=LABEL= and device=UUID= and similar won't work without userspace,
basically udev (if they work at all, IDK if they actually do).
They aren't guaranteed to be stable, but they functionally are provided 
you don't modify hardware in any way and your disks can't be enumerated 
asynchronously without some form of ordered identification (IOW, you're 
using just one SATA or SCSI controller for all your disks).


That said, the required component for the LABEL= and UUID= syntax is not 
udev, it's blkid.  blkid can use udev to avoid having to read 
everything, but it's not mandatory.


Tho in practice from what I've seen, device enumeration order tends to be
dependable /enough/ for at least those without enterprise-level numbers
of devices to enumerate.  True, it /does/ change from time to time with a
new kernel, but anybody sane keeps a tested-dependable old kernel around
to boot to until they know the new one works as expected, and that sort
of change is seldom enough that users can boot to the old kernel and
adjust their settings for the new one as necessary when it does happen.
So as "don't do it that way because it's not reliable" as it might indeed
be in theory, in practice, just using an ordered /dev/* in kernel
commandlines does tend to "just work"... provided one is ready for the
occasion when that device parameter might need a bit of adjustment, of
course.


Also, while LVM does have 'issues' with cloned PV's, it fails safe (by
refusing to work on VG's that have duplicate PV's), while BTRFS fails
very unsafely (by randomly corrupting data).


And IMO that "failing unsafe" is both serious and common enough that it
easily justifies adding the point to a list of this sort, thus my putting
it #1.
Agreed.  My point wasn't that BTRFS is doing things correctly, just that 
LVM is not a saint in this respect either (it's just more saintly than 
we are).



2) Subvolume and (more technically) reflink-aware defrag.

It was there for a couple kernel versions some time ago, but
"impossibly" slow, so it was disabled until such time as btrfs could
be made to scale rather better in this regard.



I still contend that the biggest issue WRT reflink-aware defrag was that
it was not optional.  The only way to get the old defrag behavior was to
boot a kernel that didn't have reflink-aware defrag support.  IOW,
_everyone_ had to deal with the performance issues, not just the people
who wanted to use reflink-aware defrag.


Absolutely.

Which of course suggests making it optional, with a suitable warning as
to the speed implications with lots of snapshots/reflinks, when it does
get enabled again (and as David mentions elsewhere, there's apparently
some work going into the idea once again, which potentially moves it from
the 3-5 year range, at best, back to a 1/2-2-year range, time will tell).


3) N-way-mirroring.


[...]
This is no

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 13:58, james harvey wrote:

On Fri, Jun 29, 2018 at 1:09 PM, Austin S. Hemmelgarn
 wrote:

On 2018-06-29 11:15, james harvey wrote:


On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy 
wrote:


And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?



Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.


That said, it can't sanely fix things if there is a mismatch. At least, not
unless BTRFS gets proper generational tracking to handle temporarily missing
devices.  As of right now, sanely fixing things requires significant manual
intervention, as you have to bypass the device read selection algorithm to
be able to look at the state of the individual copies so that you can pick
one to use and forcibly rewrite the whole file by hand.


Absolutely.  User would need to use manual intervention as you
describe, or restore the single file(s) from backup.  But, it's a good
opportunity to tell the user they had partial data corruption, even if
it can't be auto-fixed.  Otherwise they get intermittent data
corruption, depending on which copies are read.
The thing is though, as things stand right now, you need to manually 
edit the data on-disk directly or restore the file from a backup to fix 
the file.  While it's technically true that you can manually repair this 
type of thing, both of the cases for doing it without those patches I 
mentioned, it's functionally impossible for a regular user to do it 
without potentially losing some data.


Unless that changes, scrub telling you it's corrupt is not going to help 
much aside from making sure you don't make things worse by trying to use 
it.  Given this, it would make sense to have a (disabled by default) 
option to have scrub repair it by just using the newer or older copy of 
the data.  That would require classic RAID generational tracking though, 
which BTRFS doesn't have yet.



A while back, Anand Jain posted some patches that would let you select a
particular device to direct all reads to via a mount option, but I don't
think they ever got merged.  That would have made manual recovery in cases
like this exponentially easier (mount read-only with one device selected,
copy the file out somewhere, remount read-only with the other device, drop
caches, copy the file out again, compare and reconcile the two copies, then
remount the volume writable and write out the repaired file).


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 11:15, james harvey wrote:

On Thu, Jun 28, 2018 at 6:27 PM, Chris Murphy  wrote:

And an open question I have about scrub is weather it only ever is
checking csums, meaning nodatacow files are never scrubbed, or if the
copies are at least compared to each other?


Scrub never looks at nodatacow files.  It does not compare the copies
to each other.

Qu submitted a patch to make check compare the copies:
https://patchwork.kernel.org/patch/10434509/

This hasn't been added to btrfs-progs git yet.

IMO, I think the offline check should look at nodatacow copies like
this, but I still think this also needs to be added to scrub.  In the
patch thread, I discuss my reasons why.  In brief: online scanning;
this goes along with user's expectation of scrub ensuring mirrored
data integrity; and recommendations to setup scrub on periodic basis
to me means it's the place to put it.
That said, it can't sanely fix things if there is a mismatch.  At least, 
not unless BTRFS gets proper generational tracking to handle temporarily 
missing devices.  As of right now, sanely fixing things requires 
significant manual intervention, as you have to bypass the device read 
selection algorithm to be able to look at the state of the individual 
copies so that you can pick one to use and forcibly rewrite the whole 
file by hand.


A while back, Anand Jain posted some patches that would let you select a 
particular device to direct all reads to via a mount option, but I don't 
think they ever got merged.  That would have made manual recovery in 
cases like this exponentially easier (mount read-only with one device 
selected, copy the file out somewhere, remount read-only with the other 
device, drop caches, copy the file out again, compare and reconcile the 
two copies, then remount the volume writable and write out the repaired 
file).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs suddenly think's it's raid6

2018-06-29 Thread Austin S. Hemmelgarn

On 2018-06-29 07:04, marble wrote:

Hello,
I have an external HDD. The HDD contains no partition.
I use the whole HDD as a LUKS container. Inside that LUKS is a btrfs.
It's used to store some media files.
The HDD was hooked up to a Raspberry Pi running up-to-date Arch Linux
to play music from the drive.

After disconnecting the drive from the Pi and connecting it to my laptop
again, I couldn't mount it anymore. If I read the dmesg right, it now
thinks that it's part of a raid6.

btrfs check --repair also didn't help.

```
marble@archlinux ~ % uname -a
Linux archlinux 4.17.2-1-ARCH #1 SMP PREEMPT Sat Jun 16 11:08:59 UTC
2018 x86_64 GNU/Linux

marble@archlinux ~ % btrfs --version
btrfs-progs v4.16.1

marble@archlinux ~ % sudo cryptsetup open /dev/sda black
[sudo] password for marble:
Enter passphrase for /dev/sda:

marble@archlinux ~ % mkdir /tmp/black
marble@archlinux ~ % sudo mount /dev/mapper/black /tmp/black
mount: /tmp/black: can't read superblock on /dev/mapper/black.

marble@archlinux ~ % sudo btrfs fi show
Label: 'black'  uuid: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
 Total devices 1 FS bytes used 143.38GiB
 devid1 size 465.76GiB used 172.02GiB path /dev/mapper/black

marble@archlinux ~ % sudo btrfs check --repair /dev/mapper/black
enabling repair mode
Checking filesystem on /dev/mapper/black
UUID: 9fea91c7-7b0b-4ef9-a83b-e24ccf2586b5
Fixed 0 roots.
checking extents
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
owner ref check failed [1082114048 16384]
repair deleting extent record: key [1082114048,168,16384]
adding new tree backref on start 1082114048 len 16384 parent 0 root 5
Repaired extent references for 1082114048
ref mismatch on [59038556160 4096] extent item 1, found 0
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
incorrect local backref count on 59038556160 root 5 owner 334393 offset
0 found 0 wanted 1 back 0x56348aee5de0
backref disk bytenr does not match extent record, bytenr=59038556160,
ref bytenr=0
backpointer mismatch on [59038556160 4096]
owner ref check failed [59038556160 4096]
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
checksum verify failed on 1082114048 found 1A9EFC07 wanted 204A6979
checksum verify failed on 1082114048 found D7CA51C8 wanted E6334CB3
bytenr mismatch, want=1082114048, have=9385453728028469028
failed to repair damaged filesystem, aborting

marble@archlinux ~ % dmesg > /tmp/dmesg.log
```

Any clues?


It's not thinking it's a raid6 array.  All the messages before this one:

Btrfs loaded, crc32c=crc32c-intel

Are completely unrelated to BTRFS (because anything before that message 
happened before any BTRFS code ran).  The raid6 messages are actually 
from the startup code for the kernel's generic parity RAID implementation.


These:

BTRFS error (device dm-1): bad tree block start 9385453728028469028 
1082114048
BTRFS error (device dm-1): bad tree block start 2365503423870651471 
1082114048


Are the relevant error messages.  Unfortunately, I don't really know 
what's wrong in this case though.  Hopefully one of the developers will 
have some further insight.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 07:46, Qu Wenruo wrote:



On 2018年06月28日 19:12, Austin S. Hemmelgarn wrote:

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo 
wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer
reboots, is a situation that *any* raid 1, (or for that matter,
raid 5, raid 6, anything but raid 0) will recover from perfectly
without raising a sweat. Some will rebuild the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.

LVM doesn't.  It detects that one of the devices was gone for some
period of time and marks the volume as degraded (and _might_, depending
on how you have things configured, automatically re-sync).  Not sure
about MD, but I am willing to bet it properly detects this type of
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?

On a proper RAID controller, it's battery backed, and that battery
backing provides enough power to also make sure that the state change is
properly recorded in the event of power loss.


Well, that explains a lot of thing.

So hardware RAID controller can be considered having a special battery
(always atomic) journal device.
If we can't provide UPS for the whole system, a battery powered journal
device indeed makes sense.






The only possibility is that, the misbehaved device missed several
super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.

Yes, but that should be a very small window (at least, once we finally
quit serializing writes across devices), and it's a problem on existing
RAID1 implementations too (and therefore isn't something we should be
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.

The title is excessive, but I'd agree it's a design flaw that BTRFS
doesn't at least notice that the generation ID's are different and
preferentially trust the device with the newer generation ID.


Well, a design flaw should be something that can't be easily fixed
without *huge* on-disk format or behavior change.
Flaw in btrfs' one-subvolume-per-tree metadata design or current extent
booking behavior could be called design flaw.
That would be a structural design flaw.  it's a result of how the 
software is structured.  There are other types of design flaws though.


While for things like this, just as the submitted RFC patch, less than
100 lines could change the behavior.
I would still consider this case a design flaw (a purely behavioral 

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Austin S. Hemmelgarn

On 2018-06-28 05:15, Qu Wenruo wrote:



On 2018年06月28日 16:16, Andrei Borzenkov wrote:

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:



On 2018年06月28日 11:14, r...@georgianit.com wrote:



On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:



Please get yourself clear of what other raid1 is doing.


A drive failure, where the drive is still there when the computer reboots, is a 
situation that *any* raid 1, (or for that matter, raid 5, raid 6, anything but 
raid 0) will recover from perfectly without raising a sweat. Some will rebuild 
the array automatically,


WOW, that's black magic, at least for RAID1.
The whole RAID1 has no idea of which copy is correct unlike btrfs who
has datasum.

Don't bother other things, just tell me how to determine which one is
correct?



When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.


Yep, the only possible way to determine such case is from external info.

For device generation, it's possible to enhance btrfs, but at least we
could start from detect and refuse to RW mount to avoid possible further
corruption.
But anyway, if one really cares about such case, hardware RAID
controller seems to be the only solution as other software may have the
same problem.
LVM doesn't.  It detects that one of the devices was gone for some 
period of time and marks the volume as degraded (and _might_, depending 
on how you have things configured, automatically re-sync).  Not sure 
about MD, but I am willing to bet it properly detects this type of 
situation too.


And the hardware solution looks pretty interesting, is the write to
NVRAM 100% atomic? Even at power loss?
On a proper RAID controller, it's battery backed, and that battery 
backing provides enough power to also make sure that the state change is 
properly recorded in the event of power loss.





The only possibility is that, the misbehaved device missed several super
block update so we have a chance to detect it's out-of-date.
But that's not always working.



Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?


What happens if there is no generation gap in device superblock?

If one device got some of its (nodatacow) data written to disk, while
the other device doesn't get data written, and neither of them reached
super block update, there is no difference in device superblock, thus no
way to detect which is correct.
Yes, but that should be a very small window (at least, once we finally 
quit serializing writes across devices), and it's a problem on existing 
RAID1 implementations too (and therefore isn't something we should be 
using as an excuse for not doing this).





If you're talking about missing generation check for btrfs, that's
valid, but it's far from a "major design flaw", as there are a lot of
cases where other RAID1 (mdraid or LVM mirrored) can also be affected
(the brain-split case).



That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.


While for VDI case (or any VM image file format other than raw), older
valid data normally means corruption.
Unless they have their own write-ahead log.

Some file format may detect such problem by themselves if they have
internal checksum, but anyway, older data normally means corruption,
especially when partial new and partial old.

On the other hand, with data COW and csum, btrfs can ensure the whole
filesystem update is atomic (at least for single device).
So the title, especially the "major design flaw" can't be wrong any more.
The title is excessive, but I'd agree it's a design flaw that BTRFS 
doesn't at least notice that the generation ID's are different and 
preferentially trust the device with the newer generation ID. The only 
special handling I can see that would be needed is around volumes 
mounted with the `nodatacow` option, which may not see generation 
changes for a very long time otherwise.





others will automatically kick out the misbehaving drive.  *none* of them will 
take back the the drive with old data and start commingling that data with good 
copy.)\ This behaviour from BTRFS is completely abnormal.. and defeats even the 
most basic expectations of RAID.


RAID1 can only tolerate 1 missing device, it has nothing to do with
error detection.
And it's impossible to detect such case without extra help.

Your expectation is completely wrong.



Well ... somehow it is my experience as well ... :)


Acceptable, but not really apply to software based RAID1.

Thanks,
Qu





I'm not the one who has to clear his expectations here.

--
To unsubscribe from this list: send the line "unsubscribe 

Re: btrfs raid10 performance

2018-06-26 Thread Austin S. Hemmelgarn

On 2018-06-25 21:05, Sterling Windmill wrote:

I am running a single btrfs RAID10 volume of eight LUKS devices, each
using a 2TB SATA hard drive as a backing store. The SATA drives are a
mixture of Seagate and Western Digital drives, some with RPMs ranging
from 5400 to 7200. Each seems to individually performance test where I
would expect for drives of this caliber. They are all attached to an
LSI PCIe SAS controller and configured in JBOD.

I have a relatively beefy quad core Xeon CPU that supports AES-NI and
don't think LUKS is my bottleneck.

Here's some info from the resulting filesystem:

   btrfs fi df /storage
   Data, RAID10: total=6.30TiB, used=6.29TiB
   System, RAID10: total=8.00MiB, used=560.00KiB
   Metadata, RAID10: total=9.00GiB, used=7.64GiB
   GlobalReserve, single: total=512.00MiB, used=0.00B

In general I see good performance, especially read performance which
is enough to regularly saturate my gigabit network when copying files
from this host via samba. Reads are definitely taking advantage of the
multiple copies of data available and spreading the load among all
drives.

Writes aren't quite as rosy, however.

When writing files using dd like in this example:

   dd if=/dev/zero of=tempfile bs=1M count=10240 conv=fdatasync,notrun
c status=progress

And running a command like:

   iostat -m 1

to monitor disk I/O, writes seem to only focus on one of the eight
disks at a time, moving from one drive to the next. This results in a
sustained 55-90 MB/sec throughput depending on which disk is being
written to (remember, some have faster spindle speed than others).

Am I wrong to expect btrfs' RAID10 mode to write to multiple disks
simultaneously and to break larger writes into smaller stripes across
my four pairs of disks?

I had trouble identifying whether btrfs RAID10 is writing (64K?)
stripes or (1GB?) blocks to disk in this mode. The latter might make
more sense based upon what I'm seeing?

Anything else I should be trying to narrow down the bottleneck?
First, you're probably incorrect that the disk access is being 
parallelized.  Given that BTRFS still doesn't parallelize writes in 
raid1 mode, I very much doubt it does so in raid10 mode.  Parallelizing 
writes is a performance optimization that still hasn't really been 
tackled by anyone.  Realistically, BTRFS writes to exactly one disk at a 
time.  So, in a four disk raid10 array, it first writes to disk 1, waits 
for that to finish, then writes to disk 2, waits for that to finish, 
then 3, waits, and then four.  Overall, this makes writes rather slow.


As far as striping across multiple disks, yes, that does happen.  The 
specifics of this are a bit complicated though, and require explaining a 
bit about how BTRFS works in general.


BTRFS uses a two-stage allocator, first allocating 'large' regions of 
disk space to be used for a specific type of data called chunks, and 
then allocating blocks out of those regions to actually store the data. 
There are three chunk types, data (used for storing actual file 
contents), metadata (used for storing things like filenames, access 
times, directory structure, etc), and system (used to store the 
allocation information for all the other chunks in the filesystem). 
Data chunks are typically 1 GB in size, metadata are typically 256 MB in 
size, and system chunks are highly variable but don't really matter for 
this explanation.  The chunk level is where the actual replication and 
striping happen, and the chunk size represents what is exposed to the 
block allocator (so every 1 GB data chunk exposes 1 GB of space to the 
block allocator).


Now, replicated (raid1 or dup profiles) chunks work just like you would 
expect, each of the two allocations for the chunk is 1 GB, and each byte 
is stored as-is in both.  Striped (raid0 or raid10 profiles) are 
somewhat more complicated, and I actually don't know exactly how they 
end up allocated at the lower level.  However, I do know how the 
striping works.  In short, you can treat each striped set (either a full 
raid0 chunk, or half a raid10 chunk) as being functionally identical in 
operation to a conventional RAID0 array, striping occurs at a small 
block granularity (I think it's equal to the block size, which would be 
4k in most cases), which unfortunately compounds the performance issues 
caused by BTRFS only writing to one disk at a time.


As far as improving the performance, I've got two suggestions for 
alternative storage arrangements:


* If you want to just stick with only BTRFS for storage, try just using 
raid1 mode.  It will give you the same theoretical total capacity as 
raid10 does and will slow down reads somewhat, but should speed up 
writes significantly (because you're only writing to two devices, not 
striping across two sets of four).


* If you're willing to try something a bit different, convert your 
storage array to two LVM or MD RAID0 volumes composed of four devices 
each, and then run BTRFS in raid1 mode on top of 

Re: btrfs balance did not progress after 12H, hang on reboot, btrfs check --repair kills the system still

2018-06-25 Thread Austin S. Hemmelgarn

On 2018-06-25 12:07, Marc MERLIN wrote:

On Tue, Jun 19, 2018 at 12:58:44PM -0400, Austin S. Hemmelgarn wrote:

In your situation, I would run "btrfs pause ", wait to hear from
a btrfs developer, and not use the volume whatsoever in the meantime.

I would say this is probably good advice.  I don't really know what's going
on here myself actually, though it looks like the balance got stuck (the
output hasn't changed for over 36 hours, unless you've got an insanely slow
storage array, that's extremely unusual (it should only be moving at most
3GB of data per chunk)).


I didn't hear from any developer, so I had to continue.
- btrfs scrub cancel did not work (hang)
- at reboot mounting the filesystem hung, even with 4.17, which is
   disappointing (it should not hang)
- mount -o recovery still hung
- mount -o ro did not hang though
One tip here specifically, if you had to reboot during a balance and the 
FS hangs when it mounts, try mounting with `-o skip_balance`.  That 
should pause the balance instead of resuming it on mount, at which point 
you should also be able to cancel it without it hanging.


Sigh, why is my FS corrupted again?
Anyway, back to
btrfs check --repair
and, it took all my 32GB of RAM on a system I can't add more RAM to, so
I'm hosed. I'll note in passing (and it's not ok at all) that check
--repair after a 20 to 30mn pause, takes all the kernel RAM more quickly
than the system can OOM or log anything, and just deadlocks it.
This is repeateable and totally not ok :(

I'm now left with btrfs-progs git master, and lowmem which finally does
a bit of repair.
So far:
gargamel:~# btrfs check --mode=lowmem --repair -p /dev/mapper/dshelf2
enabling repair mode
WARNING: low-memory mode repair support is only partial
Checking filesystem on /dev/mapper/dshelf2
UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
Fixed 0 roots.
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 21872, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Created new chunk [18457780224000 1073741824]
Delete backref in extent [84302495744 69632]
ERROR: extent[84302495744, 69632] referencer count mismatch (root: 22911, 
owner: 374857, offset: 3407872) wanted: 3, have: 4
Delete backref in extent [84302495744 69632]
ERROR: extent[125712527360, 12214272] referencer count mismatch (root: 21872, 
owner: 374857, offset: 114540544) wanted: 181, have: 240
Delete backref in extent [125712527360 12214272]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 21872, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125730848768, 5111808] referencer count mismatch (root: 22911, 
owner: 374857, offset: 126754816) wanted: 68, have: 115
Delete backref in extent [125730848768 5111808]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 21872, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[125736914944, 6037504] referencer count mismatch (root: 22911, 
owner: 374857, offset: 131866624) wanted: 115, have: 143
Delete backref in extent [125736914944 6037504]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 21872, 
owner: 374857, offset: 148234240) wanted: 302, have: 431
Delete backref in extent [129952120832 20242432]
ERROR: extent[129952120832, 20242432] referencer count mismatch (root: 22911, 
owner: 374857, offset: 148234240) wanted: 356, have: 433
Delete backref in extent [129952120832 20242432]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 21872, 
owner: 374857, offset: 180371456) wanted: 161, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[134925357056, 11829248] referencer count mismatch (root: 22911, 
owner: 374857, offset: 180371456) wanted: 162, have: 240
Delete backref in extent [134925357056 11829248]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 21872, 
owner: 374857, offset: 192200704) wanted: 170, have: 249
Delete backref in extent [147895111680 12345344]
ERROR: extent[147895111680, 12345344] referencer count mismatch (root: 22911, 
owner: 374857, offset: 192200704) wanted: 172, have: 251
Delete backref in extent [147895111680 12345344]
ERROR: extent[150850146304, 17522688] referencer count mismatch (root: 21872, 
owner: 374857, offset: 217653248) wanted: 348, have: 418
Delete backref in extent [150850146304 17522688]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 22911, 
owner: 374857, offset: 235175936) wanted: 555, have: 1449
Deleted root 2 item[156909494272, 178, 5476627808561673095]
ERROR: extent[156909494272, 55320576] referencer count mismatch (root: 21872, 
owner: 374857, offset: 235175936) wanted: 556, have: 1452
Deleted root 2 item[156909494272, 178, 7338474132555182983]

At the rate it's going, it'll probably take days though, it's already been 36H

Marc



--
To unsubscribe from this list: send the line &q

Re: unsolvable technical issues?

2018-06-25 Thread Austin S. Hemmelgarn

On 2018-06-24 16:22, Goffredo Baroncelli wrote:

On 06/23/2018 07:11 AM, Duncan wrote:

waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:


According to this:

https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
section 1.2

It claims that BTRFS still have significant technical issues that may
never be resolved.
Could someone shed some light on exactly what these technical issues
might be?! What are BTRFS biggest technical problems?

If you forget about the "RAID"5/6 like features then the only annoyances
that I have with BTRFS so far is...

1. Lack of per subvolume "RAID" levels
2. Lack of not using the deviceid to re-discover and re-add dropped
devices

And that's about it really...


... And those both have solutions on the roadmap, with RFC patches
already posted for #2 (tho I'm not sure they use devid) altho
realistically they're likely to take years to appear and be tested to
stability.  Meanwhile...

While as the others have said you really need to go to the author to get
what was referred to, and I agree, I can speculate a bit.  While this
*is* speculation, admittedly somewhat uninformed as I don't claim to be a
dev, and I'd actually be interested in what others think so don't be
afraid to tell me I haven't a clue, as long as you say why... based on
several years reading the list now...

1) When I see btrfs "technical issue that may never be resolved", the #1
first thing I think of, that AFAIK there are _definitely_ no plans to
resolve, because it's very deeply woven into the btrfs core by now, is...

Filesystem UUID Identification.  Btrfs takes the UU bit of Universally
Unique quite literally, assuming they really *are* unique, at least on
that system, and uses them to identify the possibly multiple devices that
may be components of the filesystem, a problem most filesystems don't
have to deal with since they're single-device-only.  Because btrfs uses
this supposedly unique ID to ID devices that belong to the filesystem, it
can get *very* mixed up, with results possibly including dataloss, if it
sees devices that don't actually belong to a filesystem with the same UUID
as a mounted filesystem.


As partial workaround you can disable udev btrfs rules and then do a "btrfs dev 
scan" manually only for the device which you need. The you can mount the filesystem. 
Unfortunately you cannot mount two filesystem with the same UUID. However I have to point 
out that also LVM/dm might have problems if you clone a PV
You don't even need `btrfs dev scan` if you just specify the exact set 
of devices in the mount options.  The `device=` mount option tells the 
kernel to check that device during the mount process.


Also, while LVM does have 'issues' with cloned PV's, it fails safe (by 
refusing to work on VG's that have duplicate PV's), while BTRFS fails 
very unsafely (by randomly corrupting data).


[...]
der say 3-5 (or 5-7, or whatever)

years.  These could arguably include:

2) Subvolume and (more technically) reflink-aware defrag.

It was there for a couple kernel versions some time ago, but "impossibly"
slow, so it was disabled until such time as btrfs could be made to scale
rather better in this regard.


Did you try something like that with XFS+DM snapshot ? No you can't, because 
defrag in XFS cannot traverse snapshot (and I have to suppose that defrag 
cannot be effective on a dm-snapshot at all)..
What I am trying to point out is that even tough btrfs is not the fastest 
filesystem (and for some workload is VERY slow), when you compare it when few 
snapshots were presents LVM/dm is a lot slower.

IMHO most of the complaint which affect BTRFS, are due to the fact that with 
BTRFS an user can quite easily exploit a lot of features and their 
combinations. When a the slowness issue appears when some advance features 
combinations are used (i.e. multiple disks profile and (a lot of ) snapshots), 
this is reported as a BTRFS failure. But in fact even LVM/dm is very slow when 
the snapshot is used.
I still contend that the biggest issue WRT reflink-aware defrag was that 
it was not optional.  The only way to get the old defrag behavior was to 
boot a kernel that didn't have reflink-aware defrag support.  IOW, 
_everyone_ had to deal with the performance issues, not just the people 
who wanted to use reflink-aware defrag.





There's no hint yet as to when that might actually be, if it will _ever_
be, so this can arguably be validly added to the "may never be resolved"
list.

3) N-way-mirroring.


[...]
This is not an issue, but a not implemented feature

If you're looking at feature parity with competitors, it's an issue.


4) (Until relatively recently, and still in terms of scaling) Quotas.

Until relatively recently, quotas could arguably be added to the list.
They were rewritten multiple times, and until recently, appeared to be
effectively eternally broken.


Even tough what you are reporting is correct, I have to point out that the quota in BTRFS 
is more 

Re: btrfs balance did not progress after 12H

2018-06-19 Thread Austin S. Hemmelgarn

On 2018-06-19 12:30, james harvey wrote:

On Tue, Jun 19, 2018 at 11:47 AM, Marc MERLIN  wrote:

On Mon, Jun 18, 2018 at 06:00:55AM -0700, Marc MERLIN wrote:

So, I ran this:
gargamel:/mnt/btrfs_pool2# btrfs balance start -dusage=60 -v .  &
[1] 24450
Dumping filters: flags 0x1, state 0x0, force is off
   DATA (flags 0x2): balancing, usage=60
gargamel:/mnt/btrfs_pool2# while :; do btrfs balance status .; sleep 60; done
0 out of about 0 chunks balanced (0 considered), -nan% left


This (0/0/0, -nan%) seems alarming.  I had this output once when the
system spontaneously rebooted during a balance.  I didn't have any bad
effects afterward.


Balance on '.' is running
0 out of about 73 chunks balanced (2 considered), 100% left
Balance on '.' is running

After about 20mn, it changed to this:
1 out of about 73 chunks balanced (6724 considered),  99% left


This seems alarming.  I wouldn't think # considered should ever exceed
# chunks.  Although, it does say "about", so maybe it can a little
bit, but I wouldn't expect it to exceed it by this much.
Actually, output like this is not unusual.  In the above line, the 1 is 
how many chunks have been actually processed, the 73 is how many the 
command expects to process (that is, the count of chunks that fit the 
filtering requirements, in this case, ones which are 60% or less full), 
and the 6724 is how many it has checked against the filtering 
requirements.  So, if you've got a very large number of chunks, and are 
selecting a small number with filters, then the considered value is 
likely to be significantly higher than the first two.



Balance on '.' is running

Now, 12H later, it's still there, only 1 out of 73.

gargamel:/mnt/btrfs_pool2# btrfs fi show .
Label: 'dshelf2'  uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d
 Total devices 1 FS bytes used 12.72TiB
 devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2

gargamel:/mnt/btrfs_pool2# btrfs fi df .
Data, single: total=13.57TiB, used=12.60TiB
System, DUP: total=32.00MiB, used=1.55MiB
Metadata, DUP: total=121.50GiB, used=116.53GiB
GlobalReserve, single: total=512.00MiB, used=848.00KiB

kernel: 4.16.8

Is that expected? Should I be ready to wait days possibly for this
balance to finish?


It's now beeen 2 days, and it's still stuck at 1%
1 out of about 73 chunks balanced (6724 considered),  99% left


First, my disclaimer.  I'm not a btrfs developer, and although I've
ran balance many times, I haven't really studied its output beyond the
% left.  I don't know why it says "about", and I don't know if it
should ever be that far off.

In your situation, I would run "btrfs pause ", wait to hear from
a btrfs developer, and not use the volume whatsoever in the meantime.
I would say this is probably good advice.  I don't really know what's 
going on here myself actually, though it looks like the balance got 
stuck (the output hasn't changed for over 36 hours, unless you've got an 
insanely slow storage array, that's extremely unusual (it should only be 
moving at most 3GB of data per chunk)).


That said, I would question the value of repacking chunks that are 
already more than half full.  Anything above a 50% usage filter 
generally takes a long time, and has limited value in most cases (higher 
values are less likely to reduce the total number of allocated chunks). 
With `-duszge=50` or less, you're guaranteed to reduce the number of 
chunk if at least two match, and it isn't very time consuming for the 
allocator, all because you can pack at least two matching chunks into 
one 'new' chunk (new in quotes because it may re-pack them into existing 
slack space on the FS).  Additionally, `-dusage=50` is usually 
sufficient to mitigate the typical ENOSPC issues that regular balancing 
is supposed to help with.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ntfs -> qemu -> raw-image -> btrfs -> iscsi

2018-06-15 Thread Austin S. Hemmelgarn

On 2018-06-15 13:40, Chris Murphy wrote:

On Fri, Jun 15, 2018 at 5:33 AM, ein  wrote:

Hello group,

does anyone have had any luck with hosting qemu kvm images resided on BTRFS 
filesystem while serving
the volume via iSCSI?

I encouraged some unidentified problem and I am able to replicate it. Basically 
the NTFS filesystem
inside RAW image gets corrupted every time when Windows guest boots. What is 
weired is that changing
filesystem for ext4 or xfs solves the issue.

The problem replication looks as follows:
1) run chkdsk on the guest to make sure the filesystem structure is in good 
shape,
2) shut down the VM via libvirtd,
3) rsync changes between source and backup image,
4) generate SHA1 for backup and original and compare it,
5) try to run guest on the backup image, I was able to boot windows once for 
ten times, every time
after reboot NTFS' chkdsk finds problems with filesystem and the VM is unable 
to boot again.

What am I missing?

VM disk config:

 
   




cache=none uses O_DIRECT, and that's the source of the issue with VM
images on Btrfs. Details are in the list archive.

I'm not really sure what you want to use with Windows in this
particular case, probably not cache=unsafe though. I'd say give
writethrough a shot and see how it affects performance and fixes this
problem.

cache=writethrough is probably going to be the best option, unless you 
want to switch to cache=writeback and disable write caching in Windows 
(which from what I hear can actually give better performance than using 
cache=none).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: csum failed root raveled during balance

2018-05-29 Thread Austin S. Hemmelgarn

On 2018-05-29 10:02, ein wrote:

On 05/29/2018 02:12 PM, Austin S. Hemmelgarn wrote:

On 2018-05-28 13:10, ein wrote:

On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:

On 2018-05-23 06:09, ein wrote:

On 05/23/2018 11:09 AM, Duncan wrote:

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:


IMHO the best course of action would be to disable checksumming for
you
vm files.


Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.


You can use nocow (-C), but of course that has other restrictions (like
setting it on the files when they're zero-length, easiest done for
existing data by setting it on the containing dir and copying files (no
reflink) in) as well as the nocow effects, and nocow becomes cow1
after a
snapshot (which locks the existing copy in place so changes written to a
block /must/ be written elsewhere, thus the cow1, aka cow the first time
written after the snapshot but retain the nocow for repeated writes
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way
to go.


Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.


Setting NOCOW on a file doesn't prevent it from being snapshotted, it
just prevents COW operations from happening under most normal
circumstances.  In essence, it prevents COW from happening except for
writing right after the snapshot.  More specifically, the first write to
a given block in a file set for NOCOW after taking a snapshot will
trigger a _single_ COW operation for _only_ that block (unless you have
autodefrag enabled too), after which that block will revert to not doing
COW operations on write.  This way, you still get consistent and working
snapshots, but you also don't take the performance hits from COW except
right after taking a snapshot.


Yeah, just after I've post it, I've found some Duncan post from 2015,
explaining it, thank you anyway.

Is there anything we can do better in random/write VM workload to speed
the BTRFS up and why?

My settings:


    
    
    
    [...]


/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
    468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] []
    bitmap: 0/4 pages [0KB], 65536KB chunk

CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
kernel 4.15.0-0.bpo.2-amd64

As far as I understand compress and autodefrag are impacting negatively
for performance (latency), especially autodefrag. I think also that
nodatacow shall also speed things up and it's a must when using qemu and
BTRFS. Is it better to use virtio or virt-scsi with TRIM support?


FWIW, I've been doing just fine without nodatacow, but I also use raw images 
contained in sparse
files, and keep autodefrag off for the dedicated filesystem I put the images on.


So do I, RAW images created by qemu-img, but I am not sure if preallocation 
works as expected. The
size of disks in filesystem looks fine though.

Unless I'm mistaken, qemu-img will fully pre-allocate the images.

You can easily check though with `ls -ls`, which will show the amount of 
space taken up by the file on-disk (before compression or deduplication) 
on the left.  If that first column on the left doesn't match up with the 
apparent file size, then the file is sparse and not fully pre-allocated.


From a practical perspective, if you really want maximal performance, 
it's worth pre-allocating space, as that both avoids the non-determinism 
of allocating blocks on first-write, and avoids some degree of 
fragmentation.


If you would rather save the space and not pre-allocate, you can either 
use touch with the `--size` argument to quickly create an apropriately 
sized virtual disk image file.


May I ask in what workloads? From my testing while having VM on BTRFS storage:
- file/web servers works perfect on BTRFS.
- Windows (2012/2016) file servers with AD, are perfect too, besides time 
required for Windows
Update, but this service is... let's say not fine enough.
- database (firebird) impact is huuuge, guest filesystem is Ext4, the database 
performs slower in
this conditions (4 SSDs in RAID10) than when it was on raid1 with 2 10krpm 
SASes. I am still
thinking how to benchmark it properly. A lot of iowait in host's kernel.
In my case, I've got a couple of different types of VM's, each with it's 
own type of workload:
- A total of 8 static VM's that are always running, each running a 
different distribution/version of Linux.  These see very little activity 
most of the time (I keep them around as reference systems so i 

Re: csum failed root raveled during balance

2018-05-29 Thread Austin S. Hemmelgarn

On 2018-05-28 13:10, ein wrote:

On 05/23/2018 01:03 PM, Austin S. Hemmelgarn wrote:

On 2018-05-23 06:09, ein wrote:

On 05/23/2018 11:09 AM, Duncan wrote:

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:


IMHO the best course of action would be to disable checksumming for
you
vm files.


Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.


You can use nocow (-C), but of course that has other restrictions (like
setting it on the files when they're zero-length, easiest done for
existing data by setting it on the containing dir and copying files (no
reflink) in) as well as the nocow effects, and nocow becomes cow1
after a
snapshot (which locks the existing copy in place so changes written to a
block /must/ be written elsewhere, thus the cow1, aka cow the first time
written after the snapshot but retain the nocow for repeated writes
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way
to go.


Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.


Setting NOCOW on a file doesn't prevent it from being snapshotted, it
just prevents COW operations from happening under most normal
circumstances.  In essence, it prevents COW from happening except for
writing right after the snapshot.  More specifically, the first write to
a given block in a file set for NOCOW after taking a snapshot will
trigger a _single_ COW operation for _only_ that block (unless you have
autodefrag enabled too), after which that block will revert to not doing
COW operations on write.  This way, you still get consistent and working
snapshots, but you also don't take the performance hits from COW except
right after taking a snapshot.


Yeah, just after I've post it, I've found some Duncan post from 2015,
explaining it, thank you anyway.

Is there anything we can do better in random/write VM workload to speed
the BTRFS up and why?

My settings:


   
   
   
   [...]


/dev/mapper/raid10-images on /var/lib/libvirt type btrfs
(rw,noatime,nodiratime,compress=lzo:3,ssd,space_cache,autodefrag,subvolid=5,subvol=/)

md1 : active raid10 sdc1[2] sdb2[1] sdd1[3] sda2[0]
   468596736 blocks super 1.2 512K chunks 2 near-copies [4/4] []
   bitmap: 0/4 pages [0KB], 65536KB chunk

CPU: E3-1246 with: VT-x, VT-d, HT, EPT, TSX-NI, AES-NI on debian's
kernel 4.15.0-0.bpo.2-amd64

As far as I understand compress and autodefrag are impacting negatively
for performance (latency), especially autodefrag. I think also that
nodatacow shall also speed things up and it's a must when using qemu and
BTRFS. Is it better to use virtio or virt-scsi with TRIM support?

FWIW, I've been doing just fine without nodatacow, but I also use raw 
images contained in sparse files, and keep autodefrag off for the 
dedicated filesystem I put the images on.


Compression shouldn't have much in the way of negative impact unless 
you're also using transparent compression (or disk for file encryption) 
inside the VM (in fact, it may speed things up significantly depending 
on what filesystem is being used by the guest OS, the ext4 inode table 
in particular seems to compress very well).  If you are using 
`nodatacow` though, you can just turn compression off, as it's not going 
to be used anyway.  If you want to keep using compression, then I'd 
suggest using `compress-force` instead of `compress`, which makes BTRFS 
a bit more aggressive about trying to compress things, but makes the 
performance much more deterministic.  You may also want to look int 
using `zstd` instead of `lzo` for the compression, it gets better ratios 
most of the time, and usually performs better than `lzo` does.


Autodefrag should probably be off.  If you have nodatacow set (or just 
have all the files marked with the NOCOW attribute), then there's not 
really any point to having autodefrag on.  If like me you aren't turning 
off COW for data, it's still a good idea to have it off and just do 
batch defragmentation at a regularly scheduled time.


For the VM settings, everything looks fine to me (though if you have 
somewhat slow storage and aren't giving the VM's lots of memory to work 
with, doing write-through caching might be helpful).  I would probably 
be using virtio-scsi for the TRIM support, as with raw images you will 
get holes in the file where the TRIM command was issued, which can 
actually improve performance (and does improve storage utilization 
(though doing batch trims instead of using the `discard` mount option is 
better for performance if you have Linux guests).


You're using an MD RAID10 array.  This is generally the fastest option 
in terms of performance, but it also means you c

Re: csum failed root raveled during balance

2018-05-23 Thread Austin S. Hemmelgarn

On 2018-05-23 06:09, ein wrote:

On 05/23/2018 11:09 AM, Duncan wrote:

ein posted on Wed, 23 May 2018 10:03:52 +0200 as excerpted:


IMHO the best course of action would be to disable checksumming for you
vm files.


Do you mean '-o nodatasum' mount flag? Is it possible to disable
checksumming for singe file by setting some magical chattr? Google
thinks it's not possible to disable csums for a single file.


You can use nocow (-C), but of course that has other restrictions (like
setting it on the files when they're zero-length, easiest done for
existing data by setting it on the containing dir and copying files (no
reflink) in) as well as the nocow effects, and nocow becomes cow1 after a
snapshot (which locks the existing copy in place so changes written to a
block /must/ be written elsewhere, thus the cow1, aka cow the first time
written after the snapshot but retain the nocow for repeated writes
between snapshots).

But if you're disabling checksumming anyway, nocow's likely the way to go.


Disabling checksumming only may be a way to go - we live without it
every day. But nocow @ VM files defeats whole purpose of using BTRFS for
me, even with huge performance penalty - backup reasons - I mean few
snapshots (20-30), send & receive.

Setting NOCOW on a file doesn't prevent it from being snapshotted, it 
just prevents COW operations from happening under most normal 
circumstances.  In essence, it prevents COW from happening except for 
writing right after the snapshot.  More specifically, the first write to 
a given block in a file set for NOCOW after taking a snapshot will 
trigger a _single_ COW operation for _only_ that block (unless you have 
autodefrag enabled too), after which that block will revert to not doing 
COW operations on write.  This way, you still get consistent and working 
snapshots, but you also don't take the performance hits from COW except 
right after taking a snapshot.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-21 13:43, David Sterba wrote:

On Fri, May 18, 2018 at 01:10:02PM -0400, Austin S. Hemmelgarn wrote:

On 2018-05-18 12:36, Niccolò Belli wrote:

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without
any news on this front.

What I would really like to know is why nobody resumed his work: is it
because it's impossible to implement snapshot-aware degram without
excessive ram usage or is it simply because nobody is interested?

I think it's because nobody who is interested has both the time and the
coding skills to tackle it.

Personally though, I think the biggest issue with what was done was not
the memory consumption, but the fact that there was no switch to turn it
on or off.  Making defrag unconditionally snapshot aware removes one of
the easiest ways to forcibly unshare data without otherwise altering the
files (which, as stupid as it sounds, is actually really useful for some
storage setups), and also forces the people who have ridiculous numbers
of snapshots to deal with the memory usage or never defrag.


Good points. The logic of the sharing-aware is a technical detail,
what's being discussed is the usecase and I think this would be good to
clarify.

1) always -- the old (and now disabled) way, unconditionally (ie. no
option for the user), problems with memory consumption

2) more fine grained:

2.1) defragment only the non-shared extents, ie. no sharing awareness
  needed, shared extents will be silently skipped

2.2) defragment only within the given subvolume -- like 1) but by user's choice

The naive dedup, that Tomasz (CCed) mentions in another mail, would be
probably beyond the defrag purpose and would make things more
complicated.

I'd vote for keeping complexity of the ioctl interface and defrag
implementation low, so if it's simply saying "do forcible defrag" or
"skip shared", then it sounds ok.

If there's eg. "keep sharing only on this  subvolunes", then it
would need to read the snapshot ids from ioctl structure, then enumerate
all extent owners and do some magic to unshare/defrag/share. That's a
quick idea, lots of details would need to be clarified.

From my perspective, I see two things to consider that are somewhat 
orthogonal to each other:


1. Whether to recurse into subvolumes or not (IIRC, we currently do not 
do so, because we see them like a mount point).
2. Whether to use the simple (not reflink-aware) defrag, the reflink 
aware one, or to base it on the extent/file type (use old simpler one 
for unshared extents, and new reflink aware one for shared extents).


This second set of options is what I'd like to see the most (possibly 
without the option to base it on file or extent sharing automatically), 
though the first one would be nice to have.


Better yet, having that second set of options and making the new 
reflink-aware defrag opt-in would allow people who really want it to use 
it, and those of us who don't need it for our storage setups to not need 
to worry about it.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-21 09:42, Timofey Titovets wrote:

пн, 21 мая 2018 г. в 16:16, Austin S. Hemmelgarn <ahferro...@gmail.com>:


On 2018-05-19 04:54, Niccolò Belli wrote:

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:

With a bit of work, it's possible to handle things sanely.  You can
deduplicate data from snapshots, even if they are read-only (you need
to pass the `-A` option to duperemove and run it as root), so it's
perfectly reasonable to only defrag the main subvolume, and then
deduplicate the snapshots against that (so that they end up all being
reflinks to the main subvolume).  Of course, this won't work if you're
short on space, but if you're dealing with snapshots, you should have
enough space that this will work (because even without defrag, it's
fully possible for something to cause the snapshots to suddenly take
up a lot more space).


Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably
due to autodefrag?).

It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME
ioctl).  There's two things involved here:



* BTRFS has somewhat odd and inefficient handling of partial extents.
When part of an extent becomes unused (because of a CLONE ioctl, or an
EXTENT_SAME ioctl, or something similar), that part stays allocated
until the whole extent would be unused.
* You're using the default deduplication block size (128k), which is
larger than your filesystem block size (which is at most 64k, most
likely 16k, but might be 4k if it's an old filesystem), so deduplicating
can split extents.


That's a metadata node leaf != fs block size.
btrfs fs block size == machine page size currently.
You're right, I keep forgetting about that (probably because BTRFS is 
pretty much the only modern filesystem that doesn't let you change the 
block size).



Because of this, if a duplicate region happens to overlap the front of
an already shared extent, and the end of said shared extent isn't
aligned with the deduplication block size, the EXTENT_SAME call will
deduplicate the first part, creating a new shared extent, but not the
tail end of the existing shared region, and all of that original shared
region will stick around, taking up extra space that it wasn't before.



Additionally, if only part of an extent is duplicated, then that area of
the extent will stay allocated, because the rest of the extent is still
referenced (so you won't necessarily see any actual space savings).



You can mitigate this by telling duperemove to use the same block size
as your filesystem using the `-b` option.   Note that using a smaller
block size will also slow down the deduplication process and greatly
increase the size of the hash file.


duperemove -b control "how hash data", not more or less and only support
4KiB..1MiB
And you can only deduplicate the data at the granularity you hashed it 
at.  In particular:


* The total size of a region being deduplicated has to be an exact 
multiple of the hash block size (what you pass to `-b`).  So for the 
default 128k size, you can only deduplicate regions that are multiples 
of 128k long (128k, 256k, 384k, 512k, etc).   This is a simple limit 
derived from how blocks are matched for deduplication.
* Because duperemove uses fixed hash blocks (as opposed to using a 
rolling hash window like many file synchronization tools do), the 
regions being deduplicated also have to be exactly aligned to the hash 
block size.  So, with the default 128k size, you can only deduplicate 
regions starting at 0k, 128k, 256k, 384k, 512k, etc, but not ones 
starting at, for example, 64k into the file.


And size of block for dedup will change efficiency of deduplication,
when count of hash-block pairs, will change hash file size and time
complexity.

Let's assume that: 'A' - 1KiB of data '' - 4KiB with repeated pattern.

So, example, you have 2 of 2x4KiB blocks:
1: ''
2: ''

With -b 8KiB hash of first block not same as second.
But with -b 4KiB duperemove will see both '' and ''
And then that blocks will be deduped.
This supports what I'm saying though.  Your deduplication granularity is 
bounded by your hash granularity.  If in addition to the above you have 
a file that looks like:


AABBBAA

It would not get deduplicated against the first two at either `-b 4k` or 
`-b 8k` despite the middle 4k of the file being an exact duplicate of 
the final 4k of the first file and first 4k of the second one.


If instead you have:

AABB

And the final 6k is a single on-disk extent, that extent will get split 
when you go to deduplicate against the first two files with a 4k block 
size because only the final 4k can be deduplicated, and the entire 6k 
original extent will stay completely allocated.


Even, duperemove have 2 modes of deduping:
1. By extents
2. By blocks
Ye

Re: Any chance to get snapshot-aware defragmentation?

2018-05-21 Thread Austin S. Hemmelgarn

On 2018-05-19 04:54, Niccolò Belli wrote:

On venerdì 18 maggio 2018 20:33:53 CEST, Austin S. Hemmelgarn wrote:
With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need 
to pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take 
up a lot more space).


Been there, tried that. Unfortunately even if I skip the defreg a simple

duperemove -drhA --dedupe-options=noblock --hashfile=rootfs.hash rootfs

is going to eat more space than it was previously available (probably 
due to autodefrag?).
It's not autodefrag (that doesn't trigger on use of the EXTENT_SAME 
ioctl).  There's two things involved here:


* BTRFS has somewhat odd and inefficient handling of partial extents. 
When part of an extent becomes unused (because of a CLONE ioctl, or an 
EXTENT_SAME ioctl, or something similar), that part stays allocated 
until the whole extent would be unused.
* You're using the default deduplication block size (128k), which is 
larger than your filesystem block size (which is at most 64k, most 
likely 16k, but might be 4k if it's an old filesystem), so deduplicating 
can split extents.


Because of this, if a duplicate region happens to overlap the front of 
an already shared extent, and the end of said shared extent isn't 
aligned with the deduplication block size, the EXTENT_SAME call will 
deduplicate the first part, creating a new shared extent, but not the 
tail end of the existing shared region, and all of that original shared 
region will stick around, taking up extra space that it wasn't before.


Additionally, if only part of an extent is duplicated, then that area of 
the extent will stay allocated, because the rest of the extent is still 
referenced (so you won't necessarily see any actual space savings).


You can mitigate this by telling duperemove to use the same block size 
as your filesystem using the `-b` option.   Note that using a smaller 
block size will also slow down the deduplication process and greatly 
increase the size of the hash file.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 13:18, Niccolò Belli wrote:

On venerdì 18 maggio 2018 19:10:02 CEST, Austin S. Hemmelgarn wrote:
and also forces the people who have ridiculous numbers of snapshots to 
deal with the memory usage or never defrag


Whoever has at least one snapshot is never going to defrag anyway, 
unless he is willing to double the used space.


With a bit of work, it's possible to handle things sanely.  You can 
deduplicate data from snapshots, even if they are read-only (you need to 
pass the `-A` option to duperemove and run it as root), so it's 
perfectly reasonable to only defrag the main subvolume, and then 
deduplicate the snapshots against that (so that they end up all being 
reflinks to the main subvolume).  Of course, this won't work if you're 
short on space, but if you're dealing with snapshots, you should have 
enough space that this will work (because even without defrag, it's 
fully possible for something to cause the snapshots to suddenly take up 
a lot more space).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Any chance to get snapshot-aware defragmentation?

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 12:36, Niccolò Belli wrote:

On venerdì 18 maggio 2018 18:20:51 CEST, David Sterba wrote:

Josef started working on that in 2014 and did not finish it. The patches
can be still found in his tree. The problem is in excessive memory
consumption when there are many snapshots that need to be tracked during
the defragmentation, so there are measures to avoid OOM. There's
infrastructure ready for use (shrinkers), there are maybe some problems
but fundamentally is should work.

I'd like to get the snapshot-aware working again too, we'd need to find
a volunteer to resume the work on the patchset.


Yeah I know of Josef's work, but 4 years had passed since then without 
any news on this front.


What I would really like to know is why nobody resumed his work: is it 
because it's impossible to implement snapshot-aware degram without 
excessive ram usage or is it simply because nobody is interested?
I think it's because nobody who is interested has both the time and the 
coding skills to tackle it.


Personally though, I think the biggest issue with what was done was not 
the memory consumption, but the fact that there was no switch to turn it 
on or off.  Making defrag unconditionally snapshot aware removes one of 
the easiest ways to forcibly unshare data without otherwise altering the 
files (which, as stupid as it sounds, is actually really useful for some 
storage setups), and also forces the people who have ridiculous numbers 
of snapshots to deal with the memory usage or never defrag.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-18 Thread Austin S. Hemmelgarn

On 2018-05-18 04:06, Anand Jain wrote:



Thanks Austin and Jeff for the suggestion.

I am not particularly a fan of mount option either mainly because
those options aren't persistent and host independent luns will
have tough time to have them synchronize manually.

Properties are better as it is persistent. And we can apply this
read_mirror_policy property on the fsid object.

But if we are talking about the properties then it can be stored
as extended attributes or ondisk key value pair, and I am doubt
if ondisk key value pair will get a nod.
I can explore the extended attribute approach but appreciate more
comments.


Hmm, thinking a bit further, might it be easier to just keep this as a 
mount option, and add something that lets you embed default mount 
options in the volume in a free-form manner?  Then, you could set this 
persistently there, and could specify any others you want too.  Doing 
that would also give very well defined behavior for exactly when changes 
would apply (the next time you mount or remount the volume), though 
handling of whether or not an option came from there or was specified on 
the command-line might be a bit complicated.



On 05/17/2018 10:46 PM, Jeff Mahoney wrote:

On 5/17/18 8:25 AM, Austin S. Hemmelgarn wrote:

On 2018-05-16 22:32, Anand Jain wrote:



On 05/17/2018 06:35 AM, David Sterba wrote:

On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:

Not yet ready for the integration. As I need to introduce
-o no_read_mirror_policy instead of -o read_mirror_policy=-


Mount option is mostly likely not the right interface for setting such
options, as usual.


   I am ok to make it ioctl for the final. What do you think?


   But to reproduce the bug posted in
 Btrfs: fix the corruption by reading stale btree blocks
   It needs to be a mount option, as randomly the pid can
   still pick the disk specified in the mount option.


Personally, I'd vote for filesystem property (thus handled through the
standard `btrfs property` command) that can be overridden by a mount
option.  With that approach, no new tool (or change to an existing tool)
would be needed, existing volumes could be converted to use it in a
backwards compatible manner (old kernels would just ignore the
property), and you could still have the behavior you want in tests (and
in theory it could easily be adapted to be a per-subvolume setting if we
ever get per-subvolume chunk profile support).


Properties are a combination of interfaces presented through a single
command.  Although the kernel API would allow a direct-to-property
interface via the btrfs.* extended attributes, those are currently
limited to a single inode.  The label property is set via ioctl and
stored in the superblock.  The read-only subvolume property is also set
by ioctl but stored in the root flags.

As it stands, every property is explicitly defined in the tools, so any
addition would require tools changes.  This is a bigger discussion,
though.  We *could* use the xattr interface to access per-root or
fs-global properties, but we'd need to define that interface.
btrfs_listxattr could get interesting, though I suppose we could
simplify it by only allowing the per-subvolume and fs-global operations
on root inodes.

-Jeff



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-17 Thread Austin S. Hemmelgarn

On 2018-05-17 10:46, Jeff Mahoney wrote:

On 5/16/18 6:35 PM, David Sterba wrote:

On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:

Not yet ready for the integration. As I need to introduce
-o no_read_mirror_policy instead of -o read_mirror_policy=-


Mount option is mostly likely not the right interface for setting such
options, as usual.



I've seen a few alternate suggestions in the thread.  I suppose the real
question is: what and where is the intended persistence for this choice?

A mount option gets it via fstab.  How would a user be expected to set
it consistently via ioctl on each mount?  Properties could work, but
there's more discussion needed there.  Personally, I like the property
idea since it could conceivably be used on a per-file basis.


For the specific proposed use case (the tests), it probably doesn't need 
to be persistent beyond mount options.


However, this also allows for a trivial configuration using a slow 
storage device to provide redundancy for a fast storage device of the 
same size, which is potentially very useful for some people.  In that 
case, I can see most people who would be using it wanting it to follow 
the filesystem regardless of what context it's being mounted in (for 
example, it shouldn't need an extra option if mounted from a recovery 
environment or if it's moved to another system).


Most of my reason for recommending properties is that filesystem level 
properties appear to be the best thing BTRFS has to store per-volume 
configuration that's supposed to stay with the volume, despite not 
really being used for that even though there are quite a few mount 
options that are logical candidates for this type of thing (for example, 
the `ssd` options, `metadata_ratio`, and `max_inline` all make more 
logical sense as a property of the volume, not the mount).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/3] btrfs: add read mirror policy

2018-05-17 Thread Austin S. Hemmelgarn

On 2018-05-16 22:32, Anand Jain wrote:



On 05/17/2018 06:35 AM, David Sterba wrote:

On Wed, May 16, 2018 at 06:03:56PM +0800, Anand Jain wrote:

Not yet ready for the integration. As I need to introduce
-o no_read_mirror_policy instead of -o read_mirror_policy=-


Mount option is mostly likely not the right interface for setting such
options, as usual.


  I am ok to make it ioctl for the final. What do you think?


  But to reproduce the bug posted in
    Btrfs: fix the corruption by reading stale btree blocks
  It needs to be a mount option, as randomly the pid can
  still pick the disk specified in the mount option.

Personally, I'd vote for filesystem property (thus handled through the 
standard `btrfs property` command) that can be overridden by a mount 
option.  With that approach, no new tool (or change to an existing tool) 
would be needed, existing volumes could be converted to use it in a 
backwards compatible manner (old kernels would just ignore the 
property), and you could still have the behavior you want in tests (and 
in theory it could easily be adapted to be a per-subvolume setting if we 
ever get per-subvolume chunk profile support).


Of course, I'd actually like to see most of the mount options available 
as filesystem level properties with the option to override through mount 
options, but that's a lot more ambitious of an undertaking.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs: balance: add kernel log for end or paused

2018-05-16 Thread Austin S. Hemmelgarn

On 2018-05-16 09:23, Anand Jain wrote:



On 05/16/2018 07:25 PM, Austin S. Hemmelgarn wrote:

On 2018-05-15 22:51, Anand Jain wrote:

Add a kernel log when the balance ends, either for cancel or completed
or if it is paused.
---
v1->v2: Moved from 2/3 to 3/3

  fs/btrfs/volumes.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ce68c4f42f94..a4e243a29f5c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4053,6 +4053,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
  ret = __btrfs_balance(fs_info);
  mutex_lock(_info->balance_mutex);
+    if (ret == -ECANCELED && atomic_read(_info->balance_pause_req))
+    btrfs_info(fs_info, "balance: paused");
+    else if (ret == -ECANCELED && 
atomic_read(_info->balance_cancel_req))

+    btrfs_info(fs_info, "balance: canceled");
+    else
+    btrfs_info(fs_info, "balance: ended with status: %d", ret);
+
  clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
  if (bargs) {

Is there some way that these messages could be extended to include 
info about which volume the balance in question was on.  Ideally, 
something that matches up with what's listed in the message from the 
previous patch.  There's nothi9ng that prevents you from running 
balances on separate BTRFS volumes in parallel, so this message won't 
necessarily be for the most recent balance start message.


  Hm. That's not true, btrfs_info(fs_info,,) adds the fsid.
  So its already drilled down to the lowest granular possible.


Ah, you're right, it does.  Sorry about the noise.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] btrfs: balance: add kernel log for end or paused

2018-05-16 Thread Austin S. Hemmelgarn

On 2018-05-15 22:51, Anand Jain wrote:

Add a kernel log when the balance ends, either for cancel or completed
or if it is paused.
---
v1->v2: Moved from 2/3 to 3/3

  fs/btrfs/volumes.c | 7 +++
  1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index ce68c4f42f94..a4e243a29f5c 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -4053,6 +4053,13 @@ int btrfs_balance(struct btrfs_fs_info *fs_info,
ret = __btrfs_balance(fs_info);
  
  	mutex_lock(_info->balance_mutex);

+   if (ret == -ECANCELED && atomic_read(_info->balance_pause_req))
+   btrfs_info(fs_info, "balance: paused");
+   else if (ret == -ECANCELED && atomic_read(_info->balance_cancel_req))
+   btrfs_info(fs_info, "balance: canceled");
+   else
+   btrfs_info(fs_info, "balance: ended with status: %d", ret);
+
clear_bit(BTRFS_FS_BALANCE_RUNNING, _info->flags);
  
  	if (bargs) {


Is there some way that these messages could be extended to include info 
about which volume the balance in question was on.  Ideally, something 
that matches up with what's listed in the message from the previous 
patch.  There's nothi9ng that prevents you from running balances on 
separate BTRFS volumes in parallel, so this message won't necessarily be 
for the most recent balance start message.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs installation advices

2018-05-14 Thread Austin S. Hemmelgarn

On 2018-05-12 21:58, faurepi...@gmail.com wrote:

Thanks you two very much for your answers.

So if I sum up correctly, I could:

1- use Self-Encrypting Drive (SED), since my drive is a Samsung NVMe 960 
EVO, which is supposed to support SED according to 
http://www.samsung.com/semiconductor/minisite/ssd/support/faqs-nvmessd:

"*Do Samsung NVMe M.2 SSDs have hardware encryption?*
Samsung NVMe SSDs provide internal hardware encryption of all data 
stored on the SSD, including the operating system. Data is decrypted 
through a pre-boot authentication process.
Because all user data is encrypted, private information is protected 
against loss or theft.
Encryption is done by hardware, which provides a safer environment 
without sacrificing performance.


The encryption methods provided by each Samsung NVMe SSD are: AES 
(Advanced Encryption Standard, Class0 SED) TCG/OPAL, and eDrive


Please note that you cannot use more than one encryption method 
simultaneously.



*Do Samsung NVMe M.2 SSDs support TCG Opal?*
TCG Opal is supported by Samsung NVMe SSDs (960EVO / PRO and newer). It 
is an authentication method that employs the protocol specified by the 
Trusted Computing Group (TCG) meaning that you will need to install TCG 
software supplied by a TCG OPAL software development company.


User authentication is done by pre-boot authentication provided by the 
software. For more detailed information and instructions, please contact 
a TCG software company. In addition, TCG/opal can only be enabled / 
disabled by using special security software. "


For the moment, I don't know how to use that self-encryption from linux. 
Could you please give me some tips or links about how you did?


2- now that the full drive is self-encrypted, I can build manually the 
three partitions from a live system: boot with ext(2,3,4), swap with 
swap, and root with btrfs


3- and finally install debian sid in the dedicaced partitions.

Am I right? :)
Yes, that approach will work, assuming you trust Samsung (since they're 
the ones who wrote the code responsible for the encryption, and you 
can't inspect that code yourself).



Le 08/05/2018 à 13:32, Austin S. Hemmelgarn a écrit :

On 2018-05-08 03:50, Rolf Wald wrote:

Hello,

some hints inside

Am 08.05.2018 um 02:22 schrieb faurepi...@gmail.com:

Hi,

I'm curious about btrfs, and maybe considering it for my new laptop
installation (a Lenovo T470).
I was going to install my usual lvm+ext4+full disk encryption setup, 
but

thought I should maybe give a try to btrfs.


Is it possible to meet all these criteria?
- operating system: debian sid
- file system: btrfs
- disk encryption (or at least of sensitives partitions)
- hibernation feature (which implies a swap partition or file, and I've
read btrfs is not a big fan of the latter)


A swap partition is not possible inside or with btrfs alone.

You can choose btrfs filesystem out of the box in debian install, but 
that would mean full-disk-encryption with lvm and btrfs. The extra 
layer lvm doesn't hurt, but you have two layers with many functions 
double, e.g. snapshotting, resize.
Um, this isn't really as much of an issue as you might think.  LVM has 
near zero overhead unless you're actually doing any of that stuff (as 
long as the LV is just a simple linear mapping, it has less than 1% 
more overhead than just using partitions).  The only real caveat here 
is to make _ABSOLUTELY CERTAIN_ that you _DO NOT_ make LVM snapshots 
of _ANY_ BTRFS volumes.  Doing so is a recipe for disaster, and will 
likely eat at least your data, and possibly your children.


The bigger issue is that dm-crypt generally slows down device access, 
which BTRFS is very sensitive to.  Using BTRFS with FDE works, but 
it's slow, so I would only suggest doing it with an SSD (and if you're 
using an SSD, you may be better off getting a TCG Opal compliant 
self-encrypting drive and just using the self-encryption functionality 
instead of FDE).




If yes, how would you suggest me to achieve it?


Yes, there is a solution, and it works for me now several years.
You need to build three partitions, e.g. named boot, swap, root. The 
sizes choose to your need. the boot partition remains unencrypted, 
but the other two partitions are encrypted with cryptsetup (luks) 
separately. Normally there are two passphrases to type in (and to 
remember), but there is an option in the cryptsetup scripts 
(/lib/cryptsetup/scripts) decrypt_derived, which could take the key 
from the root partition to decrypt the swap partition also. The 
filesystems then on the partitions are boot with ext(2,3,4), swap 
with swap and root with btrfs.
This configuration is not reachable with a standard debian 
installation. Debian always choose lvm if you want full encryption. 
You have to do the first steps manually: make partitions, 
cryptsetup(luks) for the partitions swap and root, and open the 
encrypted partitions manually. After that you can install your OS. 
The manual steps you ha

Re: Btrfs installation advices

2018-05-08 Thread Austin S. Hemmelgarn

On 2018-05-08 03:50, Rolf Wald wrote:

Hello,

some hints inside

Am 08.05.2018 um 02:22 schrieb faurepi...@gmail.com:

Hi,

I'm curious about btrfs, and maybe considering it for my new laptop
installation (a Lenovo T470).
I was going to install my usual lvm+ext4+full disk encryption setup, but
thought I should maybe give a try to btrfs.


Is it possible to meet all these criteria?
- operating system: debian sid
- file system: btrfs
- disk encryption (or at least of sensitives partitions)
- hibernation feature (which implies a swap partition or file, and I've
read btrfs is not a big fan of the latter)


A swap partition is not possible inside or with btrfs alone.

You can choose btrfs filesystem out of the box in debian install, but 
that would mean full-disk-encryption with lvm and btrfs. The extra layer 
lvm doesn't hurt, but you have two layers with many functions double, 
e.g. snapshotting, resize.
Um, this isn't really as much of an issue as you might think.  LVM has 
near zero overhead unless you're actually doing any of that stuff (as 
long as the LV is just a simple linear mapping, it has less than 1% more 
overhead than just using partitions).  The only real caveat here is to 
make _ABSOLUTELY CERTAIN_ that you _DO NOT_ make LVM snapshots of _ANY_ 
BTRFS volumes.  Doing so is a recipe for disaster, and will likely eat 
at least your data, and possibly your children.


The bigger issue is that dm-crypt generally slows down device access, 
which BTRFS is very sensitive to.  Using BTRFS with FDE works, but it's 
slow, so I would only suggest doing it with an SSD (and if you're using 
an SSD, you may be better off getting a TCG Opal compliant 
self-encrypting drive and just using the self-encryption functionality 
instead of FDE).




If yes, how would you suggest me to achieve it?


Yes, there is a solution, and it works for me now several years.
You need to build three partitions, e.g. named boot, swap, root. The 
sizes choose to your need. the boot partition remains unencrypted, but 
the other two partitions are encrypted with cryptsetup (luks) 
separately. Normally there are two passphrases to type in (and to 
remember), but there is an option in the cryptsetup scripts 
(/lib/cryptsetup/scripts) decrypt_derived, which could take the key from 
the root partition to decrypt the swap partition also. The filesystems 
then on the partitions are boot with ext(2,3,4), swap with swap and root 
with btrfs.
This configuration is not reachable with a standard debian installation. 
Debian always choose lvm if you want full encryption. You have to do the 
first steps manually: make partitions, cryptsetup(luks) for the 
partitions swap and root, and open the encrypted partitions manually. 
After that you can install your OS. The manual steps you have to make 
from a working distro, e.g. live system (disk or stick) with a recent 
kernel and recent btrfs-progs (debian sid is ok for this).
After the install of the OS you have to made the changes for a 
successful (re)boot manually. Please read the advices you can find in 
the net. There are some nice articles.




Thanks for your kind help.




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-03 Thread Austin S. Hemmelgarn

On 2018-05-03 04:11, Andrei Borzenkov wrote:

On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
<ahferro...@gmail.com> wrote:
...


Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
you 40TB of usable space).  You're storing roughly 20TB of data on it, using
a 16kB block size, and it sees about 1GB of writes a day, with no partial
stripe writes.  You, for reasons of argument, want to scrub it every week,
because the data in question matters a lot to you.

With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
can compute the parity at a rate of 1.25G/s (the ratio here is about the
average across the almost 50 systems I have quick access to check, including
a number of server and workstation systems less than a year old, though the
numbers themselves are artificially low to accentuate the point here).

At this rate, scrubbing by computing parity requires processing:

* Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 1
seconds, or 222 minutes, or about 3.7 hours.
* Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
seconds, or 267 minutes, or roughly 4.4 hours.

So, over a week, you would be spending 8.1 hours processing data solely for
data integrity, or roughly 4.8214% of your time.

Now assume instead that you're doing checksummed parity:

* Scrubbing data is the same, 3.7 hours.
* Scrubbing parity turns into computing checksums for 4TB of data, which
would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.


Scrubbing must compute parity and compare with stored value to detect
write hole. Otherwise you end up with parity having good checksum but
not matching rest of data.
Yes, but that assumes we don't do anything to deal with the write hole, 
and it's been pretty much decided by the devs that they're going to try 
and close the write hole.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID56 - 6 parity raid

2018-05-03 Thread Austin S. Hemmelgarn

On 2018-05-02 16:40, Goffredo Baroncelli wrote:

On 05/02/2018 09:29 PM, Austin S. Hemmelgarn wrote:

On 2018-05-02 13:25, Goffredo Baroncelli wrote:

On 05/02/2018 06:55 PM, waxhead wrote:


So again, which problem would solve having the parity checksummed ? On the best 
of my knowledge nothing. In any case the data is checksummed so it is 
impossible to return corrupted data (modulo bug :-) ).


I am not a BTRFS dev , but this should be quite easy to answer. Unless you 
checksum the parity there is no way to verify that that the data (parity) you 
use to reconstruct other data is correct.


In any case you could catch that the compute data is wrong, because the data is 
always checksummed. And in any case you must check the data against their 
checksum.

My point is that storing the checksum is a cost that you pay *every time*. 
Every time you update a part of a stripe you need to update the parity, and 
then in turn the parity checksum. It is not a problem of space occupied nor a 
computational problem. It is a a problem of write amplification...

The only gain is to avoid to try to use the parity when
a) you need it (i.e. when the data is missing and/or corrupted)
and b) it is corrupted.
But the likelihood of this case is very low. And you can catch it during the 
data checksum check (which has to be performed in any case !).

So from one side you have a *cost every time* (the write amplification), to 
other side you have a gain (cpu-time) *only in case* of the parity is corrupted 
and you need it (eg. scrub or corrupted data)).

IMHO the cost are very higher than the gain, and the likelihood the gain is 
very lower compared to the likelihood (=100% or always) of the cost.

You do realize that a write is already rewriting checksums elsewhere? It would 
be pretty trivial to make sure that the checksums for every part of a stripe 
end up in the same metadata block, at which point the only cost is computing 
the checksum (because when a checksum gets updated, the whole block it's in 
gets rewritten, period, because that's how CoW works).

Looking at this another way (all the math below uses SI units):


[...]
Good point: precomputing the checksum of the parity save a lot of time for the 
scrub process. You can see this in a more simply way saying that the parity 
calculation (which is dominated by the memory bandwidth) is like O(n) (where n 
is the number of disk); the parity checking (which again is dominated by the 
memory bandwidth) against a checksum is like O(1). And when the data written is 
2 order of magnitude lesser than the data stored, the effort required to 
precompute the checksum is negligible.
Excellent point about the computational efficiency, I had not thought of 
framing things that way.


Anyway, my "rant" started when Ducan put near the missing of parity checksum 
and the write hole. The first might be a performance problem. Instead the write hole 
could lead to a loosing data. My intention was to highlight that the parity-checksum is 
not related to the reliability and safety of raid5/6.
It may not be related to the safety, but it is arguably indirectly 
related to the reliability, dependent on your definition of reliability. 
 Spending less time verifying the parity means you're spending less 
time in an indeterminate state of usability, which arguably does improve 
the reliability of the system.  However, that does still have nothing to 
do with the write hole.




So, lets look at data usage:

1GB of data is translates to 62500 16kB blocks of data, which equates to an 
additional 15625 blocks for parity.  Adding parity checksums adds a 25% 
overhead to checksums being written, but that actually doesn't translate to a 
huge increase in the number of _blocks_ of checksums written.  One 16k block 
can hold roughly 500 checksums, so it would take 125 blocks worth of checksums 
without parity, and 157 (technically 156.25, but you can't write a quarter 
block) with parity checksums. Thus, without parity checksums, writing 1GB of 
data involves writing 78250 blocks, while doing the same with parity checksums 
involves writing 78282 blocks, a net change of only 32 blocks, or **0.0409%**.


How you would store the checksum ?
I asked that because I am not sure that we could use the "standard" btrfs 
metadata to store the checksum of the parity. Doing so you could face some pathological 
effect like:
- update a block(1) in a stripe(1)
- update the parity of stripe(1) containing block(1)
- update the checksum of parity stripe (1), which is contained in another 
stripe(2) [**]

- update the parity of stripe (2) containing the checksum of parity stripe(1)
- update the checksum of parity stripe (2), which is contained in another 
stripe(3)

and so on...

[**] pay attention that the checksum and the parity *have* to be in different 
stripe, otherwise you have the egg/chicken problem: compute the parity, then 
update the checksum, then update the parity again because the 

  1   2   3   4   5   6   7   8   9   10   >