from:"Andrei Borzenkov"

Re: What if TRIM issued a wipe on devices that don't TRIM?

2018-12-06 Thread Andrei Borzenkov

06.12.2018 16:04, Austin S. Hemmelgarn пишет:
> 
> * On SCSI devices, a discard operation translates to a SCSI UNMAP
> command.  As pointed out by Ronnie Sahlberg in his reply, this command
> is purely advisory, may not result in any actual state change on the
> target device, and is not guaranteed to wipe the data.  To actually wipe
> things, you have to explicitly write bogus data to the given regions
> (using either regular writes, or a WRITESAME command with the desired
> pattern), and _then_ call UNMAP on them.

WRITE SAME command has UNMAP bit and depending on device and kernel
version kernel may actually issue either UNMAP or WRITE SAME with UNMAP
bit set when doing discard.

Re: Need help with potential ~45TB dataloss

2018-12-02 Thread Andrei Borzenkov

02.12.2018 23:14, Patrick Dijkgraaf пишет:
> I have some additional info.
> 
> I found the reason the FS got corrupted. It was a single failing drive,
> which caused the entire cabinet (containing 7 drives) to reset. So the
> FS suddenly lost 7 drives.
> 

This remains mystery for me. btrfs is marketed to be always consistent
on disk - you either have previous full transaction or current full
transaction. If current transaction was interrupted the promise is you
are left with previous valid consistent transaction.

Obviously this is not what happens in practice. Which nullifies the main
selling point of btrfs.

Unless this is expected behavior, it sounds like some barriers are
missing and summary data is updated before (and without waiting for)
subordinate data. And if it is expected behavior ...

> I have removed the failed drive, so the RAID is now degraded. I hope
> the data is still recoverable... ☹
>

Re: Have 15GB missing in btrfs filesystem.

2018-10-27 Thread Andrei Borzenkov

27.10.2018 21:12, Remi Gauvin пишет:
> On 2018-10-27 01:42 PM, Marc MERLIN wrote:
> 
>>
>> I've been using btrfs for a long time now but I've never had a
>> filesystem where I had 15GB apparently unusable (7%) after a balance.
>>
> 
> The space isn't unusable.  It's just allocated.. (It's used in the sense
> that it's reserved for data chunks.).  Start writing data to the drive,
> and the data will fill that space before more gets allocated.. (Unless

No (at least, not necessarily).

On empty filesystem:

bor@10:~> df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb1  1023M   17M  656M   3% /mnt
bor@10:~> sudo dd if=/dev/zero of=/mnt/foo bs=100M count=1
1+0 records in
1+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.260088 s, 403 MB/s
bor@10:~> sync
bor@10:~> df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb1  1023M  117M  556M  18% /mnt
bor@10:~> sudo filefrag -v /mnt/foo
Filesystem type is: 9123683e
File size of /mnt/foo is 104857600 (25600 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..   14419:  36272.. 50691:  14420:
   1:14420..   25599: 125312..136491:  11180:  50692:
last,eof
/mnt/foo: 2 extents found
bor@10:~> sudo dd if=/dev/zero of=/mnt/foo bs=10M count=1 conv=notrunc
seek=2
1+0 records in
1+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.034 s, 350 MB/s
bor@10:~> sync
bor@10:~> df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb1  1023M  127M  546M  19% /mnt
bor@10:~> sudo filefrag -v /mnt/foo
Filesystem type is: 9123683e
File size of /mnt/foo is 104857600 (25600 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..5119:  36272.. 41391:   5120:
   1: 5120..7679:  33696.. 36255:   2560:  41392:
   2: 7680..   14419:  43952.. 50691:   6740:  36256:
   3:14420..   25599: 125312..136491:  11180:  50692:
last,eof
/mnt/foo: 4 extents found
bor@10:~> sudo dd if=/dev/zero of=/mnt/foo bs=10M count=1 conv=notrunc
seek=7
1+0 records in
1+0 records out
10485760 bytes (10 MB, 10 MiB) copied, 0.0314211 s, 334 MB/s
bor@10:~> sync
bor@10:~> df -h /mnt
Filesystem  Size  Used Avail Use% Mounted on
/dev/sdb1  1023M  137M  536M  21% /mnt
bor@10:~> sudo filefrag -v /mnt/foo
Filesystem type is: 9123683e
File size of /mnt/foo is 104857600 (25600 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..5119:  36272.. 41391:   5120:
   1: 5120..7679:  33696.. 36255:   2560:  41392:
   2: 7680..   14419:  43952.. 50691:   6740:  36256:
   3:14420..   17919: 125312..128811:   3500:  50692:
   4:17920..   20479: 136492..139051:   2560: 128812:
   5:20480..   25599: 131372..136491:   5120: 139052:
last,eof
/mnt/foo: 6 extents found
bor@10:~> ll -sh /mnt
total 100M
100M -rw-r--r-- 1 root root 100M Oct 27 23:30 foo
bor@10:~>

So you still have the single file with size of 100M but space consumed
on filesystem is 120M because two initial large extents remain. Each
write of 10M will get new extent allocated, but large extents are not
split. If you look at file details

bor@10:~/python-btrfs/examples> sudo ./show_file.py /mnt/foo
filename /mnt/foo tree 5 inum 259
inode generation 239 transid 242 size 104857600 nbytes 104857600
block_group 0 mode 100644 nlink 1 uid 0 gid 0 rdev 0 flags 0x0(none)
inode ref list size 1
inode ref index 3 name utf-8 foo
extent data at 0 generation 239 ram_bytes 46563328 compression none type
regular disk_bytenr 148570112 disk_num_bytes 46563328 offset 0 num_bytes
20971520

This extent consumes about 44MB on disk but only 20MB of it is part of file.

extent data at 20971520 generation 241 ram_bytes 10485760 compression
none type regular disk_bytenr 138018816 disk_num_bytes 10485760 offset 0
num_bytes 10485760
extent data at 31457280 generation 239 ram_bytes 46563328 compression
none type regular disk_bytenr 148570112 disk_num_bytes 46563328 offset
31457280 num_bytes 15106048

And another 14MB. So 10MB allocated on disk are "lost".

extent data at 46563328 generation 239 ram_bytes 12500992 compression
none type regular disk_bytenr 195133440 disk_num_bytes 12500992 offset 0
num_bytes 12500992
extent data at 59064320 generation 239 ram_bytes 45793280 compression
none type regular disk_bytenr 513277952 disk_num_bytes 45793280 offset 0
num_bytes 14336000
extent data at 73400320 generation 242 ram_bytes 10485760 compression
none type regular disk_bytenr 559071232 disk_num_bytes 10485760 offset 0
num_bytes 10485760
extent data at 83886080 generation 239 ram_bytes 45793280 compression
none type regular disk_bytenr 513277952 disk_num_bytes 45793280 offset
24821760 num_bytes 20971520

Same here. 10MB in extent at 513277952 are lost.

There is no

Re: FS_IOC_FIEMAP fe_physical discrepancies on btrfs

2018-10-27 Thread Andrei Borzenkov

27.10.2018 18:45, Lennert Buytenhek пишет:
> Hello!
> 
> FS_IOC_FIEMAP on btrfs seems to be returning fe_physical values that
> don't always correspond to the actual on-disk data locations.  For some
> files the values match, but e.g. for this file:
> 
> # filefrag -v foo
> Filesystem type is: 9123683e
> File size of foo is 4096 (1 block of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..   0:5774454..   5774454:  1: last,eof
> foo: 1 extent found
> #
> 
> The file data is actually on disk not in block 5774454 (0x581c76), but
> in block 6038646 (0x5c2476), an offset of +0x40800.  Is this expected
> behavior?  Googling didn't turn up much, apologies if this is an FAQ. :(
> 
> (This is on 4.18.16-200.fc28.x86_64, the current Fedora 28 kernel.)
> 

My understanding is that it returns logical block address in btrfs
address space. btrfs can span multiple devices so you will need to
convert extent address to (device,offset) pair if necessary.

Re: Have 15GB missing in btrfs filesystem.

2018-10-23 Thread Andrei Borzenkov

24.10.2018 3:36, Marc MERLIN пишет:
> Normally when btrfs fi show will show lost space because 
> your trees aren't balanced.
> Balance usually reclaims that space, or most of it.
> In this case, not so much.
> 
> kernel 4.17.6:
> 
> saruman:/mnt/btrfs_pool1# btrfs fi show .
> Label: 'btrfs_pool1'  uuid: fda628bc-1ca4-49c5-91c2-4260fe967a23
>   Total devices 1 FS bytes used 186.89GiB
>   devid1 size 228.67GiB used 207.60GiB path /dev/mapper/pool1
> 
> Ok, I have 21GB between used by FS and used in block layer.
> 
> saruman:/mnt/btrfs_pool1# btrfs balance start -dusage=40 -v .
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=40
> Done, had to relocate 1 out of 210 chunks
> saruman:/mnt/btrfs_pool1# btrfs balance start -musage=60 -v .
> Dumping filters: flags 0x6, state 0x0, force is off
>   METADATA (flags 0x2): balancing, usage=60
>   SYSTEM (flags 0x2): balancing, usage=60
> Done, had to relocate 4 out of 209 chunks
> saruman:/mnt/btrfs_pool1# btrfs fi show .
> Label: 'btrfs_pool1'  uuid: fda628bc-1ca4-49c5-91c2-4260fe967a23
>   Total devices 1 FS bytes used 186.91GiB
>   devid1 size 228.67GiB used 205.60GiB path /dev/mapper/pool1
> 
> That didn't help much, delta is now 19GB
> 
> saruman:/mnt/btrfs_pool1# btrfs balance start -dusage=80 -v .
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=80
> Done, had to relocate 8 out of 207 chunks
> saruman:/mnt/btrfs_pool1# btrfs fi show .
> Label: 'btrfs_pool1'  uuid: fda628bc-1ca4-49c5-91c2-4260fe967a23
>   Total devices 1 FS bytes used 187.03GiB
>   devid1 size 228.67GiB used 201.54GiB path /dev/mapper/pool1
> 
> Ok, now delta is 14GB
> 
> saruman:/mnt/btrfs_pool1# btrfs balance start -musage=80 -v .
> Dumping filters: flags 0x6, state 0x0, force is off
>   METADATA (flags 0x2): balancing, usage=80
>   SYSTEM (flags 0x2): balancing, usage=80
> Done, had to relocate 5 out of 202 chunks
> saruman:/mnt/btrfs_pool1# btrfs fi show .
> Label: 'btrfs_pool1'  uuid: fda628bc-1ca4-49c5-91c2-4260fe967a23
>   Total devices 1 FS bytes used 188.24GiB
>   devid1 size 228.67GiB used 203.54GiB path /dev/mapper/pool1
> 
> and it's back to 15GB :-/
> 
> How can I get 188.24 and 203.54 to converge further? Where is all that
> space gone?
> 

Most likely this is due to partially used extents which has been
explained more than once on this list. When large extent is partially
overwritten, extent is not physically split - it remains allocated in
full but only part of it is referenced. Balance does not change it (at
least that is my understanding) - it moves extents, but here we have
internal fragmentation inside of extent. Defragmentation should rewrite
files, but if you have snapshots, it is unclear if there will be any gain.

I wonder if there is any tool that can compute physical vs. logical
space consumption (i.e. how much space in extent is actually
referenced). Should be possible using python-btrfs but probably time
consuming as it needs to walk each extent.

Re: Is it possible to construct snapshots manually?

2018-10-19 Thread Andrei Borzenkov

19.10.2018 12:41, Cerem Cem ASLAN пишет:
> By saying "manually", I mean copying files into a subvolume on a
> different mountpoint manually, then mark the target as if it is
> created by "btrfs send | btrfs receive".
> 
> Rationale:
> 
> When we delete all common snapshots from source, we have to send whole
> snapshot for next time. This ability will prevent sending everything
> from scratch unless it's necessary.
> (https://unix.stackexchange.com/q/462451/65781)
> 

You can set received UUID as this link suggests. How it can be done was
posted on this list more than once.

> Possible workaround:
> 
> I've got an idea at the moment: What I want can be achieved by
> dropping usage of "btrfs send | btrfs receive" completely and using
> only rsync for transfers. After transfer, a snapshot can be created
> independently on the remote site. Only problem will be handling the
> renamed/relocated files, but `rsync --fuzzy` option might help here:
> https://serverfault.com/a/489293/261445
> 
> Anyway, it would be good to have a built in support for this.
> 

No. It is already possible to shoot oneself in the foot; there is no
need to make it easily done by accident or misunderstanding.

Re: Spurious mount point

2018-10-18 Thread Andrei Borzenkov

16.10.2018 0:33, Chris Murphy пишет:
> On Mon, Oct 15, 2018 at 3:26 PM, Anton Shepelev  wrote:
>> Chris Murphy to Anton Shepelev:
>>
 How can I track down the origin of this mount point:

 /dev/sda2 on /home/hana type btrfs 
 (rw,relatime,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot/hana)

 if it is not present in /etc/fstab?  I shouldn't like to
 find/grep thoughout the whole filesystem.
>>>
>>> Sounds like some service taking snapshots regularly is
>>> managing this.  Maybe this is Mint or Ubuntu and you're
>>> using Timeshift?
>>
>> It is SUSE Linux and (probably) its tool called `snapper',
>> but I have not found a clue in its documentation.
> 
> I wasn't aware that SUSE was now using the @ location for snapshots,

It does not "use @ for snapshots". It is using /@ as top level directory
for root layout. This was changed from using / directly quite some time
ago (2-3 years).

Snapshots are located in own subvolume which is either /.snapshots (if
your install is old) or /@/.snapshots and is always mounted as
/.snapshots in current root.

> or that it was using Btrfs for /home. For a while it's been XFS with a
> Btrfs sysroot.
> 

Default proposal is separate home partition with XFS, it can of course
be changed during installation.

Note that after default installation /@/.snapshots/1/snapshot is default
subvolume and *the* filesystem root, which gets mounted as "/". So it
could be result of bind-mount of /hana on /home/hana (/hana is resolved
to "default-subvolume"/hana => /@/.snapshots/1/snapshot/hana).

Re: CoW behavior when writing same content

2018-10-09 Thread Andrei Borzenkov

09.10.2018 18:52, Chris Murphy пишет:
> On Tue, Oct 9, 2018 at 8:48 AM, Gervais, Francois
>  wrote:
>> Hi,
>>
>> If I have a snapshot where I overwrite a big file but which only a
>> small portion of it is different, will the whole file be rewritten in
>> the snapshot? Or only the different part of the file?
> 

If you overwrite the whole file, the whole file will be overwritten.

> Depends on how the application modifies files. Many applications write
> out a whole new file with a pseudorandom filename, fsync, then rename.
> 
>>
>> Something like:
>>
>> $ dd if=/dev/urandom of=/big_file bs=1M count=1024
>> $ cp /big_file root/
>> $ btrfs sub snap root snapshot
>> $ cp /big_file snapshot/
>>

And which portion of these three files is different? They must be
identical. Not that it really matters, but that does not match your
question.

>> In this case is root/big_file and snapshot/big_file still share the same 
>> data?
> 
> You'll be left with three files. /big_file and root/big_file will
> share extents,

How comes they share extents? This requires --reflink, is it default now?

> and snapshot/big_file will have its own extents. You'd
> need to copy with --reflink for snapshot/big_file to have shared
> extents with /big_file - or deduplicate.
> 
This still overwrites the whole file in the sense original file content
of "snapshot/big_file" is lost. That new content happens to be identical
and that new content will probably be reflinked does not change the fact
that original file is gone.

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov

18.09.2018 22:11, Austin S. Hemmelgarn пишет:
> On 2018-09-18 14:38, Andrei Borzenkov wrote:
>> 18.09.2018 21:25, Austin S. Hemmelgarn пишет:
>>> On 2018-09-18 14:16, Andrei Borzenkov wrote:
>>>> 18.09.2018 08:37, Chris Murphy пишет:
>>>>> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
>>>>>  wrote:
>>>>>> 18.09.2018 07:21, Chris Murphy пишет:
>>>>>>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
>>>>>>>  wrote:
>> ...
>>>>>>>
>>>>>>> There are a couple of reserve locations in Btrfs at the start and I
>>>>>>> think after the first superblock, for bootloader embedding. Possibly
>>>>>>> one or both of those areas could be used for this so it's outside
>>>>>>> the
>>>>>>> file system. But other implementations are going to run into this
>>>>>>> problem too.
>>>>>>>
>>>>>>
>>>>>> That's what SUSE grub2 version does - it includes patches to redirect
>>>>>> writes on btrfs to reserved area. I am not sure how it behaves in
>>>>>> case
>>>>>> of multi-device btrfs though.
>>>>>
>>>>> The patches aren't upstream yet? Will they be?
>>>>>
>>>>
>>>> I do not know. Personally I think much easier is to make grub location
>>>> independent of /boot, allowing grub be installed in separate partition.
>>>> This automatically covers all other cases (like MD, LVM etc).
>>> It actually is independent of /boot already.  I've got it running just
>>> fine on my laptop off of the EFI system partition (which is independent
>>> of my /boot partition), and thus have no issues with handling of the
>>> grubenv file.  The problem is that all the big distros assume you want
>>> it in /boot, so they have no option for putting it anywhere else.
>>>
>>
>> This requires more than just explicit --boot-directory. With current
>> monolithic configuration file listing all available kernels this file
>> cannot be in the same location, it must be together with kernels (think
>> about rollback to snapshot with completely different content). Or some
>> different, more flexible configuration is needed.
> Uh, no, it doesn't need to be with the kernels.

It does not need to be *with* kernels but it must match content of /boot
if you want to allow booting from multiple subvolumes (or even
partitions) using the same grub instance. The most obvious case is
snapper rollback used by SUSE. You still have single instance of
bootloader, but multiple subvolumes with different kernels. So somehow
bootloader must know which kernels to offer depending on which subvolume
you select.

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov

18.09.2018 21:57, Chris Murphy пишет:
> On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  
> wrote:
>> 18.09.2018 08:37, Chris Murphy пишет:
> 
>>> The patches aren't upstream yet? Will they be?
>>>
>>
>> I do not know. Personally I think much easier is to make grub location
>> independent of /boot, allowing grub be installed in separate partition.
>> This automatically covers all other cases (like MD, LVM etc).
> 
> The only case where I'm aware of this happens is Fedora on UEFI where
> they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
> upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
> seen it elsewhere (except Fedora on UEFI).
> 
> I'm not sure this is much easier. Yet another volume that would be
> persistently mounted? Where? A nested mount at /boot/grub? I'm not
> liking that at all. Even Windows and macOS have saner and simpler to
> understand booting methods than this.
> 
> 
That's exactly what Windows ended up with - separate boot volume with
bootloader related files.

Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov

18.09.2018 21:28, Gervais, Francois пишет:
>> No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?
> 
No, I mean setting received UUID on subvolume. Unfortunately, it is
possible. Fortunately, it is not trivially done.

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov

18.09.2018 21:25, Austin S. Hemmelgarn пишет:
> On 2018-09-18 14:16, Andrei Borzenkov wrote:
>> 18.09.2018 08:37, Chris Murphy пишет:
>>> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
>>>  wrote:
>>>> 18.09.2018 07:21, Chris Murphy пишет:
>>>>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
>>>>>  wrote:
...
>>>>>
>>>>> There are a couple of reserve locations in Btrfs at the start and I
>>>>> think after the first superblock, for bootloader embedding. Possibly
>>>>> one or both of those areas could be used for this so it's outside the
>>>>> file system. But other implementations are going to run into this
>>>>> problem too.
>>>>>
>>>>
>>>> That's what SUSE grub2 version does - it includes patches to redirect
>>>> writes on btrfs to reserved area. I am not sure how it behaves in case
>>>> of multi-device btrfs though.
>>>
>>> The patches aren't upstream yet? Will they be?
>>>
>>
>> I do not know. Personally I think much easier is to make grub location
>> independent of /boot, allowing grub be installed in separate partition.
>> This automatically covers all other cases (like MD, LVM etc).
> It actually is independent of /boot already.  I've got it running just
> fine on my laptop off of the EFI system partition (which is independent
> of my /boot partition), and thus have no issues with handling of the
> grubenv file.  The problem is that all the big distros assume you want
> it in /boot, so they have no option for putting it anywhere else.
> 

This requires more than just explicit --boot-directory. With current
monolithic configuration file listing all available kernels this file
cannot be in the same location, it must be together with kernels (think
about rollback to snapshot with completely different content). Or some
different, more flexible configuration is needed.

As is now grub silently assumes everything is under /boot. This turned
out to be oversimplified.

> Actually installing it elsewhere is not hard though, you just pass
> `--boot-directory=/wherever` to the `grub-install` script and turn off
> your distributions automatic reinstall mechanism so it doesn't get
> screwed up by the package manager when the GRUB package gets updated.
> You can also make `/boot/grub` a symbolic link pointing to the real GRUB
> directory, so that you don't have to pass any extra options to tools
> like grub-reboot or grub-set-default.

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov

18.09.2018 08:37, Chris Murphy пишет:
> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov  
> wrote:
>> 18.09.2018 07:21, Chris Murphy пишет:
>>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  
>>> wrote:
>>>> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>>>>
>>>> Does anyone know if this is still a problem on Btrfs if grubenv has
>>>> xattr +C set? In which case it should be possible to overwrite and
>>>> there's no csums that are invalidated.
>>>>
>>>> I kinda wonder if in 2018 it's specious for, effectively out of tree
>>>> code, to be making modifications to the file system, outside of the
>>>> file system.
>>>
>>> a. The bootloader code (pre-boot, not user space setup stuff) would
>>> have to know how to read xattr and refuse to overwrite a grubenv
>>> lacking xattr +C.
>>> b. The bootloader code, would have to have sophisticated enough Btrfs
>>> knowledge to know if the grubenv has been reflinked or snapshot,
>>> because even if +C, it may not be valid to overwrite, and COW must
>>> still happen, and there's no way the code in GRUB can do full blow COW
>>> and update a bunch of metadata.
>>>
>>> So answering my own question, this isn't workable. And it seems the
>>> same problem for dm-thin.
>>>
>>> There are a couple of reserve locations in Btrfs at the start and I
>>> think after the first superblock, for bootloader embedding. Possibly
>>> one or both of those areas could be used for this so it's outside the
>>> file system. But other implementations are going to run into this
>>> problem too.
>>>
>>
>> That's what SUSE grub2 version does - it includes patches to redirect
>> writes on btrfs to reserved area. I am not sure how it behaves in case
>> of multi-device btrfs though.
> 
> The patches aren't upstream yet? Will they be?
> 

I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).

> They redirect writes to grubenv specifically? Or do they use the
> reserved areas like a hidden and fixed location for what grubenv would
> contain?
> 
> I guess the user space grub-editenv could write to grubenv, which even
> if COW, GRUB can pick up that change. But GRUB itself writes its
> changes to a reserved area.
> 
> Hmmm. Complicated.
>

Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov

18.09.2018 20:56, Gervais, Francois пишет:
> 
> Hi,
> 
> I'm trying to apply a btrfs send diff (done through -p) to another subvolume 
> with the same content as the proper parent but with a different uuid.
> 
> I looked through btrfs receive and I get the feeling that this is not 
> possible right now.
> 
> I'm thinking of adding a -p option to btrfs receive which could override the 
> parent information from the stream.
> 
> Would that make sense?
> 
No. It is already possible (by setting received UUID); it should not be
made too open to easy abuse.

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-17 Thread Andrei Borzenkov

18.09.2018 07:21, Chris Murphy пишет:
> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:
>> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>>
>> Does anyone know if this is still a problem on Btrfs if grubenv has
>> xattr +C set? In which case it should be possible to overwrite and
>> there's no csums that are invalidated.
>>
>> I kinda wonder if in 2018 it's specious for, effectively out of tree
>> code, to be making modifications to the file system, outside of the
>> file system.
> 
> a. The bootloader code (pre-boot, not user space setup stuff) would
> have to know how to read xattr and refuse to overwrite a grubenv
> lacking xattr +C.
> b. The bootloader code, would have to have sophisticated enough Btrfs
> knowledge to know if the grubenv has been reflinked or snapshot,
> because even if +C, it may not be valid to overwrite, and COW must
> still happen, and there's no way the code in GRUB can do full blow COW
> and update a bunch of metadata.
> 
> So answering my own question, this isn't workable. And it seems the
> same problem for dm-thin.
> 
> There are a couple of reserve locations in Btrfs at the start and I
> think after the first superblock, for bootloader embedding. Possibly
> one or both of those areas could be used for this so it's outside the
> file system. But other implementations are going to run into this
> problem too.
> 

That's what SUSE grub2 version does - it includes patches to redirect
writes on btrfs to reserved area. I am not sure how it behaves in case
of multi-device btrfs though.

> I'm not sure how else to describe state. If NVRAM is sufficiently wear
> resilient enough to have writes to it possibly every day, for every
> boot, to indicate boot success/fail.
>

Re: lazytime mount option—no support in Btrfs

2018-08-19 Thread Andrei Borzenkov




Отправлено с iPhone

> 19 авг. 2018 г., в 11:37, Martin Steigerwald  написал(а):
> 
> waxhead - 18.08.18, 22:45:
>> Adam Hunt wrote:
>>> Back in 2014 Ted Tso introduced the lazytime mount option for ext4
>>> and shortly thereafter a more generic VFS implementation which was
>>> then merged into mainline. His early patches included support for
>>> Btrfs but those changes were removed prior to the feature being
>>> merged. His> 
>>> changelog includes the following note about the removal:
>>>   - Per Christoph's suggestion, drop support for btrfs and xfs for
>>>   now,
>>> 
>>> issues with how btrfs and xfs handle dirty inode tracking.  We
>>> can add btrfs and xfs support back later or at the end of this
>>> series if we want to revisit this decision.
>>> 
>>> My reading of the current mainline shows that Btrfs still lacks any
>>> support for lazytime. Has any thought been given to adding support
>>> for lazytime to Btrfs?
> […]
>> Is there any new regarding this?
> 
> I´d like to know whether there is any news about this as well.
> 
> If I understand it correctly this could even help BTRFS performance a 
> lot cause it is COW´ing metadata.
> 

I do not see how btrfs can support it exactly due to cow. Modified atime means 
checksum no more matches so you must update all related metadata. At which 
point you have kind of shadow in-memory metadata trees. And if this metadata is 
not written out, then some other metadata that refers to them becomes invalid.

I suspect any file system that keeps checksums of metadata will run into the 
same issue.

Re: trouble mounting btrfs filesystem....

2018-08-14 Thread Andrei Borzenkov

14.08.2018 18:16, Hans van Kranenburg пишет:
> On 08/14/2018 03:00 PM, Dmitrii Tcvetkov wrote:
>>> Scott E. Blomquist writes:
>>>  > Hi All,
>>>  > 
>>>  > [...]
>>
>> I'm not a dev, just user.
>> btrfs-zero-log is for very specific case[1], not for transid errors.
>> Transid errors mean that some metadata writes are missing, if
>> they prevent you from mounting filesystem it's pretty much fatal. If
>> btrfs could recover metadata from good copy it'd have done that.
>>
>> "wanted 2159304 found 2159295" means that some metadata is stale by 
>> 9 commits. You could try to mount it with "ro,usebackuproot" mount
>> options as readonly mount is less strict. If that works you can try
>> "usebackuproot" without ro option. But 9 commits is probably too much
>> and there isn't enough data to rollback so far.
> 
> Keep in mind that a successful mount with usebackuproot does not mean
> you're looking at a consistent filesystem. After each transaction
> commit, disk space that is no longer referenced is immediately freed for
> reuse.
> 

With all respect - in this case btrfs should not even pretend it can do
"usebackuproot". What this option is for then?

> So, even if you can mount with usebackuproot, you have to hope that none
> of the metadata blocks that were used back then have been overwritten
> already, even the ones in distant corners of trees. A full check / scrub
> / etc would be needed to find out.
>

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-12 Thread Andrei Borzenkov

12.08.2018 10:04, Andrei Borzenkov пишет:
> 
> On ZFS snapshots are contained in dataset and you limit total dataset
> space consumption including all snapshots. Thus end effect is the same -
> deleting data that is itself captured in snapshot does not make a single
> byte available. ZFS allows you to additionally restrict active file
> system size ("referenced" quota in ZFS) - this more closely matches your
> expectation - deleting file in active file system decreases its
> "referenced" size thus allowing user to write more data (as long as user
> does not exceed total dataset quota). This is different from btrfs
> "exculsive" and "shared". This should not be hard to implement in btrfs,
> as "referenced" simply means all data in current subvolume, be it
> exclusive or shared.
> 

Oops, actually this is exactly what "referenced" quota is. Limiting
total subvolume + snapshots is more difficult, as there is no inherent
connection between qgroups of source and snapshot nor any built-in way
to include snapshot qgroup in some common total qgroup when creating
snapshot.

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-12 Thread Andrei Borzenkov

12.08.2018 06:16, Chris Murphy пишет:
> On Fri, Aug 10, 2018 at 9:29 PM, Duncan <1i5t5.dun...@cox.net> wrote:
>> Chris Murphy posted on Fri, 10 Aug 2018 12:07:34 -0600 as excerpted:
>>
>>> But whether data is shared or exclusive seems potentially ephemeral, and
>>> not something a sysadmin should even be able to anticipate let alone
>>> individual users.
>>
>> Define "user(s)".
> 
> The person who is saving their document on a network share, and
> they've never heard of Btrfs.
> 
> 
>> Arguably, in the context of btrfs tool usage, "user" /is/ the admin,
> 
> I'm not talking about btrfs tools. I'm talking about rational,
> predictable behavior of a shared folder.
> 
> If I try to drop a 1GiB file into my share and I'm denied, not enough
> free space, and behind the scenes it's because of a quota limit, I
> expect I can delete *any* file(s) amounting to create 1GiB free space
> and then I'll be able to drop that file successfully without error.
> 
> But if I'm unwittingly deleting shared files, my quota usage won't go
> down, and I still can't save my file. So now I somehow need a secret
> incantation to discover only my exclusive files and delete enough of
> them in order to save this 1GiB file. It's weird, it's unexpected, I
> think it's a use case failure. Maybe Btrfs quotas isn't meant to work
> with samba or NFS shares. *shrug*
> 

That's how both NetApp and ZFS work as well. I doubt anyone can
seriously call NetApp "not meant to work with NFS or CIFS shares".

On NetApp space available to NFS/CIFS user is volume size minus space
frozen in snapshots. If file, captured in snapshot, is deleted in active
file system, it does not make a single byte available to external user.
That's what surprised most every first time NetApp users.

On ZFS snapshots are contained in dataset and you limit total dataset
space consumption including all snapshots. Thus end effect is the same -
deleting data that is itself captured in snapshot does not make a single
byte available. ZFS allows you to additionally restrict active file
system size ("referenced" quota in ZFS) - this more closely matches your
expectation - deleting file in active file system decreases its
"referenced" size thus allowing user to write more data (as long as user
does not exceed total dataset quota). This is different from btrfs
"exculsive" and "shared". This should not be hard to implement in btrfs,
as "referenced" simply means all data in current subvolume, be it
exclusive or shared.

IOW ZFS allows to place restriction on both how much data user can use
and how much data user is allowed additionally to protect (snapshot).

> 
> 
>>
>> "Regular users" as you use the term, that is the non-admins who just need
>> to know how close they are to running out of their allotted storage
>> resources, shouldn't really need to care about btrfs tool usage in the
>> first place, and btrfs commands in general, including btrfs quota related
>> commands, really aren't targeted at them, and aren't designed to report
>> the type of information they are likely to find useful.  Other tools will
>> be more appropriate.
> 
> I'm not talking about any btrfs commands or even the term quota for
> regular users. I'm talking about saving a file, being denied, and how
> does the user figure out how to free up space?
> 

Users need to be educated. Same as with NetApp and ZFS. There is no
magic, redirect-on-write filesystems work differently than traditional
and users need to adapt.

Of course devil is in details, and usability of btrfs quota is far lower
than NetApp/ZFS. In those space consumption information is first class
citizen integrated into the very basic tools, not something bolted on
later and mostly incomprehensible to end user.

> Anyway, it's a hypothetical scenario. While I have Samba running on a
> Btrfs volume with various shares as subvolumes, I don't have quotas
> enabled.
> 
> 
> 

Given all performance issues with quota reported on this list it is
probably just as good for you.

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Andrei Borzenkov

10.08.2018 12:33, Tomasz Pala пишет:
> 
>> For 4 disk with 1T free space each, if you're using RAID5 for data, then
>> you can write 3T data.
>> But if you're also using RAID10 for metadata, and you're using default
>> inline, we can use small files to fill the free space, resulting 2T
>> available space.
>>
>> So in this case how would you calculate the free space? 3T or 2T or
>> anything between them?
> 
> The answear is pretty simple: 3T. Rationale:
> - this is the space I do can put in a single data stream,
> - people are aware that there is metadata overhead with any object;
>   after all, metadata are also data,
> - while filling the fs with small files the free space available would
>   self-adjust after every single file put, so after uploading 1T of such
>   files the df should report 1.5T free. There would be nothing weird(er
>   that now) that 1T of data has actually eaten 1.5T of storage.
> 
> No crystal ball calculations, just KISS; since one _can_ put 3T file
> (non sparse, uncompressible, bulk written) on a filesystem, the free space is 
> 3T.
> 

As far as I can tell, that is exactly what "df" reports now. "btrfs fi
us" will tell you both max (reported by "df") and worst case min.

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-11 Thread Andrei Borzenkov

10.08.2018 21:21, Tomasz Pala пишет:
> On Fri, Aug 10, 2018 at 07:39:30 -0400, Austin S. Hemmelgarn wrote:
> 
>>> I.e.: every shared segment should be accounted within quota (at least once).
>> I think what you mean to say here is that every shared extent should be 
>> accounted to quotas for every location it is reflinked from.  IOW, that 
>> if an extent is shared between two subvolumes each with it's own quota, 
>> they should both have it accounted against their quota.
> 
> Yes.
> 

This is what "referenced" in quota group report is, is not it? What is
missing here?

Re: Report correct filesystem usage / limits on BTRFS subvolumes with quota

2018-08-10 Thread Andrei Borzenkov

10.08.2018 10:33, Tomasz Pala пишет:
> On Fri, Aug 10, 2018 at 07:03:18 +0300, Andrei Borzenkov wrote:
> 
>>> So - the limit set on any user
>>
>> Does btrfs support per-user quota at all? I am aware only of per-subvolume 
>> quotas.
> 
> Well, this is a kind of deceptive word usage in "post-truth" times.
> 
> In this case both "user" and "quota" are not valid...
> - by "user" I ment general word, not unix-user account; such user might
>   possess some container running full-blown guest OS,
> - by "quota" btrfs means - I guess, dataset-quotas?
> 
> 
> In fact: https://btrfs.wiki.kernel.org/index.php/Quota_support
> "Quota support in BTRFS is implemented at a subvolume level by the use of 
> quota groups or qgroup"
> 
> - what the hell is "quota group" and how it differs from qgroup? According to 
> btrfs-quota(8):
> 
> "The quota groups (qgroups) are managed by the subcommand btrfs qgroup(8)"
> 
> - they are the same... just completely different from traditional "quotas".
> 
> 
> My suggestion would be to completely remove the standalone "quota" word
> from btrfs documentation - there is no "quota", just "subvolume quota"
> or "qgroup" supported.
> 

Well, qgroup allows you to limit amount of data that can be stored in
subvolume (or under quota group in general), so it behaves like
traditional quota to me.

Re: BTRFS and databases

2018-08-02 Thread Andrei Borzenkov




Отправлено с iPhone

> 2 авг. 2018 г., в 10:02, Qu Wenruo  написал(а):
> 
> 
> 
>> On 2018年08月01日 11:45, MegaBrutal wrote:
>> Hi all,
>> 
>> I know it's a decade-old question, but I'd like to hear your thoughts
>> of today. By now, I became a heavy BTRFS user. Almost everywhere I use
>> BTRFS, except in situations when it is obvious there is no benefit
>> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
>> computers are mainly running on BTRFS with only a few file systems on
>> ext4. I even installed BTRFS in corporate productive systems (in those
>> cases, the systems were mainly on ext4; but there were some specific
>> file systems those exploited BTRFS features).
>> 
>> But there is still one question that I can't get over: if you store a
>> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
>> with nodatacow, or would you just simply use ext4?
>> 
>> I know that with nodatacow, I take away most of the benefits of BTRFS
>> (those are actually hurting database performance – the exact CoW
>> nature that is elsewhere a blessing, with databases it's a drawback).
>> But are there any advantages of still sticking to BTRFS for a database
>> albeit CoW is disabled, or should I just return to the old and
>> reliable ext4 for those applications?
> 
> Since I'm not a expert in database, so I can totally be wrong, but what
> about completely disabling database write-ahead-log (WAL), and let
> btrfs' data CoW to handle data consistency completely?
> 

This would make content of database after crash completely unpredictable, thus 
making it impossible to reliably roll back transaction.


> If there is some concern about the commit interval, it could be tuned by
> commit= mount option.
> 
> It may either lead to super unexpected fast behavior, or some unknown
> disaster. (And for latter, we at least could get some interesting
> feedback and bugs to fix)
> 
> Thanks,
> Qu
> 
>> 
>> 
>> Kind regards,
>> MegaBrutal
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS and databases

2018-08-02 Thread Andrei Borzenkov




Отправлено с iPhone

> 2 авг. 2018 г., в 12:16, Martin Steigerwald  написал(а):
> 
> Hugo Mills - 01.08.18, 10:56:
>>> On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
>>> I know it's a decade-old question, but I'd like to hear your
>>> thoughts
>>> of today. By now, I became a heavy BTRFS user. Almost everywhere I
>>> use BTRFS, except in situations when it is obvious there is no
>>> benefit (e.g. /var/log, /boot). At home, all my desktop, laptop and
>>> server computers are mainly running on BTRFS with only a few file
>>> systems on ext4. I even installed BTRFS in corporate productive
>>> systems (in those cases, the systems were mainly on ext4; but there
>>> were some specific file systems those exploited BTRFS features).
>>> 
>>> But there is still one question that I can't get over: if you store
>>> a
>>> database (e.g. MySQL), would you prefer having a BTRFS volume
>>> mounted
>>> with nodatacow, or would you just simply use ext4?
>> 
>>   Personally, I'd start with btrfs with autodefrag. It has some
>> degree of I/O overhead, but if the database isn't performance-critical
>> and already near the limits of the hardware, it's unlikely to make
>> much difference. Autodefrag should keep the fragmentation down to a
>> minimum.
> 
> I read that autodefrag would only help with small databases.
> 

I wonder if anyone actually 

a) quantified performance impact
b) analyzed the cause

I work with NetApp for a long time and I can say from first hand experience 
that fragmentation had zero impact on OLTP workload. It did affect backup 
performance as was expected, but this could be fixed by periodic reallocation 
(defragmentation).

And even that needed quite some time to observe (years) on pretty high  load 
database with regular backup and replication snapshots.

If btrfs is so susceptible to fragmentation, what is the reason for it?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: File permissions lost during send/receive?

2018-07-24 Thread Andrei Borzenkov

24.07.2018 15:16, Marc Joliet пишет:
> Hi list,
> 
> (Preemptive note: this was with btrfs-progs 4.15.1, I have since upgraded to 
> 4.17.  My kernel version is 4.14.52-gentoo.)
> 
> I recently had to restore the root FS of my desktop from backup (extent tree 
> corruption; not sure how, possibly a loose SATA cable?).  Everything was 
> fine, 
> even if restoring was slower than expected.  However, I encountered two files 
> with permission problems, namely:
> 
> - /bin/ping, which caused running ping as a normal user to fail due to 
> missing 
> permissions, and
> 
> - /sbin/unix_chkpwd (part of PAM), which prevented me from unlocking the KDE 
> Plasma lock screen; I needed to log into a TTY and run "loginctl unlock-
> session".
> 
> Both were easily fixed by reinstalling the affected packages (iputils and 
> pam), but I wonder why this happened after restoring from backup.
> 
> I originally thought it was related to the SUID bit not being set, because of 
> the explanation in the ping(8) man page (section "SECURITY"), but cannot find 
> evidence of that -- that is, after reinstallation, "ls -lh" does not show the 
> sticky bit being set, or any other special permission bits, for that matter:
> 
> % ls -lh /bin/ping /sbin/unix_chkpwd 
> -rwx--x--x 1 root root 60K 22. Jul 14:47 /bin/ping*   
>   
>   
>  
> -rwx--x--x 1 root root 31K 23. Jul 00:21 /sbin/unix_chkpwd*
> 
> (Note: no ACLs are set, either.)
> 

What "getcap /bin/ping" says? You may need to install package providing
getcap (libcap-progs here on openSUSE).

> I do remember the qcheck program (a Gentoo-specific program that checks the 
> integrity of installed packages) complaining about wrong file permissions, 
> but 
> I didn't save its output, and there's a chance it *might* have been because I 
> ran qcheck without root permissions :-/ .
> 
> I vaguely remember some patches and/or discussion regarding permission 
> transfer issues with send/receive on this ML, but didn't find anything after 
> searching through my Email archive, so I might be misremembering.
> 
> Does anybody have any idea what possibly went wrong, or any similar 
> experience 
> to speak of?
> 
> Greetings
> 




signature.asc
Description: OpenPGP digital signature

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Andrei Borzenkov

20.07.2018 20:16, Goffredo Baroncelli пишет:
> On 07/20/2018 07:17 AM, Andrei Borzenkov wrote:
>> 18.07.2018 22:42, Goffredo Baroncelli пишет:
>>> On 07/18/2018 09:20 AM, Duncan wrote:
>>>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>>>> excerpted:
>>>>
>>>>> On 07/17/2018 11:12 PM, Duncan wrote:
>>>>>> Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
>>>>>> excerpted:
>>>>>>
>>>>>>> On 07/15/2018 04:37 PM, waxhead wrote:
>>>>>>
>>>>>>> Striping and mirroring/pairing are orthogonal properties; mirror and
>>>>>>> parity are mutually exclusive.
>>>>>>
>>>>>> I can't agree.  I don't know whether you meant that in the global
>>>>>> sense,
>>>>>> or purely in the btrfs context (which I suspect), but either way I
>>>>>> can't agree.
>>>>>>
>>>>>> In the pure btrfs context, while striping and mirroring/pairing are
>>>>>> orthogonal today, Hugo's whole point was that btrfs is theoretically
>>>>>> flexible enough to allow both together and the feature may at some
>>>>>> point be added, so it makes sense to have a layout notation format
>>>>>> flexible enough to allow it as well.
>>>>>
>>>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>>>> have - striping (RAID0)
>>>>> - parity  (?)
>>>>> - striping + parity  (e.g. RAID5/6)
>>>>> - mirroring  (RAID1)
>>>>> - mirroring + striping  (RAID10)
>>>>>
>>>>> However you can't have mirroring+parity; this means that a notation
>>>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>>>> too verbose.
>>>>
>>>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>>>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>>>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>>>> on top of raid0.  
>>> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
>>> top of) ???
>>>
>>> Seriously, of course you can combine a lot of different profile; however 
>>> the only ones that make sense are the ones above.
>>
>> RAID50 (striping across RAID5) is common.
> 
> Yeah someone else report that. But other than reducing the number of disk per 
> raid5 (increasing the ration number of disks/number of parity disks), which 
> other advantages has ? 

It allows distributing IO across virtually unlimited number of disks
while confining failure domain to manageable size.

> Limiting the number of disk per raid, in BTRFS would be quite simple to 
> implement in the "chunk allocator"
> 

You mean that currently RAID5 stripe size is equal to number of disks?
Well, I suppose nobody is using btrfs with disk pools of two or three
digits size.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-19 Thread Andrei Borzenkov

18.07.2018 22:42, Goffredo Baroncelli пишет:
> On 07/18/2018 09:20 AM, Duncan wrote:
>> Goffredo Baroncelli posted on Wed, 18 Jul 2018 07:59:52 +0200 as
>> excerpted:
>>
>>> On 07/17/2018 11:12 PM, Duncan wrote:
 Goffredo Baroncelli posted on Mon, 16 Jul 2018 20:29:46 +0200 as
 excerpted:

> On 07/15/2018 04:37 PM, waxhead wrote:

> Striping and mirroring/pairing are orthogonal properties; mirror and
> parity are mutually exclusive.

 I can't agree.  I don't know whether you meant that in the global
 sense,
 or purely in the btrfs context (which I suspect), but either way I
 can't agree.

 In the pure btrfs context, while striping and mirroring/pairing are
 orthogonal today, Hugo's whole point was that btrfs is theoretically
 flexible enough to allow both together and the feature may at some
 point be added, so it makes sense to have a layout notation format
 flexible enough to allow it as well.
>>>
>>> When I say orthogonal, It means that these can be combined: i.e. you can
>>> have - striping (RAID0)
>>> - parity  (?)
>>> - striping + parity  (e.g. RAID5/6)
>>> - mirroring  (RAID1)
>>> - mirroring + striping  (RAID10)
>>>
>>> However you can't have mirroring+parity; this means that a notation
>>> where both 'C' ( = number of copy) and 'P' ( = number of parities) is
>>> too verbose.
>>
>> Yes, you can have mirroring+parity, conceptually it's simply raid5/6 on 
>> top of mirroring or mirroring on top of raid5/6, much as raid10 is 
>> conceptually just raid0 on top of raid1, and raid01 is conceptually raid1 
>> on top of raid0.  
> And what about raid 615156156 (raid 6 on top of raid 1 on top of raid 5 on 
> top of) ???
> 
> Seriously, of course you can combine a lot of different profile; however the 
> only ones that make sense are the ones above.

RAID50 (striping across RAID5) is common.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Healthy amount of free space?

2018-07-19 Thread Andrei Borzenkov

18.07.2018 16:30, Austin S. Hemmelgarn пишет:
> On 2018-07-18 09:07, Chris Murphy wrote:
>> On Wed, Jul 18, 2018 at 6:35 AM, Austin S. Hemmelgarn
>>  wrote:
>>
>>> If you're doing a training presentation, it may be worth mentioning that
>>> preallocation with fallocate() does not behave the same on BTRFS as
>>> it does
>>> on other filesystems.  For example, the following sequence of commands:
>>>
>>>  fallocate -l X ./tmp
>>>  dd if=/dev/zero of=./tmp bs=1 count=X
>>>
>>> Will always work on ext4, XFS, and most other filesystems, for any
>>> value of
>>> X between zero and just below the total amount of free space on the
>>> filesystem.  On BTRFS though, it will reliably fail with ENOSPC for
>>> values
>>> of X that are greater than _half_ of the total amount of free space
>>> on the
>>> filesystem (actually, greater than just short of half).  In essence,
>>> preallocating space does not prevent COW semantics for the first write
>>> unless the file is marked NOCOW.
>>
>> Is this a bug, or is it suboptimal behavior, or is it intentional?
> It's been discussed before, though I can't find the email thread right
> now.  Pretty much, this is _technically_ not incorrect behavior, as the
> documentation for fallocate doesn't say that subsequent writes can't
> fail due to lack of space.  I personally consider it a bug though
> because it breaks from existing behavior in a way that is avoidable and
> defies user expectations.
> 
> There are two issues here:
> 
> 1. Regions preallocated with fallocate still do COW on the first write
> to any given block in that region.  This can be handled by either
> treating the first write to each block as NOCOW, or by allocating a bit

How is it possible? As long as fallocate actually allocates space, this
should be checksummed which means it is no more possible to overwrite
it. May be fallocate on btrfs could simply reserve space. Not sure
whether it complies with fallocate specification, but as long as
intention is to ensure write will not fail for the lack of space it
should be adequate (to the extent it can be ensured on btrfs of course).
Also hole in file returns zeros by definition which also matches
fallocate behavior.

> of extra space and doing a rotating approach like this for writes:
>     - Write goes into the extra space.
>     - Once the write is done, convert the region covered by the write
>   into a new block of extra space.
>     - When the final block of the preallocated region is written,
>   deallocate the extra space.
> 2. Preallocation does not completely account for necessary metadata
> space that will be needed to store the data there.  This may not be
> necessary if the first issue is addressed properly.
>>
>> And then I wonder what happens with XFS COW:
>>
>>   fallocate -l X ./tmp
>>   cp --reflink ./tmp ./tmp2
>>   dd if=/dev/zero of=./tmp bs=1 count=X
> I'm not sure.  In this particular case, this will fail on BTRFS for any
> X larger than just short of one third of the total free space.  I would
> expect it to fail for any X larger than just short of half instead.
> 
> ZFS gets around this by not supporting fallocate (well, kind of, if
> you're using glibc and call posix_fallocate, that _will_ work, but it
> will take forever because it works by writing out each block of space
> that's being allocated, which, ironically, means that that still suffers
> from the same issue potentially that we have).

What happens on btrfs then? fallocate specifies that new space should be
initialized to zero, so something should still write those zeros?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs check (not lowmem) and OOM-like hangs (4.17.6)

2018-07-18 Thread Andrei Borzenkov

18.07.2018 03:05, Qu Wenruo пишет:
> 
> 
> On 2018年07月18日 04:59, Marc MERLIN wrote:
>> Ok, I did more testing. Qu is right that btrfs check does not crash the 
>> kernel.
>> It just takes all the memory until linux hangs everywhere, and somehow (no 
>> idea why) 
>> the OOM killer never triggers.
> 
> No OOM triggers? That's a little strange.
> Maybe it's related to how kernel handles memory over-commit?
> 
> And for the hang, I think it's related to some memory allocation failure
> and error handler just didn't handle it well, so it's causing deadlock
> for certain page.
> 
> ENOMEM handling is pretty common but hardly verified, so it's not that
> strange, but we must locate the problem.
> 
>> Details below:
>>
>> On Tue, Jul 17, 2018 at 01:32:57PM -0700, Marc MERLIN wrote:
>>> Here is what I got when the system was not doing well (it took minutes to 
>>> run):
>>>
>>>  total   used   free sharedbuffers cached
>>> Mem:  32643788   32070952 572836  0 1021604378772
>>> -/+ buffers/cache:   275900205053768
>>> Swap: 15616764 973596   14643168
>>
>> ok, the reason it was not that close to 0 was due to /dev/shm it seems.
>> I cleared that, and now I can get it to go to near 0 again.
>> I'm wrong about the system being fully crashed, it's not, it's just very
>> close to being hung.
>> I can type killall -9 btrfs in the serial console and wait a few minutes.
>> The system eventually recovers, but it's impossible to fix anything via ssh 
>> apparently because networking does not get to run when I'm in this state.
>>
>> I'm not sure why my system reproduces this easy while Qu's system does not, 
>> but Qu was right that the kernel is not dead and that it's merely a problem 
>> of userspace
>> taking all the RAM and somehow not being killed by OOM
> 
> In my system, at least I'm not using btrfs as root fs, and for the
> memory eating program I normally ensure it's eating all the memory +
> swap, so OOM killer is always triggered, maybe that's the cause.
> 
> So in your case, maybe it's btrfs not really taking up all memory, thus
> OOM killer not triggered.
> 
>>
>> I checked the PID and don't see why it's not being killed:
>> gargamel:/proc/31006# grep . oom*
>> oom_adj:0
>> oom_score:221   << this increases a lot, but OOM never kills it
>> oom_score_adj:0
>>
>> I have these variables:
>> /proc/sys/vm/oom_dump_tasks:1
>> /proc/sys/vm/oom_kill_allocating_task:0
>> /proc/sys/vm/overcommit_kbytes:0
>> /proc/sys/vm/overcommit_memory:0
>> /proc/sys/vm/overcommit_ratio:50  << is this bad (seems default)
> 
> Any kernel dmesg about OOM killer triggered?
> 
>>
>> Here is my system when it virtually died:
>> ER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
>> root 31006 21.2 90.7 29639020 29623180 pts/19 D+ 13:49   1:35 ./btrfs 
>> check /dev/mapper/dshelf2
>>
>>  total   used   free sharedbuffers cached
>> Mem:  32643788   32180100 463688  0  44664 119508
>> -/+ buffers/cache:   32015928 627860
>> Swap: 15616764 443676   15173088
> 
> For swap, it looks like only some other program's memory is swapped out,
> not btrfs'.
> 
> And unfortunately, I'm not so familiar with OOM/MM code outside of
> filesystem.
> Any help from other experienced developers would definitely help to
> solve why memory of 'btrfs check' is not swapped out or why OOM killer
> is not triggered.
> 

Almost all used memory is marked as "active" and active pages are not
swapped. Page is active if it was accessed recently. Is it possible that
btrfs logic does frequent scans across all allocated memory?



> Thanks,
> Qu
> 
>>
>> MemTotal:   32643788 kB
>> MemFree:  463440 kB
>> MemAvailable:  44864 kB
>> Buffers:   44664 kB
>> Cached:   120360 kB
>> SwapCached:87064 kB
>> Active: 30381404 kB
>> Inactive: 585952 kB
>> Active(anon):   30334696 kB
>> Inactive(anon):   474624 kB
>> Active(file):  46708 kB
>> Inactive(file):   111328 kB
>> Unevictable:5616 kB
>> Mlocked:5616 kB
>> SwapTotal:  15616764 kB
>> SwapFree:   15173088 kB
>> Dirty:  1636 kB
>> Writeback: 4 kB
>> AnonPages:  30734240 kB
>> Mapped:67236 kB
>> Shmem:  3036 kB
>> Slab: 267884 kB
>> SReclaimable:  51528 kB
>> SUnreclaim:   216356 kB
>> KernelStack:   10144 kB
>> PageTables:69284 kB
>> NFS_Unstable:  0 kB
>> Bounce:0 kB
>> WritebackTmp:  0 kB
>> CommitLimit:31938656 kB
>> Committed_AS:   32865492 kB
>> VmallocTotal:   34359738367 kB
>> VmallocUsed:   0 kB
>> VmallocChunk:  0 kB
>> HardwareCorrupted: 0 kB
>> AnonHugePages: 0 kB
>> ShmemHugePages:0 kB
>> ShmemPmdMapped:0 kB
>> CmaTotal:  16384 kB
>> CmaFree:   0 kB
>> HugePages_Total:   0
>> HugePages_Free:

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-05 Thread Andrei Borzenkov

03.07.2018 10:15, Duncan пишет:
> Andrei Borzenkov posted on Tue, 03 Jul 2018 07:25:14 +0300 as excerpted:
> 
>> 02.07.2018 21:35, Austin S. Hemmelgarn пишет:
>>> them (trimming blocks on BTRFS gets rid of old root trees, so it's a
>>> bit dangerous to do it while writes are happening).
>>
>> Could you please elaborate? Do you mean btrfs can trim data before new
>> writes are actually committed to disk?
> 
> No.
> 
> But normally old roots aren't rewritten for some time simply due to odds 
> (fuller filesystems will of course recycle them sooner), and the btrfs 
> mount option usebackuproot (formerly recovery, until the norecovery mount 
> option that parallels that of other filesystems was added and this option 
> was renamed to avoid confusion) can be used to try an older root if the 
> current root is too damaged to successfully mount.
> > But other than simply by odds not using them again immediately, btrfs has
> no special protection for those old roots, and trim/discard will recover 
> them to hardware-unused as it does any other unused space, tho whether it 
> simply marks them for later processing or actually processes them 
> immediately is up to the individual implementation -- some do it 
> immediately, killing all chances at using the backup root because it's 
> already zeroed out, some don't.
> 

How is it relevant to "while writes are happening"? Will trimming old
tress immediately after writes have stopped be any different? Why?

> In the context of the discard mount option, that can mean there's never 
> any old roots available ever, as they've already been cleaned up by the 
> hardware due to the discard option telling the hardware to do it.
> 
> But even not using that mount option, and simply doing the trims 
> periodically, as done weekly by for instance the systemd fstrim timer and 
> service units, or done manually if you prefer, obviously potentially 
> wipes the old roots at that point.  If the system's effectively idle at 
> the time, not much risk as the current commit is likely to represent a 
> filesystem in full stasis, but if there's lots of writes going on at that 
> moment *AND* the system happens to crash at just the wrong time, before 
> additional commits have recreated at least a bit of root history, again, 
> you'll potentially be left without any old roots for the usebackuproot 
> mount option to try to fall back to, should it actually be necessary.
> 

Sorry? You are just saying that "previous state can be discarded before
new state is committed", just more verbosely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov

02.07.2018 21:35, Austin S. Hemmelgarn пишет:
> them (trimming blocks on BTRFS gets rid of old root trees, so it's a bit
> dangerous to do it while writes are happening).

Could you please elaborate? Do you mean btrfs can trim data before new
writes are actually committed to disk?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to best segment a big block device in resizeable btrfs filesystems?

2018-07-02 Thread Andrei Borzenkov

03.07.2018 04:37, Qu Wenruo пишет:
> 
> BTW, IMHO the bcache is not really helping for backup system, which is
> more write oriented.
> 

There is new writecache target which may help in this case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Incremental send/receive broken after snapshot restore

2018-06-30 Thread Andrei Borzenkov

01.07.2018 02:16, Marc MERLIN пишет:
> Sorry that I missed the beginning of this discussion, but I think this is
> what I documented here after hitting hte same problem:

This is similar, yes. IIRC you had different starting point though. Here
it should have been possible to use only standard documented tools
without need of low level surgery if done right.

> http://marc.merlins.org/perso/btrfs/post_2018-03-09_Btrfs-Tips_-Rescuing-A-Btrfs-Send-Receive-Relationship.html
> 

M-m-m ... statement "because the source had a Parent UUID value too, I
was actually supposed to set Received UUID on the destination to it" is
entirely off mark nor does it even match subsequent command. You
probably meant to say "because the source had a *Received* UUID value
too, I was actually supposed to set Received UUID on the destination to
it". That is correct. And that is what I meant above - received_uuid is
misnomer, it is actually used as common data set identifier. Two
subvolumes with the same received_uuid are presumed to have identical
content.

Which makes the very idea of being able to freely manipulate it rather
questionable.

P.S. of course "parent" is also highly ambiguous in btrfs world. We
really need to come up with acceptable terminology to disambiguate tree
parent, snapshot parent and replication parent. The latter should
probably better be called "base snapshot" (NetApp calls it "common
snapshot"); error message "Could not find base snapshot matching UUID
xxx" would be far less ambiguous.

> Marc
> 
> On Sun, Jul 01, 2018 at 01:03:37AM +0200, Hannes Schweizer wrote:
>> On Sat, Jun 30, 2018 at 10:02 PM Andrei Borzenkov  
>> wrote:
>>>
>>> 30.06.2018 21:49, Andrei Borzenkov пишет:
>>>> 30.06.2018 20:49, Hannes Schweizer пишет:
>>> ...
>>>>>
>>>>> I've tested a few restore methods beforehand, and simply creating a
>>>>> writeable clone from the restored snapshot does not work for me, eg:
>>>>> # create some source snapshots
>>>>> btrfs sub create test_root
>>>>> btrfs sub snap -r test_root test_snap1
>>>>> btrfs sub snap -r test_root test_snap2
>>>>>
>>>>> # send a full and incremental backup to external disk
>>>>> btrfs send test_snap2 | btrfs receive /run/media/schweizer/external
>>>>> btrfs sub snap -r test_root test_snap3
>>>>> btrfs send -c test_snap2 test_snap3 | btrfs receive
>>>>> /run/media/schweizer/external
>>>>>
>>>>> # simulate disappearing source
>>>>> btrfs sub del test_*
>>>>>
>>>>> # restore full snapshot from external disk
>>>>> btrfs send /run/media/schweizer/external/test_snap3 | btrfs receive .
>>>>>
>>>>> # create writeable clone
>>>>> btrfs sub snap test_snap3 test_root
>>>>>
>>>>> # try to continue with backup scheme from source to external
>>>>> btrfs sub snap -r test_root test_snap4
>>>>>
>>>>> # this fails!!
>>>>> btrfs send -c test_snap3 test_snap4 | btrfs receive
>>>>> /run/media/schweizer/external
>>>>> At subvol test_snap4
>>>>> ERROR: parent determination failed for 2047
>>>>> ERROR: empty stream is not considered valid
>>>>>
>>>>
>>>> Yes, that's expected. Incremental stream always needs valid parent -
>>>> this will be cloned on destination and incremental changes applied to
>>>> it. "-c" option is just additional sugar on top of it which might reduce
>>>> size of stream, but in this case (i.e. without "-p") it also attempts to
>>>> guess parent subvolume for test_snap4 and this fails because test_snap3
>>>> and test_snap4 do not have common parent so test_snap3 is rejected as
>>>> valid parent snapshot. You can restart incremental-forever chain by
>>>> using explicit "-p" instead:
>>>>
>>>> btrfs send -p test_snap3 test_snap4
>>>>
>>>> Subsequent snapshots (test_snap5 etc) will all have common parent with
>>>> immediate predecessor again so "-c" will work.
>>>>
>>>> Note that technically "btrfs send" with single "-c" option is entirely
>>>> equivalent to "btrfs -p". Using "-p" would have avoided this issue. :)
>>>> Although this implicit check for common parent may be considered a good
>>>> thing in this case.
>>>>
>>>> P.S. loo

Re: Incremental send/receive broken after snapshot restore

2018-06-30 Thread Andrei Borzenkov

30.06.2018 21:49, Andrei Borzenkov пишет:
> 30.06.2018 20:49, Hannes Schweizer пишет:
...
>>
>> I've tested a few restore methods beforehand, and simply creating a
>> writeable clone from the restored snapshot does not work for me, eg:
>> # create some source snapshots
>> btrfs sub create test_root
>> btrfs sub snap -r test_root test_snap1
>> btrfs sub snap -r test_root test_snap2
>>
>> # send a full and incremental backup to external disk
>> btrfs send test_snap2 | btrfs receive /run/media/schweizer/external
>> btrfs sub snap -r test_root test_snap3
>> btrfs send -c test_snap2 test_snap3 | btrfs receive
>> /run/media/schweizer/external
>>
>> # simulate disappearing source
>> btrfs sub del test_*
>>
>> # restore full snapshot from external disk
>> btrfs send /run/media/schweizer/external/test_snap3 | btrfs receive .
>>
>> # create writeable clone
>> btrfs sub snap test_snap3 test_root
>>
>> # try to continue with backup scheme from source to external
>> btrfs sub snap -r test_root test_snap4
>>
>> # this fails!!
>> btrfs send -c test_snap3 test_snap4 | btrfs receive
>> /run/media/schweizer/external
>> At subvol test_snap4
>> ERROR: parent determination failed for 2047
>> ERROR: empty stream is not considered valid
>>
> 
> Yes, that's expected. Incremental stream always needs valid parent -
> this will be cloned on destination and incremental changes applied to
> it. "-c" option is just additional sugar on top of it which might reduce
> size of stream, but in this case (i.e. without "-p") it also attempts to
> guess parent subvolume for test_snap4 and this fails because test_snap3
> and test_snap4 do not have common parent so test_snap3 is rejected as
> valid parent snapshot. You can restart incremental-forever chain by
> using explicit "-p" instead:
> 
> btrfs send -p test_snap3 test_snap4
> 
> Subsequent snapshots (test_snap5 etc) will all have common parent with
> immediate predecessor again so "-c" will work.
> 
> Note that technically "btrfs send" with single "-c" option is entirely
> equivalent to "btrfs -p". Using "-p" would have avoided this issue. :)
> Although this implicit check for common parent may be considered a good
> thing in this case.
> 
> P.S. looking at the above, it probably needs to be in manual page for
> btrfs-send. It took me quite some time to actually understand the
> meaning of "-p" and "-c" and behavior if they are present.
> 
...
>>
>> Is there some way to reset the received_uuid of the following snapshot
>> on online?
>> ID 258 gen 13742 top level 5 parent_uuid -
>>received_uuid 6c683d90-44f2-ad48-bb84-e9f241800179 uuid
>> 46db1185-3c3e-194e-8d19-7456e532b2f3 path diablo
>>
> 
> There is no "official" tool but this question came up quite often.
> Search this list, I believe recently one-liner using python-btrfs was
> posted. Note that also patch that removes received_uuid when "ro"
> propery is removed was suggested, hopefully it will be merged at some
> point. Still I personally consider ability to flip read-only property
> the very bad thing that should have never been exposed in the first place.
> 

Note that if you remove received_uuid (explicitly or - in the future -
implicitly) you will not be able to restart incremental send anymore.
Without received_uuid there will be no way to match source test_snap3
with destination test_snap3. So you *must* preserve it and start with
writable clone.

received_uuid is misnomer. I wish it would be named "content_uuid" or
"snap_uuid" with semantic

1. When read-only snapshot of writable volume is created, content_uuid
is initialized

2. Read-only snapshot of read-only snapshot inherits content_uuid

3. destination of "btrfs send" inherits content_uuid

4. writable snapshot of read-only snapshot clears content_uuid

5. clearing read-only property clears content_uuid

This would make it more straightforward to cascade and restart
replication by having single subvolume property to match against.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Incremental send/receive broken after snapshot restore

2018-06-30 Thread Andrei Borzenkov

30.06.2018 20:49, Hannes Schweizer пишет:
> On Sat, Jun 30, 2018 at 8:24 AM Andrei Borzenkov  wrote:
>>
>> Do not reply privately to mails on list.
>>
>> 29.06.2018 22:10, Hannes Schweizer пишет:
>>> On Fri, Jun 29, 2018 at 7:44 PM Andrei Borzenkov  
>>> wrote:
>>>>
>>>> 28.06.2018 23:09, Hannes Schweizer пишет:
>>>>> Hi,
>>>>>
>>>>> Here's my environment:
>>>>> Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64
>>>>> Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux
>>>>> btrfs-progs v4.17
>>>>>
>>>>> Label: 'online'  uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390
>>>>> Total devices 2 FS bytes used 3.16TiB
>>>>> devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0
>>>>> devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1
>>>>> Data, RAID0: total=3.16TiB, used=3.15TiB
>>>>> System, RAID0: total=16.00MiB, used=240.00KiB
>>>>> Metadata, RAID0: total=7.00GiB, used=4.91GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>
>>>>> Label: 'offline'  uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29
>>>>> Total devices 2 FS bytes used 3.52TiB
>>>>> devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0
>>>>> devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1
>>>>> Data, RAID1: total=3.52TiB, used=3.52TiB
>>>>> System, RAID1: total=8.00MiB, used=512.00KiB
>>>>> Metadata, RAID1: total=6.00GiB, used=5.11GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>
>>>>> Label: 'external'  uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0
>>>>> Total devices 1 FS bytes used 3.65TiB
>>>>> devid1 size 5.46TiB used 3.66TiB path
>>>>> /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc
>>>>> Data, single: total=3.64TiB, used=3.64TiB
>>>>> System, DUP: total=32.00MiB, used=448.00KiB
>>>>> Metadata, DUP: total=11.00GiB, used=9.72GiB
>>>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>>>
>>>>>
>>>>> The following automatic backup scheme is in place:
>>>>> hourly:
>>>>> btrfs sub snap -r online/root online/root.
>>>>>
>>>>> daily:
>>>>> btrfs sub snap -r online/root online/root.
>>>>> btrfs send -c online/root.
>>>>> online/root. | btrfs receive offline
>>>>> btrfs sub del -c online/root.
>>>>>
>>>>> monthly:
>>>>> btrfs sub snap -r online/root online/root.
>>>>> btrfs send -c online/root.
>>>>> online/root. | btrfs receive external
>>>>> btrfs sub del -c online/root.
>>>>>
>>>>> Now here are the commands leading up to my problem:
>>>>> After the online filesystem suddenly went ro, and btrfs check showed
>>>>> massive problems, I decided to start the online array from scratch:
>>>>> 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0
>>>>> /dev/mapper/online1
>>>>>
>>>>> As you can see from the backup commands above, the snapshots of
>>>>> offline and external are not related, so in order to at least keep the
>>>>> extensive backlog of the external snapshot set (including all
>>>>> reflinks), I decided to restore the latest snapshot from external.
>>>>> 2: btrfs send external/root. | btrfs receive online
>>>>>
>>>>> I wanted to ensure I can restart the incremental backup flow from
>>>>> online to external, so I did this
>>>>> 3: mv online/root. online/root
>>>>> 4: btrfs sub snap -r online/root online/root.
>>>>> 5: btrfs property set online/root ro false
>>>>>
>>>>> Now, I naively expected a simple restart of my automatic backups for
>>>>> external should work.
>>>>> However after running
>>>>> 6: btrfs sub snap -r online/root online/root.
>>>>> 7: btrfs send -c online/root.
>>>>> online/root. | btrfs receive external
>>>>
>>>> You just recreated your "online" filesystem from scratch. Where
>>>> "old_external_reference" comes

Re: Incremental send/receive broken after snapshot restore

2018-06-30 Thread Andrei Borzenkov

Do not reply privately to mails on list.

29.06.2018 22:10, Hannes Schweizer пишет:
> On Fri, Jun 29, 2018 at 7:44 PM Andrei Borzenkov  wrote:
>>
>> 28.06.2018 23:09, Hannes Schweizer пишет:
>>> Hi,
>>>
>>> Here's my environment:
>>> Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64
>>> Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux
>>> btrfs-progs v4.17
>>>
>>> Label: 'online'  uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390
>>> Total devices 2 FS bytes used 3.16TiB
>>> devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0
>>> devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1
>>> Data, RAID0: total=3.16TiB, used=3.15TiB
>>> System, RAID0: total=16.00MiB, used=240.00KiB
>>> Metadata, RAID0: total=7.00GiB, used=4.91GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> Label: 'offline'  uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29
>>> Total devices 2 FS bytes used 3.52TiB
>>> devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0
>>> devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1
>>> Data, RAID1: total=3.52TiB, used=3.52TiB
>>> System, RAID1: total=8.00MiB, used=512.00KiB
>>> Metadata, RAID1: total=6.00GiB, used=5.11GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> Label: 'external'  uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0
>>> Total devices 1 FS bytes used 3.65TiB
>>> devid1 size 5.46TiB used 3.66TiB path
>>> /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc
>>> Data, single: total=3.64TiB, used=3.64TiB
>>> System, DUP: total=32.00MiB, used=448.00KiB
>>> Metadata, DUP: total=11.00GiB, used=9.72GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>>
>>> The following automatic backup scheme is in place:
>>> hourly:
>>> btrfs sub snap -r online/root online/root.
>>>
>>> daily:
>>> btrfs sub snap -r online/root online/root.
>>> btrfs send -c online/root.
>>> online/root. | btrfs receive offline
>>> btrfs sub del -c online/root.
>>>
>>> monthly:
>>> btrfs sub snap -r online/root online/root.
>>> btrfs send -c online/root.
>>> online/root. | btrfs receive external
>>> btrfs sub del -c online/root.
>>>
>>> Now here are the commands leading up to my problem:
>>> After the online filesystem suddenly went ro, and btrfs check showed
>>> massive problems, I decided to start the online array from scratch:
>>> 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0
>>> /dev/mapper/online1
>>>
>>> As you can see from the backup commands above, the snapshots of
>>> offline and external are not related, so in order to at least keep the
>>> extensive backlog of the external snapshot set (including all
>>> reflinks), I decided to restore the latest snapshot from external.
>>> 2: btrfs send external/root. | btrfs receive online
>>>
>>> I wanted to ensure I can restart the incremental backup flow from
>>> online to external, so I did this
>>> 3: mv online/root. online/root
>>> 4: btrfs sub snap -r online/root online/root.
>>> 5: btrfs property set online/root ro false
>>>
>>> Now, I naively expected a simple restart of my automatic backups for
>>> external should work.
>>> However after running
>>> 6: btrfs sub snap -r online/root online/root.
>>> 7: btrfs send -c online/root.
>>> online/root. | btrfs receive external
>>
>> You just recreated your "online" filesystem from scratch. Where
>> "old_external_reference" comes from? You did not show steps used to
>> create it.
>>
>>> I see the following error:
>>> ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file
>>> or directory
>>>
>>> Which is unfortunate, but the second problem actually encouraged me to
>>> post this message.
>>> As planned, I had to start the offline array from scratch as well,
>>> because I no longer had any reference snapshot for incremental backups
>>> on other devices:
>>> 8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0
>>> /dev/mapper/offline1
>>>
>>> However restarting the automatic daily backup flow bails out with a
>>> similar error, although no p

Re: unsolvable technical issues?

2018-06-29 Thread Andrei Borzenkov

30.06.2018 06:22, Duncan пишет:
> Austin S. Hemmelgarn posted on Mon, 25 Jun 2018 07:26:41 -0400 as
> excerpted:
> 
>> On 2018-06-24 16:22, Goffredo Baroncelli wrote:
>>> On 06/23/2018 07:11 AM, Duncan wrote:
 waxhead posted on Fri, 22 Jun 2018 01:13:31 +0200 as excerpted:

> According to this:
>
> https://stratis-storage.github.io/StratisSoftwareDesign.pdf Page 4 ,
> section 1.2
>
> It claims that BTRFS still have significant technical issues that may
> never be resolved.

 I can speculate a bit.

 1) When I see btrfs "technical issue that may never be resolved", the
 #1 first thing I think of, that AFAIK there are _definitely_ no plans
 to resolve, because it's very deeply woven into the btrfs core by now,
 is...

 [1)] Filesystem UUID Identification.  Btrfs takes the UU bit of
 Universally Unique quite literally, assuming they really *are*
 unique, at least on that system[.]  Because
 btrfs uses this supposedly unique ID to ID devices that belong to the
 filesystem, it can get *very* mixed up, with results possibly
 including dataloss, if it sees devices that don't actually belong to a
 filesystem with the same UUID as a mounted filesystem.
>>>
>>> As partial workaround you can disable udev btrfs rules and then do a
>>> "btrfs dev scan" manually only for the device which you need.
> 
>> You don't even need `btrfs dev scan` if you just specify the exact set
>> of devices in the mount options.  The `device=` mount option tells the
>> kernel to check that device during the mount process.
> 
> Not that lvm does any better in this regard[1], but has btrfs ever solved 
> the bug where only one device= in the kernel commandline's rootflags= 
> would take effect, effectively forcing initr* on people (like me) who 
> would otherwise not need them and prefer to do without them, if they're 
> using a multi-device btrfs as root?
> 

This requires in-kernel device scanning; I doubt we will ever see it.

> Not to mention the fact that as kernel people will tell you, device 
> enumeration isn't guaranteed to be in the same order every boot, so 
> device=/dev/* can't be relied upon and shouldn't be used -- but of course 
> device=LABEL= and device=UUID= and similar won't work without userspace, 
> basically udev (if they work at all, IDK if they actually do).
> 
> Tho in practice from what I've seen, device enumeration order tends to be 
> dependable /enough/ for at least those without enterprise-level numbers 
> of devices to enumerate.

Just boot with USB stick/eSATA drive plugged in, there are good chances
it changes device order.

>  True, it /does/ change from time to time with a 
> new kernel, but anybody sane keeps a tested-dependable old kernel around 
> to boot to until they know the new one works as expected, and that sort 
> of change is seldom enough that users can boot to the old kernel and 
> adjust their settings for the new one as necessary when it does happen.  
> So as "don't do it that way because it's not reliable" as it might indeed 
> be in theory, in practice, just using an ordered /dev/* in kernel 
> commandlines does tend to "just work"... provided one is ready for the 
> occasion when that device parameter might need a bit of adjustment, of 
> course.
> 
...
> 
> ---
> [1] LVM is userspace code on top of the kernelspace devicemapper, and 
> therefore requires an initr* if root is on lvm, regardless.  So btrfs 
> actually does a bit better here, only requiring it for multi-device btrfs.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Incremental send/receive broken after snapshot restore

2018-06-29 Thread Andrei Borzenkov

28.06.2018 23:09, Hannes Schweizer пишет:
> Hi,
> 
> Here's my environment:
> Linux diablo 4.17.0-gentoo #5 SMP Mon Jun 25 00:26:55 CEST 2018 x86_64
> Intel(R) Core(TM) i5 CPU 760 @ 2.80GHz GenuineIntel GNU/Linux
> btrfs-progs v4.17
> 
> Label: 'online'  uuid: e4dc6617-b7ed-4dfb-84a6-26e3952c8390
> Total devices 2 FS bytes used 3.16TiB
> devid1 size 1.82TiB used 1.58TiB path /dev/mapper/online0
> devid2 size 1.82TiB used 1.58TiB path /dev/mapper/online1
> Data, RAID0: total=3.16TiB, used=3.15TiB
> System, RAID0: total=16.00MiB, used=240.00KiB
> Metadata, RAID0: total=7.00GiB, used=4.91GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Label: 'offline'  uuid: 5b449116-93e5-473e-aaf5-bf3097b14f29
> Total devices 2 FS bytes used 3.52TiB
> devid1 size 5.46TiB used 3.53TiB path /dev/mapper/offline0
> devid2 size 5.46TiB used 3.53TiB path /dev/mapper/offline1
> Data, RAID1: total=3.52TiB, used=3.52TiB
> System, RAID1: total=8.00MiB, used=512.00KiB
> Metadata, RAID1: total=6.00GiB, used=5.11GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Label: 'external'  uuid: 8bf13621-01f0-4f09-95c7-2c157d3087d0
> Total devices 1 FS bytes used 3.65TiB
> devid1 size 5.46TiB used 3.66TiB path
> /dev/mapper/luks-3c196e96-d46c-4a9c-9583-b79c707678fc
> Data, single: total=3.64TiB, used=3.64TiB
> System, DUP: total=32.00MiB, used=448.00KiB
> Metadata, DUP: total=11.00GiB, used=9.72GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> The following automatic backup scheme is in place:
> hourly:
> btrfs sub snap -r online/root online/root.
> 
> daily:
> btrfs sub snap -r online/root online/root.
> btrfs send -c online/root.
> online/root. | btrfs receive offline
> btrfs sub del -c online/root.
> 
> monthly:
> btrfs sub snap -r online/root online/root.
> btrfs send -c online/root.
> online/root. | btrfs receive external
> btrfs sub del -c online/root.
> 
> Now here are the commands leading up to my problem:
> After the online filesystem suddenly went ro, and btrfs check showed
> massive problems, I decided to start the online array from scratch:
> 1: mkfs.btrfs -f -d raid0 -m raid0 -L "online" /dev/mapper/online0
> /dev/mapper/online1
> 
> As you can see from the backup commands above, the snapshots of
> offline and external are not related, so in order to at least keep the
> extensive backlog of the external snapshot set (including all
> reflinks), I decided to restore the latest snapshot from external.
> 2: btrfs send external/root. | btrfs receive online
> 
> I wanted to ensure I can restart the incremental backup flow from
> online to external, so I did this
> 3: mv online/root. online/root
> 4: btrfs sub snap -r online/root online/root.
> 5: btrfs property set online/root ro false
> 
> Now, I naively expected a simple restart of my automatic backups for
> external should work.
> However after running
> 6: btrfs sub snap -r online/root online/root.
> 7: btrfs send -c online/root.
> online/root. | btrfs receive external

You just recreated your "online" filesystem from scratch. Where
"old_external_reference" comes from? You did not show steps used to
create it.

> I see the following error:
> ERROR: unlink root/.ssh/agent-diablo-_dev_pts_3 failed. No such file
> or directory
> 
> Which is unfortunate, but the second problem actually encouraged me to
> post this message.
> As planned, I had to start the offline array from scratch as well,
> because I no longer had any reference snapshot for incremental backups
> on other devices:
> 8: mkfs.btrfs -f -d raid1 -m raid1 -L "offline" /dev/mapper/offline0
> /dev/mapper/offline1
> 
> However restarting the automatic daily backup flow bails out with a
> similar error, although no potentially problematic previous
> incremental snapshots should be involved here!
> ERROR: unlink o925031-987-0/2139527549 failed. No such file or directory
> 

Again - before you can *re*start incremental-forever sequence you need
initial full copy. How exactly did you restart it if no snapshots exist
either on source or on destination?

> I'm a bit lost now. The only thing I could image which might be
> confusing for btrfs,
> is the residual "Received UUID" of online/root.
> after command 2.
> What's the recommended way to restore snapshots with send/receive
> without breaking subsequent incremental backups (including reflinks of
> existing backups)?
> 
> Any hints appreciated...
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov

28.06.2018 12:15, Qu Wenruo пишет:
> 
> 
> On 2018年06月28日 16:16, Andrei Borzenkov wrote:
>> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>>
>>>>
>>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>>
>>>>>
>>>>> Please get yourself clear of what other raid1 is doing.
>>>>
>>>> A drive failure, where the drive is still there when the computer reboots, 
>>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>>> Some will rebuild the array automatically,
>>>
>>> WOW, that's black magic, at least for RAID1.
>>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>>> has datasum.
>>>
>>> Don't bother other things, just tell me how to determine which one is
>>> correct?
>>>
>>
>> When one drive fails, it is recorded in meta-data on remaining drives;
>> probably configuration generation number is increased. Next time drive
>> with older generation is not incorporated. Hardware controllers also
>> keep this information in NVRAM and so do not even depend on scanning
>> of other disks.
> 
> Yep, the only possible way to determine such case is from external info.
> 
> For device generation, it's possible to enhance btrfs, but at least we
> could start from detect and refuse to RW mount to avoid possible further
> corruption.
> But anyway, if one really cares about such case, hardware RAID
> controller seems to be the only solution as other software may have the
> same problem.
> 
> And the hardware solution looks pretty interesting, is the write to
> NVRAM 100% atomic? Even at power loss?
> 
>>
>>> The only possibility is that, the misbehaved device missed several super
>>> block update so we have a chance to detect it's out-of-date.
>>> But that's not always working.
>>>
>>
>> Why it should not work as long as any write to array is suspended
>> until superblock on remaining devices is updated?
> 
> What happens if there is no generation gap in device superblock?
> 

Well, you use "generation" in strict btrfs sense, I use "generation"
generically. That is exactly what btrfs apparently lacks currently -
some monotonic counter that is used to record such event.

> If one device got some of its (nodatacow) data written to disk, while
> the other device doesn't get data written, and neither of them reached
> super block update, there is no difference in device superblock, thus no
> way to detect which is correct.
> 

Again, the very fact that device failed should have triggered update of
superblock to record this information which presumably should increase
some counter.

>>
>>> If you're talking about missing generation check for btrfs, that's
>>> valid, but it's far from a "major design flaw", as there are a lot of
>>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>>> (the brain-split case).
>>>
>>
>> That's different. Yes, with software-based raid there is usually no
>> way to detect outdated copy if no other copies are present. Having
>> older valid data is still very different from corrupting newer data.
> 
> While for VDI case (or any VM image file format other than raw), older
> valid data normally means corruption.
> Unless they have their own write-ahead log.
>> Some file format may detect such problem by themselves if they have
> internal checksum, but anyway, older data normally means corruption,
> especially when partial new and partial old.
>

Yes, that's true. But there is really nothing that can be done here,
even theoretically; it hardly a reason to not do what looks possible.

> On the other hand, with data COW and csum, btrfs can ensure the whole
> filesystem update is atomic (at least for single device).
> So the title, especially the "major design flaw" can't be wrong any more.
> 
>>
>>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>>> will take back the the drive with old data and start commingling that data 
>>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>>> defeats even the most basic expectations of RAID.
>>>
>>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>>> error detection.
>>> And it's impossible to detect such case without extra help.
>>>
>>> Your expectation is completely wrong.
>>>
>>
>> Well ... somehow it is my experience as well ... :)
> 
> Acceptable, but not really apply to software based RAID1.
> 
> Thanks,
> Qu
> 
>>
>>>>
>>>> I'm not the one who has to clear his expectations here.
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>>> the body of a message to majord...@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
> 




signature.asc
Description: OpenPGP digital signature

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov

On Thu, Jun 28, 2018 at 11:16 AM, Andrei Borzenkov  wrote:
> On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>>
>>
>> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>>
>>>
>>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>>
>>>>
>>>> Please get yourself clear of what other raid1 is doing.
>>>
>>> A drive failure, where the drive is still there when the computer reboots, 
>>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>>> anything but raid 0) will recover from perfectly without raising a sweat. 
>>> Some will rebuild the array automatically,
>>
>> WOW, that's black magic, at least for RAID1.
>> The whole RAID1 has no idea of which copy is correct unlike btrfs who
>> has datasum.
>>
>> Don't bother other things, just tell me how to determine which one is
>> correct?
>>
>
> When one drive fails, it is recorded in meta-data on remaining drives;
> probably configuration generation number is increased. Next time drive
> with older generation is not incorporated. Hardware controllers also
> keep this information in NVRAM and so do not even depend on scanning
> of other disks.
>
>> The only possibility is that, the misbehaved device missed several super
>> block update so we have a chance to detect it's out-of-date.
>> But that's not always working.
>>
>
> Why it should not work as long as any write to array is suspended
> until superblock on remaining devices is updated?
>
>> If you're talking about missing generation check for btrfs, that's
>> valid, but it's far from a "major design flaw", as there are a lot of
>> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
>> (the brain-split case).
>>
>
> That's different. Yes, with software-based raid there is usually no
> way to detect outdated copy if no other copies are present. Having
> older valid data is still very different from corrupting newer data.
>
>>> others will automatically kick out the misbehaving drive.  *none* of them 
>>> will take back the the drive with old data and start commingling that data 
>>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>>> defeats even the most basic expectations of RAID.
>>
>> RAID1 can only tolerate 1 missing device, it has nothing to do with
>> error detection.
>> And it's impossible to detect such case without extra help.
>>
>> Your expectation is completely wrong.
>>
>
> Well ... somehow it is my experience as well ... :)

s/experience/expectation/

sorry.

>
>>>
>>> I'm not the one who has to clear his expectations here.
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Major design flaw with BTRFS Raid, temporary device drop will corrupt nodatacow files

2018-06-28 Thread Andrei Borzenkov

On Thu, Jun 28, 2018 at 8:39 AM, Qu Wenruo  wrote:
>
>
> On 2018年06月28日 11:14, r...@georgianit.com wrote:
>>
>>
>> On Wed, Jun 27, 2018, at 10:55 PM, Qu Wenruo wrote:
>>
>>>
>>> Please get yourself clear of what other raid1 is doing.
>>
>> A drive failure, where the drive is still there when the computer reboots, 
>> is a situation that *any* raid 1, (or for that matter, raid 5, raid 6, 
>> anything but raid 0) will recover from perfectly without raising a sweat. 
>> Some will rebuild the array automatically,
>
> WOW, that's black magic, at least for RAID1.
> The whole RAID1 has no idea of which copy is correct unlike btrfs who
> has datasum.
>
> Don't bother other things, just tell me how to determine which one is
> correct?
>

When one drive fails, it is recorded in meta-data on remaining drives;
probably configuration generation number is increased. Next time drive
with older generation is not incorporated. Hardware controllers also
keep this information in NVRAM and so do not even depend on scanning
of other disks.

> The only possibility is that, the misbehaved device missed several super
> block update so we have a chance to detect it's out-of-date.
> But that's not always working.
>

Why it should not work as long as any write to array is suspended
until superblock on remaining devices is updated?

> If you're talking about missing generation check for btrfs, that's
> valid, but it's far from a "major design flaw", as there are a lot of
> cases where other RAID1 (mdraid or LVM mirrored) can also be affected
> (the brain-split case).
>

That's different. Yes, with software-based raid there is usually no
way to detect outdated copy if no other copies are present. Having
older valid data is still very different from corrupting newer data.

>> others will automatically kick out the misbehaving drive.  *none* of them 
>> will take back the the drive with old data and start commingling that data 
>> with good copy.)\ This behaviour from BTRFS is completely abnormal.. and 
>> defeats even the most basic expectations of RAID.
>
> RAID1 can only tolerate 1 missing device, it has nothing to do with
> error detection.
> And it's impossible to detect such case without extra help.
>
> Your expectation is completely wrong.
>

Well ... somehow it is my experience as well ... :)

>>
>> I'm not the one who has to clear his expectations here.
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: in which directions does btrfs send -p | btrfs receive work

2018-06-07 Thread Andrei Borzenkov

07.06.2018 05:50, Christoph Anton Mitterer пишет:
> Hey.
> 
> Just wondered about the following:
> 
> When I have a btrfs which acts as a master and from which I make copies
>  of snapshots on it via send/receive (with using -p at send) to other
> btrfs which acts as copies like this:
> master +--> copy1
>+--> copy2
>\--> copy3
> and if now e.g. the device of master breaks, can I move *with
> incremential send -p / receive backups from one of the copies?
> 
> Which of the following two would work (or both?):
> 
> A) Redesignating a copy to be a new master, e.g.:
>old-copy1/new-master +--> new-disk/new-copy1
>

What is old-copy1? You never told how it was created.


>+--> copy2
> \--> copy3
>Obviously at least
> send/receiving to new-copy1 shoud work,
>but would that work as well
> to copy2/copy3 (with -p), since they're
>based on (and probably using
> UUIDs) from the snapshot on the old
>broken master?
> 
> B) Let a new device be the master and move on from that (kinda creating
>a "send/receive cycle":
>1st:
>copy1 +--> new-disk/new-master
> 
>from then on (when new snapshots should be incrementally sent):
>new-master +--> copy1
>   +--> copy2
>   \--> copy3

It is again not clear - you want to overwrite exiting copy1, copy2,
copy3? Do these copyN have any relation to copyN in the beginning of
your message? Or just want to start new incremental replication from
scratch?

>Again, not sure whether send/receiving to copy2/3 would work, since
>they're based on snapshots/parents from the old broken master.
>And I'm even more unsure, whether this back send/receiving,
>from copy1->new-master->copy1 would work.
> 
> 
> Any expert having some definite idea? :-)
> 
> Thanks,
> Chris.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root raveled during balance

2018-05-26 Thread Andrei Borzenkov

23.05.2018 09:32, Nikolay Borisov пишет:
> 
> 
> On 22.05.2018 23:05, ein wrote:
>> Hello devs,
>>
>> I tested BTRFS in production for about a month:
>>
>> 21:08:17 up 34 days,  2:21,  3 users,  load average: 0.06, 0.02, 0.00
>>
>> Without power blackout, hardware failure, SSD's SMART is flawless etc.
>> The tests ended with:
>>
>> root@node0:~# dmesg | grep BTRFS | grep warn
>> 185:980:[2927472.393557] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 186:981:[2927472.394158] BTRFS warning (device dm-0): csum failed root
>> -9 ino 312 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 191:986:[2928224.169814] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 192:987:[2928224.171433] BTRFS warning (device dm-0): csum failed root
>> -9 ino 314 off 608284672 csum 0x7da1b152 expected csum 0x3163a9b7 mirror 1
>> 206:1001:[2928298.039516] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 207:1002:[2928298.043103] BTRFS warning (device dm-0): csum failed root
>> -9 ino 319 off 608284672 csum 0x7d03a376 expected csum 0x3163a9b7 mirror 1
>> 208:1004:[2932213.513424] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 4564959232 csum 0xc616afb4 expected csum 0x5425e489
>> mirror 1
>> 209:1005:[2932235.666368] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219962 off 16989835264 csum 0xd63ed5da expected csum 0x7429caa1
>> mirror 1
>> 210:1072:[2936767.229277] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>> 211:1073:[2936767.276229] BTRFS warning (device dm-0): csum failed root
>> 5 ino 219915 off 82318458880 csum 0x83614341 expected csum 0x0b8706f8
>> mirror 1
>>
>> Above has been revealed during below command and quite high IO usage by
>> few VMs (Linux on top Ext4 with firebird database, lots of random
>> read/writes, two others with Windows 2016 and Windows Update in the
>> background):
> 
> I believe you are hitting the issue described here:
> 
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg25656.html
> 
> Essentially the way qemu operates on vm images atop btrfs is prone to
> producing such errors. As a matter of fact, other filesystems also
> suffer from this(i.e pages modified while being written, however due to
> lack of CRC on the data they don't detect it). Can you confirm that
> those inodes (312/314/319/219962/219915) belong to vm images files?
> 
> IMHO the best course of action would be to disable checksumming for you
> vm files.
> 
> 
> For some background I suggest you read the following LWN articles:
> 
> https://lwn.net/Articles/486311/
> https://lwn.net/Articles/442355/
> 

Hmm ... according to these articles, "pages under writeback are marked
as not being writable; any process attempting to write to such a page
will block until the writeback completes". And it says this feature is
available since 3.0 and btrfs has it. So how comes it still happens?
Were stable patches removed since then?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID56 - 6 parity raid

2018-05-03 Thread Andrei Borzenkov

On Wed, May 2, 2018 at 10:29 PM, Austin S. Hemmelgarn
 wrote:
...
>
> Assume you have a BTRFS raid5 volume consisting of 6 8TB disks (which gives
> you 40TB of usable space).  You're storing roughly 20TB of data on it, using
> a 16kB block size, and it sees about 1GB of writes a day, with no partial
> stripe writes.  You, for reasons of argument, want to scrub it every week,
> because the data in question matters a lot to you.
>
> With a decent CPU, lets say you can compute 1.5GB/s worth of checksums, and
> can compute the parity at a rate of 1.25G/s (the ratio here is about the
> average across the almost 50 systems I have quick access to check, including
> a number of server and workstation systems less than a year old, though the
> numbers themselves are artificially low to accentuate the point here).
>
> At this rate, scrubbing by computing parity requires processing:
>
> * Checksums for 20TB of data, at a rate of 1.5GB/s, which would take 1
> seconds, or 222 minutes, or about 3.7 hours.
> * Parity for 20TB of data, at a rate of 1.25GB/s, which would take 16000
> seconds, or 267 minutes, or roughly 4.4 hours.
>
> So, over a week, you would be spending 8.1 hours processing data solely for
> data integrity, or roughly 4.8214% of your time.
>
> Now assume instead that you're doing checksummed parity:
>
> * Scrubbing data is the same, 3.7 hours.
> * Scrubbing parity turns into computing checksums for 4TB of data, which
> would take 3200 seconds, or 53 minutes, or roughly 0.88 hours.

Scrubbing must compute parity and compare with stored value to detect
write hole. Otherwise you end up with parity having good checksum but
not matching rest of data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread Andrei Borzenkov

09.03.2018 19:43, Austin S. Hemmelgarn пишет:
> 
> If the answer to either one or two is no but the answer to three is yes,
> pull out the failed disk, put in a new one, mount the volume degraded,
> and use `btrfs replace` as well (you will need to specify the device ID
> for the now missing failed disk, which you can find by calling `btrfs
> filesystem show` on the volume).

I do not see it and I do not remember ever seeing device ID of missing
devices.

10:/home/bor # blkid
/dev/sda1: UUID="ce0caa57-7140-4374-8534-3443d21f3edc" TYPE="swap"
PARTUUID="d2714b67-01"
/dev/sda2: UUID="cc072e56-f671-4388-a4a0-2ffee7c98fdb"
UUID_SUB="eaeb4c78-da94-43b3-acc7-c3e963f1108d" TYPE="btrfs"
PTTYPE="dos" PARTUUID="d2714b67-02"
/dev/sdb1: UUID="e4af8f3c-8307-4397-90e3-97b90989cf5d"
UUID_SUB="f421f1e7-2bb0-4a67-a18e-cfcbd63560a8" TYPE="btrfs"
PARTUUID="875525bf-01"
10:/home/bor # mount /dev/sdb1 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error.
10:/home/bor # mount -o degraded /dev/sdb1 /mnt
10:/home/bor # btrfs fi sh /mnt
Label: none  uuid: e4af8f3c-8307-4397-90e3-97b90989cf5d
Total devices 2 FS bytes used 256.00KiB
devid2 size 1023.00MiB used 212.50MiB path /dev/sdb1
*** Some devices missing

10:/home/bor # btrfs fi us /mnt
Overall:
Device size:   2.00GiB
Device allocated:425.00MiB
Device unallocated:1.58GiB
Device missing: 1023.00MiB
Used:512.00KiB
Free (estimated):912.62MiB  (min: 912.62MiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID1: Size:102.25MiB, Used:128.00KiB
   /dev/sdb1 102.25MiB
   missing   102.25MiB

Metadata,RAID1: Size:102.25MiB, Used:112.00KiB
   /dev/sdb1 102.25MiB
   missing   102.25MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/sdb1   8.00MiB
   missing 8.00MiB

Unallocated:
   /dev/sdb1 810.50MiB
   missing   810.50MiB
10:/home/bor # rpm -q btrfsprogs
btrfsprogs-4.15-2.1.x86_64
10:/home/bor # uname -a
Linux 10 4.15.7-1-default #1 SMP PREEMPT Wed Feb 28 12:40:23 UTC 2018
(a36e160) x86_64 x86_64 x86_64 GNU/Linux
10:/home/bor #



And "missing" is not the answer because I obviously may have more than
one missing device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Change of Ownership of the filesystem content when cloning a volume

2018-03-09 Thread Andrei Borzenkov

10.03.2018 02:13, Saravanan Shanmugham (sarvi) пишет:
> 
> Netapp’s storage system, has the concept of snapshot/clones.
> And when I create a clone from a snapshot, I can give/change ownership of 
> entire tree in the volume to a different userid.

You are probably mistaken. NetApp FlexClone (which you probably mean)
does not have any ways to change volume content. Of course you can now
mount this clone and do whatever you like from host, but that is
completely unrelated to NetApp itself and can just as well be done using
btrfs subvolume.

> 
> Is something like that possible in BTRFS?
> 
> We are looking to use CopyOnWrite to snapshot nightly build workspace and 
> clone as developer workspaces to avoid building from scratch for developers,
> And move directly for incremental builds.
> For this we would like the clone workspace/volume to be instantly owned by 
> the developer cloning the workspace.
> 
> Thanks,
> Sarvi
> -
> Occam's Razor Rules
> 
> 
> N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*jg����ݢj/���z�ޖ��2�ޙ���&�)ߡ�a�����G���h��j:+v���w�٥
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Inconsistence between sender and receiver

2018-03-08 Thread Andrei Borzenkov

09.03.2018 08:38, Liu Bo пишет:
> On Thu, Mar 08, 2018 at 09:15:50AM +0300, Andrei Borzenkov wrote:
>> 07.03.2018 21:49, Liu Bo пишет:
>>> Hi,
>>>
>>> In the following steps[1], if  on receiver side has got
>>> changed via 'btrfs property set', then after doing incremental
>>> updates, receiver gets a different snapshot from what sender has sent.
>>>
>>> The reason behind it is that there is no change about file 'foo' in
>>> the send stream, such that receiver simply creates a snapshot of
>>>  on its side with nothing to apply from the send stream.
>>>
>>> A possible way to avoid this is to check rtransid and ctranid of
>>>  on receiver side, but I'm not very sure whether the current
>>> behavior is made deliberately, does anyone have an idea? 
>>>
>>> Thanks,
>>>
>>> -liubo
>>>
>>> [1]:
>>> $ btrfs sub create /mnt/send/sub
>>> $ touch /mnt/send/sub/foo
>>> $ btrfs sub snap -r /mnt/send/sub /mnt/send/parent
>>>
>>> # send parent out
>>> $ btrfs send /mnt/send/parent | btrfs receive /mnt/recv/
>>>
>>> # change parent and file under it
>>> $ btrfs property set -t subvol /mnt/recv/parent ro false
>>
>> Is removing the ability to modify read-only property an option? What are
>> use cases for it? What can it do that "btrfs sub snap read-only
>> writable" cannot?
>>
> 
> Tbh, I don't know any usecase for that, I just wanted to toggle off
> the ro bit in order to change something inside .
> 
>>> $ truncate -s 4096 /mnt/recv/parent/foo
>>>
>>> $ btrfs sub snap -r /mnt/send/sub /mnt/send/update
>>> $ btrfs send -p /mnt/send/parent /mnt/send/update | btrfs receive /mnt/recv
>>>
>>
>> This should fail right away because /mnt/send/parent is not read-only.
>> If it does not, this is really a bug.
>>
> 
> It's not '/mnt/send/parent', but '/mnt/recv/parent' got changed.
> 

Ah, sorry.

What happened to this patch which clears received_uuid when ro is
flipped off?

https://patchwork.kernel.org/patch/9986521/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to change/fix 'Received UUID'

2018-03-08 Thread Andrei Borzenkov

08.03.2018 19:02, Marc MERLIN пишет:
> On Thu, Mar 08, 2018 at 09:34:45AM +0300, Andrei Borzenkov wrote:
>> 08.03.2018 09:06, Marc MERLIN пишет:
>>> On Tue, Mar 06, 2018 at 12:02:47PM -0800, Marc MERLIN wrote:
>>>>> https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d
>>>>
>>>> Well, I had never heard about it until now, thank you.
>>>>
>>>> I'll see if I can make it work when I get a bit of time.
>>>
>>> Sorry, I missed the fact that there was no code to write at all.
>>> gargamel:/var/local/src/python-btrfs/examples# ./set_received_uuid.py 
>>> 2afc7a5e-107f-d54b-8929-197b80b70828 31337 1234.5678 
>>> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41
>>> Current subvolume information:
>>>   subvol_id: 94887
>>>   received_uuid: ----
>>>   stime: 0.0 (1970-01-01T00:00:00)
>>>   stransid: 0  
>>>   rtime: 0.0 (1970-01-01T00:00:00)
>>>   rtransid: 0  
>>>
>>> Setting received subvolume...
>>>
>>> Resulting subvolume information:
>>>   subvol_id: 94887
>>>   received_uuid: 2afc7a5e-107f-d54b-8929-197b80b70828
>>>   stime: 1234.5678 (1970-01-01T00:20:34.567800)
>>>   stransid: 31337
>>>   rtime: 1520488877.415709329 (2018-03-08T06:01:17.415709)
>>>   rtransid: 255755
>>>
>>> gargamel:/var/local/src/python-btrfs/examples# btrfs property set -ts 
>>> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41 ro true
>>>
>>>
>>> ABORT: btrfs send -p /mnt/btrfs_pool1/Video_ro.20180205_21:05:15 
>>> Video_ro.20180307_22:03:03 |  btrfs receive /mnt/btrfs_bigbackup/DS1//. 
>>> failed
>>> At subvol Video_ro.20180307_22:03:03
>>> At snapshot Video_ro.20180307_22:03:03
>>> ERROR: cannot find parent subvolume
>>>
>>> gargamel:/mnt/btrfs_pool1# btrfs subvolume show 
>>> /mnt/btrfs_pool1/Video_ro.20180220_21\:03\:41/
>>> Video_ro.20180220_21:03:41
>>
>> Not sure I understand how this subvolume is related. You send
>> differences between Video_ro.20180205_21:05:15 and
>> Video_ro.20180307_22:03:03, so you need to have (replica of)
>> Video_ro.20180205_21:05:15 on destination. How exactly
>> Video_ro.20180220_21:03:41 comes in picture here?
>  
> Sorry, I pasted the wrong thing.
> ABORT: btrfs send -p /mnt/btrfs_pool1/Video_ro.20180220_21:03:41 
> Video_ro.20180308_07:50:06 |  btrfs receive /mnt/btrfs_bigbackup/DS1//. failed
> At subvol Video_ro.20180308_07:50:06
> At snapshot Video_ro.20180308_07:50:06
> ERROR: cannot find parent subvolume
> 
> Same problem basically, just copied the wrong attempt, sorry about that.
> 
> Do I need to make sure of more than
> DS1/Video_ro.20180220_21:03:41
> Received UUID:  2afc7a5e-107f-d54b-8929-197b80b70828
> 
> be equal to
> Name:   Video_ro.20180220_21:03:41
> UUID:   2afc7a5e-107f-d54b-8929-197b80b70828
> 

Yes. Your source has Received UUID. In this case btrfs send will
transmit received UUID instead of subvolume UUID as reference to base
snapshot. You need to either clear received UUID on source or set
received UUID on destination to received UUID of source (not to
subvolume UUID of source).

> Thanks,
> Marc
> 
> 
>>> Name:   Video_ro.20180220_21:03:41
>>> UUID:   2afc7a5e-107f-d54b-8929-197b80b70828
>>> Parent UUID:e5ec5c1e-6b49-084e-8820-5a8cfaa1b089
>>> Received UUID:  0e220a4f-6426-4745-8399-0da0084f8b23
I mean this:

>>> Creation time:  2018-02-20 21:03:42 -0800
>>> Subvolume ID:   11228
>>> Generation: 4174
>>> Gen at creation:4150
>>> Parent ID:  5
>>> Top level ID:   5
>>> Flags:  readonly
>>> Snapshot(s):
>>> Video_rw.20180220_21:03:41
>>> Video
>>>
>>>
>>> Wasn't I supposed to set 2afc7a5e-107f-d54b-8929-197b80b70828 onto the 
>>> destination?
>>>
>>> Doesn't that look ok now? Is there something else I'm missing?
>>> gargamel:/mnt/btrfs_pool1# btrfs subvolume show 
>>> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41
>>> DS1/Video_ro.20180220_21:03:41
>>> Name:   Video_ro.20180220_21:03:

Re: How to change/fix 'Received UUID'

2018-03-07 Thread Andrei Borzenkov

08.03.2018 09:06, Marc MERLIN пишет:
> On Tue, Mar 06, 2018 at 12:02:47PM -0800, Marc MERLIN wrote:
>>> https://github.com/knorrie/python-btrfs/commit/1ace623f95300ecf581b1182780fd6432a46b24d
>>
>> Well, I had never heard about it until now, thank you.
>>
>> I'll see if I can make it work when I get a bit of time.
> 
> Sorry, I missed the fact that there was no code to write at all.
> gargamel:/var/local/src/python-btrfs/examples# ./set_received_uuid.py 
> 2afc7a5e-107f-d54b-8929-197b80b70828 31337 1234.5678 
> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41
> Current subvolume information:
>   subvol_id: 94887
>   received_uuid: ----
>   stime: 0.0 (1970-01-01T00:00:00)
>   stransid: 0  
>   rtime: 0.0 (1970-01-01T00:00:00)
>   rtransid: 0  
> 
> Setting received subvolume...
> 
> Resulting subvolume information:
>   subvol_id: 94887
>   received_uuid: 2afc7a5e-107f-d54b-8929-197b80b70828
>   stime: 1234.5678 (1970-01-01T00:20:34.567800)
>   stransid: 31337
>   rtime: 1520488877.415709329 (2018-03-08T06:01:17.415709)
>   rtransid: 255755
> 
> gargamel:/var/local/src/python-btrfs/examples# btrfs property set -ts 
> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41 ro true
> 
> 
> ABORT: btrfs send -p /mnt/btrfs_pool1/Video_ro.20180205_21:05:15 
> Video_ro.20180307_22:03:03 |  btrfs receive /mnt/btrfs_bigbackup/DS1//. failed
> At subvol Video_ro.20180307_22:03:03
> At snapshot Video_ro.20180307_22:03:03
> ERROR: cannot find parent subvolume
> 
> gargamel:/mnt/btrfs_pool1# btrfs subvolume show 
> /mnt/btrfs_pool1/Video_ro.20180220_21\:03\:41/
> Video_ro.20180220_21:03:41

Not sure I understand how this subvolume is related. You send
differences between Video_ro.20180205_21:05:15 and
Video_ro.20180307_22:03:03, so you need to have (replica of)
Video_ro.20180205_21:05:15 on destination. How exactly
Video_ro.20180220_21:03:41 comes in picture here?

> Name:   Video_ro.20180220_21:03:41
> UUID:   2afc7a5e-107f-d54b-8929-197b80b70828
> Parent UUID:e5ec5c1e-6b49-084e-8820-5a8cfaa1b089
> Received UUID:  0e220a4f-6426-4745-8399-0da0084f8b23> 
> Creation time:  2018-02-20 21:03:42 -0800
> Subvolume ID:   11228
> Generation: 4174
> Gen at creation:4150
> Parent ID:  5
> Top level ID:   5
> Flags:  readonly
> Snapshot(s):
> Video_rw.20180220_21:03:41
> Video
> 
> 
> Wasn't I supposed to set 2afc7a5e-107f-d54b-8929-197b80b70828 onto the 
> destination?
> 
> Doesn't that look ok now? Is there something else I'm missing?
> gargamel:/mnt/btrfs_pool1# btrfs subvolume show 
> /mnt/btrfs_bigbackup/DS1/Video_ro.20180220_21:03:41
> DS1/Video_ro.20180220_21:03:41
> Name:   Video_ro.20180220_21:03:41
> UUID:   cb4f343c-5e79-7f49-adf0-7ce0b29f23b3
> Parent UUID:0e220a4f-6426-4745-8399-0da0084f8b23
> Received UUID:  2afc7a5e-107f-d54b-8929-197b80b70828
> Creation time:  2018-02-20 21:13:36 -0800
> Subvolume ID:   94887
> Generation: 250689
> Gen at creation:250689
> Parent ID:  89160
> Top level ID:   89160
> Flags:  readonly
> Snapshot(s):
> 
> Thanks,
> Marc
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Inconsistence between sender and receiver

2018-03-07 Thread Andrei Borzenkov

07.03.2018 21:49, Liu Bo пишет:
> Hi,
> 
> In the following steps[1], if  on receiver side has got
> changed via 'btrfs property set', then after doing incremental
> updates, receiver gets a different snapshot from what sender has sent.
> 
> The reason behind it is that there is no change about file 'foo' in
> the send stream, such that receiver simply creates a snapshot of
>  on its side with nothing to apply from the send stream.
> 
> A possible way to avoid this is to check rtransid and ctranid of
>  on receiver side, but I'm not very sure whether the current
> behavior is made deliberately, does anyone have an idea? 
> 
> Thanks,
> 
> -liubo
> 
> [1]:
> $ btrfs sub create /mnt/send/sub
> $ touch /mnt/send/sub/foo
> $ btrfs sub snap -r /mnt/send/sub /mnt/send/parent
> 
> # send parent out
> $ btrfs send /mnt/send/parent | btrfs receive /mnt/recv/
> 
> # change parent and file under it
> $ btrfs property set -t subvol /mnt/recv/parent ro false

Is removing the ability to modify read-only property an option? What are
use cases for it? What can it do that "btrfs sub snap read-only
writable" cannot?

> $ truncate -s 4096 /mnt/recv/parent/foo
> 
> $ btrfs sub snap -r /mnt/send/sub /mnt/send/update
> $ btrfs send -p /mnt/send/parent /mnt/send/update | btrfs receive /mnt/recv
> 

This should fail right away because /mnt/send/parent is not read-only.
If it does not, this is really a bug.

Of course one may go one step further and set /mnt/send/parent read-only
again, then we get this issue.

> $ ls -l /mnt/send/update
> total 0
> -rw-r--r-- 1 root root 0 Mar  6 11:13 foo
> 
> $ ls -l /mnt/recv/update
> total 0
> -rw-r--r-- 1 root root 4096 Mar  6 11:14 foo
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How to change/fix 'Received UUID'

2018-03-05 Thread Andrei Borzenkov

05.03.2018 19:16, Marc MERLIN пишет:
> Howdy,
> 
> I did a bunch of copies and moving around subvolumes between disks and
> at some point, I did a snapshot dir1/Win_ro.20180205_21:18:31 
> dir2/Win_ro.20180205_21:18:31
> 
> As a result, I lost the ro flag, and apparently 'Received UUID' which is
> now preventing me from restarting the btrfs send/receive.
> 
> I changed the snapshot back to 'ro' but that's not enough:
> 
> Source:
> Name:   Win_ro.20180205_21:18:31
> UUID:   23ccf2bd-f494-e348-b34e-1f28486b2540
> Parent UUID:-
> Received UUID:  3cc327e1-358f-284e-92e2-4e4fde92b16f
> Creation time:  2018-02-15 20:14:42 -0800
> Subvolume ID:   964
> Generation: 4062
> Gen at creation:459
> Parent ID:  5
> Top level ID:   5
> Flags:  readonly
> 
> Dest:
> Name:   Win_ro.20180205_21:18:31
> UUID:   a1e8777c-c52b-af4e-9ce2-45ca4d4d2df8
> Parent UUID:-
> Received UUID:  -
> Creation time:  2018-02-17 22:20:25 -0800
> Subvolume ID:   94826
> Generation: 250714
> Gen at creation:250540
> Parent ID:  89160
> Top level ID:   89160
> Flags:  readonly
> 
> If I absolutely know that the data is the same on both sides, how do I
> either
> 1) force back in a 'Received UUID' value on the destination

I suppose the most simple is to write small program that does it using
BTRFS_IOC_SET_RECEIVED_SUBVOL.

> 2) force a btrfs receive to work despite the lack of matching 'Received
> UUID' 
> 
> Yes, I could discard and start over, but my 2nd such subvolume is 8TB,
> so I'd really rather not :)
> 
> Any ideas?
> 
> Thanks,
> Marc
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-03-01 Thread Andrei Borzenkov

On Thu, Mar 1, 2018 at 12:26 PM, vinayak hegde  wrote:
> No, there is no opened file which is deleted, I did umount and mounted
> again and reboot also.
>
> I think I am hitting the below issue, lot of random writes were
> happening and the file is not fully written and its sparse file.
> Let me try with disabling COW.
>
>
> file offset 0   offset 302g
> [-prealloced 302g extent--]
>
> (man it's impressive I got all that lined up right)
>
> On disk you have 2 things. First your file which has file extents which says
>
> inode 256, file offset 0, size 302g, offset0, disk bytenr 123, disklen 302g
>
> and then in the extent tree, who keeps track of actual allocated space has 
> this
>
> extent bytenr 123, len 302g, refs 1
>
> Now say you boot up your virt image and it writes 1 4k block to offset
> 0. Now you have this
>
> [4k][302g-4k--]
>
> And for your inode you now have this
>
> inode 256, file offset 0, size 4k, offset 0, diskebytenr (123+302g),
> disklen 4k inode 256, file offset 4k, size 302g-4k, offset 4k,
> diskbytenr 123, disklen 302g
>
> and in your extent tree you have
>
> extent bytenr 123, len 302g, refs 1
> extent bytenr whatever, len 4k, refs 1
>
> See that? Your file is still the same size, it is still 302g. If you
> cp'ed it right now it would copy 302g of information. But what you
> have actually allocated on disk? Well that's now 302g + 4k. Now lets
> say your virt thing decides to write to the middle, lets say at offset
> 12k, now you have thisinode 256, file offset 0, size 4k, offset 0,
> diskebytenr (123+302g), disklen 4k
>
> inode 256, file offset 4k, size 3k, offset 4k, diskbytenr 123, disklen 302g
>
> inode 256, file offset 12k, size 4k, offset 0, diskebytenr whatever,
> disklen 4k inode 256, file offset 16k, size 302g - 16k, offset 16k,
> diskbytenr 123, disklen 302g
>
> and in the extent tree you have this
>
> extent bytenr 123, len 302g, refs 2
> extent bytenr whatever, len 4k, refs 1
> extent bytenr notimportant, len 4k, refs 1
>
> See that refs 2 change? We split the original extent, so we have 2
> file extents pointing to the same physical extents, so we bumped the
> ref count. This will happen over and over again until we have
> completely overwritten the original extent, at which point your space
> usage will go back down to ~302g.

Sure, I just mentioned the same in another thread. But you said you
performed full defragmentation and I expect it to "fix" this condition
by relocating data and freeing original big extent. If this did not
happen, I wonder what are conditions when defragment decides to (not)
move data.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs space used issue

2018-02-28 Thread Andrei Borzenkov

On Wed, Feb 28, 2018 at 9:01 AM, vinayak hegde  wrote:
> I ran full defragement and balance both, but didnt help.

Showing the same information immediately after full defragment would be helpful.

> My created and accounting usage files are matching the du -sh output.
> But I am not getting why btrfs internals use so much extra space.
> My worry is, will get no space error earlier than I expect.
> Is it expected with btrfs internal that it will use so much extra space?
>

Did you try to reboot? Deleted opened file could well cause this effect.

> Vinayak
>
>
>
>
> On Tue, Feb 27, 2018 at 7:24 PM, Austin S. Hemmelgarn
>  wrote:
>> On 2018-02-27 08:09, vinayak hegde wrote:
>>>
>>> I am using btrfs, But I am seeing du -sh and df -h showing huge size
>>> difference on ssd.
>>>
>>> mount:
>>> /dev/drbd1 on /dc/fileunifier.datacache type btrfs
>>>
>>> (rw,noatime,nodiratime,flushoncommit,discard,nospace_cache,recovery,commit=5,subvolid=5,subvol=/)
>>>
>>>
>>> du -sh /dc/fileunifier.datacache/ -  331G
>>>
>>> df -h
>>> /dev/drbd1  746G  346G  398G  47% /dc/fileunifier.datacache
>>>
>>> btrfs fi usage /dc/fileunifier.datacache/
>>> Overall:
>>>  Device size: 745.19GiB
>>>  Device allocated: 368.06GiB
>>>  Device unallocated: 377.13GiB
>>>  Device missing: 0.00B
>>>  Used: 346.73GiB
>>>  Free (estimated): 396.36GiB(min: 207.80GiB)
>>>  Data ratio:  1.00
>>>  Metadata ratio:  2.00
>>>  Global reserve: 176.00MiB(used: 0.00B)
>>>
>>> Data,single: Size:365.00GiB, Used:345.76GiB
>>> /dev/drbd1 365.00GiB
>>>
>>> Metadata,DUP: Size:1.50GiB, Used:493.23MiB
>>> /dev/drbd1   3.00GiB
>>>
>>> System,DUP: Size:32.00MiB, Used:80.00KiB
>>> /dev/drbd1  64.00MiB
>>>
>>> Unallocated:
>>> /dev/drbd1 377.13GiB
>>>
>>>
>>> Even if we consider 6G metadata its 331+6 = 337.
>>> where is 9GB used?
>>>
>>> Please explain.
>>
>> First, you're counting the metadata wrong.  The value shown per-device by
>> `btrfs filesystem usage` already accounts for replication (so it's only 3 GB
>> of metadata allocated, not 6 GB).  Neither `df` nor `du` looks at the chunk
>> level allocations though.
>>
>> Now, with that out of the way, the discrepancy almost certainly comes form
>> differences in how `df` and `du` calculate space usage.  In particular, `df`
>> calls statvfs and looks at the f_blocks and f_bfree values to compute space
>> usage, while `du` walks the filesystem tree calling stat on everything and
>> looking at st_blksize and st_blocks (or instead at st_size if you pass in
>> `--apparent-size` as an option).  This leads to a couple of differences in
>> what they will count:
>>
>> 1. `du` may or may not properly count hardlinks, sparse files, and
>> transparently compressed data, dependent on whether or not you use
>> `--apparent-sizes` (by default, it does properly count all of those), while
>> `df` will always account for those properly.
>> 2. `du` does not properly account for reflinked blocks (from deduplication,
>> snapshots, or use of the CLONE ioctl), and will count each reflink of every
>> block as part of the total size, while `df` will always count each block
>> exactly once no matter how many reflinks it has.
>> 3. `du` does not account for all of the BTRFS metadata allocations,
>> functionally ignoring space allocated for anything but inline data, while
>> `df` accounts for all BTRFS metadata properly.
>> 4. `du` will recurse into other filesystems if you don't pass the `-x`
>> option to it, while `df` will only report for each filesystem separately.
>> 5. `du` will only count data usage under the given mount point, and won't
>> account for data on other subvolumes that may be mounted elsewhere (and if
>> you pass in `-x` won't count data on other subvolumes located under the
>> given path either), while `df` will count all the data in all subvolumes.
>> 6. There are a couple of other differences too, but they're rather complex
>> and dependent on the internals of BTRFS.
>>
>> In your case, I think the issue is probably one of the various things under
>> item 6.  Items 1, 2 and 4 will cause `du` to report more space usage than
>> `df`, item 3 is irrelevant because `du` shows less space than the total data
>> chunk usage reported by `btrfs filesystem usage`, and item 5 is irrelevant
>> because you're mounting the root subvolume and not using the `-x` option on
>> `du` (and therefore there can't be other subvolumes you're missing).
>>
>> Try running a full defrag of the given mount point.  If what I think is
>> causing this is in fact the issue, that should bring the numbers back
>> in-line with each other.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To

Re: Btrfs occupies more space than du reports...

2018-02-28 Thread Andrei Borzenkov

On Wed, Feb 28, 2018 at 2:26 PM, Shyam Prasad N  wrote:
> Hi,
>
> Thanks for the reply.
>
>> * `df` calls `statvfs` to get it's data, which tries to count physical
>> allocation accounting for replication profiles.  In other words, data in
>> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
>> raid5 and raid6 chunks gets counted with a bit of extra space for the
>> parity, etc.
>
> We have data not using raid (single), metadata using dup, we've not
> used compression, subvols have not been created yet (other than the
> default subvol), there are no other mount points within the tree.
> Taking into account all that you're saying, the numbers don't make
> sense to me. "btrfs fi usage" tells that the data "used" is much more
> than what it should be. I agree more with what du is saying the disk
> usage is.
> I tried an experiment. Filled up the available space (as per what
> btrfs believes is available) with huge files. As soon as the usage
> reached 100%, further writes started to return ENOSPC. This is what
> I'm scared is what is going to happen when these filesystems
> eventually fill up. This would normally be the expected behaviour, but
> in many of these servers, the actual data that is being used is much
> lesser (60-70 GBs in some cases).
> To me, it looks like a btrfs internal refcounting has gone wrong.
> Maybe it's thinking that some data blocks (which are actually free)
> are in use?

One reason could be overwrites inside of extents. What happens is
btrfs does not (always) physically split extent when it is partially
overwritten. So some space remains free but unavailable.

Filesystem 1K-blocks  Used Available Use% Mounted on
/dev/sdb18387584 16704   7531456   1% /mnt
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=100
100+0 records in
100+0 records out
104857600 bytes (105 MB, 100 MiB) copied, 0.580041 s, 181 MB/s
localhost:~ # sync
localhost:~ # df -k /mnt
Filesystem 1K-blocks   Used Available Use% Mounted on
/dev/sdb18387584 119552   7428864   2% /mnt
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=25
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00781892 s, 134 MB/s
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=50
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00780386 s, 134 MB/s
localhost:~ # dd if=/dev/urandom of=/mnt/file bs=1M count=1 conv=notrunc seek=75
1+0 records in
1+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00761908 s, 138 MB/s
localhost:~ # sync
localhost:~ # df -k /mnt
Filesystem 1K-blocks   Used Available Use% Mounted on
/dev/sdb18387584 122624   7425792   2% /mnt

So 3M is lost. And if you write 50M in the middle you will get 50M "lost" space.

I do not know how btrfs decides when to split extent.

defragmenting file should free those partial extents again.

btrfs fi defrag -r /mnt

> Or some other refcounting issue?
> We've tried "btrfs check" as well as "btrfs scrub", so far. Both have
> not reported any errors.
>
> Regards,
> Shyam
>
> On Fri, Feb 23, 2018 at 6:53 PM, Austin S. Hemmelgarn
>  wrote:
>> On 2018-02-23 06:21, Shyam Prasad N wrote:
>>>
>>> Hi,
>>>
>>> Can someone explain me why there is a difference in the number of
>>> blocks reported by df and du commands below?
>>>
>>> =
>>> # df -h /dc
>>> Filesystem  Size  Used Avail Use% Mounted on
>>> /dev/drbd1  746G  519G  225G  70% /dc
>>>
>>> # btrfs filesystem df -h /dc/
>>> Data, single: total=518.01GiB, used=516.58GiB
>>> System, DUP: total=8.00MiB, used=80.00KiB
>>> Metadata, DUP: total=2.00GiB, used=1019.72MiB
>>> GlobalReserve, single: total=352.00MiB, used=0.00B
>>>
>>> # du -sh /dc
>>> 467G/dc
>>> =
>>>
>>> df shows 519G is used. While recursive check using du shows only 467G.
>>> The filesystem doesn't contain any snapshots/extra subvolumes.
>>> Neither does it contain any mounted filesystem under /dc.
>>> I also considered that it could be a void left behind by one of the
>>> open FDs held by a process. So I rebooted the system. Still no
>>> changes.
>>>
>>> The situation is even worse on a few other systems with similar
>>> configuration.
>>>
>>
>> At least part of this is a difference in how each tool computes space usage.
>>
>> * `df` calls `statvfs` to get it's data, which tries to count physical
>> allocation accounting for replication profiles.  In other words, data in
>> chunks with the dup, raid1, and raid10 profiles gets counted twice, data in
>> raid5 and raid6 chunks gets counted with a bit of extra space for the
>> parity, etc.
>>
>> * `btrfs fi df` looks directly at the filesystem itself and counts how much
>> space is available to each chunk type in the `total` values and how much
>> space is used in each chunk type in the `used` values, after replication.
>> If you add together the data used value

Re: incremental send/receive ERROR: cannot find parent subvolume

2018-02-27 Thread Andrei Borzenkov

27.02.2018 01:54, Emil.s пишет:
> Hello,
> 
> I'm trying to restore a subvolume from a backup, but I'm failing when
> I try to setup the replication chain again.
> 
> Previously I had disk A and B, where I was sending snapshots from A to
> B using "send -c /disk_a/1 /disk_a/2 | receive /disk_b" and so on.
> Now disk A failed, so I got a new disk: C.
> 
> First I sent the last backup from B to C by running "send disk_b/2 |
> receive /disk_c/".
> 
> Then my plan was to use disk B as the new main disk, so on disk B, I
> created a new snapshot by running "snapshot -r disk_b/2
> disk_b/my_volume".
> Now I want to send "disk_b/my_volume" to disk C, so that I can set
> disk_b/2 to RW and start using it again, but the last send fails with
> "ERROR: cannot find parent subvolume".
> 
> Disk B:
> root@tillberga: /backup #> btrfs subvol sh snapshots/user_data_2018-02-17/
> snapshots/user_data_2018-02-17
> Name: user_data_2018-02-17
> UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Parent UUID: 51180286-c202-c94c-b8f9-2ecc8d2b5b7c
> Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10
> Flags: readonly
> Snapshot(s):
> user_data_test
> 
> Disk C:
> root@hedemora: /btrfs #> btrfs subvol sh user_data_2018-02-17/
> user_data_2018-02-17
> Name: user_data_2018-02-17
> UUID: 35361429-a42c-594b-b390-51ffb9725324
> Parent UUID: -
> Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Flags: readonly
> Snapshot(s):
> 
> Disk B has UUID 93433261-d954-9f4b-8319-d450ae079a9a, and disk C has
> "Received UUID: 93433261-d954-9f4b-8319-d450ae079a9a". I think that's
> alright?
> 
> The new snapshot on disk B looks as following:
> root@tillberga: /backup #> btrfs subvol sh user_data_test
> user_data_test
> Name: user_data_test
> UUID: bf88000c-e3a6-434b-8f69-f5ce2174227e
> Parent UUID: 93433261-d954-9f4b-8319-d450ae079a9a
> Received UUID: 014fc004-ae04-0148-9525-1bf556fd4d10
> Flags: readonly
> Snapshot(s):
> 
> But when I'm trying to send it, I'm getting the following:
> root@tillberga: /backup #> btrfs send -v -c
> snapshots/user_data_2018-02-17 user_data_test | ssh user@hedemora
> "sudo btrfs receive -v /btrfs/"
> At subvol user_data_test
> BTRFS_IOC_SEND returned 0
> joining genl thread
> receiving snapshot user_data_test
> uuid=014fc004-ae04-0148-9525-1bf556fd4d10, ctransid=52373
> parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10,
> parent_ctransid=52373
> At snapshot user_data_test
> ERROR: cannot find parent subvolume
> 
> Note that the receiver says
> "parent_uuid=014fc004-ae04-0148-9525-1bf556fd4d10". Not really sure
> where that comes from, but disk B has the same, so maybe that's the
> UUID of the original snapshot on disk A?
> 


If "Received UUID" is set on a subvolume, "btrfs send" will forward it
and set on destination unchanged, instead of computing "correct"
Received UUID. This has been discussed several times.

> Is it possible to continue to send incremental snapshots between these
> two file systems, or must I do a full sync?

You would need to create writable clone on B first and start with this
clone; new writable clone won't have (misleading) Received UUID and you
start clean.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Andrei Borzenkov

11.02.2018 04:02, Hans van Kranenburg пишет:
...
> 
>> - /dev/sda6 / btrfs
>> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
>> 0 0
> 
> Note that changes on atime cause writes to metadata, which means cowing
> metadata blocks and unsharing them from a previous snapshot, only when
> using the filesystem, not even when changing things (!).

With relatime atime is updated only once after file was changed. So your
description is not entirely accurate and things should not be that
dramatic unless files are continuously being changed.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: IO Error (.snapshots is not a btrfs subvolume)

2018-02-07 Thread Andrei Borzenkov

08.02.2018 06:03, Chris Murphy пишет:
> On Wed, Feb 7, 2018 at 6:26 PM, Nick Gilmour  wrote:
>> Hi all,
>>
>> I have successfully restored a snapshot of root but now when I try to

How exactly was it done?

>> make a new snapshot I get this error:
>> IO Error (.snapshots is not a btrfs subvolume).
>> My snapshots were within @ which I renamed to @_old.
>> What can I do now? How can I move the snapshots from @_old/ into @ and
>> be able to make snapshots again?
>>
>> This is an excerpt of my subvolumes list:
>>
>> # btrfs subvolume list /
>> ID 257 gen 175397 top level 5 path @_old
>> ID 258 gen 175392 top level 5 path @pkg
>> ID 260 gen 175447 top level 5 path @tmp
>> ID 262 gen 19 top level 257 path @_old/var/lib/machines
>> ID 268 gen 175441 top level 5 path @test
>> ID 291 gen 175394 top level 257 path @_old/.snapshots
>> ID 292 gen 1705 top level 291 path @_old/.snapshots/1/snapshot
>> ...
>>
>> ID 3538 gen 175398 top level 291 path @_old/.snapshots/1594/snapshot
>> ID 3540 gen 175447 top level 5 path @
>>
> 
> 
> This is a snapper behavior. It creates .snapshots as a subvolume and
> then puts snapshots into that subvolume. If you snapshot a subvolume
> that contains another subvolume, the nested subvolume is not snapshot,
> instead a plain directory placeholder is created instead. So your
> restored snapshot contains a .snapshot directory rather than a
> .snapshot subvolume. Possibly if you delete the directory and create a
> new subvolume .snapshot, the problem will be fixed.
> 

No, you should create subvolume @/.snapshots and mount it as /.snapshots
(and have it in /etc/fstab). Snapshots should always be available in
running system under fixed path and this only possible when it is
mounted, otherwise after rollback /.snapshots will be lost just like it
happened now.

Exact subvolume name probably not matters that much, but better stick
with what installer does by default. It may matter for grub2 snapshots
handling.

Also openSUSE expects that actual root is subvolume under /.snapshots
which is valid snapper snapshot (i.e. it has valid metadata). Again, not
having this may confuse snapper.

It may be possible to move @_old/.snapshots into @/.snapshots, although
this breaks parent-child relationships those old snapshots cannot be
cleaned up without removing old root completely.

> I can't tell you how this will confuse snapper though, or how to
> unconfuse it. It pretty much expects to be in control of all
> snapshots, creation, deletion, and rollbacks. So if you do it manually
> for whatever reason, I think it can confuse snapper.
> 
> 

There was blog post recently outlining how to restore openSUSE root. You
may want to search opensuse or opensuse-factory mailing list. Ah found:

https://rootco.de/2018-01-19-opensuse-btrfs-subvolumes/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-29 Thread Andrei Borzenkov

29.01.2018 14:24, Adam Borowski пишет:
...
> 
> So any event (the user's request) has already happened.  A rc system, of
> which systemd is one, knows whether we reached the "want root filesystem" or
> "want secondary filesystems" stage.  Once you're there, you can issue the
> mount() call and let the kernel do the work.
> 
>> It is a btrfs choice to not expose compound device as separate one (like
>> every other device manager does)
> 
> Btrfs is not a device manager, it's a filesystem.
> 
>> it is a btrfs drawback that doesn't provice anything else except for this
>> IOCTL with it's logic
> 
> How can it provide you with something it doesn't yet have?  If you want the
> information, call mount().  And as others in this thread have mentioned,
> what, pray tell, would you want to know "would a mount succeed?" for if you
> don't want to mount?
> 
>> it is a btrfs drawback that there is nothing to push assembling into "OK,
>> going degraded" state
> 
> The way to do so is to timeout, then retry with -o degraded.
> 

That's possible way to solve it. This likely requires support from
mount.btrfs (or btrfs.ko) to return proper indication that filesystem is
incomplete so caller can decide whether to retry or to try degraded mount.

Or may be mount.btrfs should implement this logic internally. This would
really be the most simple way to make it acceptable to the other side by
not needing to accept anything :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov

28.01.2018 18:57, Duncan пишет:
> Andrei Borzenkov posted on Sun, 28 Jan 2018 11:06:06 +0300 as excerpted:
> 
>> 27.01.2018 18:22, Duncan пишет:
>>> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
>>>
>>>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>>>
>>>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>>>> process stop on initramfs.
>>>>>>>
>>>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>>>
>>>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>>>> everything is fine from their point of view.
>>>>
>>>> It's quite obvious who's the culprit: every single remaining rc system
>>>> manages to mount degraded btrfs without problems.  They just don't try
>>>> to outsmart the kernel.
>>>
>>> No kidding.
>>>
>>> All systemd has to do is leave the mount alone that the kernel has
>>> already done,
>>
>> Are you sure you really understand the problem? No mount happens because
>> systemd waits for indication that it can mount and it never gets this
>> indication.
> 
> As Tomaz indicates, I'm talking about manual mounting (after the initr* 
> drops to a maintenance prompt if it's root being mounted, or on manual 
> mount later if it's an optional mount) here.  The kernel accepts the 
> degraded mount and it's mounted for a fraction of a second, but systemd 
> actually undoes the successful work of the kernel to mount it, so by the 
> time the prompt returns and a user can check, the filesystem is unmounted 
> again, with the only indication that it was mounted at all being the log.
> 

This is fixed in current systemd (actually for quite some time). If you
still observe it with more or less recent systemd, report a bug.

> He says that's because the kernel still says it's not ready, but that's 
> for /normal/ mounting.  The kernel accepted the degraded mount and 
> actually mounted the filesystem, but systemd undoes that.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-28 Thread Andrei Borzenkov

27.01.2018 18:22, Duncan пишет:
> Adam Borowski posted on Sat, 27 Jan 2018 14:26:41 +0100 as excerpted:
> 
>> On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
>>> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
>>>
>>>>> I just tested to boot with a single drive (raid1 degraded), even
>>>>> with degraded option in fstab and grub, unable to boot !  The boot
>>>>> process stop on initramfs.
>>>>>
>>>>> Is there a solution to boot with systemd and degraded array ?
>>>>
>>>> No. It is finger pointing. Both btrfs and systemd developers say
>>>> everything is fine from their point of view.
>>
>> It's quite obvious who's the culprit: every single remaining rc system
>> manages to mount degraded btrfs without problems.  They just don't try
>> to outsmart the kernel.
> 
> No kidding.
> 
> All systemd has to do is leave the mount alone that the kernel has 
> already done,

Are you sure you really understand the problem? No mount happens because
systemd waits for indication that it can mount and it never gets this
indication.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-27 Thread Andrei Borzenkov

27.01.2018 13:08, Christophe Yayon пишет:
> I just tested to boot with a single drive (raid1 degraded), even with 
> degraded option in fstab and grub, unable to boot ! The boot process stop on 
> initramfs.
> 
> Is there a solution to boot with systemd and degraded array ?

No. It is finger pointing. Both btrfs and systemd developers say
everything is fine from their point of view.

> 
> Thanks 
> 
> --
> Christophe Yayon
> 
>> On 27 Jan 2018, at 07:48, Christophe Yayon <cya...@nbux.org> wrote:
>>
>> I think you are right, i do not see any systemd message when degraded option 
>> is missing and have to remount manually with degraded.
>>
>> It seems it is better to use mdadm for raid and btrfs over it as i 
>> understand. Even in recent kernel ?
>> I hav me to do some bench and compare...
>>
>> Thanks
>>
>> --
>> Christophe Yayon
>>
>>> On 27 Jan 2018, at 07:43, Andrei Borzenkov <arvidj...@gmail.com> wrote:
>>>
>>> 27.01.2018 09:40, Christophe Yayon пишет:
>>>> Hi, 
>>>>
>>>> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
>>>> In fstab root is mounted via UUID. As far as I know the UUID is the same
>>>> for all devices in raid array.
>>>> The system boot with no problem with degraded and only 1/2 root device.
>>>
>>> Then your initramfs does not use systemd.
>>>
>>>> --
>>>> Christophe Yayon
>>>> cyayon-l...@nbux.org
>>>>
>>>>
>>>>
>>>>>> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>>>>>> 26.01.2018 17:47, Christophe Yayon пишет:
>>>>>> Hi Austin,
>>>>>>
>>>>>> Thanks for your answer. It was my opinion too as the "degraded"
>>>>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>>>>> running Archlinux with recent kernel on all my servers (because of
>>>>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>>>>> Your idea to add a separate entry in grub.cfg with
>>>>>> rootflags=degraded is attractive, i will do this...> >
>>>>>> Just a last question, i thank that it was necessary to add
>>>>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>>>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
>>>>>> have done some test and boot a a system with only 1 of
>>>>>> 2 drive in my root raid1 array. No problem with systemd,>
>>>>> Are you using systemd in your initramfs (whatever
>>>>> implementation you are> using)? I just tested with dracut using systemd 
>>>>> dracut module and it
>>>>> does not work - it hangs forever waiting for device. Of course,
>>>>> there is> no way to abort it and go into command line ...
>>>>>
>>>>> Oh, wait - what device names are you using? I'm using mount by
>>>>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>>>>> not appear> unless all devices have been seen once ...
>>>>>
>>>>> ... and it still does not work even if I change it to root=/dev/sda1
>>>>> explicitly because sda1 will *not* be announced as "present" to
>>>>> systemd> until all devices have been seen once ...
>>>>>
>>>>> So no, it does not work with systemd *in initramfs*. Absolutely.
>>>>
>>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-26 Thread Andrei Borzenkov

27.01.2018 09:40, Christophe Yayon пишет:
> Hi, 
> 
> I am using archlinux with kernel 4.14, there is btrfs module in initrd.
> In fstab root is mounted via UUID. As far as I know the UUID is the same
> for all devices in raid array.
> The system boot with no problem with degraded and only 1/2 root device.

Then your initramfs does not use systemd.

> --
>   Christophe Yayon
>   cyayon-l...@nbux.org
> 
> 
> 
> On Sat, Jan 27, 2018, at 06:50, Andrei Borzenkov wrote:
>> 26.01.2018 17:47, Christophe Yayon пишет:
>>> Hi Austin,
>>>
>>> Thanks for your answer. It was my opinion too as the "degraded"
>>> seems to be flagged as "Mostly OK" on btrfs wiki status page. I am
>>> running Archlinux with recent kernel on all my servers (because of
>>> use of btrfs as my main filesystem, i need a recent kernel).> >
>>> Your idea to add a separate entry in grub.cfg with
>>> rootflags=degraded is attractive, i will do this...> >
>>> Just a last question, i thank that it was necessary to add
>>> "degraded" option in grub.cfg AND fstab to allow boot in degraded
>>> mode. I am not sure that only grub.cfg is sufficient...> > Yesterday, i 
>>> have done some test and boot a a system with only 1 of
>>> 2 drive in my root raid1 array. No problem with systemd,>
>> Are you using systemd in your initramfs (whatever
>> implementation you are> using)? I just tested with dracut using systemd 
>> dracut module and it
>> does not work - it hangs forever waiting for device. Of course,
>> there is> no way to abort it and go into command line ...
>>
>> Oh, wait - what device names are you using? I'm using mount by
>> UUID and> this is where the problem starts - /dev/disk/by-uuid/xxx will
>> not appear> unless all devices have been seen once ...
>>
>> ... and it still does not work even if I change it to root=/dev/sda1
>> explicitly because sda1 will *not* be announced as "present" to
>> systemd> until all devices have been seen once ...
>>
>> So no, it does not work with systemd *in initramfs*. Absolutely.
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: degraded permanent mount option

2018-01-26 Thread Andrei Borzenkov

26.01.2018 17:47, Christophe Yayon пишет:
> Hi Austin,
> 
> Thanks for your answer. It was my opinion too as the "degraded" seems to be 
> flagged as "Mostly OK" on btrfs wiki status page. I am running Archlinux with 
> recent kernel on all my servers (because of use of btrfs as my main 
> filesystem, i need a recent kernel).
> 
> Your idea to add a separate entry in grub.cfg with rootflags=degraded is 
> attractive, i will do this...
> 
> Just a last question, i thank that it was necessary to add "degraded" option 
> in grub.cfg AND fstab to allow boot in degraded mode. I am not sure that only 
> grub.cfg is sufficient... 
> Yesterday, i have done some test and boot a a system with only 1 of 2 drive 
> in my root raid1 array. No problem with systemd,

Are you using systemd in your initramfs (whatever implementation you are
using)? I just tested with dracut using systemd dracut module and it
does not work - it hangs forever waiting for device. Of course, there is
no way to abort it and go into command line ...

Oh, wait - what device names are you using? I'm using mount by UUID and
this is where the problem starts - /dev/disk/by-uuid/xxx will not appear
unless all devices have been seen once ...

... and it still does not work even if I change it to root=/dev/sda1
explicitly because sda1 will *not* be announced as "present" to systemd
until all devices have been seen once ...

So no, it does not work with systemd *in initramfs*. Absolutely.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Recommendations for balancing as part of regular maintenance?

2018-01-16 Thread Andrei Borzenkov

On Tue, Jan 16, 2018 at 9:45 AM, Chris Murphy  wrote:
...
>>
>> Unless some better fix is in the works, this _should_ be a systemd unit or
>> something. Until then, please put it in FAQ.
>
> At least openSUSE has a systemd unit for a long time now, but last
> time I checked (a bit over a year ago) it's disabled by default. Why?
>

It is now enabled by default on Tumbleweed and hence likely on SLE/Leap 15.

> And insofar as I'm aware, openSUSE users aren't having big problems
> related to lack of balancing, they have problems due to the lack of
> balancing combined with schizo snapper defaults, which are these days
> masked somewhat by turning on quotas so snapper can be more accurate
> about cleaning up.
>

Not only that but also making snapshot policy less aggressive - now
(in Tumbleweed/Leap 42.3) periodical snapshots are turned off by
default, only configuration changes via YaST/package updates via
zypper trigger snapshot creation.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to make a cache directory nodatacow while also excluded from snapshots?

2018-01-15 Thread Andrei Borzenkov

16.01.2018 00:56, Dave пишет:
> I want to exclude my ~/.cache directory from snapshots. The obvious
> way to do this is to mount a btrfs subvolume at that location.
> 
> However, I also want the ~/.cache directory to be nodatacow. Since the
> parent volume is COW, I believe it isn't possible to mount the
> subvolume with different mount options.
> 
> What's the solution for achieving both of these goals?
> 
> I tried this without success:
> 
> chattr +C ~/.cache
> 
> Since ~/.cache is a btrfs subvolume, apparently that doesn't work.
> 
> lsattr ~/.cache
> 
> returns nothing.

Try creating file under ~/.cache and check its attributes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-21 Thread Andrei Borzenkov

On Wed, Dec 20, 2017 at 11:07 PM, Chris Murphy  wrote:
>
> YaST doesn't have Btrfs raid1 or raid10 options; and also won't do
> encrypted root with Btrfs either because YaST enforces LVM to do LUKS
> encryption for some weird reason; and it also enforces NOT putting
> Btrfs on LVM.
>

That's incorrect, btrfs on LVM is default on some SLES flavors and one
of the three standard proposals (where you do not need to go in expert
mode) - normal partitions, LVM, encrypted LVM - even on openSUSE.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-20 Thread Andrei Borzenkov

19.12.2017 22:47, Chris Murphy пишет:
> 
>>
>> BTW, doesn't SuSE use btrfs by default? Would you expect everyone using
>> this distro to research every component used?
> 
> As far as I'm aware, only Btrfs single device stuff is "supported".
> The multiple device stuff is definitely not supported on openSUSE, but
> I have no idea to what degree they support it with enterprise license,
> no doubt that support must come with caveats.
> 

I was rather surprised seeing RAID1 and RAID10 listed as supported in
SLES 12.x release notes, especially as there is no support for
multi-device btrfs in YaST and hence no way to even install on such
filesystem.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Unexpected raid1 behaviour

2017-12-19 Thread Andrei Borzenkov

On Tue, Dec 19, 2017 at 1:28 AM, Chris Murphy  wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain  wrote:
>
>>  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
>>  caused by [1], which we should revert back, since..
>>- balance (to raid1 chunk) may fail if FS is near full
>>- recovery (to raid1 chunk) will take more writes as compared
>>  to recovery under degraded raid1 chunks
>
>
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
>

The alternative would be to introduce new "resilver" operation that
would allocate second copy for every degraded chunk. And it could even
be started automatically when enough redundacy is present again.

> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
>

ZFS manages to avoid full sync in this case quite efficiently.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs subvolume list for not mounted filesystem?

2017-12-18 Thread Andrei Borzenkov

18.12.2017 19:49, Ulli Horlacher пишет:
> I want to mount an alternative subvolume of a btrfs filesystem.
> I can list the subvolumes when the filesystem is mounted, but how do I
> know them, when the filesystem is not mounted? Is there a query command?
> 
> root@xerus:~# mount | grep /test
> /dev/sdd4 on /test type btrfs 
> (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
> 
> root@xerus:~# btrfs subvolume list /test
> ID 258 gen 156 top level 5 path tux/zz
> ID 259 gen 156 top level 5 path tux/z1
> ID 260 gen 156 top level 5 path tmp/zz
> ID 261 gen 19 top level 260 path tmp/zz/z1
> ID 269 gen 156 top level 5 path tux/test
> ID 271 gen 129 top level 269 path tux/test/.snapshot/2017-12-02_1341.test
> 
> root@xerus:~# umount /test
> 
> root@xerus:~# btrfs subvolume list /dev/sdd4
> ERROR: not a btrfs filesystem: /dev/sdd4
> ERROR: can't access '/dev/sdd4'
> 
> 
> 

btrfs-debug-tree -r /dev/sdd4

(or in more recent btrfs-progs, "btrfs inspect-internal dump-tree") will
show subvolume IDs (as ROOT_ITEM); I'm not sure of there is simple way
to resolve subvolume path names on unmounted device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

How exclusive in parent qgroup is computed? (was: Re: exclusive subvolume space missing)

2017-12-05 Thread Andrei Borzenkov

02.12.2017 03:27, Qu Wenruo пишет:
> 
> That's the difference between how sub show and quota works.
> 
> For quota, it's per-root owner check.
> Means even a file extent is shared between different inodes, if all
> inodes are inside the same subvolume, it's counted as exclusive.
> And if any of the file extent belongs to other subvolume, then it's
> counted as shared.
> 

Could you also explain how parent qgroup computes exclusive space? I.e.

10:~ # mkfs -t btrfs -f /dev/sdb1
btrfs-progs v4.13.3
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM /dev/sdb1 (1023.00MiB) ...
Label:  (null)
UUID:   b9b0643f-a248-4667-9e69-acf5baaef05b
Node size:  16384
Sector size:4096
Filesystem size:1023.00MiB
Block group profiles:
  Data: single8.00MiB
  Metadata: DUP  51.12MiB
  System:   DUP   8.00MiB
SSD detected:   no
Incompat features:  extref, skinny-metadata
Number of devices:  1
Devices:
   IDSIZE  PATH
1  1023.00MiB  /dev/sdb1

10:~ # mount -t btrfs /dev/sdb1 /mnt
10:~ # cd /mnt
10:/mnt # btrfs quota enable .
10:/mnt # btrfs su cre sub1
Create subvolume './sub1'
10:/mnt # dd if=/dev/urandom of=sub1/file1 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.00833739 s, 126 MB/s
10:/mnt # dd if=/dev/urandom of=sub1/file2 bs=1K count=1024
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0179272 s, 58.5 MB/s
10:/mnt # btrfs subvolume snapshot sub1 sub2
Create a snapshot of 'sub1' in './sub2'
10:/mnt # dd if=/dev/urandom of=sub2/file2 bs=1K count=1024 conv=notrunc
1024+0 records in
1024+0 records out
1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0348762 s, 30.1 MB/s
10:/mnt # btrfs qgroup show --sync -p .
qgroupid rfer excl parent
   --
0/5  16.00KiB 16.00KiB ---
0/256 2.02MiB  1.02MiB ---
0/257 2.02MiB  1.02MiB ---

So far so good. This is expected, each subvolume has 1MiB shared and
1MiB exclusive.

10:/mnt # btrfs qgroup create 22/7 /mnt
10:/mnt # btrfs qgroup assign --rescan 0/256 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup assign --rescan 0/257 22/7 /mnt
Quota data changed, rescan scheduled
10:/mnt # btrfs quota rescan -s /mnt
no rescan operation in progress
10:/mnt # btrfs qgroup show --sync -p .
qgroupid rfer excl parent
   --
0/5  16.00KiB 16.00KiB ---
0/256 2.02MiB  1.02MiB 22/7
0/257 2.02MiB  1.02MiB 22/7
22/7  3.03MiB  3.03MiB ---
10:/mnt #

Oops. Total for 22/7 is correct (1MiB shared + 2 * 1MiB exclusive) but
why all data is treated as exclusive here? It does not match your
explanation ...



signature.asc
Description: OpenPGP digital signature

Re: btrfs-transacti hammering the system

2017-12-02 Thread Andrei Borzenkov

01.12.2017 21:04, Austin S. Hemmelgarn пишет:
> On 2017-12-01 12:13, Andrei Borzenkov wrote:
>> 01.12.2017 20:06, Hans van Kranenburg пишет:
>>>
>>> Additional tips (forgot to ask for your /proc/mounts before):
>>> * Use the noatime mount option, so that only accessing files does not
>>> lead to changes in metadata,
>>
>> Is not 'lazytime" default today?

Sorry, it was relatime that is today's default, I mixed them up.

> It gives you correct atime + no extra
>> metadata update cause by update of atime only.
> Unless things have changed since the last time this came up, BTRFS does
> not support the 'lazytime' mount option (but it doesn't complain about
> it either).
> 

Actually since v2.27 "lazytime" is interpreted by mount command itself
and converted into MS_LAZYTIME flag, so should be available for each FS.

bor@10:~> sudo mkfs -t ext4 /dev/sdb1
mke2fs 1.43.7 (16-Oct-2017)
...

bor@10:~> sudo mount -t ext4 -o lazytime /dev/sdb1 /mnt
bor@10:~> tail /proc/self/mountinfo
...
224 66 8:17 / /mnt rw,relatime shared:152 - ext4 /dev/sdb1
rw,lazytime,data=ordered
bor@10:~> sudo umount /dev/sdb1
bor@10:~> sudo mkfs -t btrfs -f /dev/sdb1
btrfs-progs v4.13.3
...

bor@10:~> sudo mount -t btrfs -o lazytime /dev/sdb1 /mnt
bor@10:~> tail /proc/self/mountinfo
...
224 66 0:88 / /mnt rw,relatime shared:152 - btrfs /dev/sdb1
rw,lazytime,space_cache,subvolid=5,subvol=/
bor@10:~>


> Also, lazytime is independent from noatime, and using both can have
> benefits (lazytime will still have to write out the inode for every file
> read on the system every 24 hours, but with noatime it only has to write
> out the inode for files that have changed).
> 

OK, that's true.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs-transacti hammering the system

2017-12-01 Thread Andrei Borzenkov

01.12.2017 20:06, Hans van Kranenburg пишет:
> 
> Additional tips (forgot to ask for your /proc/mounts before):
> * Use the noatime mount option, so that only accessing files does not
> lead to changes in metadata,

Is not 'lazytime" default today? It gives you correct atime + no extra
metadata update cause by update of atime only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Read before you deploy btrfs + zstd

2017-11-29 Thread Andrei Borzenkov

29.11.2017 16:24, Austin S. Hemmelgarn пишет:
> On 2017-11-28 18:49, David Sterba wrote:
>> On Tue, Nov 28, 2017 at 09:31:57PM +, Nick Terrell wrote:
>>>
 On Nov 21, 2017, at 8:22 AM, David Sterba  wrote:

 On Wed, Nov 15, 2017 at 08:09:15PM +, Nick Terrell wrote:
> On 11/15/17, 6:41 AM, "David Sterba"  wrote:
>> The branch is now in a state that can be tested. Turns out the memory
>> requirements are too much for grub, so the boot fails with "not
>> enough
>> memory". The calculated value
>>
>> ZSTD_BTRFS_MAX_INPUT: 131072
>> ZSTD_DStreamWorkspaceBound with ZSTD_BTRFS_MAX_INPUT: 549424
>>
>> This is not something I could fix easily, we'd probalby need a tuned
>> version of ZSTD for grub constraints. Adding Nick to CC.
>
> If I understand the grub code correctly, we only need to read, and
> we have
> the entire input and output buffer in one segment. In that case you
> can use
> ZSTD_initDCtx(), and ZSTD_decompressDCtx().
> ZSTD_DCtxWorkspaceBound() is
> only 155984. See decompress_single() in
> https://patchwork.kernel.org/patch/9997909/ for an example.

 Does not help, still ENOMEM.
>>>
>>> It looks like XZ had the same issue, and they make the decompression
>>> context a static object (grep for GRUB_EMBED_DECOMPRESSOR). We could
>>> potentially do the same and statically allocate the workspace:
>>>
>>> ```
>>> /* Could also be size_t */
>>> #define BTRFS_ZSTD_WORKSPACE_SIZE_U64 (155984 / sizeof(uint64_t))
>>> static uint64_t workspace[BTRFS_ZSTD_WORKSPACE_SIZE_U64];
>>>
>>> /* ... */
>>>
>>> assert(sizeof(workspace) >= ZSTD_DCtxWorkspaceBound());
>>> ```
>>
>> Interesting, thanks for the tip, I'll try it next.
>>
>> I've meanwhile tried to tweak the numbers, the maximum block for zstd,
>> that squeezed the DCtx somewhere under 48k, with block size 8k. Still
>> enomem.
>>
>> I've tried to add some debugging prints to see what numbers get actually
>> passed to the allocator, but did not see anything printed.  I'm sure
>> there is a more intelligent way to test the grub changes.  So far each
>> test loop takes quite some time, as I build the rpm package, test it in
>> a VM and have to recreate the environmet each time.
> On the note of testing, have you tried writing up a module to just test
> the decompressor?  If so, you could probably use the 'emu' platform to
> save the need to handle the RPM package and the VM until you get the
> decompressor working by itself, at which point the FUSE modules used to
> test the GRUB filesystem modules may be of some use (or you might be
> able to just use them directly).

There is also grub-fstest which directly calls filesystem drivers; usage
is something like "grub-fstest /dev/sdb1 cat /foo". Replace /dev/sdb1
with any btrfs image. As this is user space it is easy to single step if
needed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: bug? fstrim only trims unallocated space, not unused in bg's

2017-11-18 Thread Andrei Borzenkov

19.11.2017 09:17, Chris Murphy пишет:
> fstrim should trim free space, but it only trims unallocated. This is
> with kernel 4.14.0 and the entire 4.13 series. I'm pretty sure it
> behaved this way with 4.12 also.
> 

Well, I was told it should also trim free space ...

https://www.spinics.net/lists/linux-btrfs/msg61819.html

> 
> [root@f27h ~]# fstrim -v /
> /: 39 GiB (41841328128 bytes) trimmed
> [root@f27h ~]# btrfs fi us /
> Overall:
> Device size:  70.00GiB
> Device allocated:  31.03GiB
> Device unallocated:  38.97GiB
> Device missing: 0.00B
> Used:  22.02GiB
> Free (estimated):  47.72GiB(min: 47.72GiB)
> Data ratio:  1.00
> Metadata ratio:  1.00
> Global reserve:  65.97MiB(used: 0.00B)
> 
> Data,single: Size:30.00GiB, Used:21.25GiB
>/dev/nvme0n1p8  30.00GiB
> 
> Metadata,single: Size:1.00GiB, Used:791.58MiB
>/dev/nvme0n1p8   1.00GiB
> 
> System,single: Size:32.00MiB, Used:16.00KiB
>/dev/nvme0n1p8  32.00MiB
> 
> Unallocated:
>/dev/nvme0n1p8  38.97GiB
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-16 Thread Andrei Borzenkov

16.11.2017 19:13, Kai Krakow пишет:
...
> > BTW: From user API perspective, btrfs snapshots do not guarantee
> perfect granular consistent backups.

Is it documented somewhere? I was relying on crash-consistent
write-order-preserving snapshots in NetApp for as long as I remember.
And I was sure btrfs offers is as it is something obvious for
redirect-on-write idea.

> A user-level file transaction may
> still end up only partially in the snapshot. If you are running
> transaction sensitive applications, those usually do provide some means
> of preparing a freeze and a thaw of transactions.
> 

Is snapshot creation synchronous to know when thaw?

> I think the user transactions API which could've been used for this
> will even be removed during the next kernel cycles. I remember
> reiserfs4 tried to deploy something similar. But there's no consistent
> layer in the VFS for subscribing applications to filesystem snapshots
> so they could prepare and notify the kernel when they are ready.
> 

I do not see what VFS has to do with it. NetApp works by simply
preserving previous consistency point instead of throwing it away. I.e.
snapshot is always last committed image on stable storage. Would
something like this be possible on btrfs level by duplicating current
on-disk root (sorry if I use wrong term)?

...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to repair or access broken btrfs?

2017-11-14 Thread Andrei Borzenkov

14.11.2017 12:56, Stefan Priebe - Profihost AG пишет:
> Hello,
> 
> after a controller firmware bug / failure i've a broken btrfs.
> 
> # parent transid verify failed on 181846016 wanted 143404 found 143399
> 
> running repair, fsck or zero-log always results in the same failure message:
> extent-tree.c:2725: alloc_reserved_tree_block: BUG_ON `ret` triggered,
> value -1
> .. stack trace ..
> 
> Is there an chance to get at least a single file out of the broken fs?
> 

Did you try "btrfs restore"?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: updatedb does not index /home when /home is Btrfs

2017-11-05 Thread Andrei Borzenkov

04.11.2017 21:55, Chris Murphy пишет:
> On Sat, Nov 4, 2017 at 12:27 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
>> 04.11.2017 10:05, Adam Borowski пишет:
>>> On Sat, Nov 04, 2017 at 09:26:36AM +0300, Andrei Borzenkov wrote:
>>>> 04.11.2017 07:49, Adam Borowski пишет:
>>>>> On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
>>>>>> Ancient bug, still seems to be a bug.
>>>>>> https://bugzilla.redhat.com/show_bug.cgi?id=906591
>>>>>>
>>>>>> The issue is that updatedb by default will not index bind mounts, but
>>>>>> by default on Fedora and probably other distros, put /home on a
>>>>>> subvolume and then mount that subvolume which is in effect a bind
>>>>>> mount.
>>>>>>
>>>>>> There's a lot of early discussion in 2013 about it, but then it's
>>>>>> dropped off the radar as nobody has any ideas how to fix this in
>>>>>> mlocate.
>>>>>
>>>>> I don't see how this would be a bug in btrfs.  The same happens if you
>>>>> bind-mount /home (or individual homes), which is a valid and non-rare 
>>>>> setup.
>>>>
>>>> It is the problem *on* btrfs because - as opposed to normal bind mount -
>>>> those mount points do *not* refer to the same content.
>>>
>>> Neither do they refer to in a "normal" bind mount.
>>>
>>>> As was commented in mentioned bug report:
>>>>
>>>> mount -o subvol=root /dev/sdb1 /root
>>>> mount -o subvol=foo /dev/sdb1 /root/foo
>>>> mount -o subvol bar /dev/sdb1 /bar/bar
>>>>
>>>> Both /root/foo and /root/bar, will be skipped even though they are not
>>>> accessible via any other path (on mounted filesystem)
>>>
>>> losetup -D
>>> truncate -s 4G junk
>>> losetup -f junk
>>> mkfs.ext4 /dev/loop0
>>> mkdir -p foo bar
>>> mount /dev/loop0 foo
>>> mkdir foo/bar
>>> touch foo/fileA foo/bar/fileB
>>> mount --bind foo/bar bar
>>> umount foo
>>>
>>
>> Indeed. I can build the same configuration on non-btrfs and updatedb
>> would skip non-overlapping mounts just as it would on btrfs. It is just
>> that it is rather more involved on other filesystems (and as you
>> mentioned this requires top-level to be mounted at some point), while on
>> btrfs it is much easier to get (and is default on number of distributions).
>>
>> So yes, it really appears that updatedb check for duplicated mounts is
>> wrong in general and needs rethinking.
> 
> Yes, even if it's not a Btrfs bug, I think it's useful to get a
> different set of eyes on this than just the mlocate folks. Maybe it
> should get posted to fs-devel?
> 

Looking at mlocate history, initial bind detection was extremely
simplistic but actually correct, and would still work even with btrfs -
just look in /etc/mtab for mount with "bind" option where what != where.
This covers any sort of bind mount.

Later /etc/mtab disappeared and code was rewritten to use mountinfo.
Intentionally or not, this rewrite only works for bind mounts inside the
same filesystem subtree. I.e. it also won't catch cross filesystem bind
mounts. Failure on btrfs is side effect of this assumption.

So it actually can be considered regression in mlocate code.

I suppose first mlocate folks need to get clear answer what they want to
test here, then it makes sense to discuss how to do it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: updatedb does not index /home when /home is Btrfs

2017-11-04 Thread Andrei Borzenkov

04.11.2017 10:05, Adam Borowski пишет:
> On Sat, Nov 04, 2017 at 09:26:36AM +0300, Andrei Borzenkov wrote:
>> 04.11.2017 07:49, Adam Borowski пишет:
>>> On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
>>>> Ancient bug, still seems to be a bug.
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=906591
>>>>
>>>> The issue is that updatedb by default will not index bind mounts, but
>>>> by default on Fedora and probably other distros, put /home on a
>>>> subvolume and then mount that subvolume which is in effect a bind
>>>> mount.
>>>>
>>>> There's a lot of early discussion in 2013 about it, but then it's
>>>> dropped off the radar as nobody has any ideas how to fix this in
>>>> mlocate.
>>>
>>> I don't see how this would be a bug in btrfs.  The same happens if you
>>> bind-mount /home (or individual homes), which is a valid and non-rare setup.
>>
>> It is the problem *on* btrfs because - as opposed to normal bind mount -
>> those mount points do *not* refer to the same content.
> 
> Neither do they refer to in a "normal" bind mount.
> 
>> As was commented in mentioned bug report:
>>
>> mount -o subvol=root /dev/sdb1 /root
>> mount -o subvol=foo /dev/sdb1 /root/foo
>> mount -o subvol bar /dev/sdb1 /bar/bar
>>
>> Both /root/foo and /root/bar, will be skipped even though they are not
>> accessible via any other path (on mounted filesystem)
> 
> losetup -D
> truncate -s 4G junk
> losetup -f junk
> mkfs.ext4 /dev/loop0
> mkdir -p foo bar
> mount /dev/loop0 foo
> mkdir foo/bar
> touch foo/fileA foo/bar/fileB
> mount --bind foo/bar bar
> umount foo
> 

Indeed. I can build the same configuration on non-btrfs and updatedb
would skip non-overlapping mounts just as it would on btrfs. It is just
that it is rather more involved on other filesystems (and as you
mentioned this requires top-level to be mounted at some point), while on
btrfs it is much easier to get (and is default on number of distributions).

So yes, it really appears that updatedb check for duplicated mounts is
wrong in general and needs rethinking.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: updatedb does not index /home when /home is Btrfs

2017-11-04 Thread Andrei Borzenkov

04.11.2017 07:49, Adam Borowski пишет:
> On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
>> Ancient bug, still seems to be a bug.
>> https://bugzilla.redhat.com/show_bug.cgi?id=906591
>>
>> The issue is that updatedb by default will not index bind mounts, but
>> by default on Fedora and probably other distros, put /home on a
>> subvolume and then mount that subvolume which is in effect a bind
>> mount.
>>
>> There's a lot of early discussion in 2013 about it, but then it's
>> dropped off the radar as nobody has any ideas how to fix this in
>> mlocate.
> 
> I don't see how this would be a bug in btrfs.  The same happens if you
> bind-mount /home (or individual homes), which is a valid and non-rare setup.
> 

It is the problem *on* btrfs because - as opposed to normal bind mount -
those mount points do *not* refer to the same content. As was commented
in mentioned bug report:

mount -o subvol=root /dev/sdb1 /root
mount -o subvol=foo /dev/sdb1 /root/foo
mount -o subvol bar /dev/sdb1 /bar/bar

Both /root/foo and /root/bar, will be skipped even though they are not
accessible via any other path (on mounted filesystem)


191 25 0:54 /root /home/bor/tmp/root rw,relatime shared:131 - btrfs
/dev/loop0 rw,space_cache,subvolid=258,subvol=/root
285 191 0:54 /foo /home/bor/tmp/root/foo rw,relatime shared:239 - btrfs
/dev/loop0 rw,space_cache,subvolid=256,subvol=/foo
325 191 0:54 /bar /home/bor/tmp/root/bar rw,relatime shared:279 - btrfs
/dev/loop0 rw,space_cache,subvolid=257,subvol=/bar

bor@bor-Latitude-E5450:~/tmp$ sudo updatedb --debug-pruning -l 0 -o
../db -U root
...
Matching bind_mount_paths:
 => adding `/home/bor/tmp/root/foo'
 => adding `/home/bor/tmp/root/bar'
...done


It is a problem *of* btrfs because it does not offer any easy way to
distinguish between subvolume mount and bind mount. If you are aware of
one, please comment on mentioned bug report.

And note that updatedb can be run as non-root as well, so it probably
cannot use btrfs specific ioctls to extract information.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Several questions regarding btrfs

2017-11-02 Thread Andrei Borzenkov

02.11.2017 20:13, Austin S. Hemmelgarn пишет:
>>
>> 2. I want to limit access to sftp, so there will be no custom commands
>> to execute...
> A custom version of the 'quota' command would be easy to add in there.
> In fact, this is really the only option right now, since setting up sudo
> (or doas, or whatever other privilege escalation tool) to allow users to
> check usage requires full access to the 'btrfs' command, which in turn
> opens you up to people escaping their quotas.

It should be possible to allow only "btrfs qgroup show", at least in sudo.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Several questions regarding btrfs

2017-11-01 Thread Andrei Borzenkov

01.11.2017 15:01, Austin S. Hemmelgarn пишет:
...
> The default subvolume is what gets mounted if you don't specify a
> subvolume to mount.  On a newly created filesystem, it's subvolume ID 5,
> which is the top-level of the filesystem itself.  Debian does not
> specify a subvo9lume in /etc/fstab during the installation, so setting
> the default subvolume will control what gets mounted.  If you were to
> add a 'subvolume=' or 'subvolid=' mount option to /etc/fstab for that
> filesystem, that would override the default subvolume.
> 
> The reason I say to set the default subvolume instead of editing
> /etc/fstab is a pretty simple one though.  If you edit /etc/fstab and
> don't set the default subvolume, you will need to mess around with the
> bootloader configuration (and possibly rebuild the initramfs) to make
> the system bootable again, whereas by setting the default subvolume, the
> system will just boot as-is without needing any other configuration
> changes.

That breaks as soon as you have nested subvolumes that are not
explicitly mounted because they are lost in new snapshot.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Several questions regarding btrfs

2017-10-31 Thread Andrei Borzenkov

31.10.2017 20:45, Austin S. Hemmelgarn пишет:
> On 2017-10-31 12:23, ST wrote:
>> Hello,
>>
>> I've recently learned about btrfs and consider to utilize for my needs.
>> I have several questions in this regard:
>>
>> I manage a dedicated server remotely and have some sort of script that
>> installs an OS from several images. There I can define partitions and
>> their FSs.
>>
>> 1. By default the script provides a small separate partition for /boot
>> with ext3. Does it have any advantages or can I simply have /boot
>> within / all on btrfs? (Note: the OS is Debian9)
> It depends on the boot loader.  I think Debian 9's version of GRUB has
> no issue with BTRFS, but see the response below to your question on
> subvolumes for the one caveat.
>>
>> 2. as for the / I get ca. following written to /etc/fstab:
>> UUID=blah_blah /dev/sda3 / btrfs ...
>> So top-level volume is populated after initial installation with the
>> main filesystem dir-structure (/bin /usr /home, etc..). As per btrfs
>> wiki I would like top-level volume to have only subvolumes (at least,
>> the one mounted as /) and snapshots. I can make a snapshot of the
>> top-level volume with / structure, but how can get rid of all the
>> directories within top-lvl volume and keep only the subvolume
>> containing / (and later snapshots), unmount it and then mount the
>> snapshot that I took? rm -rf / - is not a good idea...
> There are three approaches to doing this, from a live environment, from
> single user mode running with init=/bin/bash, or from systemd emergency
> mode.  Doing it from a live environment is much safer overall, even if
> it does take a bit longer.  I'm listing the last two methods here only
> for completeness, and I very much suggest that you use the first (do it
> from a live environment).
> 
> Regardless of which method you use, if you don't have a separate boot
> partition, you will have to create a symlink called /boot outside the
> subvolume, pointing at the boot directory inside the subvolume, or
> change the boot loader to look at the new location for /boot.
> 
> From a live environment, it's pretty simple overall, though it's much
> easier if your live environment matches your distribution:
> 1. Create the snapshot of the root, naming it what you want the
> subvolume to be called (I usually just call it root, SUSE and Ubuntu
> call it @, others may have different conventions).
> 2. Delete everything except the snapshot you just created.  The safest
> way to do this is to explicitly list each individual top-level directory
> to delete.
> 3. Use `btrfs subvolume list` to figure out the subvolume ID for the
> subvolume you just created, and then set that as the default subvolume
> with `btrfs subvolume set-default /path SUBVOLID`.  Once you do this,
> you will need to specify subvolid=5 in the mount options to get the real
> top-level subvolume.

Note that current grub2 works with absolute paths (relative to
filesystem root). It means that if a) /boot/grub is on btrfs and b) it
is part of snapshot that becomes new root, $prefix (that points to
/boot/grub) in the first-stage grub2 image will be wrong. So to be on
safe side you would want to reinstall grub2 after this change.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLES 11 SP4: can't mount btrfs

2017-10-26 Thread Andrei Borzenkov

26.10.2017 15:18, Lentes, Bernd пишет:
> 
>> -Original Message-
>> From: linux-btrfs-ow...@vger.kernel.org 
>> [mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Lentes, Bernd
>> Sent: Tuesday, October 24, 2017 6:44 PM
>> To: Btrfs ML 
>> Subject: RE: SLES 11 SP4: can't mount btrfs
>>
>>
>>>
>>> A short-term alternative, if you've got a full backup of what SLES
>>> mounts as /, is to run a regular install, boot the system, and then
>>> extract the backup on top of /.  It's not perfect, but it should work
>>> well enough.
>>
>> That's what I'm currently trying. I will keep you informed.
>>
> 
> I was able to restore the root fs. I formatted the / partition with Btrfs 
> again and could restore the files from a backup.
> Everything seems to be there, I can mount the Btrfs manually.
> But booting does not work. My Btrfs resides on a logical volume. I changed 
> /boot/grub/menu.lst and /etc/fstab to point to the lv. Before it was 
> pointing to a UUID.
> But booting my SLES complains that it does not find the root fs. screenshot: 
> https://hmgubox.helmholtz-muenchen.de/f/2d6b374e8a8b40569d4f/?dl=1
> 
> I can manually mount from the booted SLES. So everything (Btrfs, lvm) seems 
> to be available. I added in menu.lst and fstab the path to the device node 
> (/dev/vg1/lv_root), which works on other systems the same way, the only 
> difference is there I have ext3.
> But SLES finds from where I don't know a UUID (see screenshot). This UUID is 
> commented out in fstab and replaced by /dev/vg1/lv_root. Using 
> /dev/vg1/lv_root I can manually mount my Btrfs without any problem.
> 
> Where does my SLES find that UUID ? It is not available unter 
> /dev/disk/by-uuid. Can I change that value ?
> 

root device information is stored in initrd, you need to rebuild it.
Just run mkinitrd after you boot system.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLES 11 SP4: can't mount btrfs

2017-10-24 Thread Andrei Borzenkov

On Tue, Oct 24, 2017 at 2:53 PM, Austin S. Hemmelgarn
 wrote:
>
> SLES (and OpenSUSE in general) does do something special though, they use
> subvolumes and qgroups to replicate multiple independent partitions (which
> is a serious pain in the arse), and they have snapshotting with snapper by
> default as well.  On OpenSUSE at least you can dispense with all that crap
> by telling the installer to not enable snapshot support, not sure about SLES
> though.

SUSE is using so many subvolumes because
a) it wants to use snapshot of operating system to enable rollback
b) data that needs to be part of snapshot includes RPM database
c) RPM database is located on /var

So they were forced to make /var part of root subvolume and explicitly
exclude everything below /var by making it separate subvolumes.

Fortunately it is going to change now with both RH and SUSE moving RPM
database under /usr. Which leaves you basically with / and /var as
default subvolumes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SLES 11 SP4: can't mount btrfs

2017-10-19 Thread Andrei Borzenkov

19.10.2017 23:04, Chris Murphy пишет:
> Btrfs
> is not just supported by SUSE, it's the default file system.
> 

It is default choice for root starting with SLES12, not in SLES11. But
yes, it should still be supported.

I do not hold my breath though. For all I can tell transid errors are
usually fatal and if this is root, it may be easier and faster to just
reinstall.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] btrfs: Remove received_uuid during received snapshot ro->rw switch

2017-10-07 Thread Andrei Borzenkov

07.10.2017 00:27, Hans van Kranenburg пишет:
> On 10/06/2017 10:07 PM, Andrei Borzenkov wrote:
>>
>> What is reason behind allowing change from ro to rw in the first place?
>> What is the use case?
> 
> I think this is a case of "well, nobody actually has been thinking of
> the use cases ever, we just did something yolo"
> 
> Btrfs does not make a difference between snapshots and clones. Other
> systems like netapp and zfs do. Btrfs cloud also do that, and just not
> expose the ro/rw flag to the outside.
> 

Current pure user-level implementation of btrfs receive requires
possibility to switch from rw to ro. So it is not possible to completely
hide it. This is different from both NetApp and ZFS. On NetApp
destination volume/qtree are always read-only for client access. ZFS
explicitly disallows any access to destination until transfer is complete.

It was already mentioned that in btrfs destination may be changed before
subvolume is changed to ro without anyone noticing it. Ideally btrfs
receive needs exclusive access to subvolume with some sort of automatic
cleanup if receive fails for any reason. This will ensure atomic (from
end user PoV) transfer.

> Personally, I would like btrfs to go into that direction, because it
> just makes things more clear. This is a snapshot, you cannot touch it.
> If you want to make changes, you have to make a rw clone of the snapshot.
>
> The nice thing for btrfs is that you can remove the snapshot after you
> made the rw clone, which you cannot do on a NetApp filer. :o)
> 
>>> Even if it wouldn't make sense for some reason, it's a nice thought
>>> experiment. :)
> 
> There we go :)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] btrfs: Remove received_uuid during received snapshot ro->rw switch

2017-10-06 Thread Andrei Borzenkov

06.10.2017 20:49, Hans van Kranenburg пишет:
> On 10/06/2017 07:24 PM, David Sterba wrote:
>> On Thu, Oct 05, 2017 at 05:03:47PM +0800, Anand Jain wrote:
>>> On 10/05/2017 04:22 PM, Nikolay Borisov wrote:
 Currently when a read-only snapshot is received and subsequently its ro 
 property
 is set to false i.e. switched to rw-mode the received_uuid of that subvol 
 remains
 intact. However, once the received volume is switched to RW mode we cannot
 guaranteee that it contains the same data, so it makes sense to remove the
 received uuid. The presence of the received_uuid can also cause problems 
 when
 the volume is being send.
> 
> Are the 'can cause problems when being send' explained somewhere?
> 

If received_uuid is present, btrfs send will use it instead of subvolume
uuid. It means btrfs receive may find wrong volume as differential
stream base. Example that was demonstrated earlier

1. A -> B on remote system S. B now has received_uui == A
2. A -> C on local system. C now has received_uuid == A
3. C is made read-write and changed.
4. Create snapshot D from C and do "btrfs send -p C D" to system S. Now
btrfs receive on S will get base uuid of A and will find B. So any
changes between B and C are silently lost.

>>>
>>> Wonder if this [1] approach was considered
>>> [1]
>>>   - set a flag on the subvolume to indicate its dirtied so that 
>>> received_uuid can be kept forever just in case if user needs it for some 
>>> reference at a later point of time.
>>
>> Yeah, we need to be careful here. There are more items related to the
>> recived subvolume, besides received_uuid there's rtransid and rtime so
>> they might need to be cleared as well.
>>
>> I don't remember all the details how the send/receive and uuids
>> interact. Switching from ro->rw needs to affect the 'received' status,
>> but I don't know how. The problem is that some information is being lost
>> although it may be quite important to the user/administrator. In such
>> cases it would be convenient to request a confirmation via a --force
>> flag or something like that.
> 
> On IRC I think we generally recommends users to never do this, and as a
> best practice always clone the snapshot to a rw subvolume in a different
> location if someone wants to proceed working with the contents and
> changing them as opposed to messing with the ro/rw attributes.
> 
> So, what about option [2]:
> 
> [2] if a subvolume has a received_uuid, then just do not allow changing
> it to rw.
> 

What is reason behind allowing change from ro to rw in the first place?
What is the use case?

> Even if it wouldn't make sense for some reason, it's a nice thought
> experiment. :)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] btrfs-progs: subvol: change subvol set-default to also accept subvol path

2017-10-02 Thread Andrei Borzenkov

On Mon, Oct 2, 2017 at 11:19 AM, Misono, Tomohiro
 wrote:
> This patch changes "subvol set-default" to also accept the subvolume path
> for convenience.
>
> This is the one of the issue on github:
> https://github.com/kdave/btrfs-progs/issues/35
>
> If there are two args, they are assumed as subvol id and path to the fs
> (the same as current behavior), and if there is only one arg, it is assumed
> as the path to the subvolume. Therefore there is no ambiguity between subvol
> id and subvol name, which is mentioned in the above issue page.
>
> Only the absolute path to the subvolume is allowed, for the safety when
> multiple filesystems are used.
>
> subvol id is resolved by get_subvol_info() which is used by "subvol show".
>
> change to v2:
> restrict the path to only allow absolute path.

This is absolutely arbitrary restriction. Why we can do "btrfs
subvolume create ./relative/path" but cannot do "btrfs subvolume
set-default ./relative/path"?

> documents are updated accordingly.
>
> Signed-off-by: Tomohiro Misono 
> ---
>  Documentation/btrfs-subvolume.asciidoc | 11 +++
>  cmds-subvolume.c   | 53 
> +++---
>  2 files changed, 48 insertions(+), 16 deletions(-)
>
> diff --git a/Documentation/btrfs-subvolume.asciidoc 
> b/Documentation/btrfs-subvolume.asciidoc
> index 5cfe885..7973567 100644
> --- a/Documentation/btrfs-subvolume.asciidoc
> +++ b/Documentation/btrfs-subvolume.asciidoc
> @@ -142,12 +142,13 @@ you can add \'\+' or \'-' in front of each items, \'+' 
> means ascending,
>  for --sort you can combine some items together by \',', just like
>  --sort=+ogen,-gen,path,rootid.
>
> -*set-default*  ::
> -Set the subvolume of the filesystem  which is mounted as
> -default.
> +*set-default* [| ]::
> +Set the subvolume of the filesystem which is mounted as default.
>  +
> -The subvolume is identified by , which is returned by the *subvolume 
> list*
> -command.
> +Only absolute path is allowed to specify the subvolume.
> +Alterantively, the pair of  and  can be used. In that
> +case, the subvolume is identified by , which is returned by the
> +*subvolume list* command. The filesystem is specified by .
>
>  *show* ::
>  Show information of a given subvolume in the .
> diff --git a/cmds-subvolume.c b/cmds-subvolume.c
> index 666f6e0..dda9e73 100644
> --- a/cmds-subvolume.c
> +++ b/cmds-subvolume.c
> @@ -803,28 +803,59 @@ out:
>  }
>
>  static const char * const cmd_subvol_set_default_usage[] = {
> -   "btrfs subvolume set-default  ",
> -   "Set the default subvolume of a filesystem",
> +   "btrfs subvolume set-default [| ]",
> +   "Set the default subvolume of the filesystem mounted as default.",
> +   "The subvolume can be specified by its absolute path,",
> +   "or the pair of subvolume id and mount point to the filesystem.",
> NULL
>  };
>
>  static int cmd_subvol_set_default(int argc, char **argv)
>  {
> -   int ret=0, fd, e;
> -   u64 objectid;
> -   char*path;
> -   char*subvolid;
> -   DIR *dirstream = NULL;
> +   int ret = 0;
> +   int fd, e;
> +   u64 objectid;
> +   char *path;
> +   char *subvolid;
> +   DIR *dirstream = NULL;
>
> clean_args_no_options(argc, argv, cmd_subvol_set_default_usage);
>
> -   if (check_argc_exact(argc - optind, 2))
> +   if (check_argc_min(argc - optind, 1) ||
> +   check_argc_max(argc - optind, 2))
> usage(cmd_subvol_set_default_usage);
>
> -   subvolid = argv[optind];
> -   path = argv[optind + 1];
> +   if (argc - optind == 1) {
> +   /* path to the subvolume is specified */
> +   struct root_info ri;
> +   char *fullpath;
>
> -   objectid = arg_strtou64(subvolid);
> +   path = argv[optind];
> +   if (path[0] != '/') {
> +   error("only absolute path is allowed");
> +   return 1;
> +   }
> +
> +   fullpath = realpath(path, NULL);
> +   if (!fullpath) {
> +   error("cannot find real path for '%s': %s",
> +   path, strerror(errno));
> +   return 1;
> +   }
> +
> +   ret = get_subvol_info(fullpath, );
> +   free(fullpath);
> +
> +   if (ret)
> +   return 1;
> +
> +   objectid = ri.root_id;
> +   } else {
> +   /* subvol id and path to the filesystem are specified */
> +   subvolid = argv[optind];
> +   path = argv[optind + 1];
> +   objectid = arg_strtou64(subvolid);
> +   }
>
> fd = btrfs_open_dir(path, , 1);
> if (fd < 0)
> --
> 2.9.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to

Re: What means "top level" in "btrfs subvolume list" ?

2017-10-01 Thread Andrei Borzenkov

30.09.2017 14:57, Goffredo Baroncelli пишет:
> (please ignore my previous email, because I wrote somewhere "top id" instead 
> of "top level")
> Hi All,
> 
> I am trying to figure out which means "top level" in the output of "btrfs sub 
> list"
> 
> 

Digging in git history - "top level" originally meant "subvolume where
path starts". Apparently original code allowed for "detached" roots not
present in tree of roots (deleted subvolume?)

> ghigo@venice:~$ sudo btrfs sub list /
> [sudo] password for ghigo: 
> ID 257 gen 548185 top level 5 path debian
> ID 289 gen 418851 top level 257 path var/lib/machine> ID 299 gen 537230 top 
> level 5 path boot
> ID 532 gen 502364 top level 257 path tmp/test1
> ID 533 gen 502277 top level 532 path tmp/test1/test2
> ID 534 gen 502407 top level 257 path tmp/test1.snap
> ID 535 gen 502363 top level 532 path tmp/test1/test3
> ID 536 gen 502407 top level 257 path tmp/test1.snap2
> ID 537 gen 537132 top level 5 path test
> ID 538 gen 537130 top level 537 path test/sub1
> 
> The root filesystem is mounted as
> 
> mount -o subvol=debian 
> 
> 
> I think that "btrfs sub list" is buggy, and it outputs a "top level ID" equal 
> to the "parent ID" (on the basis of the code).

Yes, apparently this was (unintentionally?) changed by commit
4f5ebb3ef55396ef976d3245e2cdf9860680df74 which also apparently changed
semantic of -o option - before this commit resolve_root() would produce
subvolume path relative to given top_id; after this commit path is
always relative to filesystem root.

Moreover, this fix by itself is buggy. It sets "top level" to immediate
parent, but option -o will setup filter by top_id, which means that only
immediate subvolumes be listed. To illustrate:

bor@10:~> /usr/sbin/btrfs sub cre tsub
Create subvolume './tsub'
bor@10:~> /usr/sbin/btrfs sub cre tsub/sub1
Create subvolume 'tsub/sub1'
bor@10:~> /usr/sbin/btrfs sub cre tsub/sub2
Create subvolume 'tsub/sub2'
bor@10:~> sudo btrfs sub li -p -o tsub
ID 346 gen 1960 parent 345 top level 345 path @/home/bor/tsub/sub1
ID 347 gen 1961 parent 345 top level 345 path @/home/bor/tsub/sub2
bor@10:~> /usr/sbin/btrfs sub cre tsub/sub1/subsub1
Create subvolume 'tsub/sub1/subsub1'
bor@10:~> /usr/sbin/btrfs sub cre tsub/sub2/subsub2
Create subvolume 'tsub/sub2/subsub2'
bor@10:~> sudo btrfs sub li -p -o tsub
ID 346 gen 1965 parent 345 top level 345 path @/home/bor/tsub/sub1
ID 347 gen 1966 parent 345 top level 345 path @/home/bor/tsub/sub2

it misses nested subvolumes.

> But I am still asking which would be the RIGHT "top level id". My Hypothesis, 
> it should be the ID of the root subvolume ( or 5 if it is not mounted).

To remind - "top level" was intended as "subvolume to which shown path
is relative".

Given that the code now will fail (return -ENOENT) for detached root,
the only possible output can be 5. Except the controversial case of
"-o". Going back to original behavior is probably going to break some
scripts now.

But again, at this point we may have scripts that rely on current "top
level" semantic. The change was there for how much ... 3 and half years.

But documenting it in manual page would be good.

> So the output should be
> 
>  
> ghigo@venice:~$ sudo btrfs sub list /
> [sudo] password for ghigo: 
> ID 257 gen 548185 top level 5 path debian
> ID 289 gen 418851 top level 257 path var/lib/machines
> ID 299 gen 537230 top level 5 path boot
> ID 532 gen 502364 top level 257 path tmp/test1
> ID 533 gen 502277 top level 257 path tmp/test1/test2
> ID 534 gen 502407 top level 257 path tmp/test1.snap
> ID 535 gen 502363 top level 257 path tmp/test1/test3
> ID 536 gen 502407 top level 257 path tmp/test1.snap2
> ID 537 gen 537132 top level 5 path test
> ID 538 gen 537130 top level 5 path test/sub1
> 
> 
> The subvolumes in the file system (mounted from /debian) are
> 
> / ID = 5
> /debian   ID = 257
> /debian/var/lib/machines  ID = 289
> /boot ID = 299
> /debian/tmp/test1 ID = 532
> /debian/tmp/test1/test2   ID = 533
> /debian/tmp/test1.snapID = 534
> /debian/tmp/test1/test3   ID = 535
> /debian/tmp/test1.snap2   ID = 536
> /test ID = 537
> /test/sub1ID = 538
> 
> BR
> G.Baroncelli
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: What means "top level" in "btrfs subvolume list" ?

2017-09-30 Thread Andrei Borzenkov

30.09.2017 17:53, Peter Grandi пишет:
>> I am trying to figure out which means "top level" in the
>> output of "btrfs sub list"
> 
> The terminology (and sometimes the detailed behaviour) of Btrfs
> is not extremely consistent, I guess because of permissive
> editorship of the design, in a "let 1000 flowers bloom" sort
> of fashion so that does not matter a lot.
> 
>> [ ... ] outputs a "top level ID" equal to the "parent ID" (on
>> the basis of the code).
> 
> You could have used option '-p' and it would have printed out
> both "top level ID" and "parent ID" for extra enlightenment.
> 

How does it explain what "top level" is?

>> But I am still asking which would be the RIGHT "top level id".
> 
> But perhaps one of them is now irrelevant, because 'man btrfs
> subvolume says:
> 
>   "If -p is given, then parent  is added to the output
>   between ID and top level. The parent’s ID may be used at mount
>   time via the subvolrootid= option."
> 
> and 'man 5 btrfs' says:
> 
>   "subvolrootid=objectid
> (irrelevant since: 3.2, formally deprecated since: 3.10)
> A workaround option from times (pre 3.2) when it was not
> possible to mount a subvolume that did not reside directly
> under the toplevel subvolume."
> 

This still does not explain what "top level" in "btrfs sub list" means.

>> My Hypothesis, it should be the ID of the root subvolume ( or
>> 5 if it is not mounted). [ ... ]
> 
> Well, a POSIX filesystem typically has a root directory, and it
> can be mounted as the system root or any other point. A Btrfs
> filesystem has multiple root directories, that are mounted by
> default "somewhere" (a design decision that I think was unwise,
> but "whatever").
> 
> The subvolume containing the mountpoint directory of another
> subvolume's root directory is is no way or sense its "parent",
> as there is no derivation relationship; root directories are
> independent of each other and their mountpoint is (or should be)
> a runtime entity.
> 

With all respect that is not how it looks like. Each subvolume has very
precise relationship to containing subvolume; you can only traverse
subvolumes via this very explicit relationship. The fact that subvolumes
can also be mounted individually on VFS level is rather irrelevant for
filesystem structure.

Whether "parent" is correct name for containing subvolume is of course
subject to opinion, but for me it fits - subvolumes do form a tree and
have very well defined parent/child relationship.

> If there is a "parent" relationship that maybe be between
> snapshot and origin subvolume (ignoring 'send'/'receive'...),

Yes, having the same name for entirely different types of hierarchical
relationships is unfortunate. But it still does not explain what "top
level" means :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Wrong device?

2017-09-27 Thread Andrei Borzenkov

26.09.2017 10:31, Lukas Pirl пишет:
> On 09/25/2017 06:11 PM, linux-bt...@oh3mqu.pp.hyper.fi wrote as excerpted:
>> After a long googling (about more complex situations) I suddenly
>> noticed "device sdb" WTF???  Filesystem is mounted from /dev/md3 (sdb
>> is part of that mdraid) so btrfs should not even know anything about
>> that /dev/sdb.
> 
> I would be interested in explanations regarding this too. It happened
> to me as well, that I was confused by /dev/sd* device paths being
> printed by btrfs in the logs, even though it runs on /dev/md-*
> (/dev/mapper/*) devices exclusively.
> 

Could be related:

https://forums.opensuse.org/showthread.php/526696-Upgrade-destroys-dmraid?p=2838207#post2838207




signature.asc
Description: OpenPGP digital signature

Re: AW: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Andrei Borzenkov

24.09.2017 16:53, Fuhrmann, Carsten пишет:
> Hello,
> 
> 1)
> I used direct write (no page cache) but I didn't disable the Disk cache of 
> the HDD/SSD itself. In all tests I wrote 1GB and looked for the runtime of 
> that write process.

So "latency" on your diagram means total time to write 1GiB file? That
is highly unusual meaning for "latency" which normally means time to
perform single IO. If so, you should better rename Y-axis to something
like "total run time".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: difference between -c and -p for send-receive?

2017-09-20 Thread Andrei Borzenkov

20.09.2017 22:05, Antoine Belvire пишет:
> Hello,
> 
>> All snapshots listed in -c options and snapshot that we want to
>> transfer must have the same parent uuid, unless -p is explicitly
>> provided.
> 
> It's rather the same mount point than the same parent uuid, like cp
> --reflink, isn't it?

Sorry, I do not understand this sentence. Could you rephrase?

> 
> ~# btrfs subvolume create /test2/
> Create subvolume '//test2'
> ~# btrfs subvolume create /test2/foo
> Create subvolume '/test2/foo'
> ~# cd /test2   
> ~# btrfs subvolume snapshot -r . .1
> Create a readonly snapshot of '.' in './.1'
> ~#
> ~# # a: 40 MiB in /test2/
> ~# dd if=/dev/urandom of=a bs=4k count=10k
> 10240+0 records in
> 10240+0 records out
> 41943040 bytes (42 MB, 40 MiB) copied, 0.198961 s, 211 MB/s
> ~#
> ~# # b: 80 MiB in /test2/foo
> ~# dd if=/dev/urandom of=foo/b bs=4k count=20k
> 0480+0 records in
> 20480+0 records out
> 83886080 bytes (84 MB, 80 MiB) copied, 0.393823 s, 213 MB/s
> ~#
> ~# # copy-clone /test2/foo/b to /test2/b
> ~# cp --reflink foo/b .
> ~#
> ~# btrfs subvolume -s . .2
> Create a readonly snapshot of '.' in './.2'
> ~#
> ~# # Sending .2 with only .1 as parent (.1 already sent)
> ~# btrfs send -p .1 .2 | wc -c
> At subvol .2
> 125909258 # 120 Mio = 'a' + 'b'
> ~#
> ~# # Sending .2 with .1 and foo as clone sources (.1 and foo already
> ~# # sent), .1 is automatically picked as parent
> ~# btrfs property set foo ro true
> ~# btrfs send -c .1 -c foo .2 | wc -c
> At subvol .2
> 41970349 # 40 Mio, only 'a'
> ~#
> 
> UUIDs on the sending side:
> 
> ~# btrfs subvolume list -uq / | grep test2
> ID 6141 gen 454658 top level 6049 parent_uuid - uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 path test2
> ID 6142 gen 454658 top level 6141 parent_uuid - uuid
> 54a7cdea-6198-424a-9349-8116172d0c17 path test2/foo

Yes, sorry, I misread the code. We need at least one snapshot from -c
option that has the same parent uuid (i.e. is snapshot of the same
subvolume) as the snapshot we want to transfer, not all of them. In your
case

btrfs send -c .1 -c foo .2

it will select .1 as base snapshot and additionally try to clone from
foo if possible. In wiki example the only two snapshots are from
completely different subvolumes which will fail. In discussion that lead
to this mail snapshots probably did not have any parent uuid at all.

> ID 6143 gen 454655 top level 6141 parent_uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid
> 28f1d7db-7341-f545-a2ac-d8819d22a5b5 path test2/.1
> ID 6144 gen 454658 top level 6141 parent_uuid
> bbf936dd-ca84-f749-9b9b-09f7081879a2 uuid
> db9ad2b1-aee1-544c-b368-d698b4a05119 path test2/.2
> ~#
> 
> On the receiving side, .1 is used as parent:
> 
> ~# btrfs subvolume list -uq /var/run/media/antoine/backups/ | grep dest
> ID 298 gen 443 top level 5 parent_uuid - uuid
> 7695cba7-dfbf-2f44-bd79-18c9820fdb2f path dest/.1
> ID 299 gen 443 top level 5 parent_uuid - uuid
> c32c06ec-0a17-cf42-9b04-3804ad72f836 path dest/foo
> ID 300 gen 446 top level 5 parent_uuid
> 7695cba7-dfbf-2f44-bd79-18c9820fdb2f uuid
> 552b1d51-38bf-d546-a47f-4bc667ec4128 path dest/.2
> ~#
> 
> Regards,
> 
> -- 
> Antoine

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: difference between -c and -p for send-receive?

2017-09-19 Thread Andrei Borzenkov

19.09.2017 03:41, Dave пишет:
> new subject for new question
> 
> On Mon, Sep 18, 2017 at 1:37 PM, Andrei Borzenkov <arvidj...@gmail.com> wrote:
> 
>>>> What scenarios can lead to "ERROR: parent determination failed"?
>>>
>>> The man page for btrfs-send is reasonably clear on the requirements
>>> btrfs imposes. If you want to use incremental sends (i.e. the -c or -p
>>> options) then the specified snapshots must exist on both the source and
>>> destination. If you don't have a suitable existing snapshot then don't
>>> use -c or -p and just do a full send.
>>>
>>
>> Well, I do not immediately see why -c must imply incremental send. We
>> want to reduce amount of data that is transferred, so reuse data from
>> existing snapshots, but it is really orthogonal to whether we send full
>> subvolume or just changes since another snapshot.
>>
> 
> Starting months ago when I began using btrfs serious, I have been
> reading, rereading and trying to understand this:
> 
> FAQ - btrfs Wiki
> https://btrfs.wiki.kernel.org/index.php/FAQ#What_is_the_difference_between_-c_and_-p_in_send.3F
> 

This wiki entry is wrong (and as long as I can believe git, it has
always been wrong).

First, "btrfs send -c" does not start with blank subvolume; it starts
with "best parent" which is determined automatically. Actually if you
look at the help output in the very first version of send command:

"By default, this will send the whole subvolume. To do",
"an incremental send, one or multiple '-i '",
"arguments have to be specified. A 'clone source' is",
"a subvolume that is known to exist on the receiving",
"side in exactly the same state as on the sending side.\n",
"Normally, a good snapshot parent is searched automatically",
"in the list of 'clone sources'. To override this, use",
"'-p ' to manually specify a snapshot parent.",

it explains fat better what -c and -p do (ignore -i, this is error that
was fixed later, it means -c).

Second, example in wiki simply does not work. All snapshots listed in -c
options and snapshot that we want to transfer must have the same parent
uuid, unless -p is explicitly provided. Example shows snapshots of two
different subvolumes. I could not make it work even if A and B
themselves are cloned from common subvolume.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-19 Thread Andrei Borzenkov

18.09.2017 09:10, Dave пишет:
> I use snap-sync to create and send snapshots.
> 
> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external 
> drive
> https://github.com/wesbarnett/snap-sync
> 

Are you trying to backup top-level subvolume? I just reproduced this
behavior with this tool. The problem is, snapshots of top-level
subvolume do not have parent UUID (I am not even sure if UUID exists at
all TBH). If you mount any other subvolume, it will work. On openSUSE
root is always mounted as subvolume (actually, the very first snapshot)
which explains why I did not see it before.

I.e.

mkfs -t btrfs /dev/sdb1
mount /dev/sdb1 /test
snapper -c test create-config /test

attempt to "snap-sync -c test" will fail second time. But

btrfs sub create /test/@
umount /test
mount -o subvol=@ /dev/sdb1 /test
snapper -c test create-config /test
...

will work.

As I told you in the first reply, showing output of "btrfs su li -qu
/path/to/src" would explain your problem much earlier.

Actually if snap-sync used "btrfs send -p" instead of "btrfs send -c" it
would work as well, as then no parent search would be needed (and as I
mentioned in another mail both commands are functionally equivalent).
But this becomes really off-topic on this list. As already suggested,
open issue for snap-sync.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Storage and snapshots as historical yearly

2017-09-19 Thread Andrei Borzenkov

19.09.2017 14:49, Senén Vidal Blanco пишет:
> Perfect!! Just what I was looking for.
> Sorry for the delay, because before doing so, I preferred to test to see if 
> it 
> actually worked.
> 
> I have a doubt. The system works perfectly, but at the time of deleting the 
> writing disk and merging the data on the read-only disk I fail to understand 
> the process.
> 
> I have tried to remove the seed bit on disk A and delete the write B as you 
> mention, and so move the data to A, but tells me that disk B does not exist.
> These are the orders I have made:
> 
> md127-> A
> md126-> B
> 
> btrfstune -S 0 /dev /md127
> mount /dev/md127 /mnt (I mount this disk since the md126 gives error)
> btrfs device delete /dev/md126 /mnt
> ERROR: error removing device '/dev/md126': No such file or directory
> 
> Another thing I've tried is to remove disk B without removing the seed bit, 
> but it gives me the error:
> 
> ERROR: error removing device '/dev/md126': unable to remove the only 
> writeable 
> device.
> 
> Any ideas about it?

Yes, sorry about it. Clearing seed flag on device invalidates
filesystem. What you can do, is to rotate devices. I.e. remove
/dev/md126, set seed flag on md127 and add md126 back.

I actually tested it and it works for me.

> Thank you very much for the reply.
> Greetings.
> 
> El martes, 12 de septiembre de 2017 6:34:15 (CEST) Andrei Borzenkov escribió:
>> 11.09.2017 21:17, Senén Vidal Blanco пишет:
>>> I am trying to implement a system that stores the data in a unit (A) with
>>> BTRFS format that is untouchable and that future files and folders created
>>> or modified are stored in another physical unit (B) with BTRFS format.
>>> Each year the new files will be moved to store A and start over.
>>>
>>> The idea is that a duplicate of disk A can be made to keep it in a safe
>>> place and that the files stored there can not be modified until the
>>> mixture of (A) and (B) is made.
>>
>> This can probably be achieved using seed device. Mark original device as
>> seed and all changes will go to another writable device, similar to
>> overlay; then remove seed bit from original device, "btrfs device remove
>> writable" device and it should relocate its content back. Rinse and repeat.
> 




signature.asc
Description: OpenPGP digital signature

Re: difference between -c and -p for send-receive?

2017-09-19 Thread Andrei Borzenkov

On Tue, Sep 19, 2017 at 1:24 PM, Graham Cobb  wrote:
> On 19/09/17 01:41, Dave wrote:
>> Would it be correct to say the following?
>
> Like Duncan, I am just a user, and I haven't checked the code. I
> recommend Duncan's explanation, but in case you are looking for
> something simpler, how about thinking with the following analogy...
>
> Think of -p as like doing an incremental backup: it tells send to just
> send the instructions for the changes to get from the "parent" subvolume
> to the current subvolume. Without -p it is like a full backup:
> everything in the current subvolume is sent.
>
> -c is different:

It is not really different - it is extra. You have -p and optionally
-c which modifies its behavior.

> it says "and by the way, these files also already exist
> on the destination so they might be useful to skip actually sending some
> of the file contents". Imagine that whenever a file content is about to
> be sent (whether incremental or full), btrfs-send checks to see if the
> data is in one of the -c subvolumes and, if it is, it sends "get the
> data by reflinking to this file over here" instead of sending the data
> itself. -c is really just an optimisation to save sending data if you
> know the data is already available somewhere else on the destination.
>
> Be aware that this is really just an analogy (like "hard linking" is an
> analogy for reflinking using the clone range ioctl). Duncan's email
> provides more real details.
>
> In particular, this analogy doesn't explain the original questioner's
> problem. In the analogy, -c might work without the files actually being
> present on the source (as long as they are on the destination). But, in
> reality, because the underlying mechanism is extent range cloning, the
> files have to be present on **both** the source and the destination in
> order for btrfs-send to work out what commands to send.
>

Yes. Decision whether to send full data or reflink is taken on source,
so data must be present on source.

> By the way, like Duncan, I was surprised that the man page suggests that
> -c without -p causes one of the clones to be treated as a parent. I have
> not checked the code to see if that is actually how it works.
>

It is. As implemented, -c *requires* parent snapshot, either
explicitly via -p option or implicitly. What it does:

a) checks that both snapshot to transfer and all snapshots given as
arguments to -c have the same parent uuid;
b) selects "best match" by comparing how close snapshots from -c
option are to parent. As far as I can tell it chooses the oldest
snapshot (with minimal difference to the parent) as base (implicit
-p).

Which implies that "btrfs send -c foo bar" is entirely equivalent to
"btrfs send -p foo bar".

Which still does not explain why script fails. As mentioned, as
snapshots created by snapper should have the same parent uuid, which
leaves only possibility of non-existent subvolume, but then script
should have failed much earlier.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-18 Thread Andrei Borzenkov

18.09.2017 11:45, Graham Cobb пишет:
> On 18/09/17 07:10, Dave wrote:
>> For my understanding, what are the restrictions on deleting snapshots?
>>
>> What scenarios can lead to "ERROR: parent determination failed"?
> 
> The man page for btrfs-send is reasonably clear on the requirements
> btrfs imposes. If you want to use incremental sends (i.e. the -c or -p
> options) then the specified snapshots must exist on both the source and
> destination. If you don't have a suitable existing snapshot then don't
> use -c or -p and just do a full send.
> 

Well, I do not immediately see why -c must imply incremental send. We
want to reduce amount of data that is transferred, so reuse data from
existing snapshots, but it is really orthogonal to whether we send full
subvolume or just changes since another snapshot.

>> I use snap-sync to create and send snapshots.
>>
>> GitHub - wesbarnett/snap-sync: Use snapper snapshots to backup to external 
>> drive
>> https://github.com/wesbarnett/snap-sync
> 
> I am not familiar with this tool. Your question should be sent to the
> author of the tool, if that is what is deciding what -p and -c options
> are being used.
> 

I am not sure how it could come to this error. I looked on more or less
default installation of openSUSE here and all snapper snapshots have as
parent UUID the subvolume that is mounted as root (by default only one
configuration for root subvolume exists). So it is not possible to
remove this subvolume, unless some rollback to another snapshot was
performed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to run balance successfully (No space left on device)?

2017-09-18 Thread Andrei Borzenkov

On Mon, Sep 18, 2017 at 11:20 AM, Tomasz Chmielewski  wrote:
>>> # df -h /var/lib/lxd
>>>
>>> FWIW, standard (aka util-linux) df is effectively useless in a situation
>>> such as this, as it really doesn't give you the information you need (it
>>> can say you have lots of space available, but if btrfs has all of it
>>> allocated into chunks, even if the chunks have space in them still, there
>>> can be problems).
>
>
> I see here on RAID-1, "df -h" it shows pretty much the same amount of free
> space as "btrfs fi show":
>
> - "df -h" shows 105G free
> - "btrfs fi show" says: Free (estimated):104.28GiB  (min:
> 104.28GiB)
>

I think both use the same algorithm to compute free space (df at the
end just shows what kernel returns). The problem is that this
algorithm itself is just approximation in general case. For uniform
RAID1 profile it should be correct though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ERROR: parent determination failed (btrfs send-receive)

2017-09-17 Thread Andrei Borzenkov

18.09.2017 05:31, Dave пишет:
> Sometimes when using btrfs send-receive, I get errors like this:
> 
> ERROR: parent determination failed for 
> 
> When this happens, btrfs send-receive backups fail. And all subsequent
> backups fail too.
> 
> The issue seems to stem from the fact that an automated cleanup
> process removes certain earlier subvolumes. (I'm using Snapper.)
> 
> I'd like to understand exactly what is happening so that my backups do
> not unexpectedly fail.
> 

You try to send incremental changes but you deleted subvolume to compute
changes against. It is hard to tell more without seeing subvolume list
with uuid/parent uuid.

> In my scenario, no parent subvolumes have been deleted from the
> target. Some subvolumes have been deleted from the source, but why
> does that matter? I am able to take a valid snapshot at this time and
> every snapshot ever taken continues to reside at the target backup
> destination (seemingly meaning that a parent subvolume can be found at
> the target).
> 
> This issue seems to make btrfs send-receive a very fragile backup
> solution. 

btrfs send/receive is not backup solution - it is low level tool that
does exactly what it is told to do. You may create backup solution that
is using btrfs send/receive to transfer data stream, but then do not
blame tool for incorrect usage.

To give better advice how to fix your situation you need to describe
your backup solution - how exactly you select/create snapshots.

I hope, instead, there is some knowledge I'm missing, that
> when learned, will make this a robust backup solution.
> 
> Thanks
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 >

1 - 100 of 198 matches

Mail list logo