Re: moving disks to new case

2021-04-15 Thread Hugo Mills
On Thu, Apr 15, 2021 at 02:36:37PM -0500, Charles Zeitler wrote:
> disks are raid5, i don't know which are /dev/sdb /dev/sdc etc.
> is this going to be an issue?

   Nope.

   Hugo.










   Oh, all right, I'll explain...

   The superblock on each device contains the UUID of the filesystem
and a device ID (you can see both of these in the output of btrfs fi
show). When btrfs dev scan is run (for example, by udev as it
enumerates the devices), it attempts to read the superblock on each
device. Any superblocks that are found are read, and the UUID and
devid of that device node are passed to the kernel. The kernel holds a
lookup table of information for every device known of every
filesystem, and uses that table to work out which device nodes it
needs to use for any given FS.

   Hugo.

-- 
Hugo Mills | Guards! Help! We're being rescued!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   The Stainless Steel Rat Forever


Re: nfs subvolume access?

2021-03-10 Thread Hugo Mills
On Wed, Mar 10, 2021 at 08:46:20AM +0100, Ulli Horlacher wrote:
> When I try to access a btrfs filesystem via nfs, I get the error:
> 
> root@tsmsrvi:~# mount tsmsrvj:/data/fex /nfs/tsmsrvj/fex
> root@tsmsrvi:~# time find /nfs/tsmsrvj/fex | wc -l
> find: File system loop detected; '/nfs/tsmsrvj/fex/spool' is part of the same 
> file system loop as '/nfs/tsmsrvj/fex'.
> 1
> root@tsmsrvi:~# 
> 
> 
> 
> On tsmsrvj I have in /etc/exports:
> 
> /data/fex   tsmsrvi(rw,async,no_subtree_check,no_root_squash)
> 
> This is a btrfs subvolume with snapshots:
> 
> root@tsmsrvj:~# btrfs subvolume list /data
> ID 257 gen 35 top level 5 path fex
> ID 270 gen 36 top level 257 path fex/spool
> ID 271 gen 21 top level 270 path fex/spool/.snapshot/2021-03-07_1453.test
> ID 272 gen 23 top level 270 path fex/spool/.snapshot/2021-03-07_1531.test
> ID 273 gen 25 top level 270 path fex/spool/.snapshot/2021-03-07_1532.test
> ID 274 gen 27 top level 270 path fex/spool/.snapshot/2021-03-07_1718.test
> 
> root@tsmsrvj:~# find /data/fex | wc -l
> 489887
> root@tsmsrvj:~# 
> 
> What must I add to /etc/exports to enable subvolume access for the nfs
> client?
> 
> tsmsrvi and tsmsrvj (nfs client and server) both run Ubuntu 20.04 with
> btrfs-progs v5.4.1 

   I can't remember if this is why, but I've had to put a distinct
fsid field in each separate subvolume being exported:

/srv/nfs/home -rw,async,fsid=0x1730,no_subtree_check,no_root_squash

   It doesn't matter what value you use, as long as each one's
different.

   Hugo.

-- 
Hugo Mills | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: Adding Device Fails - Why?

2021-03-01 Thread Hugo Mills
On Mon, Mar 01, 2021 at 12:19:12PM +0100, Christian Völker wrote:
> I am using BTRS on a Debian10 system. I am trying to extend my existing
> filesystem with another device but adding it fails for no reason.
> 
> This is my setup of existing btrfs:
> 
>  2x DRBD Devices (Network RAID1)
>  top of each a luks encrypted device (crypt_drbd1 and crypt_drbd3):
> 
> vdb 254:16   0  1,1T  0 disk
> └─drbd1 147:1    0  1,1T  0 disk
>   └─crypt_drbd1 253:3    0  1,1T  0 crypt
> vdc 254:32   0  900G  0 disk
> └─drbd2 147:2    0  900G  0 disk
>   └─crypt2  253:4    0  900G  0 crypt
> vdd 254:48   0  800G  0 disk
> └─drbd3 147:3    0  800G  0 disk
>   └─crypt_drbd3 253:5    0  800G  0 crypt /var/lib/backuppc
> 
> 
> 
> I have now a third drbd device (drbd2) which I encrypted, too (crypt2). And
> tried to add to existing fi.
> Here further system information:
> 
> Linux backuppc41 5.10.0-3-amd64 #1 SMP Debian 5.10.13-1 (2021-02-06) x86_64
> GNU/Linux
> btrfs-progs v5.10.1
> 
> root@backuppc41:~# btrfs fi sh
> Label: 'backcuppc'  uuid: 73b98c7b-832a-437a-a15b-6cb00734e5db
>     Total devices 2 FS bytes used 1.83TiB
>     devid    3 size 799.96GiB used 789.96GiB path dm-5
>     devid    4 size 1.07TiB used 1.06TiB path dm-3
> 
> 
> I can create an additional btrfs filesystem with mkfs.btrfs on the new
> device without any issues:
> 
> root@backuppc41:~# btrfs fi sh
> Label: 'backcuppc'  uuid: 73b98c7b-832a-437a-a15b-6cb00734e5db
>     Total devices 2 FS bytes used 1.83TiB
>     devid    3 size 799.96GiB used 789.96GiB path dm-5
>     devid    4 size 1.07TiB used 1.06TiB path dm-3
> 
> Label: none  uuid: b111a08e-2969-457a-b9f1-551ff65451d1
>     Total devices 1 FS bytes used 128.00KiB
>     devid    1 size 899.96GiB used 2.02GiB path /dev/mapper/crypt2
> 
> But I can not add this device to the existing btrfs fi:
> root@backuppc41:~# wipefs /dev/mapper/crypt2 -a
> /dev/mapper/crypt2: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42
> 48 52 66 53 5f 4d
> 
> root@backuppc41:~# btrfs device add /dev/mapper/crypt2 /var/lib/backuppc/
> ERROR: error adding device 'dm-4': No such file or directory
> 
> This is what I see in dmesg:
> [43827.535383] BTRFS info (device dm-5): disk added /dev/drbd2
> [43868.910994] BTRFS info (device dm-5): device deleted: /dev/drbd2
> [48125.323995] BTRFS: device fsid 2b4b631c-b500-4f8d-909c-e88b012eba1e devid
> 1 transid 5 /dev/mapper/crypt2 scanned by mkfs.btrfs (4937)
> [57799.499249] BTRFS: device fsid b111a08e-2969-457a-b9f1-551ff65451d1 devid
> 1 transid 5 /dev/mapper/crypt2 scanned by mkfs.btrfs (5178)

   We had someone on IRC a couple of days ago with exactly the same
kind of problem. I don't think I have a record of the solution in my
IRC logs, though, and I don't think we got to the bottom of it. From
memory, a reboot helped.

   Hugo.

> And these are the mapping in dm:
> 
> root@backuppc41:~# ll /dev/mapper/
> insgesamt 0
> lrwxrwxrwx 1 root root   7 28. Feb 21:08 backuppc41--vg-root -> ../dm-1
> lrwxrwxrwx 1 root root   7 28. Feb 21:08 backuppc41--vg-swap_1 ->
> ../dm-2
> crw--- 1 root root 10, 236 28. Feb 21:08 control
> lrwxrwxrwx 1 root root   7  1. Mär 12:12 crypt2 -> ../dm-4
> lrwxrwxrwx 1 root root   7 28. Feb 20:21 crypt_drbd1 -> ../dm-3
> lrwxrwxrwx 1 root root   7 28. Feb 20:21 crypt_drbd3 -> ../dm-5
> lrwxrwxrwx 1 root root   7 28. Feb 21:08 vda5_crypt -> ../dm-0
> 
> 
> Anyone having an idea why I can not add the device to the existing
> filesystem? The error message is not really helpful...
> 
> Thanks a lot!
> 
> /KNEBB

-- 
Hugo Mills | Releasing out of hours. A Haiku:
hugo@... carfax.org.uk | Simply merge PR
http://carfax.org.uk/  | It is wrong. Buildkite, cancel!
PGP: E2AB1DE4  | gitops now corrupt


Re: BTRFS error (device dm-0): block=711870922752 write time tree block corruption detected

2021-02-18 Thread Hugo Mills
On Thu, Feb 18, 2021 at 08:46:02PM +, Samir Benmendil wrote:
> On Feb 17, 2021 at 16:56, Samir Benmendil wrote:
> > On 17 February 2021 13:45:02 GMT+00:00, Hugo Mills 
> > wrote:
> > > On Wed, Feb 17, 2021 at 01:26:40PM +, Samir Benmendil wrote:
> > > > Any advice on what to do next would be appreciated.
> > > 
> > >   The first thing to do is run memtest for a while (I'd usually
> > > recomment at least overnight) to identify your broken RAM module and
> > > replace it. Don't try using the machine normally until you've done
> > > that.
> > 
> > Memtest just finished it's first pass with no errors, but printed a note
> > regarding vulnerability to high freq row hammer bit flips.
> > 
> > I'll keep it running for a while longer.
> 
> 2nd pass flagged a few errors, removed one of the RAM module, tested again
> and it passed. I then booted and ran `btrfs check --readonly` with no
> errors.
> 
> [root@hactar ~]# btrfs check --readonly /dev/mapper/home_ramsi
> Opening filesystem to check...
> Checking filesystem on /dev/mapper/home_ramsi
> UUID: 1e0fea36-a9c9-4634-ba82-1afc3fe711ea
> [1/7] checking root items
> [2/7] checking extents
> [3/7] checking free space cache
> [4/7] checking fs roots
> [5/7] checking only csums items (without verifying data)
> [6/7] checking root refs
> [7/7] checking quota groups skipped (not enabled on this FS)
> found 602514441102 bytes used, no error found
> total csum bytes: 513203560
> total tree bytes: 63535939584
> total fs tree bytes: 58347077632
> total extent tree bytes: 4500455424
>     btree space waste bytes: 15290027113
> file data blocks allocated: 25262661455872
> referenced 4022677716992
> 
> 
> Thanks again for your help Hugo.

   Great news.

   Hugo.


-- 
Hugo Mills | The early bird gets the worm, but the second mouse
hugo@... carfax.org.uk | gets the cheese.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: PGP signature


Re: BTRFS error (device dm-0): block=711870922752 write time tree block corruption detected

2021-02-17 Thread Hugo Mills
On Wed, Feb 17, 2021 at 01:26:40PM +, Samir Benmendil wrote:
> Hello list,
> 
> I've just had my btrfs volume remounted read-only, the logs read as such:
> 
>BTRFS critical (device dm-0): corrupt leaf: root=2 block=711870922752 
> slot=275, bad key order, prev (693626798080 182 702129324032) current 
> (693626798080 182 701861986304)
>BTRFS info (device dm-0): leaf 711870922752 gen 610518 total ptrs 509 free 
> space 276 owner 2
>BTRFS error (device dm-0): block=711870922752 write time tree
> block corruption detected
>BTRFS: error (device dm-0) in btrfs_commit_transaction:2376: errno=-5 IO 
> failure (Error while writing out transaction)
>BTRFS info (device dm-0): forced readonly
>BTRFS warning (device dm-0): Skipping commit of aborted transaction.
>BTRFS: error (device dm-0) in cleanup_transaction:1941: errno=-5 IO failure
> 
> It's seems this coincided with a scheduled snapshot creation on that drive.
> 
> Any advice on what to do next would be appreciated.

   The first thing to do is run memtest for a while (I'd usually
recomment at least overnight) to identify your broken RAM module and
replace it. Don't try using the machine normally until you've done
that.

   This looks like a single-bit error (a 1 bit changing to a 0 bit in
this case):

>>> hex(702129324032)
'0xa37a2b4000'
>>> hex(701861986304)
'0xa36a3c'

   Note the 7 changing to a 6 -- these values should be increasing in
value, not decreasing.

   The write-time checker is doing exactly what it should be doing,
and preventing obviously-broken metadata reaching your filesystem.
Once you've fixed the hardware (and only then), I'd recommend running
a btrfs check --readonly just to make sure that there aren't any
obvious errors that made it through to disk.

   Hugo.

> Thanks in advance.
> 
> $ uname -a
> Linux hactar 5.10.15-arch1-1 #1 SMP PREEMPT Wed, 10 Feb 2021 18:32:40 + 
> x86_64 GNU/Linux
> $ btrfs version
> btrfs-progs v5.10.1

> -- Journal begins at Sat 2020-03-28 20:39:35 GMT, ends at Wed 2021-02-17 
> 13:15:01 GMT. --
> Feb 17 11:00:34 hactar kernel: BTRFS critical (device dm-0): corrupt leaf: 
> root=2 block=711870922752 slot=275, bad key order, prev (693626798080 182 
> 702129324032) current (693626798080 182 701861986304)
> Feb 17 11:00:34 hactar kernel: BTRFS info (device dm-0): leaf 711870922752 
> gen 610518 total ptrs 509 free space 276 owner 2
> Feb 17 11:00:34 hactar kernel: item 0 key (693626765312 169 0) 
> itemoff 15458 itemsize 825
> Feb 17 11:00:34 hactar kernel: extent refs 89 gen 592321 
> flags 258
> Feb 17 11:00:34 hactar kernel: ref#0: shared block backref 
> parent 693646163968
> Feb 17 11:00:34 hactar kernel: ref#1: shared block backref 
> parent 693628502016
> Feb 17 11:00:34 hactar kernel: ref#2: shared block backref 
> parent 693614460928
> Feb 17 11:00:34 hactar kernel: ref#3: shared block backref 
> parent 693603991552
> Feb 17 11:00:34 hactar kernel: ref#4: shared block backref 
> parent 693527379968
> Feb 17 11:00:34 hactar kernel: ref#5: shared block backref 
> parent 693490483200
> Feb 17 11:00:34 hactar kernel: ref#6: shared block backref 
> parent 693444968448
> Feb 17 11:00:34 hactar kernel: ref#7: shared block backref 
> parent 693442478080
> Feb 17 11:00:34 hactar kernel: ref#8: shared block backref 
> parent 693438906368
> Feb 17 11:00:34 hactar kernel: ref#9: shared block backref 
> parent 693433057280
> Feb 17 11:00:34 hactar kernel: ref#10: shared block backref 
> parent 693408710656
> Feb 17 11:00:34 hactar kernel: ref#11: shared block backref 
> parent 693387526144
> Feb 17 11:00:34 hactar kernel: ref#12: shared block backref 
> parent 693350809600
> Feb 17 11:00:34 hactar kernel: ref#13: shared block backref 
> parent 693310963712
> Feb 17 11:00:34 hactar kernel: ref#14: shared block backref 
> parent 693304655872
> Feb 17 11:00:34 hactar kernel: ref#15: shared block backref 
> parent 693283717120
> Feb 17 11:00:34 hactar kernel: ref#16: shared block backref 
> parent 693280112640
> Feb 17 11:00:34 hactar kernel: ref#17: shared block backref 
> parent 693241184256
> Feb 17 11:00:34 hactar kernel: ref#18: shared block backref 
> parent 693224652800
> Feb 17 11:00:34 hactar kernel: ref#19: shared block backref 
> parent 693221130240
> Feb 17 11:00:34 hactar kernel: ref#20: shared block backref 
> parent 693214494720
> Feb 17 11:00:34 hactar kernel: ref#21: shared block backref 
> parent 693201338368
> Feb 17 11:00:34 hactar kernel: ref#22: shared block backref 
> parent 693192916992
> Feb 17 11:00:34 hactar kernel: ref#23: shared block backref 
> paren

Re: is back and forth incremental send/receive supported/stable?

2021-02-01 Thread Hugo Mills
On Mon, Feb 01, 2021 at 11:51:06PM +0100, Christoph Anton Mitterer wrote:
> On Mon, 2021-02-01 at 10:46 +0000, Hugo Mills wrote:
> >    It'll fail *obviously*. I'm not sure how graceful it is. :)
> 
> Okay that doesn't sound like it was very trustworthy... :-/
> 
> Especially this from the manpage:
>You must not specify clone sources unless you guarantee that these
>snapshots are exactly in the same state on both sides—both for the
>sender and the receiver.
> 
> I mean what should the user ever be able to guarantee... respectively
> what's meant with above?
> 
> If the tools or any option combination thereof would allow one to
> create corrupted send/received shapthots, then there's not much a user
> can do.
> If this sentence just means that the user mustn't have manually hacked
> some UUIDs or so... well then I guess that's anyway clear and the
> sentence is just confusing.

   It means that (a) the snapshots should exist, and (b) you shouldn't
use the tools to make any of them read-write, make modifications, and
make them read-only again. (and (c), as you say, don't modify the
UUIDs).

   Hugo.

> > but I guess it's not a priority for the devs
> 
> Since it seems to be a valuable feature with probably little chances to
> get it working in the foreseeable future, I've added it as a feature
> request to the long term records ;-)
> https://bugzilla.kernel.org/show_bug.cgi?id=211521
> 
> 
> 
> Cheers,
> Chris.
> 

-- 
Hugo Mills |
hugo@... carfax.org.uk | __(_'>
http://carfax.org.uk/  | Squeak!
PGP: E2AB1DE4  |


Re: is back and forth incremental send/receive supported/stable?

2021-02-01 Thread Hugo Mills
On Sun, Jan 31, 2021 at 11:50:22PM +0100, Christoph Anton Mitterer wrote:
> Hey Hugo.
> 
> 
> Thanks for your explanation.
> I assume such a swapped send/receive would fail at least gracefully?

   It'll fail *obviously*. I'm not sure how graceful it is. :)

> On Fri, 2021-01-29 at 19:20 +, Hugo Mills wrote:
> >    In your scenario with MASTER and COPY-1 swapped, you'd have to
> > match the received_uuid from the sending side (on old COPY-1) to the
> > actual UUID on old MASTER. The code doesn't do this, so you'd have to
> > patch send/receive to do this.
> 
> Well from the mailing list thread you've referenced it seems that the
> whole thing is rather quite non-trivial... so I guess it's nothing for
> someone who has basically no insight into btrfs code ^^
> 
> It's a pity though, that this doesn't work. Especially the use case of
> sending back (backup)snapshots would seem pretty useful.
> 
> Given that this thread is nearly 6 years, I'd guess the whole idea has
> been abandoned upstream?!

   It can be made to work, in a number of different ways -- the option
above is one way; another would be to add extra history of subvolume
identities -- but I guess it's not a priority for the devs, and at
least the latter approach would require extending the on-disk FS
format. Both approaches would need changes to the send stream format.

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 7:
hugo@... carfax.org.uk | The Simple Truth
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: is back and forth incremental send/receive supported/stable?

2021-01-29 Thread Hugo Mills
On Fri, Jan 29, 2021 at 08:09:49PM +0100, Christoph Anton Mitterer wrote:
> I regularly do the following with btrfs, which seems to work pretty
> stable since years:
> - having n+1 filesystems MASTER and COPY_n
> - creating snapshots on MASTER, e.g. one each month
> - incremental send/receive the new snapshot from MASTER to each of
>   COPY_n (which already have the previous snapshot)
> 
> 
> so for example:
> - MASTER has
>   - snapshot-2020-11/
>   - snapshot-2020-12/
>   and newly get's
>   - snapshot-2021-01/
> - each of COPY_n has only
>   - snapshot-2020-11/
>   - snapshot-2020-12(
> - with:
>   # btrfs send -p MASTER/snapshot-2020-12 MASTER/snapshot-2021-01  |  btrfs 
> receive COPY_n/
>   I incrementally send the new snapshot from MASTER to each of COPY_n
>   using the already available previous snapshot as parent.
> 
> Works(TM)
> 
> 
> 
> Now I basically want to swap a MASTER with a COPY_n (e.g. because
> MASTER's HDD has started to age).
> 
> So the plan is e.g.:
> - COPY_1 becomes NEW_MASTER
> - MASTER becomes OLD_MASTER later known NEW_COPY_1
> 
> a) Can I then start e.g. in February to incrementally send/receive from
> NEW_MASTER back(!!) to OLD_MASTER?

   No.

   When you make an incremental send, you give it a reference
subvolume with -p. This subvol's UUID is sent in the send stream to
the remote side for receive.

   When receive gets told about a reference subvolume in this way, it
looks for the reference and snapshots it (RW) to use as the base to
apply the incremental on top of.

   The way it finds the reference subvol is to look for a subvol with
the "received_uuid" field matching. This field is set by the receiving
process that made it in the first place (as the result of an earlier
send).

   In your scenario with MASTER and COPY-1 swapped, you'd have to
match the received_uuid from the sending side (on old COPY-1) to the
actual UUID on old MASTER. The code doesn't do this, so you'd have to
patch send/receive to do this.

   Your best bet here is to do a new full send and then continue a new
incremental sequence based on that.

   There's a detailed and fairly formal description of this stuff that
I wrote a few years ago here:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

> Like:
> # btrfs send -p NEW_MASTER/snapshot-2021-01 NEW_MASTER/snapshot-2021-02  |  
> btrfs receive OLD_MASTER/
> 
> b) And the same from NEW_MSTER to all the other COPY_n?
> Like:
> # btrfs send -p NEW_MASTER/snapshot-2021-01 NEW_MASTER/snapshot-2021-02  |  
> btrfs receive COPY_n
> 
> 
> So in other words, does btrfs get, that the new parent (which is no
> longer on the OLD_MASTER but the previous COPY_1, now NEW_MASTER) is
> already present (and identical and usable) on the OLD_MASTER, now
> NEW_COPY_1, and also on the other COPY_n ?
> 
> 
> By the way, I'm talking about *precious* data, so I'd like to be really
> sure that this works... and whether it's intended to work and ideally
> have been tested.
> 
> 
> Thanks,
> Chris.
> 

-- 
Hugo Mills | You shouldn't anthropomorphise computers. They
hugo@... carfax.org.uk | really don't like that.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: super_total_bytes mismatch with fs_devices total_rw_bytes

2021-01-28 Thread Hugo Mills
   I'm not sure I'm confident enough to recommend a course of action
on this one, but one note from something you said:

On Thu, Jan 28, 2021 at 10:03:08PM +1100, Andrew Vaughan wrote:
[...]
> Today I did '# btrfs fi resize 4:max /srv/shared' as preparation for a
> balance to make the extra drive space available.  (The old drives are
> all fairly full.  About 130 GB free space on each.  I initially tried
> btrfs fi resize max /srv/shared as the syntax on the manpage implies
> that devid is optional.  Since that command errored, I assume it
> didn't change the filesystem).

   The devid is indeed optional, but it then assumes that you mean
device 1 (which is what it is on a single-device FS). It looks like
your FS, for historical reasons, no longer has a device 1, hence the
error. That should be completely harmless.

[...]
> # uname -a
> Linux nl40 5.10.0-2-amd64 #1 SMP Debian 5.10.9-1 (2021-01-20) x86_64 GNU/Linux
> 
> # mount -t btrfs /dev/sdd1 /mnt/sdd-tmp
> mount: /mnt/sdd-tmp: wrong fs type, bad option, bad superblock on
> /dev/sdd1, missing codepage or helper program, or other error.
> 
> # dmesg | grep -i btrfs
> [5.799637] Btrfs loaded, crc32c=crc32c-generic
> [6.428245] BTRFS: device label samba.btrfs devid 8 transid 1281994
> /dev/sdb1 scanned by btrfs (172)
> [6.428804] BTRFS: device label samba.btrfs devid 5 transid 1281994
> /dev/sdd1 scanned by btrfs (172)
> [6.429473] BTRFS: device label samba.btrfs devid 4 transid 1281994
> /dev/sde1 scanned by btrfs (172)
> [ 2004.140494] BTRFS info (device sde1): disk space caching is enabled
> [ 2004.790843] BTRFS error (device sde1): super_total_bytes
> 22004298366976 mismatch with fs_devices total_rw_bytes 22004298370048
> [ 2004.790854] BTRFS error (device sde1): failed to read chunk tree: -22
> [ 2004.805043] BTRFS error (device sde1): open_ctree failed
> 
> Note that drive identifiers have changed between reboots.  I haven't
> seen that on this system before.

   It happens sometimes. Sometimes between kernels, sometimes changed
hardware responds slightly faster than the previous device. Sometimes
devices get bumped along by having something new attached to an
earlier controller in the enumeration sequence. I've seen machines
that have had totally stable hardware for years suddenly decide to
flip enumeration order on one reboot. I wouldn't worry about it. :)

   The good news is I don't see any of the usual horribly fatal error
messages here, so it's probably fixable.

> Questions
> =
> 
> Is btrfs rescue fix-device-size  considered the best way to
> recover?  Should I run that once for each device in the filesystem?

   I'm not confident enough to answer anything more than "probably" to
both of those.

> Do you want me to run any other commands to help diagnose the cause
> before attempting recovery?

   Looks like a fairly complete report to me (but see above).

   Hugo.

-- 
Hugo Mills | Be pure.
hugo@... carfax.org.uk | Be vigilant.
http://carfax.org.uk/  | Behave.
PGP: E2AB1DE4  |   Torquemada, Nemesis


Re: Cannot resize filesystem: not enough free space

2021-01-24 Thread Hugo Mills
On Sun, Jan 24, 2021 at 08:11:37PM +0100, Jakob Schöttl wrote:
> 
> Hugo Mills  writes:
> 
> > On Sun, Jan 24, 2021 at 07:23:21PM +0100, Jakob Schöttl wrote:
> > > 
> > > Help please, increasing the filesystem size doesn't work.
> > > 
> > > When mounting my btrfs filesystem, I had errors saying, "no space
> > > left
> > > on device". Now I managed to mount the filesystem with -o
> > > skip_balance but:
> > > 
> > > # btrfs fi df /mnt
> > > Data, RAID1: total=147.04GiB, used=147.02GiB
> > > System, RAID1: total=8.00MiB, used=48.00KiB
> > > Metadata, RAID1: total=1.00GiB, used=458.84MiB
> > > GlobalReserve, single: total=181.53MiB, used=0.00B
> > 
> >Can you show the output of "sudo btrfs fi show" as well?
> > 
> >Hugo.
> 
> Thanks, Hugo, for the quick response.
> 
> # btrfs fi show /mnt/
> Label: 'data'  uuid: fc991007-6ef3-4c2c-9ca7-b4d637fccafb
>Total devices 2 FS bytes used 148.43GiB
>devid1 size 232.89GiB used 149.05GiB path /dev/sda
>devid2 size 149.05GiB used 149.05GiB path /dev/sdb
> 
> Oh, now I see! Resize only worked for one sda!
> 
> # btrfs fi resize 1:max /mnt/
> # btrfs fi resize 2:max /mnt/
> # btrfs fi show /mnt/
> Label: 'data'  uuid: fc991007-6ef3-4c2c-9ca7-b4d637fccafb
>Total devices 2 FS bytes used 150.05GiB
>devid1 size 232.89GiB used 151.05GiB path /dev/sda
>devid2 size 465.76GiB used 151.05GiB path /dev/sdb
> 
> Now it works. Thank you!

   Note that the new configuration is going to waste about 232 GiB of
/dev/sdb, because you've got RAID-1, and there won't be spare space to
mirror anything onto once /dev/sda fills up.

   You can add a third device of 232 GiB (250 GB) or more to the FS
and that'll allow the use of the remaining space on /dev/sdb.

   Hugo.

> > > It is full and resize doesn't work although both block devices sda
> > > and
> > > sdb have more 250 GB and more nominal capacity (I don't have
> > > partitions,
> > > btrfs is directly on sda and sdb):
> > > 
> > > # fdisk -l /dev/sd{a,b}*
> > > Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
> > > [...]
> > > Disk /dev/sdb: 465.76 GiB, 500107862016 bytes, 976773168 sectors
> > > [...]
> > > 
> > > I tried:
> > > 
> > > # btrfs fi resize 230G /mnt
> > > runs without errors but has no effect
> > > 
> > > # btrfs fi resize max /mnt
> > > runs without errors but has no effect
> > > 
> > > # btrfs fi resize +1G /mnt
> > > ERROR: unable to resize '/mnt': no enough free space
> > > 
> > > Any ideas? Thank you!
> 
> 

-- 
Hugo Mills | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Terry Pratchett


Re: Cannot resize filesystem: not enough free space

2021-01-24 Thread Hugo Mills
On Sun, Jan 24, 2021 at 07:23:21PM +0100, Jakob Schöttl wrote:
> 
> Help please, increasing the filesystem size doesn't work.
> 
> When mounting my btrfs filesystem, I had errors saying, "no space left
> on device". Now I managed to mount the filesystem with -o skip_balance but:
> 
> # btrfs fi df /mnt
> Data, RAID1: total=147.04GiB, used=147.02GiB
> System, RAID1: total=8.00MiB, used=48.00KiB
> Metadata, RAID1: total=1.00GiB, used=458.84MiB
> GlobalReserve, single: total=181.53MiB, used=0.00B

   Can you show the output of "sudo btrfs fi show" as well?

   Hugo.
 
> It is full and resize doesn't work although both block devices sda and
> sdb have more 250 GB and more nominal capacity (I don't have partitions,
> btrfs is directly on sda and sdb):
> 
> # fdisk -l /dev/sd{a,b}*
> Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
> [...]
> Disk /dev/sdb: 465.76 GiB, 500107862016 bytes, 976773168 sectors
> [...]
> 
> I tried:
> 
> # btrfs fi resize 230G /mnt
> runs without errors but has no effect
> 
> # btrfs fi resize max /mnt
> runs without errors but has no effect
> 
> # btrfs fi resize +1G /mnt
> ERROR: unable to resize '/mnt': no enough free space
> 
> Any ideas? Thank you!

-- 
Hugo Mills | Attempted murder, now honestly, what is that? Do
hugo@... carfax.org.uk | they give a Nobel Prize for attempted chemistry?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Sideshow Bob


Re: received uuid not set btrfs send/receive

2021-01-17 Thread Hugo Mills
On Sun, Jan 17, 2021 at 10:49:26AM -0800, Anders Halman wrote:
> Hello,
> 
> I try to backup my laptop over an unreliable slow internet connection to a
> even slower Raspberry Pi.
> 
> To bootstrap the backup I used the following:
> 
> # local
> btrfs send root.send.ro | pigz | split --verbose -d -b 1G
> rsync -aHAXxv --numeric-ids --partial --progress -e "ssh -T -o
> Compression=no -x" x* remote-host:/mnt/backup/btrfs-backup/
> 
> # remote
> cat x* > split.gz
> pigz -d split.gz
> btrfs receive -f split

> worked nicely. But I don't understand why the "received uuid" on the remote
> site in blank.

   Are you doing the receive as root?

   Hugo.

> I tried it locally with smaller volumes and it worked.
> 
> The 'split' file contains the correct uuid, but it is not set (remote).
> 
> remote$ btrfs receive --dump -f split | head
> subvol  ./root.send.ro uuid=99a34963-3506-7e4c-a82d-93e337191684
> transid=1232187
> 
> local$ sudo btrfs sub show root.send.ro| grep -i uuid:
>     UUID:             99a34963-3506-7e4c-a82d-93e337191684
> 
> 
> Questions:
> 
> - Is there a way to set the "received uuid"?
> - Is it a matter of btrfs-progs version difference?
> - What whould be a better approach?
> 
> 
> Thank you
> 
> 
> 
> 
> # local
> 
> root@fos ~$ uname -a
> Linux fos 5.9.16-200.fc33.x86_64 #1 SMP Mon Dec 21 14:08:22 UTC 2020 x86_64
> x86_64 x86_64 GNU/Linux
> 
> root@fos ~$   btrfs --version
> btrfs-progs v5.9
> 
> root@fos ~$   btrfs fi show
> Label: 'DATA'  uuid: b6e675b3-84e3-4869-b858-218c5f0ac5ad
>     Total devices 1 FS bytes used 402.17GiB
>     devid    1 size 464.27GiB used 414.06GiB path
> /dev/mapper/luks-e4e69cfa-faae-4af8-93f5-7b21b25ab4e6
> 
> root@fos ~$   btrfs fi df /btrfs-root/
> Data, single: total=404.00GiB, used=397.80GiB
> System, DUP: total=32.00MiB, used=64.00KiB
> Metadata, DUP: total=5.00GiB, used=4.38GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> # remote
> root@pih:~# uname -a
> Linux pih 5.4.72+ #1356 Thu Oct 22 13:56:00 BST 2020 armv6l GNU/Linux
> 
> root@pih:~#   btrfs --version
> btrfs-progs v4.20.1
> 
> root@pih:~#   btrfs fi show
> Label: 'DATA'  uuid: 6be1e09c-d1a5-469d-932b-a8d1c339afae
>     Total devices 1 FS bytes used 377.57GiB
>     devid    2 size 931.51GiB used 383.06GiB path
> /dev/mapper/luks_open_backup0
> 
> root@pih:~#   btrfs fi df /mnt/backup
> Data, single: total=375.00GiB, used=374.25GiB
> System, DUP: total=32.00MiB, used=64.00KiB
> Metadata, DUP: total=4.00GiB, used=3.32GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> 
> dmesg is empty for the time of import/btrfs receive.

-- 
Hugo Mills | If it ain't broke, hit it again.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Foon


Re: btrfs send / receive via netcat, fails halfway?

2021-01-10 Thread Hugo Mills
   By the way, Cedric, your SMTP server is rejecting mail from mine
with a "forged HELO" error. I'm not sure why, but I've not knowingly
encountered this with anyone else's mail server.

   Hugo.

-- 
Hugo Mills | The last man on Earth sat in a room. Suddenly, there
hugo@... carfax.org.uk | was a knock at the door.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Frederic Brown


Re: Re: Re: cloning a btrfs drive with send and receive: clone is bigger than the original?

2021-01-10 Thread Hugo Mills
On Sun, Jan 10, 2021 at 01:06:44PM +, Graham Cobb wrote:
> On 10/01/2021 07:41, cedric.dew...@eclipso.eu wrote:
> > I've tested some more.
> > 
> > Repeatedly sending the difference between two consecutive snapshots creates 
> > a structure on the target drive where all the snapshots share data. So 10 
> > snapshots of 10 files of 100MB takes up 1GB, as expected.
> > 
> > Repeatedly sending the difference between the first snapshot and each next 
> > snapshot creates a structure on the target drive where the snapshots are 
> > independent, so they don't share any data. How can that be avoided?
> 
> If you send a snapshot B with a parent A, any files not present in A
> will be created in the copy of B. The fact that you already happen to
> have a copy of the files somewhere else on the target is not known to
> either the sender or the receiver - how would it be?
> 
> If you want the send process to take into account *other* snapshots that
> have previously been sent, you need to tell send to also use those
> snapshots as clone sources. That is what the -c option is for.

   And even then, it won't spot files that are identical but which
don't share extents.

> Alternatively, use a deduper on the destination after the receive has
> finished and let it work out what can be shared.

   This is a viable approach.

   Hugo.

-- 
Hugo Mills | The last man on Earth sat in a room. Suddenly, there
hugo@... carfax.org.uk | was a knock at the door.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Frederic Brown


Re: btrfs send / receive via netcat, fails halfway?

2021-01-10 Thread Hugo Mills
On Sun, Jan 10, 2021 at 11:34:27AM +0100,   wrote:
> ­I'm trying to transfer a btrfs snapshot via the network.
> 
> First attempt: Both NC programs don't exit after the transfer is complete. 
> When I ctrl-C the sending side, the receiving side exits OK.
> 
> btrfs subvolume delete /mnt/rec/snapshots/*
> receive side:
> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At subvol 0
> 
> sending side:
> # btrfs send  /mnt/send/snapshots/0 | nc -v 127.0.0.1 6790
> At subvol /mnt/send/snapshots/0
> localhost [127.0.0.1] 6790 (hnmp) open

   Use -q 15 on the sending side. That will exit after 15 seconds of
no activity from the send process.

> Second attempt: both nc programs exit ok at snapshot 0,1,2, but snapshot3 
> fails halfway, and 4 fails, as 3 is not complete. 
> receive side:
> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At subvol 0
> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At snapshot 1
> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At snapshot 2
> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At snapshot 3
> read(net): Connection reset by peer
> ERROR: short read from stream: expected 49183 read 10450

   This failed because of a network disconnect.

> # nc -l -p 6790 | btrfs receive /mnt/rec/snapshots
> At snapshot 4
> ERROR: cannot find parent subvolume
> write(stdout): Broken pipe

   This is expected because the previous one failed.

   There's no btrfs problem here. You just need better error handling
(to retry a failed transfer, for example).

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


Re: Improve balance command

2021-01-08 Thread Hugo Mills
On Fri, Jan 08, 2021 at 02:30:52PM +, Claudius Ellsel wrote:
> Hello,
> 
> currently I am slowly adding drives to my filesystem (RAID1). This process is 
> incremental, since I am copying files off them to the btrfs filesystem and 
> then adding the free drive to it afterwards. Since RAID1 needs double the 
> space, I added an empty 12TB drive and also had a head start with an empty 
> 1TB and 4TB drive. With that I can go ahead and copy a 4TB drive, then add it 
> to the filesystem until I have three 4TB and one 12TB drives (the 1TB drive 
> will get replaced in the process).
> While I was doing this (still in the process), I have used the `balance` 
> command after adding a drive as described in the Wiki. Unforunately I now 
> learned that this will at least by default rewrite all data and not only the 
> relevant chunks that need to be rewritten to reach a balanced drive. In order 
> that leads to pretty long process times and also I don't really like that the 
> drives are stressed unnecessarily.
> 
> So now I have the question whether there are better methods to do rebalancing 
> (like some filters?) or whether it is even needed every time. I also created 
> a bug report to suggest improvement of the rebalancing option if you are 
> interested: https://bugzilla.kernel.org/show_bug.cgi?id=211091.
> 
> On a slightly different topic: I was wondering what would happen if I just 
> copied stuff over without adding new drives. The 1TB and 4TB drives would 
> then be full while the 12TB one still had space.

   The algorithm puts new data chunks on the devices with the most
space free. In this case, each data chunk needs two devices.

   With a 12TB, 4TB and 1TB, you'll be able to get 5TB of data on a
RAID-1 array. One copy goes on the 12TB, and the other copy will go on
one of the other two devices. (In this process, the first 3 TB of data
will go exclusively on the two larger ones, and only then will the 1TB
drive be written to as well).

   You can keep adding devices to this without balancing, and it will
all work OK, as long as you have at least two devices with free space
on them. If you have only one device with free space on it (or near
that), that's the point that you need to balance. You can cancel the
balance when there's an approximately even distribution of free space
on the devices.

   (When I say "free space" in the above, I'm talking about
unallocated space, as reported by btrfs fi usage).

> I am asking because when running `sudo btrfs filesystem usage /mount/point` I 
> am getting displayed more free space than would be possible with RAID1:

> Overall:
> Device size:19.10TiB
> Device allocated:8.51TiB
> Device unallocated: 10.59TiB
> Device missing:0.00B
> Used:8.40TiB
> Free (estimated):5.35TiB  (min: 5.35TiB)
> Data ratio: 2.00
> Metadata ratio: 2.00
> Global reserve:512.00MiB  (used: 0.00B)
> 
> Data,RAID1: Size:4.25TiB, Used:4.20TiB (98.74%)
>/dev/sdc565.00GiB
>/dev/sdd  3.28TiB
>/dev/sdb  4.25TiB
>/dev/sde430.00GiB
> 
> Metadata,RAID1: Size:5.00GiB, Used:4.78GiB (95.61%)
>/dev/sdc  1.00GiB
>/dev/sdd  4.00GiB
>/dev/sdb  5.00GiB
> 
> System,RAID1: Size:32.00MiB, Used:640.00KiB (1.95%)
>/dev/sdd 32.00MiB
>/dev/sdb 32.00MiB
> 
> Unallocated:
>/dev/sdc365.51GiB
>/dev/sdd364.99GiB
>/dev/sdb  6.66TiB
>/dev/sde  3.22TiB
> 
> It looks a bit like the free size was simply calculated by total disk space - 
> used space and then divided by two since it is RAID1. But that would in 
> reality mean that some chunks are just twice on the 12TB drive and not 
> spread. Is this the way it will work in practice or is the estimated value 
> just wrong?

   A reasonably accurate free space calculation is either complicated
or expensive, and I don't think any of the official tools gets it
right in all cases. You can get a better idea of the usable space on
any given configuration by putting the unallocated space into the tool
at

https://carfax.org.uk/btrfs-usage

or I think there's an accurate implementation as a command-line tool
in Hans's python-btrfs library.

   Hugo.

-- 
Hugo Mills | You are not stuck in traffic: you are traffic
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |German ad campaign


Re: Cloning / getting a full backup of a BTRFS filesystem

2019-09-04 Thread Hugo Mills
On Wed, Sep 04, 2019 at 04:04:44PM +0200, Piotr Szymaniak wrote:
> On Wed, Sep 04, 2019 at 12:03:10PM +0300, Andrei Borzenkov wrote:
> > On Wed, Sep 4, 2019 at 9:16 AM Swâmi Petaramesh  
> > wrote:
> > >
> > > Hi list,
> > >
> > > Is there an advised way to completely “clone” a complete BTRFS
> > > filesystem, I mean to get an exact copy of a BTRFS filesystem including
> > > subvolumes (even readonly snapshots) and complete file attributes
> > > including extended attributes, ACLs and so, to another storage pool,
> > > possibly defined with a different RAID geometry or compression ?
> > >
> > 
> > As long as you do not use top level subvolume directly (all data is
> > located in subolumes), send/receive should work.
> > 
> > > The question boils down to getting an exact backup replica of a given
> > > BTRFS filesystem that could be restored to something logically
> > > absolutely identical.
> > >
> > > The usual backup tools have no clue about share extents, snapshots and
> > > the like, and using btrfs send/receive for individual subvols is a real
> > > pain in a BTRFS filesystem that may contain hundreds of snapshots of
> > > different BTRFS subvols plus deduplication etc.
> > >
> > 
> > Shared extents could be challenging. You can provide this information
> > to "btrfs send", but for one, there is no direct visibility into which
> > subvolumes share extents with given subvolume, so no way to build
> > corresponding list for "btrfs send". I do not even know if this
> > information can be obtained without exhaustive search over all
> > extents. Second, btrfs send/receive only allows sharing of full
> > extents which means there is no guarantee of identical structure on
> > receiving side.
> 
> So right now the only answer is: use good old dd?

   If you want an exact copy, including all of the exact UUIDs, yes.

   Be aware of the problems of making block-level copies of btrfs
filesystems, though:
https://btrfs.wiki.kernel.org/index.php/Gotchas#Block-level_copies_of_devices

   Hugo.

-- 
Hugo Mills | I have a step-ladder. My real ladder left when I was
hugo@... carfax.org.uk | a child.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: Need advice for fixing a btrfs volume

2019-08-29 Thread Hugo Mills
On Thu, Aug 29, 2019 at 10:45:37PM +0800, UGlee wrote:
> Dear:
> 
> We are using btrfs in an embedded arm/linux device.
> 
> The bootloader (u-boot) has only limited support for btrfs.
> Occasionally the device lost power supply unexpectedly, leaving an
> inconsistent file system on eMMC. If I dump the partition image from
> eMMC and mount it on linux desktop, the file system is perfectly
> usable.
> 
> My guess is that the linux kernel can fully handle the journalled
> update and garbage data.

   btrfs doesn't have a journal -- if the hardware is telling the
truth about barriers, and about written data reaching permanent
storage, then the FS structures on disk are always consistent. It's
got a log tree which is used for recovery of partial writes in the
case of a crash midway through a transaction, but that doesn't affect
the primary FS structures.

> But the u-boot cannot. So I consider to add a minimal ext4 rootfs
> partition as a fallback. When u-boot cannot read file from btrfs
> partition, it can switch to a minimal Linux system booting from an
> ext4 fs.

> Then I have a chance to use some tool to fix btrfs volume and reboot
> the system. My question is which tools is recommended for this
> purpose?

   It depends on the nature of the failure (if there is one, and why
uboot can't read the FS. Maybe it's saying that if there's a non-empty
log tree, it's not going to handle it (but there would be additional
code needed to check that). If that were the case, then simply
mounting the FS and unmounting it cleanly would work.

> According to the following page:
> 
> https://btrfs.wiki.kernel.org/index.php/Btrfsck
> 
> btrfsck is said to be deprecated. `btrfs check --repair` seems to be a
> full volume check and time-consuming.

   btrfs check --repair *is* btrfsck, under a different name. They're
the same code.

> All I need is just a good superblock and a few files could be
> loaded. Most frequently complaints from u-boot is the superblock
> issue such as `root_backup not found`.  It there a way to just fix
> the superblock, settle all journalled update, and make sure the
> required several files is OK?

   Mount the FS and unmount it cleanly?

   Hugo.

-- 
Hugo Mills | Prisoner unknown: Return to Zenda.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


Re: Unable to delete or change ro flag on subvolume/snapshot

2019-08-07 Thread Hugo Mills
On Wed, Aug 07, 2019 at 10:37:43AM +0200, Jon Ander MB wrote:
> Hi!
> I have a snapshot with the read only flag set and I'm currently unable
> to delete it or change the ro setting
> btrfs property set -ts /path/t/snapshot ro false
> ERROR: failed to set flags for /path/t/snapshot: Operation not permitted
> 
> Deleting the snapshot is also a no-go:
> 
> btrfs subvolume delete /path/t/snapshot
> Delete subvolume (no-commit): '/path/t/snapshot'
> ERROR: cannot delete '/path/t/snapshot': Operation not permitted

   First question: are you running those commands as root?

   Second question: has the FS itself gone read-only for some reason?
(e.g. corruption detected).

   Hugo.

> 
> The snapshot information:
> 
> btrfs subvolume show /path/t/snapshot
> /path/t/snapshot
> Name:   snapshot
> UUID:   66a145da-a20d-a44e-bb7a-3535da400f5d
> Parent UUID:f1866638-f77f-e34e-880d-e2e3bec1c88b
> Received UUID:  66a145da-a20d-a44e-bb7a-3535da400f5d
> Creation time:  2019-07-31 12:00:30 +0200
> Subvolume ID:   23786
> Generation: 1856068
> Gen at creation:1840490
> Parent ID:  517
> Top level ID:   517
> Flags:      readonly
> Snapshot(s):
> 
> 
> Any idea of what can I do?
> 
> Regards!
> 

-- 
Hugo Mills | I'm all for giving people enough rope to shoot
hugo@... carfax.org.uk | themselves in the foot.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Andreas Dilger


Re: [PATCH] btrfs: add an ioctl to force chunk allocation

2019-08-05 Thread Hugo Mills
   Just to throw a can of turquoise paint into the bike-shed works,
this could be handy with a list of devids to say *where* to create the
new block group.

   This allows for some truly horrible things to happen if abused, but
could also allow for some kind of poor-mans directed-balance: Create a
new block group on the devices you want, balance away one block group
on device(s) you don't want -- data should end up going to the new
empty block group in preference to another new one being automatically
allocated.

   (Alternatively, ignore this suggestion, and I'll just wait for a
proper "move this BG to these devices" ioctl...)

   Hugo.

On Mon, Aug 05, 2019 at 08:24:23PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/8/3 上午12:10, Josef Bacik wrote:
> > In testing block group removal it's sometimes handy to be able to create
> > block groups on demand.  Add an ioctl to allow us to force allocation
> > from userspace.
> 
> Not sure if we should add another ioctl just for debug purpose.
> 
> Although I see the usefulness in such debug feature, can we move it to
> something like sysfs so we can hide it more easily?
> 
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/ioctl.c   | 30 ++
> >  include/uapi/linux/btrfs.h |  1 +
> >  2 files changed, 31 insertions(+)
> > 
> > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> > index d0743ec1231d..f100def53c29 100644
> > --- a/fs/btrfs/ioctl.c
> > +++ b/fs/btrfs/ioctl.c
> > @@ -5553,6 +5553,34 @@ static int _btrfs_ioctl_send(struct file *file, void 
> > __user *argp, bool compat)
> > return ret;
> >  }
> >  
> > +static long btrfs_ioctl_alloc_chunk(struct file *file, void __user *arg)
> > +{
> > +   struct btrfs_root *root = BTRFS_I(file_inode(file))->root;
> > +   struct btrfs_trans_handle *trans;
> > +   u64 flags;
> > +   int ret;
> > +
> > +   if (!capable(CAP_SYS_ADMIN))
> > +   return -EPERM;
> > +
> > +   if (copy_from_user(&flags, arg, sizeof(flags)))
> > +   return -EFAULT;
> > +
> > +   /* We can only specify one type at a time. */
> > +   if (flags != BTRFS_BLOCK_GROUP_DATA &&
> > +   flags != BTRFS_BLOCK_GROUP_METADATA &&
> > +   flags != BTRFS_BLOCK_GROUP_SYSTEM)
> > +   return -EINVAL;
> 
> It looks like MIXED bg get less and less love.
> 
> > +
> > +   trans = btrfs_start_transaction(root, 0);
> > +   if (IS_ERR(trans))
> > +   return PTR_ERR(trans);
> > +
> > +   ret = btrfs_chunk_alloc(trans, flags, CHUNK_ALLOC_FORCE);
> 
> And the flags lacks the profile bits, thus default to SINGLE.
> Is it designed or you'd better use btrfs_force_chunk_alloc()?
> 
> Thanks,
> Qu
> 
> > +   btrfs_end_transaction(trans);
> > +   return ret < 0 ? ret : 0;
> > +}
> > +
> >  long btrfs_ioctl(struct file *file, unsigned int
> > cmd, unsigned long arg)
> >  {
> > @@ -5699,6 +5727,8 @@ long btrfs_ioctl(struct file *file, unsigned int
> > return btrfs_ioctl_get_subvol_rootref(file, argp);
> > case BTRFS_IOC_INO_LOOKUP_USER:
> > return btrfs_ioctl_ino_lookup_user(file, argp);
> > +   case BTRFS_IOC_ALLOC_CHUNK:
> > +   return btrfs_ioctl_alloc_chunk(file, argp);
> > }
> >  
> > return -ENOTTY;
> > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
> > index c195896d478f..3a6474c34ad0 100644
> > --- a/include/uapi/linux/btrfs.h
> > +++ b/include/uapi/linux/btrfs.h
> > @@ -943,5 +943,6 @@ enum btrfs_err_code {
> > struct btrfs_ioctl_get_subvol_rootref_args)
> >  #define BTRFS_IOC_INO_LOOKUP_USER _IOWR(BTRFS_IOCTL_MAGIC, 62, \
> > struct btrfs_ioctl_ino_lookup_user_args)
> > +#define BTRFS_IOC_ALLOC_CHUNK _IOR(BTRFS_IOCTL_MAGIC, 63, __u64)
> >  
> >  #endif /* _UAPI_LINUX_BTRFS_H */
> > 
> 




-- 
Hugo Mills | We don't just borrow words; on occasion, English has
hugo@... carfax.org.uk | pursued other languages down alleyways to beat them
http://carfax.org.uk/  | unconscious and rifle their pockets for new
PGP: E2AB1DE4  | vocabulary.   James D. Nicoll


signature.asc
Description: PGP signature


Re: delete recursivly subvolumes?

2019-07-05 Thread Hugo Mills
On Fri, Jul 05, 2019 at 09:56:39PM +0200, Ulli Horlacher wrote:
> On Fri 2019-07-05 (19:51), Hugo Mills wrote:
> 
> > > Is there a command/script/whatever to snapshot (copy) a subvolume which
> > > contains (somewhere) other subvolumes?
> > > 
> > > Example:
> > > 
> > > root@xerus:/test# btrfs_subvolume_list /test/ | grep /tmp
> > > /test/tmp
> > > /test/tmp/xx/ss1
> > > /test/tmp/xx/ss2
> > > /test/tmp/xx/ss3
> > > 
> > > I want to have (with one command):
> > > 
> > > /test/tmp --> /test/tmp2
> > > /test/tmp/xx/ss1 --> /test/tmp2/xx/ss1
> > > /test/tmp/xx/ss2 --> /test/tmp2/xx/ss2
> > > /test/tmp/xx/ss3 --> /test/tmp2/xx/ss3
> > 
> >Remember that this isn't quite so useful, because you can't make
> > read-only snapshots in that structure.
> 
> ss1 ss2 and ss3 are indeed read-only snapshots!
> Of course they do not contain other subvolumes.

   What I'm saying is that you can't make a RO snapshot of test/tmp to
test/tmp2 and have your RO snapshots of ss1-3 in place within it.

   (OK, you could make the snapshot RW initially, snapshot the others
into place and then force it RO, but then you've just broken
send/receive on tmp2).

> >Generally, I'd recommend not having nested subvols at all, but to
> > put every subvol independently, and mount them into the places you
> > want them to be. That avoids a lot of the issues of nested subvols,
> > such as the ones you're trying to deal with here.
> 
> *I* do it this way from the very beginning :-)
> But I have *users* with *strange* ideas :-}
> 
> I need to handle their data.

   That makes it more awkward. :(

   Hugo.

-- 
Hugo Mills | "You know, the British have always been nice to mad
hugo@... carfax.org.uk | people."
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Laura Jesson, Brief Encounter


signature.asc
Description: Digital signature


Re: delete recursivly subvolumes?

2019-07-05 Thread Hugo Mills
On Fri, Jul 05, 2019 at 09:47:20PM +0200, Ulli Horlacher wrote:
> On Fri 2019-07-05 (21:39), Ulli Horlacher wrote:
> 
> > Is there a command/script/whatever to remove subvolume which contains
> > (somewhere) other subvolumes?
> 
> ADONN QUESTION! :-)
> 
> Is there a command/script/whatever to snapshot (copy) a subvolume which
> contains (somewhere) other subvolumes?
> 
> Example:
> 
> root@xerus:/test# btrfs_subvolume_list /test/ | grep /tmp
> /test/tmp
> /test/tmp/xx/ss1
> /test/tmp/xx/ss2
> /test/tmp/xx/ss3
> 
> I want to have (with one command):
> 
> /test/tmp --> /test/tmp2
> /test/tmp/xx/ss1 --> /test/tmp2/xx/ss1
> /test/tmp/xx/ss2 --> /test/tmp2/xx/ss2
> /test/tmp/xx/ss3 --> /test/tmp2/xx/ss3

   Remember that this isn't quite so useful, because you can't make
read-only snapshots in that structure.

   Generally, I'd recommend not having nested subvols at all, but to
put every subvol independently, and mount them into the places you
want them to be. That avoids a lot of the issues of nested subvols,
such as the ones you're trying to deal with here.

   Hugo.

-- 
Hugo Mills | "You know, the British have always been nice to mad
hugo@... carfax.org.uk | people."
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Laura Jesson, Brief Encounter


signature.asc
Description: Digital signature


Re: Rebalancing raid1 after adding a device

2019-06-18 Thread Hugo Mills
On Tue, Jun 18, 2019 at 07:14:26PM +, DO NOT USE wrote:
> June 18, 2019 8:45 PM, "Hugo Mills"  wrote:
>
> > On Tue, Jun 18, 2019 at 08:26:32PM +0200, Stéphane Lesimple wrote:
> >> [...]
> >> I tried using the -ddevid option but it only instructs btrfs to work
> >> on the block groups allocated on said device, as it happens, it
> >> tends to move data between the 4 preexisting devices and doesn't fix
> >> my problem. A full balance with -dlimit=100 did no better.
> >
> > -dlimit=100 will only move 100 GiB of data (i.e. 200 GiB), so it'll
> > be a pretty limited change. You'll need to use a larger number than
> > that if you want it to have a significant visible effect.
>
> Yes of course, I wasn't clear here but what I meant to do when starting
> a full balance with -dlimit=100 was to test under a reasonable amount of
> time whether the allocator would prefer to fill the new drive. I observed
> after those 100G (200G) of data moved that it wasn't the case at all.
> Specifically, no single allocation happened on the new drive. I know this
> would be the case at some point, after Terabytes of data would have been
> moved, but that's exactly what I'm trying to avoid.

   It's probably putting the data into empty space first. The solution
here would, as Austin said in his reply to your original post, be to
run some compaction on the FS, which will move data from chunks with
little data in, into existing chunks with space. When that's done,
you'll be able to see the chunks moving onto the new device.

[snip]
> > It would be really great if there was an ioctl that allowed you to
> > say things like "take the chunks of this block group and put them on
> > devices 2, 4 and 5 in RAID-5", because you could do a load of
> > optimisation with reshaping the FS in userspace with that. But I
> > suspect it's a long way down the list of things to do.
>
> Exactly, that would be awesome. I would probably even go as far as
> writing some C code myself to call this ioctl to do this "intelligent"
> balance on my system!

   You wouldn't need to. I'd be at the head of the queue to write the
tool. :)

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: Rebalancing raid1 after adding a device

2019-06-18 Thread Hugo Mills
On Tue, Jun 18, 2019 at 02:50:34PM -0400, Austin S. Hemmelgarn wrote:
> On 2019-06-18 14:45, Hugo Mills wrote:
> >On Tue, Jun 18, 2019 at 08:26:32PM +0200, Stéphane Lesimple wrote:
> >>I've been a btrfs user for quite a number of years now, but it seems
> >>I need the wiseness of the btrfs gurus on this one!
> >>
> >>I have a 5-hdd btrfs raid1 setup with 4x3T+1x10T drives.
> >>A few days ago, I replaced one of the 3T by a new 10T, running btrfs
> >>replace and then resizing the FS to use all the available space of
> >>the new device.
> >>
> >>The filesystem was 90% full before I expanded it so, as expected,
> >>most of the space on the new device wasn't actually allocatable in
> >>raid1, as very few available space was available on the 4 other
> >>devs.
> >>
> >>Of course the solution is to run a balance, but as the filesystem is
> >>now quite big, I'd like to avoid running a full rebalance. This
> >>would be quite i/o intensive, would be running for several days, and
> >>putting and unecessary stress on the drives. This also seems
> >>excessive as in theory only some Tb would need to be moved: if I'm
> >>correct, only one of two block groups of a sufficient amount of
> >>chunks to be moved to the new device so that the sum of the amount
> >>of available space on the 4 preexisting devices would at least equal
> >>the available space on the new device, ~7Tb instead of moving ~22T.
> >>I don't need to have a perfectly balanced FS, I just want all the
> >>space to be allocatable.
> >>
> >>I tried using the -ddevid option but it only instructs btrfs to work
> >>on the block groups allocated on said device, as it happens, it
> >>tends to move data between the 4 preexisting devices and doesn't fix
> >>my problem. A full balance with -dlimit=100 did no better.
> >
> >-dlimit=100 will only move 100 GiB of data (i.e. 200 GiB), so it'll
> >be a pretty limited change. You'll need to use a larger number than
> >that if you want it to have a significant visible effect.
> Last I checked, that's not how the limit filter works.  AFAIUI, it's
> an upper limit on how full a chunk can be to be considered for the
> balance operation.  So, balancing with only `-dlimit=100` should
> actually balance all data chunks (but only data chunks, because you
> haven't asked for metadata balancing).

   That's usage, not limit. limit is simply counting the number of
block groups to move.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Umpire of the Rising Sun
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Rebalancing raid1 after adding a device

2019-06-18 Thread Hugo Mills
On Tue, Jun 18, 2019 at 08:26:32PM +0200, Stéphane Lesimple wrote:
> I've been a btrfs user for quite a number of years now, but it seems
> I need the wiseness of the btrfs gurus on this one!
> 
> I have a 5-hdd btrfs raid1 setup with 4x3T+1x10T drives.
> A few days ago, I replaced one of the 3T by a new 10T, running btrfs
> replace and then resizing the FS to use all the available space of
> the new device.
> 
> The filesystem was 90% full before I expanded it so, as expected,
> most of the space on the new device wasn't actually allocatable in
> raid1, as very few available space was available on the 4 other
> devs.
> 
> Of course the solution is to run a balance, but as the filesystem is
> now quite big, I'd like to avoid running a full rebalance. This
> would be quite i/o intensive, would be running for several days, and
> putting and unecessary stress on the drives. This also seems
> excessive as in theory only some Tb would need to be moved: if I'm
> correct, only one of two block groups of a sufficient amount of
> chunks to be moved to the new device so that the sum of the amount
> of available space on the 4 preexisting devices would at least equal
> the available space on the new device, ~7Tb instead of moving ~22T.
> I don't need to have a perfectly balanced FS, I just want all the
> space to be allocatable.
> 
> I tried using the -ddevid option but it only instructs btrfs to work
> on the block groups allocated on said device, as it happens, it
> tends to move data between the 4 preexisting devices and doesn't fix
> my problem. A full balance with -dlimit=100 did no better.

   -dlimit=100 will only move 100 GiB of data (i.e. 200 GiB), so it'll
be a pretty limited change. You'll need to use a larger number than
that if you want it to have a significant visible effect.

   The -ddevid= option would be my recommendation. It's got
more chunks on it, so they're likely to have their copies spread
across the other four devices. This should help with the
balance.

   Alternatively, just do a full balance and then cancel it when the
amount of unallocated space is reasonably well spread across the
devices (specifically, the new device's unallocated space is less than
the sum of the unallocated space on the other devices).

> Is there a way to ask the block group allocator to prefer writing to
> a specific device during a balance? Something like -ddestdevid=N?
> This would just be a hint to the allocator and the usual constraints
> would always apply (and prevail over the hint when needed).

   No, there isn't. Having control over the allocator (or bypassing
it) would be pretty difficult to implement, I think.

   It would be really great if there was an ioctl that allowed you to
say things like "take the chunks of this block group and put them on
devices 2, 4 and 5 in RAID-5", because you could do a load of
optimisation with reshaping the FS in userspace with that. But I
suspect it's a long way down the list of things to do.

> Or is there any obvious solution I'm completely missing?

   I don't think so.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Umpire of the Rising Sun
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies

2019-06-10 Thread Hugo Mills
On Mon, Jun 10, 2019 at 04:02:36PM +0200, David Sterba wrote:
> On Mon, Jun 10, 2019 at 12:42:26PM +0000, Hugo Mills wrote:
> >Hi, David,
> > 
> > On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> > > this patchset brings the RAID1 with 3 and 4 copies as a separate
> > > feature as outlined in V1
> > > (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dste...@suse.com/).
> > [...]
> > > Compatibility
> > > ~
> > > 
> > > The new block group types cost an incompatibility bit, so old kernel
> > > will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> > > the filesystem with the new type.
> > > 
> > > To upgrade existing filesystems use the balance filters eg. from RAID6
> > > 
> > >   $ btrfs balance start -mconvert=raid1c3 /path
> > [...]
> > 
> >If I do:
> > 
> > $ btrfs balance start -mprofiles=raid13c,convert=raid1 \
> >   -dprofiles=raid13c,convert=raid6 /path
> > 
> > will that clear the incompatibility bit?
> 
> No the bit will stay, even though there are no chunks of the raid1c3
> type. Same for raid5/6.
> 
> Dropping the bit would need an extra pass trough all chunks after
> balance, which is feasible and I don't see usability surprises. That you
> ask means that the current behaviour is probably opposite to what users
> expect.

   We've had a couple of cases in the past where people have tried out
a new feature on a new kernel, then turned it off again and not been
able to go back to an earlier kernel. Particularly in this case, I can
see people being surprised at the trapdoor. "I don't have any RAID13C
on this filesystem: why can't I go back to 5.2?"

   Hugo.

-- 
Hugo Mills | Great films about cricket: Forrest Stump
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v2 0/6] RAID1 with 3- and 4- copies

2019-06-10 Thread Hugo Mills
   Hi, David,

On Mon, Jun 10, 2019 at 02:29:40PM +0200, David Sterba wrote:
> this patchset brings the RAID1 with 3 and 4 copies as a separate
> feature as outlined in V1
> (https://lore.kernel.org/linux-btrfs/cover.1531503452.git.dste...@suse.com/).
[...]
> Compatibility
> ~
> 
> The new block group types cost an incompatibility bit, so old kernel
> will refuse to mount filesystem with RAID1C3 feature, ie. any chunk on
> the filesystem with the new type.
> 
> To upgrade existing filesystems use the balance filters eg. from RAID6
> 
>   $ btrfs balance start -mconvert=raid1c3 /path
[...]

   If I do:

$ btrfs balance start -mprofiles=raid13c,convert=raid1 \
  -dprofiles=raid13c,convert=raid6 /path

will that clear the incompatibility bit?

(I'm not sure if profiles= and convert= work together, but let's
assume that they do for the purposes of this question).

   Hugo.

-- 
Hugo Mills | The enemy have elected for Death by Powerpoint.
hugo@... carfax.org.uk | That's what they shall get.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   gdb


signature.asc
Description: Digital signature


Re: Unable to mount, corrupt leaf

2019-05-28 Thread Hugo Mills
On Tue, May 28, 2019 at 03:39:36PM -0300, Cesar Strauss wrote:
> Hello,
> 
> After a BTRFS partition becoming read-only under use, it cannot be
> mounted anymore.
> 
> The output is:
> 
> # mount /dev/sdb5 /mnt/disk1
> mount: /mnt/disk1: wrong fs type, bad option, bad superblock on
> /dev/sdb5, missing codepage or helper program, or other error.
> 
> Kernel output:
> [ 2042.106654] BTRFS info (device sdb5): disk space caching is enabled
> [ 2042.799537] BTRFS critical (device sdb5): corrupt leaf: root=2
> block=199940210688 slot=31, unexpected item end, have 268450090
> expect 14634

   You have bad RAM.

The item end it's got on the disk:
>>> hex(268450090)
'0x1000392a'

The item end it should have (based on the other items and their
lengths and positions):
>>> hex(14634)
'0x392a'

   The good checksum on the block (it hasn't complained about the
csum, so it's good) indicates that the corruption happened in memory
at some point. The bit-flip in the data would strongly suggest that
it's caused by a stuck memory cell -- i.e. bad hardware.

   Run memtest86 for a minimum of 8 hours (preferably 24) and see what
shows up. Then fix the hardware.

   Hugo.

> [ 2042.807879] BTRFS critical (device sdb5): corrupt leaf: root=2
> block=199940210688 slot=31, unexpected item end, have 268450090
> expect 14634
> [ 2042.807947] BTRFS error (device sdb5): failed to read block groups: -5
> [ 2042.832362] BTRFS error (device sdb5): open_ctree failed
> 
> # btrfs check /dev/sdb5
> Opening filesystem to check...
> incorrect offsets 14634 268450090
> incorrect offsets 14634 268450090
> incorrect offsets 14634 268450090
> incorrect offsets 14634 268450090
> ERROR: cannot open file system
> 
> Giving -s and -b options to "btrfs check" made no difference.
> 
> The usebackuproot mount option made no difference.
> 
> "btrfs restore" was successful in recovering most of the files,
> except for a couple instances of "Error copying data".
> 
> System information:
> 
> OS: Arch Linux
> 
> $ uname -a
> Linux rescue 5.1.4-arch1-1-ARCH #1 SMP PREEMPT Wed May 22 08:06:56
> UTC 2019 x86_64 GNU/Linux
> 
> $ btrfs --version
> btrfs-progs v5.1
> 
> I have since updated the kernel, with no difference:
> 
> $ uname -a
> Linux rescue 5.1.5-arch1-2-ARCH #1 SMP PREEMPT Mon May 27 03:37:39
> UTC 2019 x86_64 GNU/Linux
> 
> Before making any recovery attempts, or even restoring from backup,
> I would like to ask for the best option to proceed.
> 
> Thanks,
> 
> Cesar

-- 
Hugo Mills | You've read the project plan. Forget that. We're
hugo@... carfax.org.uk | going to Do Stuff and Have Fun doing it.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Jeremy Frey


signature.asc
Description: Digital signature


Re: Citation Needed: BTRFS Failure Resistance

2019-05-22 Thread Hugo Mills
On Wed, May 22, 2019 at 09:46:42PM +0300, Cerem Cem ASLAN wrote:
> Could you confirm or disclaim the following explanation:
> https://unix.stackexchange.com/a/520063/65781

   Well, the quoted comment at the top is accurate (although I haven't
looked for the IRC conversation in question).

   However, there are some inaccuracies in the detailed comment
below. These aren't particularly relevant to the argument addressing
your question, but do detract somewhat from the authority of the
answer. :)

   Specifically: Btrfs doesn't use Merkle trees. It uses CoW-friendly
B-trees -- there's no csum of tree contents. It also doesn't make a
complete copy of the tree (that would take a long time). Instead,
it'll only update the blocks in the tree that need updating, which
will bubble the changes up through the tree node path to the top
level.

   There's a detailed description of the issues of broken hardware on
the btrfs wiki, here:

https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22parent_transid_verify_failed.22_mean.3F

   Hugo.

-- 
Hugo Mills | Why play all the notes, when you need only play the
hugo@... carfax.org.uk | most beautiful?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Miles Davis


signature.asc
Description: Digital signature


Re: "bad tree block start" when trying to mount on ARM

2019-05-21 Thread Hugo Mills
On Tue, May 21, 2019 at 01:34:42AM -0700, Erik Jensen wrote:
> I have a 5-drive btrfs filesystem. (raid-5 data, dup metadata). I can
> mount it fine on my x86_64 system, and running `btrfs check` there
> reveals no errors. However, I am not able to mount the filesystem on
> my 32-bit ARM board, which I am hoping to use for lower-power file
> serving. dmesg shows the following:
> 
> [   83.066301] BTRFS info (device dm-3): disk space caching is enabled
> [   83.072817] BTRFS info (device dm-3): has skinny extents
> [   83.553973] BTRFS error (device dm-3): bad tree block start, want
> 17628726968320 have 39646195496896
> [   83.554089] BTRFS error (device dm-3): bad tree block start, want
> 17628727001088 have 5606876608493751477
> [   83.601176] BTRFS error (device dm-3): bad tree block start, want
> 17628727001088 have 5606876608493751477
> [   83.610811] BTRFS error (device dm-3): failed to verify dev extents
> against chunks: -5
> [   83.639058] BTRFS error (device dm-3): open_ctree failed
> 
> Is this expected to work? I did notice that there are gotchas on the
> wiki related to filesystems over 8TiB on 32-bit systems, but it
> sounded like they were mostly related to running the tools, as opposed
> to the filesystem driver itself. (Each of the five drives is
> 8TB/7.28TiB)

   Yes, it should work. We had problems with ARM several years ago,
because of its unusual behaviour with unaligned word accesses, but
those were in userspace, and, as far as I know, fixed now. Looking at
the want/have numbers, it doesn't look like an endianness problem or
an ARM-unaligned-access problem.

> If this isn't expected, what should I do to help track down the issue?

   Can you show us the output of "btrfs check --readonly", on both the
x86_64 machine and the ARM machine? It might give some more insight
into the nature of the breakage.

   Possibly also "btrfs inspect dump-super" on both machines.

> Also potentially relevant: The x86_64 system is currently running
> 4.19.27, while the ARM system is running 5.1.3.

   Shouldn't make a difference.

> Finally, just in case it's relevant, I just finished reencrypting the
> array, which involved doing a `btrfs replace` on each device in the
> array.

   If you can still mount on x86_64, then the FS is at least
reasonably complete and undamaged. I don't think this will make a
difference.  However, it's worth checking whether there are any
funnies about your encryption layer on ARM (I wouldn't expect any,
since it's recognising the decrypted device as btrfs, rather than
random crud).

   Hugo.

-- 
Hugo Mills | Prisoner unknown: Return to Zenda.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Used disk size of a received subvolume?

2019-05-16 Thread Hugo Mills
On Thu, May 16, 2019 at 04:54:42PM +0200, Axel Burri wrote:
> Trying to get the size of a subvolume created using "btrfs receive",
> I've come with a cute little script:
> 
>SUBVOL=/path/to/subvolume
>CGEN=$(btrfs subvolume show "$SUBVOL" \
>  | sed -n 's/\s*Gen at creation:\s*//p')
>btrfs subvolume find-new "$SUBVOL" $((CGEN+1)) \
>  | cut -d' ' -f7 \
>  | tr '\n' '+' \
>  | sed 's/\+\+$/\n/' \
>  | bc
> 
> This simply sums up the "len" field from all modified files since the
> creation of the subvolume. Works fine, as btrfs-receive first makes a
> snapshot of the parent subvolume, then adds the files according to the
> send-stream.
> 
> Now this rises some questions:
> 
> 1. How accurate is this? AFAIK "btrfs find-new" prints real length, not
> compressed length.
> 
> 2. If there are clone-sources in the send-stream, the cloned files
> probably also appear in the list.
> 
> 3. Is there a better way? It would be nice to have a btrfs command for
> this. It would be straight-forward to have a "--summary" option in
> "btrfs find-new", another approach would be to calculate and dump the
> size in either "btrfs send" or "btrfs receive".

   btrfs find-new also doesn't tell you about deleted files (fairly
obviously), so if anything's been removed, you'll be overestimating
the overall change in size.

> Any thoughts? I'm willing to implement such a feature in btrfs-progs if
> this sounds reasonable to you.

   If you're looking for the incremental usage of the subvolume, why
not just use the "exclusive" value from btrfs fi du? That's exactly
that information. (And note that it changes over time, as other
subvols it shares with are deleted).

   Hugo.

-- 
Hugo Mills | Your problem is that you've got too much taste to be
hugo@... carfax.org.uk | a web developer.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Steve Harris


signature.asc
Description: Digital signature


Re: [PATCH V9] Btrfs: enhance raid1/10 balance heuristic

2019-05-13 Thread Hugo Mills
On Mon, May 13, 2019 at 09:57:43PM +0200, waxhead wrote:
> David Sterba wrote:
> >On Mon, May 06, 2019 at 05:37:40PM +0300, Timofey Titovets wrote:
> >>From: Timofey Titovets 
> >>
> >>Currently btrfs raid1/10 bаlance requests to mirrors,
> >>based on pid % num of mirrors.
> >>
> >
> >Regarding the patches to select mirror policy, that Anand sent, I think
> >we first should provide a sane default policy that addresses most
> >commong workloads before we offer an interface for users that see the
> >need to fiddle with it.
> >
> As just a regular btrfs user I would just like to add that I earlier
> made a comment where I think that btrfs should have the ability to
> assign certain DevID's to groups (storage device groups).
> 
> From there I think it would be a good idea to "assign" subvolumes to
> either one (or more) group(s) so that btrfs would prefer (if free
> space permits) to store data from that subvolume on a certain group
> of storage devices.
> 
> If you could also set a weight value for read and write separately
> for a group then you are from a humble users point of view good to
> go and any PID% optimization (and management) while very interesting
> sounds less important.
> 
> As BTRFS scales to more than 32 devices (I think there is a limit
> for 30 or 32) device groups should really be in there from a
> management point of view and mount options for readmirror policy
> does not sound good the way I understand it as this would affect the
> fileystem globally.
> 
> Groups could also allow for useful features like making sure
> metadata stays on fast devices, migrating hot data to faster groups
> automatically on read, and when (if?) subvolumes support different
> storage profiles "Raid1/10/5/6" it sounds like an even better idea
> to assign such subvolumes to faster/slower groups depending on the
> storage profile.
> 
> Anyway... I just felt like airing some ideas since the readmirror
> topic has come up a few times on the mailing list recently.

   I did write up a slightly more concrete proposal on how to do this
algorithmically (plus quite a lot more) some years ago. I even started
implementing it, but I ran into problems of available time and
available kernel mad skillz, neither of which I had enough of.

https://www.spinics.net/lists/linux-btrfs/msg33916.html

   Hugo.

-- 
Hugo Mills | Questions are a burden, and answers a prison for
hugo@... carfax.org.uk | oneself
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  The Prisoner


signature.asc
Description: Digital signature


Re: btrfs mount fail after adding new drive to raid1 array

2019-04-15 Thread Hugo Mills
On Mon, Apr 15, 2019 at 12:55:44PM -0700, George Mitchell wrote:
> After adding a new drive to a btrfs raid1 array I cannot remount
> it.  One thing that went wrong for me after having successfully done
> this many times before over the last five years is that in this case
> I failed to format the partition as btrfs and just left it
> unformatted assuming the btrfs command would format it or give me a
> warning.  But I am not sure that is what caused the problem.

   It wouldn't have. That's the correct approach -- btrfs will write
the appropriate superblocks and metadata to the new device,
effectively destroying anything that was on there before.

> So far I have tried physically disconnecting the added drive and
> trying to mount read only but that has been unsuccessful.
> 
> I could forge ahead attempting to recover but I prefer to gather any
> advice or suggestions I can get before making some foolish mistake. 
> I appreciate any comments you might have.
> 
> Here is what I am seeing:
> 
> [root@localhost ghmitch]# btrfs check /dev/sda4
> Checking filesystem on /dev/sda4
> UUID: 4b0983d7-8d85-463d-85c1-c20aa3b4fa3b
> checking extents
> WARNING: unaligned total_bytes detected for devid 4, have
> 41422428 should be aligned to 4096
> WARNING: this is OK for older kernel, but may cause kernel warning
> for newer kernels
> WARNING: this can be fixed by 'btrfs rescue fix-device-size'

   You only need to run btrfs check on one of the devices. The device
only serves to identify the filesystem, and btrfs check will scan all
devices to find the ones that match the device you gave it. btrfs
check on the other devices is redundant -- you're checking the
*filesystem*, not a *device*.

   Have you tried using the command recommended in the btrfs check
output? (But give us the dmesg first, just in case).

> [root@localhost ghmitch]# mount LABEL=common /common
> mount: can't find LABEL=common
> [root@localhost ghmitch]# mount LABEL=COMMON /common
> mount: wrong fs type, bad option, bad superblock on /dev/sdb4,
>    missing codepage or helper program, or other error
> 
>    In some cases useful info is found in syslog - try
>    dmesg | tail or so.
> [root@localhost ghmitch]# mount LABEL=COMMON /common
> mount: wrong fs type, bad option, bad superblock on /dev/sdb4,
>    missing codepage or helper program, or other error
> 
>    In some cases useful info is found in syslog - try
>    dmesg | tail or so.
> [root@localhost ghmitch]#
> 
> 
> I am attaching the output from journalctl for this.

   I don't see a kernel log attached, just the btrfs check output
again.

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 4:
hugo@... carfax.org.uk | Future Perfect
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: checksum error...

2019-04-08 Thread Hugo Mills
On Mon, Apr 08, 2019 at 11:48:03AM -0400, Scott E. Blomquist wrote:
> 
> Hi All,
> 
> The weekend btrfs scrub/balance came back with this following...
> 
> [Sun Apr  7 06:57:10 2019] BTRFS warning (device sdb1): checksum error at 
> logical 274820497408 on dev /dev/sda1, sector 536758784, root 271471, inode 
> 109421914, offset 491520, length 4096, links 1 (path: /y)
[snip]

   Since there doesn't seem to be anything else wrong (no messages
without a filename, which would imply metadata corruption), this is
most likely a simple case of on-device corruption.

   Delete /y and restore it from backups. At least, do so in
the working copy; The snapshots of it can safely remain until they get
rotated out normally.

   Check your SMART statistics and see if anything looks wrong there
on the hardware side. Also check dmesg and earlier kernel logs for
signs of the hardware showing an error on read -- it may have tried
several times to read that location before giving up and/or returning
bad data.

   Hugo.

> 
> Here is what I have...
> 
> root@cbmm-fsb:~# uname -a
> Linux cbmm-fsb 4.14.24-custom #1 SMP Mon Mar 5 10:10:39 EST 2018 x86_64 
> x86_64 x86_64 GNU/Linux
> 
> root@cbmm-fsb:~# btrfs --version
> btrfs-progs v4.15.1
> 
> root@cbmm-fsb:~# btrfs fi show
> Label: none  uuid: d83b1e28-db27-4035-8638-d4b2eb824ff2
>Total devices 2 FS bytes used 80.09TiB
>devid1 size 76.40TiB used 62.49TiB path /dev/sda1
>devid2 size 32.74TiB used 18.83TiB path /dev/sdb1
> 
> root@cbmm-fsb:~# btrfs fi df /home/cbcl
> Data, single: total=79.80TiB, used=79.80TiB
> System, RAID1: total=32.00MiB, used=9.09MiB
> Metadata, RAID1: total=757.00GiB, used=281.34GiB
> Metadata, DUP: total=22.50GiB, used=19.27GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> sda and sdb are megaraid raid6 with BBU and both are optimal.
> 
> Any tips?  Thanks.
> 
> sb. Scott Blomquist
> 

-- 
Hugo Mills | If it ain't broke, hit it again.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Foon


signature.asc
Description: Digital signature


Re: corrupt leaf, bad key order on kernel 5.0

2019-04-05 Thread Hugo Mills
On Fri, Apr 05, 2019 at 10:11:57PM +0300, Nazar Mokrynskyi wrote:
> NOTE: I do not need help with recovery, I have fully automated snapshots, 
> backups and restoration mechanisms, the only purpose of this email is to help 
> developers find the reason of yet another filesystem corruption and hopefully 
> fix it.

   That's good news, at least.

> Yet another corruption of my root BTRFS filesystem happened today.
> Didn't bother to run scrub, balance or check, just created disk image for 
> future investigation and restored everything from backup.
> 
> Here is what corruption looks like:
> [  274.241339] BTRFS info (device dm-0): disk space caching is enabled
> [  274.241344] BTRFS info (device dm-0): has skinny extents
> [  274.283238] BTRFS info (device dm-0): enabling ssd optimizations
> [  310.436672] BTRFS critical (device dm-0): corrupt leaf: root=268 
> block=42044719104 slot=123, bad key order, prev (1240717 108 41447424) 
> current (1240717 76 41451520)

   "Bad key order" is usually an indicator of faulty RAM -- a piece of
metadata gets loaded into RAM for modification, a bit gets flipped in
it (because the bit is stuck on one value), and then the csum is
computed for the page (including the faulty bit), and written out to
disk. In this case, it's not obvious, but I'd suggest that the second
field of the key has been flipped, as 108 is 0x6c, and 76 is 0x4c --
one bit away from each other.

   I recommend you check your hardware thoroughly before attempting to
rebuild the FS.

   Hugo.

> [  310.449304] BTRFS critical (device dm-0): corrupt leaf: root=268 
> block=42044719104 slot=123, bad key order, prev (1240717 108 41447424) 
> current (1240717 76 41451520)
> [  310.449309] BTRFS: error (device dm-0) in btrfs_dropa_snapshot:9250: 
> errno=-5 IO failure
> [  310.449311] BTRFS info (device dm-0): forced readonly
> [  311.266789] BTRFS info (device dm-0): delayed_refs has NO entry
> [  311.277088] BTRFS error (device dm-0): cleaner transaction attach returned 
> -30
> 
> My system just freezed when I was not looking at it and this is the state it 
> is in now.
> File system survived from March 8th til April 05, one of the fastest 
> corruptions in my experience.
> 
> Looks like this happened during sending incremental snapshot to the other 
> BTRFS filesystem, since last snapshot on that one was not read-only as it 
> should have been otherwise.
> 
> I'm on Ubuntu 19.04 with Linux kernel 5.0.5 and btrfs-progs v4.20.2.
> 
> My filesystem is on top of LUKS on NVMe SSD (SM961), I have 3 snapshots 
> created every 15 minutes from 3 subvolumes with rotation of old snapshots 
> (can be from tens to hundreds of snapshots at any time).
> 
> Mount options: compress=lzo,noatime,ssd
> 
> I have full disk image with corrupted filesystem and will create Qcow2 
> snapshots of it, so if you want me to run any experiments, including 
> potentially destructive, including usage of custom patches to btrfs-progs to 
> find out the reason of corruption, would be happy to help as much as I can.
> 
> P.S. I'm riding latest stable and rc kernels all the time and during last 6 
> months I've got about as many corruptions of different BTRFS filesystems as 
> during 3 years before that, really worrying if you ask me.
> 

-- 
Hugo Mills | I'm always right.
hugo@... carfax.org.uk | But I might be wrong about that.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


IRC logs

2019-04-04 Thread Hugo Mills
   For some time, I've been running a log bot on the #btrfs IRC
channel, and hosting the resulting logs. I'm currently re-assessing
the collection of things I host and manage. Baudot, the "official"
#btrfs log bot, has never really been stable, and is very sensitive to
being disconnected from the IRC server (apparently not fixable with
the IRC library in question). I think it's time to retire that
particular codebase and let someone else run something more reliable
for logging #btrfs.

   Is anyone willing to take over running a log bot and hosting the
logs? I can hand over the existing dataset if you want historical
continuity.

   Hugo.

-- 
Hugo Mills | UDP jokes: It's OK if no-one gets them.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: interest in post-mortem examination of a BTRFS system and improving the btrfs-code?

2019-04-02 Thread Hugo Mills
On Tue, Apr 02, 2019 at 12:28:12PM -0600, Chris Murphy wrote:
> On Tue, Apr 2, 2019 at 7:24 AM Qu Wenruo  wrote:
> > On 2019/4/2 下午9:06, Nik. wrote:
> 
> > > On the larger file system only "btrfs check --repair --readonly ..." was
> > > attempted (without success; most command executions were documented, so
> > > the results can be made available), no writing commands were issued.
> >
> > --repair will cause write, unless it even failed to open the filesystem.
> 
> It consider `--repair --readonly` is a contradictory request, and it's
> ambiguous what the user wants (it's user error) and the command should
> fail with "conflicting options" error.

   I already raised that question. :)

   It was a typo in the email. --repair was what was intended.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Forrest Stump
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: interest in post-mortem examination of a BTRFS system and improving the btrfs-code?

2019-04-02 Thread Hugo Mills
On Tue, Apr 02, 2019 at 09:24:03PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/4/2 下午9:06, Nik. wrote:
[snip]
> > On the larger file system only "btrfs check --repair --readonly ..." was
> > attempted (without success; most command executions were documented, so
> > the results can be made available), no writing commands were issued.
> 
> --repair will cause write, unless it even failed to open the filesystem.

   If btrfs check accepted both --repair and --readonly without
complaining, then that's a regression and a bug. --readonly should be
mutually exclusive with any option that might write to the FS, and if
it isn't any more, then it's been broken and needs fixing.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Interview with the Umpire
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Btrfs wiki is down

2019-03-29 Thread Hugo Mills
On Fri, Mar 29, 2019 at 04:39:48PM +, William Muriithi wrote:
> Hello,
> 
> Not sure I should be raising this here, but I can't find another reliable 
> email.  I suspect someone here would be able to reach out to those 
> responsible for the wiki.
> 
> https://btrfs.wiki.kernel.org/index.php/Status

   It's working for me. Maybe it was a transient while someone was
fiddling with it, that's already been fixed?

   Hugo.

> This link is broken.  When I go there, I see this error.
> 
> ==
> A database query syntax error has occurred. This may indicate a bug in the 
> software. The last attempted database query was:
> (SQL query hidden)
> from within function "Title::getCascadeProtectionSources". Database returned 
> error "1267: Illegal mix of collations (latin1_bin,IMPLICIT) and 
> (utf8_general_ci,COERCIBLE) for operation '=' 
> (pdx-wl-lb-db.web.codeaurora.org)".
> =
> 
> Any chance this can be conveyed to someone who can help please?
> 
> Regards,
> William,
> ​

-- 
Hugo Mills | Would you like an ocelot with that non-sequitur?
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH URGENT v1.1 0/2] btrfs-progs: Fix the nobarrier behavior of write

2019-03-27 Thread Hugo Mills
On Wed, Mar 27, 2019 at 03:07:48PM +0100, Adam Borowski wrote:
> On Wed, Mar 27, 2019 at 05:46:50PM +0800, Qu Wenruo wrote:
> > This urgent patchset can be fetched from github:
> > https://github.com/adam900710/btrfs-progs/tree/flush_super
> > Which is based on v4.20.2.
> > 
> > Before this patch, btrfs-progs writes to the fs has no barrier at all.
> > All metadata and superblock are just buffered write, no barrier between
> > super blocks and metadata writes at all.
> > 
> > No wonder why even clear space cache can cause serious transid
> > corruption to the originally good fs.
> > 
> > Please merge this fix as soon as possible as I really don't want to see
> > btrfs-progs corrupting any fs any more.
> 
> How often does this happen in practice?  I'm slightly incredulous about
> btrfs-progs crashing often.   Especially that pwrite() is buffered on the
> kernel side, so we'd need a _kernel_ crash (usually a power loss) to break
> consistency.  Obviously, a potential data loss bug is always something that
> needs fixing, I'm just wondering about severity.

   It's a pretty regular event -- there's often a segfault or other
uncontrolled exit when running btrfs check on a broken filesystem.
It's usually hard to say whether that kind of thing (in --repair mode)
is causing additional corruption, or whether it's not fixing anything,
or whether it's fixing something and exposing the next error down.

> Or do I understand this wrong?
> 
> Asking because Dimitri John Ledkov stepped down as Debian's maintainer of
> this package, and I'm taking up the mantle (with Nicholas D Steeves being
> around) -- modulo any updates other than important bug fixes being on hold
> because of Debian's freeze.  Thus, I wonder if this is important enough to
> ask for a freeze exception.

   My ha'penn'orth: it's probably not worth asking for a freeze
exception -- I don't think it makes normal operation of the btrfs
progs actively dangerous, but it's increasing risk somewhat on what
are generally pretty rare operations in the lifetime of a filesystem.
It's only the offline tools that are going to be affected here anyway
-- most of the use-cases for btrfs-progs are in telling the kernel
what to do, rather than modifying the FS directly.

   I'd say it's definitely worth fixing the issue upstream (which Qu
is doing), and then (if possible) backporting it to your maintained
packages after the Debian release.

[Other opinions are also available from alternative vendors].

   Hugo.

-- 
Hugo Mills | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/  |
PGP: E2AB1DE4  | The Goons


signature.asc
Description: Digital signature


Re: parent transid verify failed / FS wont mount / help please!

2019-03-25 Thread Hugo Mills
On Mon, Mar 25, 2019 at 10:51:24PM +, berodual_xyz wrote:
> Running "btrfs check" on the 3rd of the 4 devices the volume consists of 
> crashes with a trace:

   Just for the record, it doesn't matter which device you use for
btrfs check. You're running it on the whole filesystem, not just one
device. The device just serves to identify which FS you're running it
on. (The btrfs check code will scan all the available block devices
for the other pieces of the FS).

   Hugo.

> ##
> $ btrfs check --readonly /dev/sdd
> Opening filesystem to check...
> parent transid verify failed on 1048576 wanted 60234 found 60230
> parent transid verify failed on 1048576 wanted 60234 found 60230
> Ignoring transid failure
> volumes.c:1762: btrfs_chunk_readonly: BUG_ON `!ce` triggered, value 1
> btrfs[0x426fdc]
> btrfs(btrfs_chunk_readonly+0x98)[0x429acd]
> btrfs(btrfs_read_block_groups+0x1c1)[0x41cd44]
> btrfs(btrfs_setup_all_roots+0x368)[0x416540]
> btrfs[0x416a8a]
> btrfs(open_ctree_fs_info+0xd0)[0x416bcc]
> btrfs(cmd_check+0x591)[0x45f431]
> btrfs(main+0x24a)[0x40ca02]
> /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fa320a44b35]
> btrfs[0x40c509]
> [1]22848 abort  btrfs check --readonly /dev/sdd
> ##
> 
> Trying to mount "ro,usebackuproot" shows bad superblock and following errors 
> in /var/log/messages:
> 
> ##
> 33814.360633] BTRFS info (device sdd): trying to use backup root at mount time
> [33814.360637] BTRFS info (device sdd): using free space tree
> [33814.360638] BTRFS info (device sdd): has skinny extents
> [33814.361708] BTRFS error (device sdd): parent transid verify failed on 
> 1048576 wanted 60234 found 60230
> [33814.361764] BTRFS error (device sdd): failed to read chunk root
> [33814.373140] BTRFS error (device sdd): open_ctree failed
> ##
> 
> 
> Again, thank you very much for all help!
> 
> 
> 
> Sent with ProtonMail Secure Email.
> 
> ‐‐‐ Original Message ‐‐‐
> On Monday, March 25, 2019 11:44 PM, berodual_xyz 
>  wrote:
> 
> > Thank you very much Hugo,
> >
> > the underlying devices are based on HW raid6 and effectively "stitched" 
> > together. Loosing any of those would mean loosing all data, so much is 
> > clear.
> >
> > My concern was not so much bitrod / silent data corruption but I would not 
> > have expected disabled data checksumming to be a disadvantage at recovering 
> > from the supposed corruption now.
> >
> > Does anyone have any input on how to restore files based on inode no. from 
> > the tree dump that I have?
> >
> > "usebackuproot,ro" did not succeed either.
> >
> > Much appreciate the input!
> >
> > Sent with ProtonMail Secure Email.
> >
> > ‐‐‐ Original Message ‐‐‐
> > On Monday, March 25, 2019 11:38 PM, Hugo Mills h...@carfax.org.uk wrote:
> >
> > > On Mon, Mar 25, 2019 at 10:26:29PM +, berodual_xyz wrote:
> > >
> > > > Dear all,
> > > > on a large btrfs based filesystem (multi-device raid0 - all devices 
> > > > okay, nodatacow,nodatasum...)
> > >
> > > Ouch. I think the only thing you could have done to make the FS
> > > more fragile is mounting with nobarrier(). Frankly, anything you're
> > > getting off it is a bonus. RAID-0 gives you no duplicate copy,
> > > nodatacow implies nodatasum, and nodatasum doesn't even give you the
> > > ability to detect data corruption, let alone fix it.
> > > With that configuration, I'd say pretty much by definition the
> > > contents of the FS are considered to be discardable.
> > > Restoring from backups is the recommended approach with transid
> > > failures.
> > > () Don't do that.
> > >
> > > > I experienced severe filesystem corruption, most likely due to a hard 
> > > > reset with inflight data.
> > > > The system cannot mount (also not with "ro,nologreplay" / 
> > > > "nospace_cache" etc.).
> > >
> > > Given how close the transids are, have you tried
> > > "ro,usebackuproot"? That's about your only other option at this
> > > point. But, if btrfs restore isn't working, then usebacuproot probably
> > > won't either.
> > >
> > > > Running "btrfs restore" I got a reasonable amount of data backed up, 
> > > > but a large chunk is missing.
> > > > "btrfs check" gives the following error:

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: parent transid verify failed / FS wont mount / help please!

2019-03-25 Thread Hugo Mills
On Mon, Mar 25, 2019 at 10:44:15PM +, berodual_xyz wrote:
> Thank you very much Hugo,
> 
> the underlying devices are based on HW raid6 and effectively "stitched" 
> together. Loosing any of those would mean loosing all data, so much is clear.
> 
> My concern was not so much bitrod / silent data corruption but I would not 
> have expected disabled data checksumming to be a disadvantage at recovering 
> from the supposed corruption now.

   OK, so it's not quite as bad a case as I painted. Turning off all
of the btrfs data-protection features still isn't something you'd do
to data you're friends with. However, it shouldn't directly affect the
recoverability of the data (assuming you had RAID-1 metadata).

   The main problem is that you've had a transid error, which is
pretty much universally fatal. There's a description of what that
means in the FAQ here:
https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22parent_transid_verify_failed.22_mean.3F

> Does anyone have any input on how to restore files based on inode no. from 
> the tree dump that I have?

   I'm not sure what you mean by "tree dump". Do you mean
btrfs-debug-tree? Or btrfs-image? Or something else? In any case, none
of those are likely to help all that much. The metadata is
corrupted in a way that shouldn't ever happen, and where it's really
hard to work out how to fix it, even with an actual human expert
involved.  (It's why there's no btrfs check fix for this situation --
you simply can't take the metadata broken in this way and make much
sense out of it).

   Hugo.

> "usebackuproot,ro" did not succeed either.
> 
> Much appreciate the input!
> 
> 
> Sent with ProtonMail Secure Email.
> 
> ‐‐‐ Original Message ‐‐‐
> On Monday, March 25, 2019 11:38 PM, Hugo Mills  wrote:
> 
> > On Mon, Mar 25, 2019 at 10:26:29PM +, berodual_xyz wrote:
> >
> > > Dear all,
> > > on a large btrfs based filesystem (multi-device raid0 - all devices okay, 
> > > nodatacow,nodatasum...)
> >
> > Ouch. I think the only thing you could have done to make the FS
> > more fragile is mounting with nobarrier(). Frankly, anything you're
> > getting off it is a bonus. RAID-0 gives you no duplicate copy,
> > nodatacow implies nodatasum, and nodatasum doesn't even give you the
> > ability to detect data corruption, let alone fix it.
> > With that configuration, I'd say pretty much by definition the
> > contents of the FS are considered to be discardable.
> > Restoring from backups is the recommended approach with transid
> > failures.
> > () Don't do that.
> >
> > > I experienced severe filesystem corruption, most likely due to a hard 
> > > reset with inflight data.
> > > The system cannot mount (also not with "ro,nologreplay" / "nospace_cache" 
> > > etc.).
> >
> > Given how close the transids are, have you tried
> > "ro,usebackuproot"? That's about your only other option at this
> > point. But, if btrfs restore isn't working, then usebacuproot probably
> > won't either.
> >
> > > Running "btrfs restore" I got a reasonable amount of data backed up, but 
> > > a large chunk is missing.
> > > "btrfs check" gives the following error:
> > >

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: parent transid verify failed / FS wont mount / help please!

2019-03-25 Thread Hugo Mills
On Mon, Mar 25, 2019 at 10:26:29PM +, berodual_xyz wrote:
> Dear all,
> 
> on a large btrfs based filesystem (multi-device raid0 - all devices okay, 
> nodatacow,nodatasum...) 

   Ouch. I think the only thing you could have done to make the FS
more fragile is mounting with nobarrier(*). Frankly, anything you're
getting off it is a bonus. RAID-0 gives you no duplicate copy,
nodatacow implies nodatasum, and nodatasum doesn't even give you the
ability to *detect* data corruption, let alone fix it.

   With that configuration, I'd say pretty much by definition the
contents of the FS are considered to be discardable.

   Restoring from backups is the recommended approach with transid
failures.

(*) Don't do that.

> I experienced severe filesystem corruption, most likely due to a hard reset 
> with inflight data.
> The system cannot mount (also not with "ro,nologreplay" / "nospace_cache" 
> etc.).

   Given how close the transids are, have you tried
"ro,usebackuproot"? That's about your only other option at this
point. But, if btrfs restore isn't working, then usebacuproot probably
won't either.

> Running "btrfs restore" I got a reasonable amount of data backed up, but a 
> large chunk is missing.
> 
> "btrfs check" gives the following error:
> 
> ##
> $ btrfs check -b /dev/sdd
> Opening filesystem to check...
> parent transid verify failed on 1048576 wanted 60234 found 60230
> parent transid verify failed on 1048576 wanted 60234 found 60230
> Ignoring transid failure
> parent transid verify failed on 55432763981824 wanted 60233 found 60235
> parent transid verify failed on 55432763981824 wanted 60233 found 60235
> Ignoring transid failure
> parent transid verify failed on 55432753725440 wanted 60232 found 60235
> parent transid verify failed on 55432753725440 wanted 60232 found 60235
> Ignoring transid failure
> parent transid verify failed on 55432764063744 wanted 60233 found 60235
> parent transid verify failed on 55432764063744 wanted 60233 found 60235
> Ignoring transid failure
> Checking filesystem on /dev/sdd
> UUID: 8b19ff46-3f42-4f51-be6b-5fc8a7d8f2cd
> [1/7] checking root items
> Error: could not find extent items for root 268
> ERROR: failed to repair root items: No such file or directory
> ##
> 
> I have a complete "dump tree" zip but its a couple of GB.
> 
> Some sources on the net say to run "btrfs check --init-extent-tree" but I 
> would like to reach out first.

   Probably not wise. "Sources on the net" are frequently wrong when
it comes to btrfs recovery.

> btrfs progs version is 4.20.2 and kernel is 4.20.17

   At least those aren't out of date. The only positive thing here...

   Hugo.

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v5] btrfs-progs: dump-tree: add noscan option

2019-03-12 Thread Hugo Mills
;
> > int slot;
> > int extent_only = 0;
> > int device_only = 0;
> > @@ -222,6 +224,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
> > int roots_only = 0;
> > int root_backups = 0;
> > int traverse = BTRFS_PRINT_TREE_DEFAULT;
> > +   int dev_optind;
> > unsigned open_ctree_flags;
> > u64 block_only = 0;
> > struct btrfs_root *tree_root_scan;
> > @@ -239,8 +242,8 @@ int cmd_inspect_dump_tree(int argc, char **argv)
> > optind = 0;
> > while (1) {
> > int c;
> > -   enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_DFS,
> > -  GETOPT_VAL_BFS };
> > +   enum { GETOPT_VAL_FOLLOW = 256, GETOPT_VAL_DFS, GETOPT_VAL_BFS,
> > +  GETOPT_VAL_NOSCAN};
> > static const struct option long_options[] = {
> > { "extents", no_argument, NULL, 'e'},
> > { "device", no_argument, NULL, 'd'},
> > @@ -252,6 +255,7 @@ int cmd_inspect_dump_tree(int argc, char **argv)
> > { "follow", no_argument, NULL, GETOPT_VAL_FOLLOW },
> > { "bfs", no_argument, NULL, GETOPT_VAL_BFS },
> > { "dfs", no_argument, NULL, GETOPT_VAL_DFS },
> > +   { "noscan", no_argument, NULL, GETOPT_VAL_NOSCAN },
> > { NULL, 0, NULL, 0 }
> > };
> >  
> > @@ -313,24 +317,49 @@ int cmd_inspect_dump_tree(int argc, char **argv)
> > case GETOPT_VAL_BFS:
> > traverse = BTRFS_PRINT_TREE_BFS;
> > break;
> > +   case GETOPT_VAL_NOSCAN:
> > +   open_ctree_flags |= OPEN_CTREE_NO_DEVICES;
> > +   break;
> > default:
> > usage(cmd_inspect_dump_tree_usage);
> > }
> > }
> >  
> > -   if (check_argc_exact(argc - optind, 1))
> > +   if (check_argc_min(argc - optind, 1))
> > usage(cmd_inspect_dump_tree_usage);
> >  
> > -   ret = check_arg_type(argv[optind]);
> > -   if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) {
> > +   dev_optind = optind;
> > +   while (dev_optind < argc) {
> > +   int fd;
> > +   struct btrfs_fs_devices *fs_devices;
> > +   u64 num_devices;
> > +
> > +   ret = check_arg_type(argv[optind]);
> > +   if (ret != BTRFS_ARG_BLKDEV && ret != BTRFS_ARG_REG) {
> > +   if (ret < 0) {
> > +   errno = -ret;
> > +   error("invalid argument %s: %m", 
> > argv[dev_optind]);
> > +   } else {
> > +   error("not a block device or regular file: %s",
> > +  argv[dev_optind]);
> > +   }
> > +   }
> > +   fd = open(argv[dev_optind], O_RDONLY);
> > +   if (fd < 0) {
> > +   error("cannot open %s: %m", argv[dev_optind]);
> > +   return -EINVAL;
> > +   }
> > +   ret = btrfs_scan_one_device(fd, argv[dev_optind], &fs_devices,
> > +   &num_devices,
> > +   BTRFS_SUPER_INFO_OFFSET,
> > +   SBREAD_DEFAULT);
> > +   close(fd);
> > if (ret < 0) {
> > errno = -ret;
> > -   error("invalid argument %s: %m", argv[optind]);
> > -   } else {
> > -   error("not a block device or regular file: %s",
> > - argv[optind]);
> > +   error("device scan %s: %m", argv[dev_optind]);
> > +   return ret;
> > }
> > -   goto out;
> > +   dev_optind++;
> > }
> >  
> > printf("%s\n", PACKAGE_STRING);
> > 

-- 
Hugo Mills | O tempura! O moresushi!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Better distribution of RAID1 data?

2019-02-15 Thread Hugo Mills
On Fri, Feb 15, 2019 at 10:40:56AM -0500, Brian B wrote:
> It looks like the btrfs code currently uses the total space available on
> a disk to determine where it should place the two copies of a file in
> RAID1 mode.  Wouldn't it make more sense to use the _percentage_ of free
> space instead of the number of free bytes?

   I don't think it'll make much difference. I spent a long time a
couple of years ago trying to prove (mathematically) that the current
strategy always produces an optimal usage of the available space -- I
wasn't able to complete the theorem, but a lot of playing around with
it convinced me that at least if there are cases where it's
non-optimal, they're bizarre corner cases.

> For example, I have two disks in my array that are 8 TB, plus an
> assortment of 3,4, and 1 TB disks.  With the current allocation code,
> btrfs will use my two 8 TB drives exclusively until I've written 4 TB of
> files, then it will start using the 4 TB disks, then eventually the 3,
> and finally the 1 TB disks.  If the code used a percentage figure
> instead, it would spread the allocations much more evenly across the
> drives, ideally spreading load and reducing drive wear.
> 
> Is there a reason this is done this way, or is it just something that
> hasn't had time for development?

   I'd guess it's the easiest algorithm to use, plus it seems to
provide optimal space usage (almost?) all of the time. 

   Hugo.

-- 
Hugo Mills | Be pure.
hugo@... carfax.org.uk | Be vigilant.
http://carfax.org.uk/  | Behave.
PGP: E2AB1DE4  |   Torquemada, Nemesis


signature.asc
Description: Digital signature


Re: corrupt leaf: root=1 block=57567265079296 slot=83, bad key order

2019-02-14 Thread Hugo Mills
On Thu, Feb 14, 2019 at 08:25:26PM +0800, Qu Wenruo wrote:
> On 2019/2/14 下午7:58, Jesper Utoft wrote:
> > Hello Fellow BTRFS users.
> > 
> > I have run into the bad key order issue.
> > corrupt leaf: root=1 block=57567265079296 slot=83, bad key order, prev
> > (18446744073709551605 0 57707594776576) current (18446726481523507189
> > 0 57709742260224)
> > The lines repeats over and over..
> > 
> > I read a thread between Hugo Mills and Eric Wolf about a similar issue
> > and i have gathered the same info. 
> Now we have all the needed info.
> 
> > 
> > I understand that it probably is hardware related, i have been running
> > memtest for 60h+ to see if i could reproduce it.
> > I also tried to run btrfs check --recover but it did not help.
> > 
> > My questions is if it can be fixed?
> 
> Yes, but only manual patching is possible yet.

   David: What needs to be done to get the bitflip-in-key patches
added to btrfs check? They've been lurking in some patch stack for
literally years, and would have dealt with this one easily.

[snip]
> Thankfully, all keys around give us a pretty good idea what the original
> value should be: (FREE_SPACE UNTYPED 57709742260224).
> 
> And for the raw value:
> bad:  0xeff5
> good: 0xfff5
> ^
> e->f, one bit get flipped.
> (UNTYPED is the same value for UNKNOWN.0, so don't worry about that).
> 
> I have created a special branch for you:
> https://github.com/adam900710/btrfs-progs/tree/dirty_fix
> 
> Just compile that btrfs-progs, no need to install, then excute the
> following command inside btrfs-progs directory:
> 
> # ./btrfs-corrupt-block -X 

   BUT, don't do it until you've found and replaced the bad RAM that
broke it in the first place.

> And your report just remind me to update the write time tree block
> checker

   Looking forward to dealing with a whole new type of "btrfs is
broken!" complaints on IRC (followed by "can't I just let it carry on
regardless?"). ;)

   Hugo.

-- 
Hugo Mills | Hickory Dickory Dock,
hugo@... carfax.org.uk | Three mice ran up the clock.
http://carfax.org.uk/  | The clock struck one,
PGP: E2AB1DE4  | The other two escaped with minor injuries


signature.asc
Description: Digital signature


Re: RAID1 filesystem not mounting

2019-02-02 Thread Hugo Mills
On Fri, Feb 01, 2019 at 11:28:27PM -0500, Alan Hardman wrote:
> I have a Btrfs filesystem using 6 partitionless disks in RAID1 that's failing 
> to mount. I've tried the common recommended safe check options, but I haven't 
> gotten the disk to mount at all, even with -o ro,recovery. If necessary, I 
> can try to use the recovery to another filesystem, but I have around 18 TB of 
> data on the filesystem that won't mount, so I'd like to avoid that if there's 
> some other way of recovering it.
> 
> Versions:
> btrfs-progs v4.19.1
> Linux localhost 4.20.6-arch1-1-ARCH #1 SMP PREEMPT Thu Jan 31 08:22:01 UTC 
> 2019 x86_64 GNU/Linux
> 
> Based on my understanding of how RAID1 works with Btrfs, I would expect a 
> single disk failure to not prevent the volume from mounting entirely, but I'm 
> only seeing one disk with errors according to dmesg output, maybe I'm 
> misinterpreting it:
> 
> [  534.519437] BTRFS warning (device sdd): 'recovery' is deprecated, use 
> 'usebackuproot' instead
> [  534.519441] BTRFS info (device sdd): trying to use backup root at mount 
> time
> [  534.519443] BTRFS info (device sdd): disk space caching is enabled
> [  534.519446] BTRFS info (device sdd): has skinny extents
> [  536.306194] BTRFS info (device sdd): bdev /dev/sdc errs: wr 23038942, rd 
> 22208378, flush 1, corrupt 29486730, gen 2933
> [  556.126928] BTRFS critical (device sdd): corrupt leaf: root=2 
> block=25540634836992 slot=45, unexpected item end, have 13882 expect 13898

   It's worth noting that 13898-13882 = 16, which is a power of
two. This means that you most likely have a single-bit error in your
metadata. That, plus the checksum not being warned about, would
strongly suggest that you have bad RAM. I would recommend that you
check your RAM first before trying anything else that would write to
your filesystem (including btrfs check --repair).

   Hugo.

> [  556.134767] BTRFS critical (device sdd): corrupt leaf: root=2 
> block=25540634836992 slot=45, unexpected item end, have 13882 expect 13898
> [  556.150278] BTRFS critical (device sdd): corrupt leaf: root=2 
> block=25540634836992 slot=45, unexpected item end, have 13882 expect 13898
> [  556.150310] BTRFS error (device sdd): failed to read block groups: -5
> [  556.216418] BTRFS error (device sdd): open_ctree failed
> 
> If helpful, here is some lsblk output:
> 
> NAME   TYPE   SIZE FSTYPE MOUNTPOINT UUID
> sdadisk 111.8G   
> ├─sda1 part   1.9M   
> └─sda2 part 111.8G ext4   /  c598dfdf-d6e7-47d3-888a-10f5f53fa338
> sdbdisk   7.3T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> sdcdisk   7.3T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> sdddisk   7.3T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> sdedisk   7.3T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> sdfdisk   2.7T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> sdhdisk   2.7T btrfs 8f26ae2d-84b5-47d7-8f19-64b0ef5a481b
> 
> My main system partition on sda mounts fine and is usable to work with the 
> btrfs filesystem that's having issues.
> 
> Running "btrfs check /dev/sdb" exits with this:
> 
> Opening filesystem to check...
> Incorrect offsets 13898 13882
> ERROR: cannot open file system
> 
> Also, "btrfs restore -Dv /dev/sdb /tmp" outputs some of the files on the 
> filesystem but not all of them. I'm not sure if this is limited to the files 
> on that physical disk, or if there's a bigger issue with the filesystem. I'm 
> not sure what the best approach from here is, so any advice would be great.

-- 
Hugo Mills | If it's December 1941 in Casablanca, what time is it
hugo@... carfax.org.uk | in New York?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Rick Blaine, Casablanca


signature.asc
Description: Digital signature


Re: [PATCH 3/4] Btrfs: check if destination root is read-only for deduplication

2019-01-31 Thread Hugo Mills
On Thu, Jan 31, 2019 at 04:39:22PM +, Filipe Manana wrote:
> On Thu, Dec 13, 2018 at 4:08 PM David Sterba  wrote:
> >
> > On Wed, Dec 12, 2018 at 06:05:58PM +, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > Checking if the destination root is read-only was being performed only for
> > > clone operations. Make deduplication check it as well, as it does not make
> > > sense to not do it, even if it is an operation that does not change the
> > > file contents (such as defrag for example, which checks first if the root
> > > is read-only).
> >
> > And this is also change in user-visible behaviour of dedupe, so this
> > needs to be verified if it's not breaking existing tools.
> 
> Have you had the chance to do such verification?
> 
> This actually conflicts with send. Send does not expect a root/tree to
> change, and with dedupe on read-only roots happening
> in parallel with send is going to cause all sorts of unexpected and
> undesired problems...
> 
> This is a problem introduced by dedupe ioctl when it landed, since
> send existed for a longer time (when nothing else was
> allowed to change read-only roots, including defrag).
> 
> I understand it can break some applications, but adding other solution
> such as preventing send and dedupe from running in parallel
> (erroring out or block and wait for each other, etc) is going to be
> really ugly. There's always the workaround for apps to set the
> subvolume
> to RW mode, do the dedupe, then switch it back to RO mode.

   Only if you want your incremental send chain to break on the way
past...

   I think it's fairly clear by now (particularly from the last thread
on this a couple of weeks ago) that making RO subvols RW and then back
again is a fast way to broken incremental receives.

   Hugo.

-- 
Hugo Mills | A clear conscience. Where did you get this taste for
hugo@... carfax.org.uk | luxuries, Bernard?
http://carfax.org.uk/  |  Sir Humphrey
PGP: E2AB1DE4  |   Yes, Prime Minister


signature.asc
Description: Digital signature


Re: Incremental receive completes succesfully despite missing files

2019-01-22 Thread Hugo Mills
On Tue, Jan 22, 2019 at 12:37:34PM -0700, Chris Murphy wrote:
> On Tue, Jan 22, 2019 at 10:57 AM Andrei Borzenkov  wrote:
> > "Related" is in the eye of the beholder. Clone subvolume, delete content
> > of clone, reflink content of another volume into clone. Are original
> > subvolume and clone related now? Clone still have parent UUID ...
> 
> I'm not talking about -c option. Just -p. Conceptually -c is even more
> complicated and effectively supports multiple "parents" as it can be
> specified more than once.

   I tend to use the term "reference" for the -p subvolume, and the -c
subvolume(s).

   Hugo.

-- 
Hugo Mills | The future isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: roadmap for btrfs

2019-01-16 Thread Hugo Mills
   Hi, Stefan,

On Wed, Jan 16, 2019 at 04:34:15PM +0100, Stefan K wrote:
> Hello,
> 
> does exist an roadmap or something like "what do first/next"?
> I saw the project ideas[1] and there are a lot of intersting things in it 
> (like read/write caches, per subvolumes mount options, block devices, etc), 
> but there is no plan or an order of ideas. Did btrfs has something like that?

   No, there's no such thing. There's not even a concrete list of
what's currently being worked on, or what people are planning on doing
next. We tried it a while ago, but there was no good way to keep it up
to date.

   Hugo.

-- 
Hugo Mills | Anyone using a computer to generate random numbers
hugo@... carfax.org.uk | is, of course, in a state of sin.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Jon von Neumann


signature.asc
Description: Digital signature


Re: question about creating a raid10

2019-01-16 Thread Hugo Mills
On Wed, Jan 16, 2019 at 03:36:25PM +0100, Stefan K wrote:
> Hello,
> 
> if I create a raid10 it looks like that:
> mkfs.btrfs -m raid10 -d raid10 /dev/sdb /dev/sdc /dev/sdd /dev/sde
> 
> but if I've different jbods and I want that every mirror of a raid10 is on a 
> different jbod how can I archive that? in zfs it looks like that:
[snip]
> how can I be sure that is btrfs the same?

   I'm afraid you can't. It would take modifications of the chunk
allocator to achieve this (and you'd also need to store the metadata
somewhere as to which devices were in which failure domain).

   Hugo.

-- 
Hugo Mills | "It was half way to Rivendell when the drugs began
hugo@... carfax.org.uk | to take hold"
http://carfax.org.uk/  |  Hunter S Tolkien
PGP: E2AB1DE4  |Fear and Loathing in Barad Dûr


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: Do mandatory tree block check before submitting bio

2019-01-16 Thread Hugo Mills
On Wed, Jan 16, 2019 at 08:26:35PM +0800, Qu Wenruo wrote:
> 
> 
> On 2019/1/16 下午8:01, Hugo Mills wrote:
> >Hi, Qu,
> > 
> > On Wed, Jan 16, 2019 at 07:53:08PM +0800, Qu Wenruo wrote:
> >> There are at least 2 reports about memory bit flip sneaking into on-disk
> >> data.
> >>
> >> Currently we only have a relaxed check triggered at
> >> btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
> >> CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.
> >>
> >> This patch will address the hole by triggering comprehensive check on
> >> tree blocks before writing it back to disk.
> >>
> >> The timing is set to csum_tree_block() where @verify == 0.
> >> At that timing, we're generation csum for tree blocks before submitting
> >> the metadata bio, so we could avoid all the unnecessary calls at
> >> btrfs_mark_buffer_dirty(), but still catch enough error.
> > 
> >I agree wholeheartedly with the idea of this change. Just one
> > question:
> > 
> >How does this get reported to the user/administrator? As I
> > understand it, a detectably corrupt metadata page will generate an I/O
> > error from the filesystem before it's written to disk? How will this
> > show up in kernel logs?
> 
> Well, you caught me.
> 
> I haven't try the error case, and in fact if it fails, it fails by
> triggering kernel BUG_ON(), thus you may not have a chance to see the
> error message from btrfs module.
> 
> > Is it distinguishable in any way from a
> > similar error that was generated on reading such a corrupt metadata
> > node from the disk?
> 
> Kind of distinguishable for this patch, when you hit kernel BUG_ON() at
> fs/btrfs/extent_io.c:4016 then it's definitely from this patch. :P

   Haha. :)

> >Basically, I want to be able to distinguish this case (error
> > detected when writing) from the existing case (error detected when
> > reading) when someone shows up on IRC with a "broken filesystem".
> 
> Definitely, I'll address this and the BUG_ON() in next version.

   The error-on-read case gives us a pretty good report on what's
wrong and where. Typically something like "bad key order" or "wrong
item size", plus the logical address of the metadata chunk so we can
dump it with debug-tree -b.

   Having the same messages from this call-site to indicate the kind
of error found, and also something to indicate that it was detected
before hitting disk (i.e. which call-site for the checks triggered the
error) would be, I think, the minimal information needed. If we could
also have the human-readable dump of the full metadata page logged as
well, that would be ideal -- we won't be able to use debug-tree to
diagnose the issue afterwards, as it won't have reached the disk.

   Thanks,
   Hugo.

> Thanks,
> Qu
> 
> > 
> >Hugo.
> > 
> >> Reported-by: Leonard Lausen 
> >> Signed-off-by: Qu Wenruo 
> >> ---
> >>  fs/btrfs/disk-io.c | 10 ++
> >>  1 file changed, 10 insertions(+)
> >>
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index 8da2f380d3c0..45bf6be9e751 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -313,6 +313,16 @@ static int csum_tree_block(struct btrfs_fs_info 
> >> *fs_info,
> >>return -EUCLEAN;
> >>}
> >>} else {
> >> +  /*
> >> +   * Here we're calculating csum before writing it to disk,
> >> +   * do comprenhensive check here to catch memory corruption
> >> +   */
> >> +  if (btrfs_header_level(buf))
> >> +  err = btrfs_check_node(fs_info, buf);
> >> +  else
> >> +  err = btrfs_check_leaf_full(fs_info, buf);
> >> +  if (err < 0)
> >> +  return err;
> >>write_extent_buffer(buf, result, 0, csum_size);
> >>}
> >>  
> > 
> 




-- 
Hugo Mills | Friends: people who know you well, but like you
hugo@... carfax.org.uk | anyway.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: Do mandatory tree block check before submitting bio

2019-01-16 Thread Hugo Mills
   Hi, Qu,

On Wed, Jan 16, 2019 at 07:53:08PM +0800, Qu Wenruo wrote:
> There are at least 2 reports about memory bit flip sneaking into on-disk
> data.
> 
> Currently we only have a relaxed check triggered at
> btrfs_mark_buffer_dirty() time, as it's not mandatory, only for
> CONFIG_BTRFS_FS_CHECK_INTEGRITY enabled build.
> 
> This patch will address the hole by triggering comprehensive check on
> tree blocks before writing it back to disk.
> 
> The timing is set to csum_tree_block() where @verify == 0.
> At that timing, we're generation csum for tree blocks before submitting
> the metadata bio, so we could avoid all the unnecessary calls at
> btrfs_mark_buffer_dirty(), but still catch enough error.

   I agree wholeheartedly with the idea of this change. Just one
question:

   How does this get reported to the user/administrator? As I
understand it, a detectably corrupt metadata page will generate an I/O
error from the filesystem before it's written to disk? How will this
show up in kernel logs? Is it distinguishable in any way from a
similar error that was generated on reading such a corrupt metadata
node from the disk?

   Basically, I want to be able to distinguish this case (error
detected when writing) from the existing case (error detected when
reading) when someone shows up on IRC with a "broken filesystem".

   Hugo.

> Reported-by: Leonard Lausen 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 8da2f380d3c0..45bf6be9e751 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -313,6 +313,16 @@ static int csum_tree_block(struct btrfs_fs_info *fs_info,
>   return -EUCLEAN;
>   }
>   } else {
> + /*
> +  * Here we're calculating csum before writing it to disk,
> +  * do comprenhensive check here to catch memory corruption
> +  */
> + if (btrfs_header_level(buf))
> + err = btrfs_check_node(fs_info, buf);
> + else
> + err = btrfs_check_leaf_full(fs_info, buf);
> + if (err < 0)
> + return err;
>   write_extent_buffer(buf, result, 0, csum_size);
>   }
>  

-- 
Hugo Mills | ... one ping(1) to rule them all, and in the
hugo@... carfax.org.uk | darkness bind(2) them.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: BTRFS crash: what can I do more?

2019-01-05 Thread Hugo Mills
> [1676572.902040] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [1676572.902042] Call Trace:
> [1676572.902051] ? __slab_free+0x225/0x340
> [1676572.902107] ? btrfs_merge_delayed_refs+0x31d/0x360 [btrfs]
> [1676572.902148] __btrfs_run_delayed_refs+0x20e/0x1010 [btrfs]
> [1676572.902193] ? btree_set_page_dirty+0xe/0x10 [btrfs]
> [1676572.902233] btrfs_run_delayed_refs+0x80/0x190 [btrfs]
> [1676572.902274] btrfs_start_dirty_block_groups+0x2c3/0x400 [btrfs]
> [1676572.902320] btrfs_commit_transaction+0xcb/0x870 [btrfs]
> [1676572.902364] ? start_transaction+0xa0/0x410 [btrfs]
> [1676572.902409] transaction_kthread+0x15c/0x190 [btrfs]
> [1676572.902416] kthread+0x120/0x140
> [1676572.902458] ? btrfs_cleanup_transaction+0x560/0x560 [btrfs]
> [1676572.902463] ? kthread_bind+0x40/0x40
> [1676572.902469] ret_from_fork+0x35/0x40
> [1676572.902474] ---[ end trace f2212539a1b94aed ]---
> [1676572.902490] BTRFS: error (device sda) in
> __btrfs_free_extent:6953: errno=-28 No space left
> [1676572.902505] BTRFS info (device sda): forced readonly
> [1676572.902511] BTRFS: error (device sda) in
> btrfs_run_delayed_refs:3057: errno=-28 No space left
> [1683350.961567] kauditd_printk_skb: 1140 callbacks suppressed

-- 
Hugo Mills | Great films about cricket: 200/1: A Pace Odyssey
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Specifying block group size

2018-12-22 Thread Hugo Mills
On Fri, Dec 21, 2018 at 10:18:40PM -0800, Raymond Jennings wrote:
> How do I specify the size of a block group at mkfs?

   You don't -- there's no explicit control over it. The FS will
decide based on the overall size of the filesystem in question.

   Typically, data groups are made of 1 GiB chunks, and metadata
groups are made of 256 MiB chunks (where the RAID level will determine
the number of chunks in a group and the amount of usable space of the
group).

   Hugo.

> 
> Like, for example, saying that data groups will be 1GiB, but metadata
> groups will be 1MiB?
> 
> I noticed that they had different default sizes based on the profile
> (dup vs single vs raid)

-- 
Hugo Mills | Beware geeks bearing GIFs
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Understanding "btrfs filesystem usage"

2018-10-29 Thread Hugo Mills
On Mon, Oct 29, 2018 at 05:57:10PM -0400, Remi Gauvin wrote:
> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> > I want to know how many free space is left and have problems in
> > interpreting the output of: 
> > 
> > btrfs filesystem usage
> > btrfs filesystem df
> > btrfs filesystem show
> > 
> >
> 
> In my not so humble opinion, the filesystem usage command has the
> easiest to understand output.  It' lays out all the pertinent information.

   Opinions are divided. I find it almost impossible to read, and
always use btrfs fi df and btrfs fi show together.

   There's short tutorials of how to read the output in both cases in
the FAQ, which is where I start out by directing people in this
instance.

   Hugo.

> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
> filesystem show is actually using the "Allocated" value as "Used".
> Allocated can be thought of "Reserved For".  As the output of the Usage
> command and df command clearly show, you have almost 400GiB space available.
> 
> Note that the btrfs commands are clearly and explicitly displaying
> values in Binary units, (Mi, and Gi prefix, respectively).  If you want
> df command to match, use -h instead of -H (see man df)
> 
> An observation:
> 
> The disparity between 498GiB used and 823Gib is pretty high.  This is
> probably the result of using an SSD with an older kernel.  If your
> kernel is not very recent, (sorry, I forget where this was fixed,
> somewhere around 4.14 or 4.15), then consider mounting with the nossd
> option.  You can improve this by running a balance.
> 
> Something like:
> btrfs balance start -dusage=55
> 
> You do *not* want to end up with all your space allocated to Data, but
> not actually used by data.  Bad things can happen if you run out of
> Unallocated space for more metadata. (not catastrophic, but awkward and
> unexpected downtime that can be a little tricky to sort out.)
> 
> 

> begin:vcard
> fn:Remi Gauvin
> n:Gauvin;Remi
> org:Georgian Infotech
> adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
> email;internet:r...@georgianit.com
> tel;work:226-256-1545
> version:2.1
> end:vcard
> 


-- 
Hugo Mills | Great oxymorons of the world, no. 8:
hugo@... carfax.org.uk | The Latest In Proven Technology
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Urgent: Need BTRFS-Expert

2018-10-17 Thread Hugo Mills
   Hi, Michael,

On Wed, Oct 17, 2018 at 09:58:31AM +0200, Michael Post wrote:
> Hello together,
> 
> i need a BTRFS-Expert for remote support.
> 
> Anyone who can assist me?

   This is generally the wrong approach to take in open-source
circles. Instead, if you describe your problem here on this mailing
list, you'll get *most* of the experts looking at it, rather than just
the one, and you'll generally get a much better (and easier to use)
service.

   Hugo.

-- 
Hugo Mills | The early bird gets the worm, but the second mouse
hugo@... carfax.org.uk | gets the cheese.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 05:40:40PM +0300, Anton Shepelev wrote:
> Hugo Mills to Anton Shepelev:
> 
> >>While trying to resolve free space problems, and found
> >>that I cannot interpret the output of:
> >>
> >>> btrfs filesystem show
> >>
> >>Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> >>Total devices 1 FS bytes used 34.06GiB
> >>devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> >>
> >>How come the total used value is less than the value
> >>listed for the only device?
> >
> >   "Used" on the device is the mount of space allocated.
> >"Used" on the FS is the total amount of actual data and
> >metadata in that allocation.
> >
> >   You will also need to look at the output of "btrfs fi
> >df" to see the breakdown of the 37.82 GiB into data,
> >metadata and currently unused.
> >
> >See
> >https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
> > for the details
> 
> Thank you, Hugo, understood.  mount/amount is a very fitting
> typo :-)
> 
> Do the standard `du' and `du' tools report correct values
> with btrfs?

   Well...

   du will tell you the size of the files you asked it about, but it
doesn't know about reflinks, so it'll double-count if you've got a
reflink copy of something. Other than that, it should be accurate, I
think. There's also a "btrfs fi du" which can tell you the amount of
shared and unique data as well, so you can know, for example, how much
space you'll reclaim if you delete those files.

   df should be mostly OK, but it does sometimes get its estimate of
the total usable size of the FS wrong, particularly if the FS is
unbalanced. However, as the FS fills up, the estimate gets better,
because it gets more evenly balanced across devices over time.

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 02:26:41PM +, Hugo Mills wrote:
> On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> > Hello, all
> > 
> > While trying to resolve free space problems, and found that
> > I cannot interpret the output of:
> > 
> > > btrfs filesystem show
> > 
> > Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> > Total devices 1 FS bytes used 34.06GiB
> > devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> > 
> > How come the total used value is less than the value listed
> > for the only device?
> 
>"Used" on the device is the mount of space allocated. "Used" on the

s/mount/amount/

> FS is the total amount of actual data and metadata in that allocation.
> 
>You will also need to look at the output of "btrfs fi df" to see
> the breakdown of the 37.82 GiB into data, metadata and currently
> unused.
> 
>See 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>  for the details.
> 
>Hugo.
> 

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> Hello, all
> 
> While trying to resolve free space problems, and found that
> I cannot interpret the output of:
> 
> > btrfs filesystem show
> 
> Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> Total devices 1 FS bytes used 34.06GiB
> devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> 
> How come the total used value is less than the value listed
> for the only device?

   "Used" on the device is the mount of space allocated. "Used" on the
FS is the total amount of actual data and metadata in that allocation.

   You will also need to look at the output of "btrfs fi df" to see
the breakdown of the 37.82 GiB into data, metadata and currently
unused.

   See 
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
 for the details.

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 11:01:35PM +0200, Pierre Couderc wrote:
> On 10/08/2018 06:14 PM, Hugo Mills wrote:
> >On Mon, Oct 08, 2018 at 04:10:55PM +0000, Hugo Mills wrote:
> >>On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> >>>I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> >>>
> >>>But I have stranges status or errors  about "missing devices" and I
> >>>do not understand the current situation :
> >>>
> >>>
> >>>root@server:~# btrfs fi show
> >>>Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >>>     Total devices 1 FS bytes used 190.91GiB
> >>>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> >>>
> >>>warning, device 1 is missing
> >>>Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >>>     Total devices 2 FS bytes used 116.18GiB
> >>>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >>>     *** Some devices missing
> >>This looks like you've created a RAID-1 array with /dev/sda2 and
> >>/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> >>original [part of a] filesystem on /dev/sda2, and replacing it with a
> >>wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> >>28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> >>and the original FS was made of two devices, btrfs fi show is telling
> >>you that there's some devices missing -- /dev/sda2 is no longer part
> >>of that FS, and is therefore a missing device.
> >>
> >>I note that you've got data on both filesystems, so they must both
> >>have been mounted somewhere and had stuff put on them.
> >>
> >>I recommend doing something like this:
> >>
> >># mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> >># mount /dev/sdb /media/btrfs/myraid1/
> >># mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> >># cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> >># umount /media/btrfs/tmp/
> >># wipefs /dev/sda2   # destroy the FS on sda2
> >># btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> >>
> >>This will copy all the data from the filesystem on /dev/sda2 into
> >>the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> >>as the second device for the main FS.
> >>
> >>*WARNING!*
> >>
> >>Note that, since the main FS is missing a device, it will probably
> >>need to be mounted in degraded mode (-o degraded), and that on kernels
> >>earlier than (IIRC) 4.14, this can only be done *once* without the FS
> >>becoming more or less permanently read-only. On recent kernels, it
> >>_should_ be OK.
> >>
> >>*WARNING ENDS*
> >Oh, and for the record, to make a RAID-1 filesystem from scratch,
> >you simply need this:
> >
> ># mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb
> >
> >You do not need to run mkfs.btrfs on each device separately.
> >
> >Hugo.
> Thnk you very much. I understand a bit better. I think  that I have
> nothing of interest on /dev/sdb and that its contents is the result
> of previous trials.
> And that my system is on /dev/dsda2 as :
> 
> root@server:~# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> udev    3.9G 0  3.9G   0% /dev
> tmpfs   787M  8.8M  778M   2% /run
> /dev/sda2   1.9T  193G  1.7T  11% /
> tmpfs   3.9G 0  3.9G   0% /dev/shm
> tmpfs   5.0M 0  5.0M   0% /run/lock
> tmpfs   3.9G 0  3.9G   0% /sys/fs/cgroup
> /dev/sda1   511M  5.7M  506M   2% /boot/efi
> tmpfs   100K 0  100K   0% /var/lib/lxd/shmounts
> tmpfs   100K 0  100K   0% /var/lib/lxd/devlxd
> root@server:~#
> 
> Is it exact ?

   Yes, it looks like you're running / from the FS on /dev/sda2.

> If yes, I suppose I should wipe data on /dev/sdb, then build the
> RAID by expanding /dev/sda2.

   Correct.

   I would recommend putting a partition table on /dev/sdb, because it
doesn't take up much space, and it's always easier to have one already
there when you need it (and there's a few things that can get confused
if there isn't a partition table).

> So I should :
> 
> wipefs /dev/sdb
> btrfs device add /dev/sdb /
> btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /

> Does it sound correct ? (my kernel is boot/vmlinuz-4.18.0-1-amd64)

   Yes, exactly.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> > I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> > 
> > But I have stranges status or errors  about "missing devices" and I
> > do not understand the current situation :
> > 
> > 
> > root@server:~# btrfs fi show
> > Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >     Total devices 1 FS bytes used 190.91GiB
> >     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> > 
> > warning, device 1 is missing
> > Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >     Total devices 2 FS bytes used 116.18GiB
> >     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >     *** Some devices missing
> 
>This looks like you've created a RAID-1 array with /dev/sda2 and
> /dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> original [part of a] filesystem on /dev/sda2, and replacing it with a
> wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> 28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> and the original FS was made of two devices, btrfs fi show is telling
> you that there's some devices missing -- /dev/sda2 is no longer part
> of that FS, and is therefore a missing device.
> 
>I note that you've got data on both filesystems, so they must both
> have been mounted somewhere and had stuff put on them.
> 
>I recommend doing something like this:
> 
> # mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> # mount /dev/sdb /media/btrfs/myraid1/
> # mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> # cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> # umount /media/btrfs/tmp/
> # wipefs /dev/sda2   # destroy the FS on sda2
> # btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> 
>This will copy all the data from the filesystem on /dev/sda2 into
> the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> as the second device for the main FS.
> 
> *WARNING!*
> 
>Note that, since the main FS is missing a device, it will probably
> need to be mounted in degraded mode (-o degraded), and that on kernels
> earlier than (IIRC) 4.14, this can only be done *once* without the FS
> becoming more or less permanently read-only. On recent kernels, it
> _should_ be OK.
> 
> *WARNING ENDS*

   Oh, and for the record, to make a RAID-1 filesystem from scratch,
you simply need this:

# mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb

   You do not need to run mkfs.btrfs on each device separately.

   Hugo.

-- 
Hugo Mills | Welcome to Rivendell, Mr Anderson...
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Machinae Supremacy, Hybrid


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> 
> But I have stranges status or errors  about "missing devices" and I
> do not understand the current situation :
> 
> 
> root@server:~# btrfs fi show
> Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
>     Total devices 1 FS bytes used 190.91GiB
>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> 
> warning, device 1 is missing
> Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
>     Total devices 2 FS bytes used 116.18GiB
>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
>     *** Some devices missing

   This looks like you've created a RAID-1 array with /dev/sda2 and
/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
original [part of a] filesystem on /dev/sda2, and replacing it with a
wholly different filesystem. Since the new FS on /dev/sda2 (UUID
28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
and the original FS was made of two devices, btrfs fi show is telling
you that there's some devices missing -- /dev/sda2 is no longer part
of that FS, and is therefore a missing device.

   I note that you've got data on both filesystems, so they must both
have been mounted somewhere and had stuff put on them.

   I recommend doing something like this:

# mkfs /media/btrfs/myraid1 /media/btrfs/tmp
# mount /dev/sdb /media/btrfs/myraid1/
# mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
# cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
# umount /media/btrfs/tmp/
# wipefs /dev/sda2   # destroy the FS on sda2
# btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/

   This will copy all the data from the filesystem on /dev/sda2 into
the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
as the second device for the main FS.

*WARNING!*

   Note that, since the main FS is missing a device, it will probably
need to be mounted in degraded mode (-o degraded), and that on kernels
earlier than (IIRC) 4.14, this can only be done *once* without the FS
becoming more or less permanently read-only. On recent kernels, it
_should_ be OK.

*WARNING ENDS*

   Hugo.

[snip]

-- 
Hugo Mills | UNIX: Japanese brand of food containers
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Hugo Mills
On Tue, Sep 18, 2018 at 06:28:37PM +, Gervais, Francois wrote:
> > No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?

   No, there's an ioctl to change the received UUID of a
subvolume. It's used by receive, at the very end of the receive
operation.

   Messing around in this area is basically a recipe for ending up
with a half-completed send/receive full of broken data because the
receiving subvolume isn't quite as identical as you thought. It
enforces the rules for a reason.

   Now, it's possible to modify the send stream and the logic around
it a bit to support a number of additional modes of operation
(bidirectional send, for example), but that's queued up waiting for
(a) a definitive list of send stream format changes, and (b) David's
bandwidth to put them together in one patch set.

   If you want to see more on the underlying UUID model, and how it
could be (ab)used and modified, there's a write-up here, in a thread
on pretty much exactly the same proposal that you've just made:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | Great films about cricket: Monster's No-Ball
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Hugo Mills
On Wed, Aug 29, 2018 at 09:58:58AM +, Duncan wrote:
> Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:
> 
> > Thinking again, this is totally acceptable. If the requirement was a
> > good health disk, then I think I must check the disk health by myself.
> > I may believe that the disk is in a good state, or make a quick test or
> > make some very detailed tests to be sure.
> 
> For testing you might try badblocks.  It's most useful on a device that 
> doesn't have a filesystem on it you're trying to save, so you can use the 
> -w write-test option.  See the manpage for details.
> 
> The -w option should force the device to remap bad blocks where it can as 
> well, and you can take your previous smartctl read and compare it to a 
> new one after the test.
> 
> Hint if testing multiple spinning-rust devices:  Try running multiple 
> tests at once.  While this might have been slower on old EIDE, at least 
> with spinning rust, on SATA and similar you should be able to test 
> multiple devices at once without them slowing down significantly, because 
> the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
> used badblocks years ago to test my new disks before setting up mdraid on 
> them, and with full disk tests on spinning rust taking (at the time) 
> nearly a day a pass and four passes for the -w test, the multiple tests 
> at once trick saved me quite a bit of time!

   Hah. Only a day? It's up to 2 days now.

   The devices get bigger. The interfaces don't get faster at the same
rate. Back in the late '90s, it was only an hour or so to run a
badblocks pass on a big disk...

   Hugo.

-- 
Hugo Mills | Nostalgia isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS and databases

2018-08-01 Thread Hugo Mills
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> I know it's a decade-old question, but I'd like to hear your thoughts
> of today. By now, I became a heavy BTRFS user. Almost everywhere I use
> BTRFS, except in situations when it is obvious there is no benefit
> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
> computers are mainly running on BTRFS with only a few file systems on
> ext4. I even installed BTRFS in corporate productive systems (in those
> cases, the systems were mainly on ext4; but there were some specific
> file systems those exploited BTRFS features).
> 
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?

   Personally, I'd start with btrfs with autodefrag. It has some
degree of I/O overhead, but if the database isn't performance-critical
and already near the limits of the hardware, it's unlikely to make
much difference. Autodefrag should keep the fragmentation down to a
minimum.

   Hugo.

> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?
> 
> 
> Kind regards,
> MegaBrutal

-- 
Hugo Mills | In theory, theory and practice are the same. In
hugo@... carfax.org.uk | practice, they're different.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs filesystem corruptions with 4.18. git kernels

2018-07-20 Thread Hugo Mills
On Fri, Jul 20, 2018 at 11:28:42PM +0200, Alexander Wetzel wrote:
> Hello,
> 
> I'm running my normal workstation with git kernels from 
> git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git
> and just got the second file system corruption in three weeks. I do
> not have issues with stable kernels, and just want to give you a
> heads up that there might be something seriously broken in current
> development kernels.
> 
> The first corruption was with a kernel based on 4.18.0-rc1
> (wt-2018-06-20) and the second one today based on 4.18.0-rc4
> (wt-2018-07-09).
> The first corruption definitely destroyed data, the second one has
> not been looked at all, yet.
> 
> After the reinstall I did run some scrubs, the last working one one
> week ago.
> 
> Of course this could be unrelated to the development kernels or even
> btrfs, but two corruptions within weeks after years without problems
> is very suspect.
> And since btrfs also allowed to read corrupted data (with a stable
> ubuntu kernel, see below for more details) it looks like this is
> indeed an issue in btrfs, correct?
> 
> A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO
> mSATA 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard
> is enabled as mount option and there were roughly 5 other
> subvolumes.
> 
> I'm currently backing up the full btrfs partition after the second
> corruption which announced itself with the following log entries:
> 
> [  979.223767] BTRFS critical (device sdc2): corrupt leaf: root=2
> block=1029783552 slot=1, unexpected item end, have 16161 expect
> 16250

   This means that the metadata block matches the checksum in its
header, but is internally inconsistent. This means that the error in
the block was made before the csum was computed -- i.e., it was that
way in RAM. This can happen in a couple of different ways, but the
most likely cause is bad RAM.

   In this case, it's not a single bitflip in the metadata page
itself, so it's more likely to be something writing spurious data on
the page in RAM that was holding this metadata block. This is either a
bug in the kernel, or a hardware problem.

   I would strongly recommend checking your RAM (memtest86 for a
minimum of 8 hours, preferably 24).

> [  979.223808] BTRFS: error (device sdc2) in __btrfs_cow_block:1080:
> errno=-5 IO failure
> [  979.223810] BTRFS info (device sdc2): forced readonly
> [  979.224599] BTRFS warning (device sdc2): Skipping commit of
> aborted transaction.
> [  979.224603] BTRFS: error (device sdc2) in
> cleanup_transaction:1847: errno=-5 IO failure
> 
> I'll restore the system from a backup - and stick to stable kernels
> for now - after that, but if needed I can of course also restore the
> partition backup to another disk for testing.

   It may be a kernel issue, but it's not necessarily in btrfs. It
could be a bug in some other kernel component where it does some
pointer arithmetic wrong, or uses some uninitialised data as a
pointer. My money's is on bad RAM, though (by a small margin).

   Hugo.

-- 
Hugo Mills | Stick them with the pointy end.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Jon Snow


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Hugo Mills
On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:
> 20.07.2018 20:16, Goffredo Baroncelli пишет:
[snip]
> > Limiting the number of disk per raid, in BTRFS would be quite simple to 
> > implement in the "chunk allocator"
> > 
> 
> You mean that currently RAID5 stripe size is equal to number of disks?
> Well, I suppose nobody is using btrfs with disk pools of two or three
> digits size.

   But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

   That's the largest I can recall seeing mention of, though.

   Hugo.

-- 
Hugo Mills | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Terry Pratchett


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Hugo Mills
On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:
> 
> >> As implemented in BTRFS, raid1 doesn't have striping.
> > 
> > The argument is that because there's only two copies, on multi-device
> > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> > alternate device pairs, it's effectively striped at the macro level,
> > with the 1 GiB device-level chunks effectively being huge individual
> > device strips of 1 GiB.
> > 
> > At 1 GiB strip size it doesn't have the typical performance advantage of
> > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> > strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually 
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing 
> phrases.  I've seen the argument made on-list, but while I understand the 
> argument and agree with it to some extent, I'm still a bit uncomfortable 
> with it and don't normally make it myself, this thread being a noted 
> exception tho originally I simply repeated what someone else already said 
> in-thread, because I too agree it's stretching things a bit.  But it does 
> appear to be a useful conceptual equivalency for some, and I do see the 
> similarity.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just 
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
> (code or not, accidental or not, it's a reasonably accurate high-level 
> description of how it ends up working most of the time with equivalent 
> sized devices).)

   Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

   Hugo.

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-15 Thread Hugo Mills
On Fri, Jul 13, 2018 at 08:46:28PM +0200, David Sterba wrote:
[snip]
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs&m=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.

   I'd suggest using lower-case letter for the c, s, p, rather than
upper, as it makes it much easier to read. The upper-case version
tends to make the letters and numbers merge into each other. With
lower-case c, s, p, the taller digits (or M) stand out:

  1c
  1cMs2p
  2c3s8p (OK, just kidding about this one)

   Hugo.

-- 
Hugo Mills | The English language has the mot juste for every
hugo@... carfax.org.uk | occasion.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: unsolvable technical issues?

2018-06-25 Thread Hugo Mills
On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote:
[snip]
> I hope I am not asking for too much (but I know I probably am), but
> I suggest that having a small snippet of information on the status
> page showing a little bit about what is either currently the
> development focus , or what people are known for working at would be
> very valuable for users and it may of course work both ways, such as
> exciting people or calming them down. ;)
> 
> For example something simple like a "development focus" list...
> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology
> 2018-Q3: (planned) Magical feature X
> 2018-Q2: N-Way mirroring
> 2018-Q1: Feature work "RAID"5/6
> 
> I think it would be good for people living their lives outside as it
> would perhaps spark some attention from developers and perhaps even
> media as well.

   I started doing this a couple of years ago, but it turned out to be
impossible to keep even vaguely accurate or up to date, without going
round and bugging the developers individually on a per-release
basis. I don't think it's going to happen.

   Hugo.

-- 
Hugo Mills | emacs: Emacs Makes A Computer Slow.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: Add more details while checking tree block

2018-06-22 Thread Hugo Mills
On Fri, Jun 22, 2018 at 05:26:02PM +0200, Hans van Kranenburg wrote:
> On 06/22/2018 01:48 PM, Nikolay Borisov wrote:
> > 
> > 
> > On 22.06.2018 04:52, Su Yue wrote:
> >> For easier debug, print eb->start if level is invalid.
> >> Also make print clear if bytenr found is not expected.
> >>
> >> Signed-off-by: Su Yue 
> >> ---
> >>  fs/btrfs/disk-io.c | 8 
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index c3504b4d281b..a90dab84f41b 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct 
> >> btrfs_io_bio *io_bio,
> >>  
> >>found_start = btrfs_header_bytenr(eb);
> >>if (found_start != eb->start) {
> >> -  btrfs_err_rl(fs_info, "bad tree block start %llu %llu",
> >> -   found_start, eb->start);
> >> +  btrfs_err_rl(fs_info, "bad tree block start want %llu have 
> >> %llu",
> > 
> > nit: I'd rather have the want/have in brackets (want %llu have% llu)
> 
> From a user support point of view, this text should really be improved.
> There are a few places where 'want' and 'have' are reported in error
> strings, and it's totally unclear what they mean.
> 
> Intuitively I'd say when checking a csum, the "want" would be what's on
> disk now, since you want that to be correct, and the "have" would be
> what you have calculated, but it's actually the other way round, or
> wasn't it? Or was it?
> 
> Every time someone pastes such a message when we help on IRC for
> example, there's confusion, and I have to look up the source again,
> because I always forget.
> 
> What about (%llu stored on disk, %llu calculated now) or something similar?

   Yes, definitely this. I experience the same confusion as Hans, and
I think a lot of other people do, too. I usually read "want" and
"have" the wrong way round, so more clarity would be really helpful.

   Hugo.

> >> +   eb->start, found_start);
> >>ret = -EIO;
> >>goto err;
> >>}
> >> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct 
> >> btrfs_io_bio *io_bio,
> >>}
> >>found_level = btrfs_header_level(eb);
> >>if (found_level >= BTRFS_MAX_LEVEL) {
> >> -  btrfs_err(fs_info, "bad tree block level %d",
> >> -(int)btrfs_header_level(eb));
> >> +  btrfs_err(fs_info, "bad tree block level %d on %llu",
> >> +(int)btrfs_header_level(eb), eb->start);
> >>ret = -EIO;
> >>goto err;
> >>}
> >>
> 

-- 
Hugo Mills | "There's a Martian war machine outside -- they want
hugo@... carfax.org.uk | to talk to you about a cure for the common cold."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Stephen Franklin, Babylon 5


signature.asc
Description: Digital signature


Re: About more loose parameter sequence requirement

2018-06-18 Thread Hugo Mills
On Mon, Jun 18, 2018 at 01:34:32PM +0200, David Sterba wrote:
> On Thu, Jun 14, 2018 at 03:17:45PM +0800, Qu Wenruo wrote:
> > I understand that btrfs-progs introduced restrict parameter/option order
> > to distinguish global and sub-command parameter/option.
> > 
> > However it's really annoying if one just want to append some new options
> > to previous command:
> > 
> > E.g.
> > # btrfs check /dev/data/btrfs
> > # !! --check-data-csum
> > 
> > The last command will fail as current btrfs-progs doesn't allow any
> > option after parameter.
> > 
> > 
> > Despite the requirement to distinguish global and subcommand
> > option/parameter, is there any other requirement for such restrict
> > option-first-parameter-last policy?
> 
> I'd say that it's a common and recommended pattern. Getopt is able to
> reorder the parameters so mixed options and non-options are accepted,
> unless POSIXLY_CORRECT (see man getopt(3)) is not set. With the more
> strict requirement, 'btrfs' option parser works the same regardless of
> that.

   I got bitten by this the other day. I put an option flag at the end
of the line, after the mountpoint, and it refused to work.

   I would definitely prefer it if it parsed options in any
position. (Or at least, any position after the group/command
parameters).

   Hugo.

> > If I could implement a enhanced getopt to allow more loose order inside
> > subcomand while still can distinguish global option, will it be accepted
> > (if it's quality is acceptable) ?
> 
> I think it's not worth updating the parser just to support an IMHO
> narrow usecase.

-- 
Hugo Mills | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4  | Page 129 is loosed upon the world.   Zarf


signature.asc
Description: Digital signature


Re: status page

2018-04-25 Thread Hugo Mills
On Wed, Apr 25, 2018 at 02:30:42PM +0200, Gandalf Corvotempesta wrote:
> 2018-04-25 13:39 GMT+02:00 Austin S. Hemmelgarn :
> > Define 'stable'.
> 
> Something ready for production use like ext or xfs with no critical
> bugs or with easy data loss.
> 
> > If you just want 'safe for critical data', it's mostly there already
> > provided that your admins and operators are careful.  Assuming you avoid
> > qgroups and parity raid, don't run the filesystem near full all the time,
> > and keep an eye on the chunk allocations (which is easy to automate with
> > newer kernels), you will generally be fine.  We've been using it in
> > production where I work for a couple of years now, with the only issues
> > we've encountered arising from the fact that we're stuck using an older
> > kernel which doesn't automatically deallocate empty chunks.
> 
> For me, RAID56 is mandatory. Any ETA for a stable RAID56 ?
> Is something we should expect this year, next year, next 10 years,  ?

   There's not really any ETAs for anything in the kernel, in general,
unless the relevant code has already been committed and accepted (when
it has a fairly deterministic path from then onwards). ETAs for
finding even known bugs are pretty variable, depending largely on how
easily the bug can be reproduced by the reporter and by the developer.

   As for a stable version -- you'll have to define "stable" in a way
that's actually measurable to get any useful answer, and even then,
see my previous comment about ETAs.

   There have been example patches in the last few months on the
subject of closing the write hole, so there's clear ongoing work on
that particular item, but again, see the comment on ETAs. It'll be
done when it's done.

   Hugo.

-- 
Hugo Mills | Nothing wrong with being written in Perl... Some of
hugo@... carfax.org.uk | my best friends are written in Perl.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  dark


signature.asc
Description: Digital signature


Re: Recovery from full metadata with all device space consumed?

2018-04-19 Thread Hugo Mills
On Thu, Apr 19, 2018 at 04:12:39PM -0700, Drew Bloechl wrote:
> On Thu, Apr 19, 2018 at 10:43:57PM +0000, Hugo Mills wrote:
> >Given that both data and metadata levels here require paired
> > chunks, try adding _two_ temporary devices so that it can allocate a
> > new block group.
> 
> Thank you very much, that seems to have done the trick:
> 
> # fallocate -l 4GiB /var/tmp/btrfs-temp-1
> # fallocate -l 4GiB /var/tmp/btrfs-temp-2
> # losetup -f /var/tmp/btrfs-temp-1
> # losetup -f /var/tmp/btrfs-temp-2
> # btrfs device add /dev/loop0 /broken
> Performing full device TRIM (4.00GiB) ...
> # btrfs device add /dev/loop1 /broken
> Performing full device TRIM (4.00GiB) ...
> # btrfs balance start -v -dusage=1 /broken
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=1

   Excellent. Don't forget to "btrfs dev delete" the devices after
you're finished the balance. You could damage the FS (possibly
irreparably) if you destroy the devices without doing so.

> I'm guessing that'll take a while to complete, but meanwhile, in another
> terminal:
> 
> # btrfs fi show /broken
> Label: 'mon_data'  uuid: 85e52555-7d6d-4346-8b37-8278447eb590
>   Total devices 6 FS bytes used 69.53GiB
>   devid1 size 931.51GiB used 731.02GiB path /dev/sda1
>   devid2 size 931.51GiB used 731.02GiB path /dev/sdb1
>   devid3 size 931.51GiB used 730.03GiB path /dev/sdc1
>   devid4 size 931.51GiB used 730.03GiB path /dev/sdd1
>   devid5 size 4.00GiB used 1.00GiB path /dev/loop0
>   devid6 size 4.00GiB used 1.00GiB path /dev/loop1
> 
> # btrfs fi df /broken
> Data, RAID0: total=2.77TiB, used=67.00GiB
> System, RAID1: total=8.00MiB, used=192.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.49GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Do I understand correctly that this could require up to 3 extra devices,
> if for instance you arrived in this situation with a RAID6 data profile?
> Or is the number even higher for profiles like RAID10?

   The minimum number of devices for each RAID level is:

single, DUP: 1
RAID-0, -1, -5:  2
RAID-6:  3
RAID-10: 4

   Hugo.

-- 
Hugo Mills | Gentlemen! You can't fight here! This is the War
hugo@... carfax.org.uk | Room!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Dr Strangelove


signature.asc
Description: Digital signature


Re: Recovery from full metadata with all device space consumed?

2018-04-19 Thread Hugo Mills
On Thu, Apr 19, 2018 at 03:08:48PM -0700, Drew Bloechl wrote:
> I've got a btrfs filesystem that I can't seem to get back to a useful
> state. The symptom I started with is that rename() operations started
> dying with ENOSPC, and it looks like the metadata allocation on the
> filesystem is full:
> 
> # btrfs fi df /broken
> Data, RAID0: total=3.63TiB, used=67.00GiB
> System, RAID1: total=8.00MiB, used=224.00KiB
> Metadata, RAID1: total=3.00GiB, used=2.50GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> All of the consumable space on the backing devices also seems to be in
> use:
> 
> # btrfs fi show /broken
> Label: 'mon_data'  uuid: 85e52555-7d6d-4346-8b37-8278447eb590
>   Total devices 4 FS bytes used 69.50GiB
>   devid1 size 931.51GiB used 931.51GiB path /dev/sda1
>   devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
>   devid3 size 931.51GiB used 931.51GiB path /dev/sdc1
>   devid4 size 931.51GiB used 931.51GiB path /dev/sdd1
> 
> Even the smallest balance operation I can start fails (this doesn't
> change even with an extra temporary device added to the filesystem):

   Given that both data and metadata levels here require paired
chunks, try adding _two_ temporary devices so that it can allocate a
new block group.

   Hugo.

> # btrfs balance start -v -dusage=1 /broken
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=1
> ERROR: error during balancing '/broken': No space left on device
> There may be more info in syslog - try dmesg | tail
> # dmesg | tail -1
> [11554.296805] BTRFS info (device sdc1): 757 enospc errors during
> balance
> 
> The current kernel is 4.15.0 from Debian's stretch-backports
> (specifically linux-image-4.15.0-0.bpo.2-amd64), but it was Debian's
> 4.9.30 when the filesystem got into this state. I upgraded it in the
> hopes that a newer kernel would be smarter, but no dice.
> 
> btrfs-progs is currently at v4.7.3.
> 
> Most of what this filesystem stores is Prometheus 1.8's TSDB for its
> metrics, which are constantly written at around 50MB/second. The
> filesystem never really gets full as far as data goes, but there's a lot
> of never-ending churn for what data is there.
> 
> Question 1: Are there other steps that can be tried to rescue a
> filesystem in this state? I still have it mounted in the same state, and
> I'm willing to try other things or extract debugging info.
> 
> Question 2: Is there something I could have done to prevent this from
> happening in the first place?
> 
> Thanks!

-- 
Hugo Mills | Always be sincere, whether you mean it or not.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |  Flanders & Swann
PGP: E2AB1DE4  |The Reluctant Cannibal


signature.asc
Description: Digital signature


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Hugo Mills
On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> Is there some testing utility for this? Is there a way to extract this/tell 
> with a high enough certainty from datasheets/other material before purchase?

   Given that not implementing barriers is basically a bug in the
hardware [for SATA or SAS], I don't think anyone's going to specify
anything other than "fully suppors barriers" in their datasheets.

   I don't know of a testing tool. It may not be obvious that barriers
aren't being honoured without doing things like power-failure testing.

   Hugo.

> https://btrfs.wiki.kernel.org/index.php/FAQ#How_does_this_happen.3F

-- 
Hugo Mills | "Damn and blast British Telecom!" said Dirk,
hugo@... carfax.org.uk | the words coming easily from force of habit.
http://carfax.org.uk/  |Douglas Adams,
PGP: E2AB1DE4  |   Dirk Gently's Holistic Detective Agency


signature.asc
Description: Digital signature


Re: Out of space and incorrect size reported

2018-03-21 Thread Hugo Mills
On Wed, Mar 21, 2018 at 09:53:39PM +, Shane Walton wrote:
> > uname -a
> Linux rockstor 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> > btrfs —version 
> btrfs-progs v4.4.1
> 
> > btrfs fi df /mnt2/pool_homes
> Data, RAID1: total=240.00GiB, used=239.78GiB
> System, RAID1: total=8.00MiB, used=64.00KiB
> Metadata, RAID1: total=8.00GiB, used=5.90GiB
> GlobalReserve, single: total=512.00MiB, used=59.31MiB
> 
> > btrfs filesystem show /mnt2/pool_homes
> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>   Total devices 2 FS bytes used 245.75GiB
>   devid1 size 465.76GiB used 248.01GiB path /dev/sda
>   devid2 size 465.76GiB used 248.01GiB path /dev/sdb
> 
> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB” almost 
> full and limited to 240 GiB when there is I have 2x 500 GB HDD?  This is all 
> create/implemented with the Rockstor platform and it says the “share” should 
> be 400 GB.
> 
> What can I do to make this larger or closer to the full size of 465 GiB 
> (minus the System and Metadata overhead)?

   Most likely, you need to ugrade your kernel to get past the known
bug (fixed in about 4.6 or so, if I recall correctly), and then mount
with -o clear_cache to force the free space cache to be rebuilt.

   Hugo.

-- 
Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
hugo@... carfax.org.uk | A: A parroty error.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: mkfs: add uuid and otime to ROOT_ITEM of FS_TREE

2018-03-19 Thread Hugo Mills
On Mon, Mar 19, 2018 at 02:02:23PM +0100, David Sterba wrote:
> On Mon, Mar 19, 2018 at 08:20:10AM +0000, Hugo Mills wrote:
> > On Mon, Mar 19, 2018 at 05:16:42PM +0900, Misono, Tomohiro wrote:
> > > Currently, the top-level subvolume lacks the UUID. As a result, both
> > > non-snapshot subvolume and snapshot of top-level subvolume do not have
> > > Parent UUID and cannot be distinguisued. Therefore "fi show" of
> > > top-level lists all the subvolumes which lacks the UUID in
> > > "Snapshot(s)" filed.  Also, it lacks the otime information.
> > > 
> > > Fix this by adding the UUID and otime at the mkfs time.  As a
> > > consequence, snapshots of top-level subvolume now have a Parent UUID and
> > > UUID tree will create an entry for top-level subvolume at mount time.
> > > This should not cause the problem for current kernel, but user program
> > > which relies on the empty Parent UUID may be affected by this change.
> > 
> >Is there any way of adding a UUID to the top level subvol on an
> > existing filesystem? It would be helpful not to have to rebuild every
> > filesystem in the world to fix this.
> 
> We can do that by a special purpose tool. The easiest way is to set the
> uuid on an unmouted filesystem, but as this is a one-time action I hope
> this is acceptable. Added to todo, thanks for the suggestion.

   Sounds good to me.

   Hugo.

-- 
Hugo Mills | Talking about music is like dancing about
hugo@... carfax.org.uk | architecture
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Frank Zappa


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: mkfs: add uuid and otime to ROOT_ITEM of FS_TREE

2018-03-19 Thread Hugo Mills
On Mon, Mar 19, 2018 at 05:16:42PM +0900, Misono, Tomohiro wrote:
> Currently, the top-level subvolume lacks the UUID. As a result, both
> non-snapshot subvolume and snapshot of top-level subvolume do not have
> Parent UUID and cannot be distinguisued. Therefore "fi show" of
> top-level lists all the subvolumes which lacks the UUID in
> "Snapshot(s)" filed.  Also, it lacks the otime information.
> 
> Fix this by adding the UUID and otime at the mkfs time.  As a
> consequence, snapshots of top-level subvolume now have a Parent UUID and
> UUID tree will create an entry for top-level subvolume at mount time.
> This should not cause the problem for current kernel, but user program
> which relies on the empty Parent UUID may be affected by this change.

   Is there any way of adding a UUID to the top level subvol on an
existing filesystem? It would be helpful not to have to rebuild every
filesystem in the world to fix this.

   Hugo.

> Signed-off-by: Tomohiro Misono 
> ---
> This is also needed in order that "sub list -s" works properly for
> non-privileged user[1] even if there are snapshots of toplevel subvolume.
> 
> Currently the check if a subvolume is a snapshot is done by looking at the key
> offset of ROOT_ITEM of subvolume (non-zero for snapshot) used by tree search 
> ioctl.
> However, non-privileged version of "sub list" won't use tree search ioctl and 
> just
> looking if parent uuid is null or not. Therefore there is no way to recognize
> snapshots of toplevel subvolume.
> 
> [1] https://marc.info/?l=linux-btrfs&m=152144463907830&w=2
> 
>  mkfs/common.c | 14 ++
>  mkfs/main.c   |  3 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/mkfs/common.c b/mkfs/common.c
> index 16916ca2..6924d9b7 100644
> --- a/mkfs/common.c
> +++ b/mkfs/common.c
> @@ -44,6 +44,7 @@ static int btrfs_create_tree_root(int fd, struct 
> btrfs_mkfs_config *cfg,
>   u32 itemoff;
>   int ret = 0;
>   int blk;
> + u8 uuid[BTRFS_UUID_SIZE];
>  
>   memset(buf->data + sizeof(struct btrfs_header), 0,
>   cfg->nodesize - sizeof(struct btrfs_header));
> @@ -77,6 +78,19 @@ static int btrfs_create_tree_root(int fd, struct 
> btrfs_mkfs_config *cfg,
>   btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
>   btrfs_set_item_size(buf, btrfs_item_nr(nritems),
>   sizeof(root_item));
> + if (blk == MKFS_FS_TREE) {
> + time_t now = time(NULL);
> +
> + uuid_generate(uuid);
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
> + btrfs_set_stack_timespec_sec(&root_item.otime, now);
> + btrfs_set_stack_timespec_sec(&root_item.ctime, now);
> + } else {
> + memset(uuid, 0, BTRFS_UUID_SIZE);
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
> + btrfs_set_stack_timespec_sec(&root_item.otime, 0);
> + btrfs_set_stack_timespec_sec(&root_item.ctime, 0);
> + }
>   write_extent_buffer(buf, &root_item,
>   btrfs_item_ptr_offset(buf, nritems),
>   sizeof(root_item));
> diff --git a/mkfs/main.c b/mkfs/main.c
> index 5a717f70..52d92581 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -315,6 +315,7 @@ static int create_tree(struct btrfs_trans_handle *trans,
>   struct btrfs_key location;
>   struct btrfs_root_item root_item;
>   struct extent_buffer *tmp;
> + u8 uuid[BTRFS_UUID_SIZE] = {0};
>   int ret;
>  
>   ret = btrfs_copy_root(trans, root, root->node, &tmp, objectid);
> @@ -325,6 +326,8 @@ static int create_tree(struct btrfs_trans_handle *trans,
>   btrfs_set_root_bytenr(&root_item, tmp->start);
>   btrfs_set_root_level(&root_item, btrfs_header_level(tmp));
>   btrfs_set_root_generation(&root_item, trans->transid);
> + /* clear uuid of source tree */
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
>   free_extent_buffer(tmp);
>  
>   location.objectid = objectid;

-- 
Hugo Mills | This chap Anon is writing some perfectly lovely
hugo@... carfax.org.uk | stuff at the moment.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Improve error stats message

2018-03-07 Thread Hugo Mills
On Wed, Mar 07, 2018 at 08:02:51PM +0100, Diego wrote:
> El miércoles, 7 de marzo de 2018 19:24:53 (CET) Hugo Mills escribió:
> >On multi-device filesystems, the two are not necessarily the same.
> 
> Ouch. FWIW, I was moved to do this because I saw this conversation on
> IRC which made me think that people aren't understanding what the
> message means:
> 
>hi! I noticed bdev rd 13  as a kernel message
>what does it mean
>Well, that's not the whole message.
>Can you paste the whole line in here? (Just one line)
   ^^ nick2... that would be me. :)

>[3.404959] BTRFS info (device sda4): bdev /dev/sda4 errs: 
> wr 0, rd 13, flush 0, corrupt 0, gen 0
> 
> 
> Maybe something like this would be better:
> 
> BTRFS info (device sda4): disk /dev/sda4 errors: write 0, read 13, flush 0, 
> corrupt 0, generation 0

   I think the single most helpful modification here would be to
change "device" to "fs on", to show that it's only an indicator of the
filesystem ID, rather than actually the device on which the errors
occurred. The others I'm not really bothered about, personally.

   Hugo.

> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 2ceb924ca0d6..cfa029468585 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7239,7 +7239,7 @@ static void btrfs_dev_stat_print_on_error(struct 
> btrfs_device *dev)
>   if (!dev->dev_stats_valid)
>   return;
>   btrfs_err_rl_in_rcu(dev->fs_info,
> - "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
> + "disk %s errors: write %u, read %u, flush %u, corrupt %u, 
> generation %u",
>  rcu_str_deref(dev->name),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),

-- 
Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
hugo@... carfax.org.uk | A: A parroty error.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Improve error stats message

2018-03-07 Thread Hugo Mills
On Wed, Mar 07, 2018 at 06:37:29PM +0100, Diego wrote:
> A typical notification of filesystem errors looks like this:
> 
> BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 1, flush 0, corrupt 
> 0, gen 0
> 
> The device name is being printed twice.

   For good reason -- the first part ("device sda2") indicates the
filesystem, and is the arbitrarily-selected device used by the kernel
to represent the FS. The second part ("bdev /dev/sda2") indicates the
_actual_ device for which the errors are being reported.

   On multi-device filesystems, the two are not necessarily the same.

   Hugo.

> Also, these abbreviatures
> feel unnecesary. Make the message look like this instead:
> 
> BTRFS error (device sda2): errors: write 0, read 1, flush 0, corrupt 0, 
> generation 0
> 
> 
> Signed-off-by: Diego Calleja 
> ---
>  fs/btrfs/volumes.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 2ceb924ca0d6..52fee5bb056f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7238,9 +7238,8 @@ static void btrfs_dev_stat_print_on_error(struct 
> btrfs_device *dev)
>  {
>   if (!dev->dev_stats_valid)
>   return;
> - btrfs_err_rl_in_rcu(dev->fs_info,
> - "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
> -rcu_str_deref(dev->name),
> + btrfs_err_rl(dev->fs_info,
> + "errors: write %u, read %u, flush %u, corrupt %u, generation 
> %u",
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
>      btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),

-- 
Hugo Mills | Would you like an ocelot with that non-sequitur?
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs send/receive in reverse possible?

2018-02-16 Thread Hugo Mills
On Fri, Feb 16, 2018 at 10:43:54AM +0800, Sampson Fung wrote:
> I have snapshot A on Drive_A.
> I send snapshot A to an empty Drive_B.  Then keep Drive_A as backup.
> I use Drive_B as active.
> I create new snapshot B on Drive_B.
> 
> Can I use btrfs send/receive to send incremental differences back to Drive_A?
> What is the correct way of doing this?

   You can't do it with the existing tools -- it needs a change to the
send stream format. Here's a write-up of what's going on behind the
scenes, and what needs to change:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |The Doctor


signature.asc
Description: Digital signature


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Hugo Mills
On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:
> In discussing the performance of various metadata operations over
> the past few days I've had this idea in the back of my head, and
> wanted to see if anybody had already thought about it before
> (likely, I would guess).
> 
> It appears based on this page:
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> that data and metadata in BTRFS are fairly well isolated from one
> another, particularly in the case of large files.  This appears
> reinforced by a recent comment from Qu ("...btrfs strictly
> split metadata and data usage...").
> 
> Yet, while there are plenty of options to RAID0/1/10/etc across
> generally homogeneous media types, there doesn't appear to be any
> functionality (at least that I can find) to segment different BTRFS
> internals to different types of devices.  E.G., place metadata trees
> and extent block groups on SSD, and data trees and extent block
> groups on HDD(s).
> 
> Is this something that has already been considered (and if so,
> implemented, which would make me extremely happy)?  Is it feasible
> it is hasn't been approached yet?  I admit my internal knowledge of
> BTRFS is fleeting, though I'm trying to work on that daily at this
> time, so forgive me if this is unapproachable for obvious
> architectural reasons.

   Well, it's been discussed, and I wrote up a theoretical framework
which should cover a wide range of use-cases:

https://www.spinics.net/lists/linux-btrfs/msg33916.html

   I never got round to implementing it, though -- I ran into issues
over storing the properties/metadata needed to configure it.

   Hugo.

-- 
Hugo Mills | Dullest spy film ever: The Eastbourne Ultimatum
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   The Thick of It


signature.asc
Description: Digital signature


Re: [PATCH 00/26] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"

2018-01-26 Thread Hugo Mills
ython/module.c |  321 ++
>  libbtrfsutil/python/qgroup.c |  141 +++
>  libbtrfsutil/python/setup.py |  103 ++
>  libbtrfsutil/python/subvolume.c  |  665 +
>  libbtrfsutil/python/tests/__init__.py|   66 ++
>  libbtrfsutil/python/tests/test_filesystem.py |   73 ++
>  libbtrfsutil/python/tests/test_qgroup.py |   57 ++
>  libbtrfsutil/python/tests/test_subvolume.py  |  383 +++
>  libbtrfsutil/qgroup.c|   86 ++
>  libbtrfsutil/subvolume.c | 1383 
> ++
>  messages.h   |   14 +
>  props.c  |   69 +-
>  qgroup.c |  106 --
>  qgroup.h |4 -
>  send-utils.c |   25 +-
>  utils.c  |  152 +--
>  utils.h  |6 -
>  41 files changed, 6188 insertions(+), 1754 deletions(-)
>  create mode 100644 libbtrfsutil/COPYING
>  create mode 100644 libbtrfsutil/COPYING.LESSER
>  create mode 100644 libbtrfsutil/README.md
>  create mode 100644 libbtrfsutil/btrfsutil.h
>  create mode 100644 libbtrfsutil/errors.c
>  create mode 100644 libbtrfsutil/filesystem.c
>  create mode 100644 libbtrfsutil/internal.h
>  create mode 100644 libbtrfsutil/python/.gitignore
>  create mode 100644 libbtrfsutil/python/btrfsutilpy.h
>  create mode 100644 libbtrfsutil/python/error.c
>  create mode 100644 libbtrfsutil/python/filesystem.c
>  create mode 100644 libbtrfsutil/python/module.c
>  create mode 100644 libbtrfsutil/python/qgroup.c
>  create mode 100755 libbtrfsutil/python/setup.py
>  create mode 100644 libbtrfsutil/python/subvolume.c
>  create mode 100644 libbtrfsutil/python/tests/__init__.py
>  create mode 100644 libbtrfsutil/python/tests/test_filesystem.py
>  create mode 100644 libbtrfsutil/python/tests/test_qgroup.py
>  create mode 100644 libbtrfsutil/python/tests/test_subvolume.py
>  create mode 100644 libbtrfsutil/qgroup.c
>  create mode 100644 libbtrfsutil/subvolume.c
> 

-- 
Hugo Mills | And what rough beast, its hour come round at last /
hugo@... carfax.org.uk | slouches towards Bethlehem, to be born?
http://carfax.org.uk/  |
PGP: E2AB1DE4  | W.B. Yeats, The Second Coming


signature.asc
Description: Digital signature


Re: bad key ordering - repairable?

2018-01-22 Thread Hugo Mills
 currently installed packages, and restoring all current system
> settings, would probably take some time for me to do.
> If it is currently not repairable, it would be nice if this kind of
> corruption could be repaired in the future, even if losing a few
> files. Or if the corruptions could be avoided in the first place.

   Given that the current tools crash, the answer's a definite
no. However, if you can get a developer interested, they may be able
to write a fix for it, given an image of the FS (using btrfs-image).

[snip]
> I have never noticed any corruptions on the NTFS and Ext4 file systems
> on the laptop, only on the Btrfs file systems.

   You've never _noticed_ them. :)

   Hugo.

-- 
Hugo Mills | ... one ping(1) to rule them all, and in the
hugo@... carfax.org.uk | darkness bind(2) them.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: Fwd: Fwd: Question regarding to Btrfs patchwork /2831525

2018-01-14 Thread Hugo Mills
On Sun, Jan 14, 2018 at 12:32:25PM +0200, Ilan Schwarts wrote:
> Thank you for clarification.
> Just 2 quick questions,
> 1. Sub volumes - 2 sub volumes cannot have 2 same inode numbers ?

   Incorrect. You can have two subvolumes of the same filesystem, and
you can have files with the same inode number in each subvolume. Each
subvolume has its own inode number space. So an inode number on its
own is not enough to uniquely identify a file -- you also need the
subvolid to uniquely identify a specific file in the filesystem.

   Hugo.

> 2. Why fsInfo fsid return u8 and the traditional file system return
> dev_t, usually 32 integer ?
> 
> 
> On Sun, Jan 14, 2018 at 12:22 PM, Qu Wenruo  wrote:
> >
> >
> > On 2018年01月14日 18:13, Ilan Schwarts wrote:
> >> both btrfs filesystems will have same fsid ?
> >>
> >>
> >> On Sun, Jan 14, 2018 at 12:06 PM, Ilan Schwarts  wrote:
> >>> But both filesystems will have same fsid?
> >>>
> >>> On Jan 14, 2018 12:04, "Nikolay Borisov"  wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 14.01.2018 12:02, Ilan Schwarts wrote:
> >>>>> First of all, Thanks for response !
> >>>>> So if i have 2 btrfs file system on the same machine (not your
> >>>>> everyday scenario, i know)
> >
> > Not a problem, the 2 filesystems will have 2 different fsid.
> >
> > (And it's my everyday scenario, since fstests neeeds TEST_DEV and
> > SCRATCH_DEV_POOL)
> >
> >>>>> Lets say a file is created on device A, the file gets inode number X
> >>>>> is it possible on device B to have inode number X also ?
> >>>>> or each device has its own Inode number range ?
> >
> > Forget the mess about device.
> >
> > Inode is bounded to a filesystem, not bounded to a device.
> >
> > Just traditional filesytems are normally bounded to a single device.
> > (Although even traditional filesystems can have external journal devices)
> >
> > So there is nothing to do with device at all.
> >
> > And you can have same inode numbers in different filesystems, but
> > BTRFS_I(inode)->root->fs_info will point to different fs_infos, with
> > different fsid.
> >
> > So return to your initial question:
> >> both btrfs filesystems will have same fsid ?
> >
> > No, different filesystems will have different fsid.
> >
> > (Unless you're SUUUPER lucky to have 2 filesystems with
> > same fsid)
> >
> > Thanks,
> > Qu
> >
> >
> >>>>
> >>>> Of course it is possible. Inodes are guaranteed to be unique only across
> >>>> filesystem instances. In your case you are going to have 2 fs instances.
> >>>>
> >>>>>
> >>>>> I need to create unique identifier for a file, I need to understand if
> >>>>> the identifier would be: GlobalFSID_DeviceID_Inode or DeviceID_Inode
> >>>>> is enough.
> >>>>>
> >>>>> Thanks
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Sun, Jan 14, 2018 at 11:13 AM, Qu Wenruo 
> >>>>> wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 2018年01月14日 16:33, Ilan Schwarts wrote:
> >>>>>>> Hello btrfs developers/users,
> >>>>>>>
> >>>>>>> I was wondering regarding to fetching the correct fsid on btrfs from
> >>>>>>> the context of a kernel module.
> >>>>>>
> >>>>>> There are two IDs for btrfs. (in fact more, but you properly won't need
> >>>>>> the extra ids)
> >>>>>>
> >>>>>> FSID: Global one, one fs one FSID.
> >>>>>> Device ID: Bonded to device, each device will have one.
> >>>>>>
> >>>>>> So in case of 2 devices btrfs, each device will has its own device id,
> >>>>>> while both of the devices have the same fsid.
> >>>>>>
> >>>>>> And I think you're talking about the global fsid instead of device id.
> >>>>>>
> >>>>>>> if on suse11.3 kernel 3.0.101-0.47.71-default in order to get fsid, I
> >>>>>>> do the following:
> >>>>>>> convert inode struct to btrfs_inode struct (use btrfs

Re: Recommendations for balancing as part of regular maintenance?

2018-01-08 Thread Hugo Mills
e limit=N option. This gives you precise control
over the number of chunks to balance, but doesn't specify which
chunks, so you may end up moving N GiB of data (whereas usage=N could
move much less actual data).

   Personally, I recommend using limit=N, where N is something like
(Allocated - Used)*3/4 GiB.

   Note the caveat below, which is that using "ssd" mount option on
earlier kernels could prevent the balance from doing a decent job.

> The other mystery is how the data allocation became so large.

   You have a non-rotational device. That means that it'd be mounted
automatically with the "ssd" mount option. Up to 4.13 (or 4.14, I
always forget), the behaviour of "ssd" leads to highly fragmented
allocation of extents, which in turn results in new data chunks being
allocated when there's theoretically loads of space available to use
(but which it may not be practical to use, due to the fragmented free
space).

   After 4.13 (or 4.14), the "ssd" mount option has been fixed, and it
no longer has the bad long-term effects that we've seen before, but it
won't deal with the existing fragmented free space without a data
balance.

   If you're running an older kernel, it's definitely recommended to
mount all filesystems with "nossd" to avoid these issues.

   Hugo.

-- 
Hugo Mills | As long as you're getting different error messages,
hugo@... carfax.org.uk | you're making progress.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 1/2] btrfs-progs: Fix progs_extra build dependencies

2017-12-23 Thread Hugo Mills
On Sat, Dec 23, 2017 at 09:52:37PM +0100, Hans van Kranenburg wrote:
> The Makefile does not have a dependency path that builds dependencies
> for tools listed in progs_extra.
> 
> E.g. doing make btrfs-show-super in a clean build environment results in:
> gcc: error: cmds-inspect-dump-super.o: No such file or directory
> Makefile:389: recipe for target 'btrfs-show-super' failed
> 
> Signed-off-by: Hans van Kranenburg 

   Hans and I worked this one out between us on IRC. Not sure if you
need this, but here it is:

Signed-off-by: Hugo Mills 

   Hugo.

> ---
>  Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Makefile b/Makefile
> index 30a0ee22..390b138f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -220,7 +220,7 @@ cmds_restore_cflags = 
> -DBTRFSRESTORE_ZSTD=$(BTRFSRESTORE_ZSTD)
>  CHECKER_FLAGS += $(btrfs_convert_cflags)
>  
>  # collect values of the variables above
> -standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
> btrfs-%, $(progs,$($(dep)))
> +standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
> btrfs-%, $(progs) $(progs_extra)))),$($(dep)))
>  
>  SUBDIRS =
>  BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS))

-- 
Hugo Mills | My code is never released, it escapes from the git
hugo@... carfax.org.uk | repo and kills a few beta testers on the way out.
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Diablo-D3


signature.asc
Description: Digital signature


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 04:18:09PM +, Neal Becker wrote:
> Is it possible to check while it is mounted?

   Certainly not while mounted read-write. While mounted read-only --
I'm not certain. Possibly.

   Hugo.

> On Tue, Dec 12, 2017 at 9:52 AM Hugo Mills  wrote:
> 
> > On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> > > sudo ls -la ~/
> > > [sudo] password for nbecker:
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > total 11652
> > > drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> > > drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> > > [...]
> > > -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> > > [...]
> >
> >Could you show the result of btrfs check --readonly on this FS? The
> > rest, below, doesn't show up anything unusual to me.
> >
> >Hugo.
> >
> > > uname -a
> > > Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> > > 2017 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > >  btrfs --version
> > > btrfs-progs v4.11.1
> > >
> > > sudo btrfs fi show
> > > Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> > > Total devices 1 FS bytes used 80.96GiB
> > > devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> > >
> > > sudo btrfs fi df /home
> > > Data, single: total=226.99GiB, used=78.89GiB
> > > System, single: total=4.00MiB, used=48.00KiB
> > > Metadata, single: total=3.01GiB, used=2.07GiB
> > > GlobalReserve, single: total=222.36MiB, used=0.00B
> > >
> > > dmesg.log is here:
> > > https://nbecker.fedorapeople.org/dmesg.txt
> > >
> > > mount | grep btrfs
> > > /dev/sda3 on / type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> > > /dev/sda3 on /home type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> > >
> >

-- 
Hugo Mills | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4  |  Ford Prefect


signature.asc
Description: Digital signature


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> sudo ls -la ~/
> [sudo] password for nbecker:
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> total 11652
> drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> [...]
> -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> [...]

   Could you show the result of btrfs check --readonly on this FS? The
rest, below, doesn't show up anything unusual to me.

   Hugo.

> uname -a
> Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
>  btrfs --version
> btrfs-progs v4.11.1
> 
> sudo btrfs fi show
> Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> Total devices 1 FS bytes used 80.96GiB
> devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> 
> sudo btrfs fi df /home
> Data, single: total=226.99GiB, used=78.89GiB
> System, single: total=4.00MiB, used=48.00KiB
> Metadata, single: total=3.01GiB, used=2.07GiB
> GlobalReserve, single: total=222.36MiB, used=0.00B
> 
> dmesg.log is here:
> https://nbecker.fedorapeople.org/dmesg.txt
> 
> mount | grep btrfs
> /dev/sda3 on / type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> /dev/sda3 on /home type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> 

-- 
Hugo Mills | Hey, Virtual Memory! Now I can have a *really big*
hugo@... carfax.org.uk | ramdisk!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Odd behaviour of replace -- unknown resulting state

2017-12-09 Thread Hugo Mills
On Sat, Dec 09, 2017 at 05:43:48PM +, Hugo Mills wrote:
>This is on 4.10, so there may have been fixes made to this since
> then. If so, apologies for the noise.
> 
>I had a filesystem on 6 devices with a badly failing drive in it
> (/dev/sdi). I replaced the drive with a new one:
> 
> # btrfs replace start /dev/sdi /dev/sdj /media/video

Sorry, that should, of course, read:

# btrfs replace start /dev/sdi2 /dev/sdj2 /media/video

   Hugo.

>Once it had finished(*), I resized the device from 6 TB to 8 TB:
> 
> # btrfs fi resize 2:max /media/video
> 
>I also removed another, smaller, device:
> 
> # btrfs dev del 7 /media/video
> 
>Following this, btrfs fi show was reporting the correct device
> size, but still the same device node in the filesystem:
> 
> Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
>Total devices 5 FS bytes used 9.15TiB
>devid2 size 7.28TiB used 6.44TiB path /dev/sdi2
>devid3 size 3.63TiB used 3.46TiB path /dev/sde2
>devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
>devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
>devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
> 
>Note that device 2 definitely isn't /dev/sdi2, because /dev/sdi2
> was on a 6 TB device, not an 8 TB device.
> 
>Finally, I physically removed the two deleted devices from the
> machine. The second device came out fine, but the first (/dev/sdi) has
> now resulted in this from btrfs fi show:
> 
> Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
>Total devices 5 FS bytes used 9.15TiB
>devid3 size 3.63TiB used 3.46TiB path /dev/sde2
>devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
>devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
>devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
>*** Some devices missing
> 
>So, what's the *actual* current state of this filesystem? It's not
> throwing write errors in the kernel logs from having a missing device,
> so it seems like it's probably OK. However, the FS's idea of which
> devices it's got seems to be confused.
> 
>I suspect that if I reboot, it'll all be fine, but I'd be happier
> if it hadn't got into this state in the first place.
> 
>Is this bug fixed in later versions of the kernel? Can anyone think
> of any issues I might have if I leave it in this state for a while?
> Likewise, any issues I might have from a reboot? (Probably into 4.14)
> 
>Hugo.
> 
> (*) as an aside, it was reporting over 300% complete when it finally
> completed. Not sure if that's been fixed since 4.10, either.
>  

-- 
Hugo Mills | I'm on a 30-day diet. So far I've lost 18 days.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Odd behaviour of replace -- unknown resulting state

2017-12-09 Thread Hugo Mills
   This is on 4.10, so there may have been fixes made to this since
then. If so, apologies for the noise.

   I had a filesystem on 6 devices with a badly failing drive in it
(/dev/sdi). I replaced the drive with a new one:

# btrfs replace start /dev/sdi /dev/sdj /media/video

   Once it had finished(*), I resized the device from 6 TB to 8 TB:

# btrfs fi resize 2:max /media/video

   I also removed another, smaller, device:

# btrfs dev del 7 /media/video

   Following this, btrfs fi show was reporting the correct device
size, but still the same device node in the filesystem:

Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
   Total devices 5 FS bytes used 9.15TiB
   devid2 size 7.28TiB used 6.44TiB path /dev/sdi2
   devid3 size 3.63TiB used 3.46TiB path /dev/sde2
   devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
   devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
   devid6 size 3.63TiB used 3.43TiB path /dev/sdc2

   Note that device 2 definitely isn't /dev/sdi2, because /dev/sdi2
was on a 6 TB device, not an 8 TB device.

   Finally, I physically removed the two deleted devices from the
machine. The second device came out fine, but the first (/dev/sdi) has
now resulted in this from btrfs fi show:

Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
   Total devices 5 FS bytes used 9.15TiB
   devid3 size 3.63TiB used 3.46TiB path /dev/sde2
   devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
   devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
   devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
   *** Some devices missing

   So, what's the *actual* current state of this filesystem? It's not
throwing write errors in the kernel logs from having a missing device,
so it seems like it's probably OK. However, the FS's idea of which
devices it's got seems to be confused.

   I suspect that if I reboot, it'll all be fine, but I'd be happier
if it hadn't got into this state in the first place.

   Is this bug fixed in later versions of the kernel? Can anyone think
of any issues I might have if I leave it in this state for a while?
Likewise, any issues I might have from a reboot? (Probably into 4.14)

   Hugo.

(*) as an aside, it was reporting over 300% complete when it finally
completed. Not sure if that's been fixed since 4.10, either.
 
-- 
Hugo Mills | Biphocles: Plato's optician
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [RFC] Improve subvolume usability for a normal user

2017-12-07 Thread Hugo Mills
On Thu, Dec 07, 2017 at 07:21:46AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-12-07 06:55, Duncan wrote:
> >Misono, Tomohiro posted on Thu, 07 Dec 2017 16:15:47 +0900 as excerpted:
> >
> >>On 2017/12/07 11:56, Duncan wrote:
> >>>Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as
> >>>excerpted:
> >>>
> >>>>Somewhat OT, but the only operation that's remotely 'instant' is
> >>>>creating an empty subvolume.  Snapshot creation has to walk the tree
> >>>>in the subvolume being snapshotted, which can take a long time (and as
> >>>>a result of it's implementation, also means BTRFS snapshots are _not_
> >>>>atomic). Subvolume deletion has to do a bunch of cleanup work in the
> >>>>background (though it may be fairly quick if it was a snapshot and the
> >>>>source subvolume hasn't changed much).
> >>>
> >>>Indeed, while btrfs in general has taken a strategy of making
> >>>/creating/ snapshots and subvolumes fast, snapshot deletion in
> >>>particular can take some time[1].
> >>>
> >>>And in that regard a question just occurred to me regarding this whole
> >>>very tough problem of a user being able to create but not delete
> >>>subvolumes and snapshots:
> >>>
> >>>Given that at least snapshot deletion (not so sure about non-snapshot
> >>>subvolume deletion, tho I strongly suspect it would depend on the
> >>>number of cross-subvolume reflinks) is already a task that can take
> >>>some time, why /not/ just bite the bullet and make the behavior much
> >>>more like the directory deletion, given that subvolumes already behave
> >>>much like directories.  Yes, for non-root, that /does/ mean tracing the
> >>>entire subtree and checking permissions, and yes, that's going to take
> >>>time and lower performance somewhat, but subvolume and in particular
> >>>snapshot deletion is already an operation that takes time, so this
> >>>wouldn't be unduly changing the situation, and it would eliminate the
> >>>entire class of security issues that come with either asymmetrically
> >>>restricting deletion (but not creation) to root on the one hand,
> >>
> >>>or possible data loss due to allowing a user to delete a subvolume they
> >>>couldn't delete were it an ordinary directory due to not owning stuff
> >>>further down the tree.
> >>
> >>But, this is also the very reason I'm for "sub del" instead of unlink().
> >>Since snapshot creation won't check the permissions of the containing
> >>files/dirs, it can copy a directory which cannot be deleted by the user.
> >>Therefore if we won't allow "sub del" for the user, he couldn't remove
> >>the snapshot.
> >
> >Maybe snapshot creation /should/ check all that, in ordered to allow
> >permissions to allow deletion.
> >
> >Tho that would unfortunately increase the creation time, and btrfs is
> >currently optimized for fast creation time.
> >
> >Hmm... What about creating a "temporary" snapshot if not root, then
> >walking the tree to check perms and deleting it without ever showing it
> >to userspace if the perms wouldn't let the user delete it.  That would
> >retain fast creation logic, tho it wouldn't show up until the perms walk
> >was completed.
> >
> I would argue that it makes more sense to keep snapshot creation as
> is, keep the subvolume deletion command as is (with some proper
> permissions checks of course), and just make unlink() work for
> subvolumes like it does for directories.

   Definitely this.

   Principle of least surprise.

   Hugo.

-- 
Hugo Mills | ... one ping(1) to rule them all, and in the
hugo@... carfax.org.uk | darkness bind(2) them.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Hugo Mills
On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

   The thing I'd first go looking for here is some rogue process
writing lots of data. I've had something like this happen to me
before, a few times. First, I'd look for large files with "du -ms /* |
sort -n", then work down into the tree until you find them.

   If that doesn't show up anything unusually large, then lsof to look
for open but deleted files (orphans) which are still being written to
by some process.

   This is very likely _not_ to be a btrfs problem, but instead some
runaway process writing lots of crap very fast. Log files are probably
the most plausible location, but not the only one.

> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?

   Personally, I'd trust qgroups' output about as far as I could spit
Belgium(*).

   Hugo.

(*) No offence indended to Belgium.

-- 
Hugo Mills | I used to live in hope, but I got evicted.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs-image hash collision option, super slow

2017-11-11 Thread Hugo Mills
On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
> OK this might be in the stupid questions category, but I'm not
> understanding the purpose of computing hash collisions with -ss. Or
> more correctly, why it's taking so much longer than -s.
> 
> It seems like what we'd want is every filename to have the same hash,
> but for the file to go through a PBKDF so the hashes we get aren't
> (easily) brute forced. So I totally understand that -ss should take
> much longer than -s, but this is at least two orders magnitude longer
> (so far). That's why I'm confused.
> 
> -s option on this file system took 5 minutes, start to finish.
> -ss option is at 8 hours and counting.
> 
> The other part I'm not groking is that some filenames fail with:
> 
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes
> 
> So? That seems like an undesirable outcome. And if it were just being
> pushed through a PBKDF function, it wouldn't fail. Every
> file/directory "Tool" would get the same hash on *this* run of
> btrs-image. If I run it again, or someone else runs it, they'd get
> some other hash (same hashes for each instance of "Tool" on their
> filesystem).

   In the FS tree, you can go from the inode of the file to its name
(where the inode is in the index, and the name is stored in the
corresponding data item). Alternatively, you can go from the filename
to the inode. In the latter case, since the keys are a structured 17
byte object, you obviously can't fit the whole filename into the key,
so the filename is hashed (using, IIRC, CRC32), and it's the hash that
appears in the key of the index.

   When an image is made without the -s options, the whole metadata is
stored, including all the filenames in the data items. For some
people, that's a security risk, and they don't want their filenames
leaking out, so -s exists to put junk in the filename records.
However, it doesn't change the hashes in the index to correspond with
the modified filenames, because that would at minimum require the
whole tree to be rebuilt (because all the items would have different
hashes, and hence different ordering in the index). This is a bad
thing for debugging, because you're not getting the details of the
tree as it was in the broken filesystem. So, in this case, the image
is actually broken, because the filenames don't match the hashes.

   Most of the time, that's absolutely fine, because the thing being
debugged is somewhere else, and it doesn't matter that "ls" on the
restored FS won't work right.

   However, in some (possibly hypothetical) cases, it _does_ matter,
and you do need the hashes to match the filenames. This is where -ss
comes in. We can't generate random filenames and then take the hashes
of those, because of the undesirability of rewriting the whole FS tree
to reindex it with the changed hashes. So, what -ss tries to do is
stick with the original hashes and find arbitrary filenames which
match them. It's (I think) CRC32, so it shouldn't be too hard, but
it's still non-trivial amounts of work to reverse engineer a
human-readable ASCII filename which hashes to a given value.
Particularly if, as was the case when Josef wrote it, a simple
brute-force algorithm was used.

   It could definitely be improved -- I believe there are some good
(but non-trivial) algorithms for finding preimages for CRC32 checksums
out there. It's just that btrfs-image doesn't use them. However, it's
not an option that's needed very often, so it's probably not worth
putting in the effort to fix it up. (I definitely remember Josef
commenting on IRC when he wrote -s and -ss that it could almost
certainly be done more efficiently, but he had bigger fish to fry at
the time, like fixing the broken FS he was working on)

   As to the thing where it's not finding a pre-image at all -- I'm
guessing here, but it's possible that this is a case where two of the
orginal filenames hashed to the same value. If that happens, one of
the hashes is incremented by a small integer in a predictable way
before storage. So it may be that the resulting value isn't mappable
to an ASCII pre-image, or that the search just gives up before finding
one.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature


Re: Problem with file system

2017-11-08 Thread Hugo Mills
On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
>  wrote:
> 
> >> It definitely does fix ups during normal operations. During reads, if
> >> there's a UNC or there's corruption detected, Btrfs gets the good
> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
> >> don't just happen with scrubbing. Even raid56 supports these kinds of
> >> passive fixups back to disk.
> >
> > I could have sworn it didn't rewrite the data on-disk during normal usage.
> > I mean, I know for certain that it will return the correct data to userspace
> > if at all possible, but I was under the impression it will just log the
> > error during normal operation.
> 
> No, everything except raid56 has had it since a long time, I can't
> even think how far back, maybe even before 3.0. Whereas raid56 got it
> in 4.12.

   Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).

   Hugo.

-- 
Hugo Mills | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4  | Page 129 is loosed upon the world.   Zarf


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-04 Thread Hugo Mills
On Tue, Oct 03, 2017 at 03:49:25PM -0700, Stephen Nesbitt wrote:
> 
> On 10/3/2017 2:11 PM, Hugo Mills wrote:
> >Hi, Stephen,
> >
> >On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> >>Here it i. There are a couple of out-of-order entries beginning at 117. And
> >>yes I did uncover a bad stick of RAM:
> >>
> >>btrfs-progs v4.9.1
> >>leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> >>fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> >>chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
> >[snip]
> >>item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> >>extent refs 1 gen 3346444 flags DATA
> >>extent data backref root 271 objectid 2478 offset 0 count 1
> >>item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> >>extent refs 1 gen 3346495 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6733824 count 1
> >>item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> >>extent refs 1 gen 3351513 flags DATA
> >>extent data backref root 271 objectid 5724364 offset 680640512 count 1
> >>item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> >>extent refs 1 gen 3346376 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6701056 count 1
> >>>>hex(1623012749312)
> >'0x179e3193000'
> >>>>hex(1621939052544)
> >'0x179a319e000'
> >>>>hex(1623012450304)
> >'0x179e314a000'
> >>>>hex(1623012802560)
> >'0x179e31a'
> >
> >That's "e" -> "a" in the fourth hex digit, which is a single-bit
> >flip, and should be fixable by btrfs check (I think). However, even
> >fixing that, it's not ordered, because 118 is then before 117, which
> >could be another bitflip ("9" -> "4" in the 7th digit), but two bad
> >bits that close to each other seems unlikely to me.
> >
> >Hugo.
> 
> Hope this is a duplicate reply - I might have fat fingered something.
> 
> The underlying file is disposable/replaceable. Any way to zero
> out/zap the bad BTRFS entry?

   Not really. Even trying to delete the related file(s), it's going
to fall over when reading the metadata in in the first place. (The key
order check is a metadata invariant, like the csum checks and transid
checks).

   At best, you'd have to get btrfs check to fix it. It should be able
to manage a single-bit error, but you've got two single-bit errors in
close proximity, and I'm not sure it'll be able to deal with it. Might
be worth trying it. The FS _might_ blow up as a result of an attempted
fix, but you say it's replacable, so that's kind of OK. The worst I'd
_expect_ to happen with btrfs check --repair is that it just won't be
able to deal with it and you're left where you started.

   Go for it.

   Hugo.

-- 
Hugo Mills | You shouldn't anthropomorphise computers. They
hugo@... carfax.org.uk | really don't like that.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
   Hi, Stephen,

On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> Here it i. There are a couple of out-of-order entries beginning at 117. And
> yes I did uncover a bad stick of RAM:
> 
> btrfs-progs v4.9.1
> leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
[snip]
> item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> extent refs 1 gen 3346444 flags DATA
> extent data backref root 271 objectid 2478 offset 0 count 1
> item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> extent refs 1 gen 3346495 flags DATA
> extent data backref root 271 objectid 21751764 offset 6733824 count 1
> item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> extent refs 1 gen 3351513 flags DATA
> extent data backref root 271 objectid 5724364 offset 680640512 count 1
> item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> extent refs 1 gen 3346376 flags DATA
> extent data backref root 271 objectid 21751764 offset 6701056 count 1

>>> hex(1623012749312)
'0x179e3193000'
>>> hex(1621939052544)
'0x179a319e000'
>>> hex(1623012450304)
'0x179e314a000'
>>> hex(1623012802560)
'0x179e31a'

   That's "e" -> "a" in the fourth hex digit, which is a single-bit
flip, and should be fixable by btrfs check (I think). However, even
fixing that, it's not ordered, because 118 is then before 117, which
could be another bitflip ("9" -> "4" in the 7th digit), but two bad
bits that close to each other seems unlikely to me.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Silly Point Break
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
On Tue, Oct 03, 2017 at 01:06:50PM -0700, Stephen Nesbitt wrote:
> All:
> 
> I came back to my computer yesterday to find my filesystem in read
> only mode. Running a btrfs scrub start -dB aborts as follows:
> 
> btrfs scrub start -dB /mnt
> ERROR: scrubbing /mnt failed for device id 4: ret=-1, errno=5
> (Input/output error)
> ERROR: scrubbing /mnt failed for device id 5: ret=-1, errno=5
> (Input/output error)
> scrub device /dev/sdb (id 4) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:09:02
>     total bytes scrubbed: 75.58GiB with 1 errors
>     error details: csum=1
>     corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
> scrub device /dev/sdc (id 5) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:11:11
>     total bytes scrubbed: 50.75GiB with 0 errors
> 
> The resulting dmesg is:
> [  699.534066] BTRFS error (device sdc): bdev /dev/sdb errs: wr 0,
> rd 0, flush 0, corrupt 6, gen 0
> [  699.703045] BTRFS error (device sdc): unable to fixup (regular)
> error at logical 1609808347136 on dev /dev/sdb
> [  783.306525] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116

   This error usually means bad RAM. Can you show us the output of
"btrfs-debug-tree -b 2589782867968 /dev/sdc"?

   Hugo.

> [  789.776132] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  911.529842] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  918.365225] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> 
> Running btrfs check /dev/sdc results in:
> btrfs check /dev/sdc
> Checking filesystem on /dev/sdc
> UUID: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> checking extents
> bad key ordering 116 117
> bad block 2589782867968
> ERROR: errors found in extent allocation tree or chunk allocation
> checking free space cache
> There is no free space entry for 1623012450304-1623012663296
> There is no free space entry for 1623012450304-1623225008128
> cache appears valid but isn't 1622151266304
> found 288815742976 bytes used err is -22
> total csum bytes: 0
> total tree bytes: 350781440
> total fs tree bytes: 0
> total extent tree bytes: 350027776
> btree space waste bytes: 115829777
> file data blocks allocated: 156499968
> 
> uname -a:
> Linux sysresccd 4.9.24-std500-amd64 #2 SMP Sat Apr 22 17:14:43 UTC
> 2017 x86_64 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz GenuineIntel
> GNU/Linux
> 
> btrfs --version: btrfs-progs v4.9.1
> 
> btrfs fi show:
> Label: none  uuid: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
>     Total devices 2 FS bytes used 475.08GiB
>     devid    4 size 931.51GiB used 612.06GiB path /dev/sdb
>     devid    5 size 931.51GiB used 613.09GiB path /dev/sdc
> 
> btrfs fi df /mnt:
> Data, RAID1: total=603.00GiB, used=468.03GiB
> System, RAID1: total=64.00MiB, used=112.00KiB
> System, single: total=32.00MiB, used=0.00B
> Metadata, RAID1: total=9.00GiB, used=7.04GiB
> Metadata, single: total=1.00GiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> What is the recommended procedure at this point? Run btrfs check
> --repair? I have backups so losing a file or two isn't critical, but
> I really don't want to go through the effort of a bare metal
> reinstall.
> 
> In the process of researching this I did uncover a bad DIMM. Am I
> correct that the problems I'm seeing are likely linked to the
> resulting memory errors.
> 
> Thx in advance,
> 
> -steve
> 

-- 
Hugo Mills | Quidquid latine dictum sit, altum videtur
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


  1   2   3   4   5   6   7   8   9   10   >