Re: Understanding "btrfs filesystem usage"

2018-10-29 Thread Hugo Mills
On Mon, Oct 29, 2018 at 05:57:10PM -0400, Remi Gauvin wrote:
> On 2018-10-29 02:11 PM, Ulli Horlacher wrote:
> > I want to know how many free space is left and have problems in
> > interpreting the output of: 
> > 
> > btrfs filesystem usage
> > btrfs filesystem df
> > btrfs filesystem show
> > 
> >
> 
> In my not so humble opinion, the filesystem usage command has the
> easiest to understand output.  It' lays out all the pertinent information.

   Opinions are divided. I find it almost impossible to read, and
always use btrfs fi df and btrfs fi show together.

   There's short tutorials of how to read the output in both cases in
the FAQ, which is where I start out by directing people in this
instance.

   Hugo.

> You can clearly see 825GiB is allocated, with 494GiB used, therefore,
> filesystem show is actually using the "Allocated" value as "Used".
> Allocated can be thought of "Reserved For".  As the output of the Usage
> command and df command clearly show, you have almost 400GiB space available.
> 
> Note that the btrfs commands are clearly and explicitly displaying
> values in Binary units, (Mi, and Gi prefix, respectively).  If you want
> df command to match, use -h instead of -H (see man df)
> 
> An observation:
> 
> The disparity between 498GiB used and 823Gib is pretty high.  This is
> probably the result of using an SSD with an older kernel.  If your
> kernel is not very recent, (sorry, I forget where this was fixed,
> somewhere around 4.14 or 4.15), then consider mounting with the nossd
> option.  You can improve this by running a balance.
> 
> Something like:
> btrfs balance start -dusage=55
> 
> You do *not* want to end up with all your space allocated to Data, but
> not actually used by data.  Bad things can happen if you run out of
> Unallocated space for more metadata. (not catastrophic, but awkward and
> unexpected downtime that can be a little tricky to sort out.)
> 
> 

> begin:vcard
> fn:Remi Gauvin
> n:Gauvin;Remi
> org:Georgian Infotech
> adr:;;3-51 Sykes St. N.;Meaford;ON;N4L 1X3;Canada
> email;internet:r...@georgianit.com
> tel;work:226-256-1545
> version:2.1
> end:vcard
> 


-- 
Hugo Mills | Great oxymorons of the world, no. 8:
hugo@... carfax.org.uk | The Latest In Proven Technology
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Urgent: Need BTRFS-Expert

2018-10-17 Thread Hugo Mills
   Hi, Michael,

On Wed, Oct 17, 2018 at 09:58:31AM +0200, Michael Post wrote:
> Hello together,
> 
> i need a BTRFS-Expert for remote support.
> 
> Anyone who can assist me?

   This is generally the wrong approach to take in open-source
circles. Instead, if you describe your problem here on this mailing
list, you'll get *most* of the experts looking at it, rather than just
the one, and you'll generally get a much better (and easier to use)
service.

   Hugo.

-- 
Hugo Mills | The early bird gets the worm, but the second mouse
hugo@... carfax.org.uk | gets the cheese.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 05:40:40PM +0300, Anton Shepelev wrote:
> Hugo Mills to Anton Shepelev:
> 
> >>While trying to resolve free space problems, and found
> >>that I cannot interpret the output of:
> >>
> >>> btrfs filesystem show
> >>
> >>Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> >>Total devices 1 FS bytes used 34.06GiB
> >>devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> >>
> >>How come the total used value is less than the value
> >>listed for the only device?
> >
> >   "Used" on the device is the mount of space allocated.
> >"Used" on the FS is the total amount of actual data and
> >metadata in that allocation.
> >
> >   You will also need to look at the output of "btrfs fi
> >df" to see the breakdown of the 37.82 GiB into data,
> >metadata and currently unused.
> >
> >See
> >https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
> > for the details
> 
> Thank you, Hugo, understood.  mount/amount is a very fitting
> typo :-)
> 
> Do the standard `du' and `du' tools report correct values
> with btrfs?

   Well...

   du will tell you the size of the files you asked it about, but it
doesn't know about reflinks, so it'll double-count if you've got a
reflink copy of something. Other than that, it should be accurate, I
think. There's also a "btrfs fi du" which can tell you the amount of
shared and unique data as well, so you can know, for example, how much
space you'll reclaim if you delete those files.

   df should be mostly OK, but it does sometimes get its estimate of
the total usable size of the FS wrong, particularly if the FS is
unbalanced. However, as the FS fills up, the estimate gets better,
because it gets more evenly balanced across devices over time.

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 02:26:41PM +, Hugo Mills wrote:
> On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> > Hello, all
> > 
> > While trying to resolve free space problems, and found that
> > I cannot interpret the output of:
> > 
> > > btrfs filesystem show
> > 
> > Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> > Total devices 1 FS bytes used 34.06GiB
> > devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> > 
> > How come the total used value is less than the value listed
> > for the only device?
> 
>"Used" on the device is the mount of space allocated. "Used" on the

s/mount/amount/

> FS is the total amount of actual data and metadata in that allocation.
> 
>You will also need to look at the output of "btrfs fi df" to see
> the breakdown of the 37.82 GiB into data, metadata and currently
> unused.
> 
>See 
> https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
>  for the details.
> 
>Hugo.
> 

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Interpreting `btrfs filesystem show'

2018-10-15 Thread Hugo Mills
On Mon, Oct 15, 2018 at 05:24:08PM +0300, Anton Shepelev wrote:
> Hello, all
> 
> While trying to resolve free space problems, and found that
> I cannot interpret the output of:
> 
> > btrfs filesystem show
> 
> Label: none  uuid: 8971ce5b-71d9-4e46-ab25-ca37485784c8
> Total devices 1 FS bytes used 34.06GiB
> devid1 size 40.00GiB used 37.82GiB path /dev/sda2
> 
> How come the total used value is less than the value listed
> for the only device?

   "Used" on the device is the mount of space allocated. "Used" on the
FS is the total amount of actual data and metadata in that allocation.

   You will also need to look at the output of "btrfs fi df" to see
the breakdown of the 37.82 GiB into data, metadata and currently
unused.

   See 
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_original_tools
 for the details.

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 11:01:35PM +0200, Pierre Couderc wrote:
> On 10/08/2018 06:14 PM, Hugo Mills wrote:
> >On Mon, Oct 08, 2018 at 04:10:55PM +0000, Hugo Mills wrote:
> >>On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> >>>I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> >>>
> >>>But I have stranges status or errors  about "missing devices" and I
> >>>do not understand the current situation :
> >>>
> >>>
> >>>root@server:~# btrfs fi show
> >>>Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >>>     Total devices 1 FS bytes used 190.91GiB
> >>>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> >>>
> >>>warning, device 1 is missing
> >>>Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >>>     Total devices 2 FS bytes used 116.18GiB
> >>>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >>>     *** Some devices missing
> >>This looks like you've created a RAID-1 array with /dev/sda2 and
> >>/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> >>original [part of a] filesystem on /dev/sda2, and replacing it with a
> >>wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> >>28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> >>and the original FS was made of two devices, btrfs fi show is telling
> >>you that there's some devices missing -- /dev/sda2 is no longer part
> >>of that FS, and is therefore a missing device.
> >>
> >>I note that you've got data on both filesystems, so they must both
> >>have been mounted somewhere and had stuff put on them.
> >>
> >>I recommend doing something like this:
> >>
> >># mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> >># mount /dev/sdb /media/btrfs/myraid1/
> >># mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> >># cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> >># umount /media/btrfs/tmp/
> >># wipefs /dev/sda2   # destroy the FS on sda2
> >># btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> >>
> >>This will copy all the data from the filesystem on /dev/sda2 into
> >>the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> >>as the second device for the main FS.
> >>
> >>*WARNING!*
> >>
> >>Note that, since the main FS is missing a device, it will probably
> >>need to be mounted in degraded mode (-o degraded), and that on kernels
> >>earlier than (IIRC) 4.14, this can only be done *once* without the FS
> >>becoming more or less permanently read-only. On recent kernels, it
> >>_should_ be OK.
> >>
> >>*WARNING ENDS*
> >Oh, and for the record, to make a RAID-1 filesystem from scratch,
> >you simply need this:
> >
> ># mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb
> >
> >You do not need to run mkfs.btrfs on each device separately.
> >
> >Hugo.
> Thnk you very much. I understand a bit better. I think  that I have
> nothing of interest on /dev/sdb and that its contents is the result
> of previous trials.
> And that my system is on /dev/dsda2 as :
> 
> root@server:~# df -h
> Filesystem  Size  Used Avail Use% Mounted on
> udev    3.9G 0  3.9G   0% /dev
> tmpfs   787M  8.8M  778M   2% /run
> /dev/sda2   1.9T  193G  1.7T  11% /
> tmpfs   3.9G 0  3.9G   0% /dev/shm
> tmpfs   5.0M 0  5.0M   0% /run/lock
> tmpfs   3.9G 0  3.9G   0% /sys/fs/cgroup
> /dev/sda1   511M  5.7M  506M   2% /boot/efi
> tmpfs   100K 0  100K   0% /var/lib/lxd/shmounts
> tmpfs   100K 0  100K   0% /var/lib/lxd/devlxd
> root@server:~#
> 
> Is it exact ?

   Yes, it looks like you're running / from the FS on /dev/sda2.

> If yes, I suppose I should wipe data on /dev/sdb, then build the
> RAID by expanding /dev/sda2.

   Correct.

   I would recommend putting a partition table on /dev/sdb, because it
doesn't take up much space, and it's always easier to have one already
there when you need it (and there's a few things that can get confused
if there isn't a partition table).

> So I should :
> 
> wipefs /dev/sdb
> btrfs device add /dev/sdb /
> btrfs balance start -v -mconvert=raid1 -dconvert=raid1 /

> Does it sound correct ? (my kernel is boot/vmlinuz-4.18.0-1-amd64)

   Yes, exactly.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 04:10:55PM +, Hugo Mills wrote:
> On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> > I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> > 
> > But I have stranges status or errors  about "missing devices" and I
> > do not understand the current situation :
> > 
> > 
> > root@server:~# btrfs fi show
> > Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
> >     Total devices 1 FS bytes used 190.91GiB
> >     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> > 
> > warning, device 1 is missing
> > Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
> >     Total devices 2 FS bytes used 116.18GiB
> >     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
> >     *** Some devices missing
> 
>This looks like you've created a RAID-1 array with /dev/sda2 and
> /dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
> original [part of a] filesystem on /dev/sda2, and replacing it with a
> wholly different filesystem. Since the new FS on /dev/sda2 (UUID
> 28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
> and the original FS was made of two devices, btrfs fi show is telling
> you that there's some devices missing -- /dev/sda2 is no longer part
> of that FS, and is therefore a missing device.
> 
>I note that you've got data on both filesystems, so they must both
> have been mounted somewhere and had stuff put on them.
> 
>I recommend doing something like this:
> 
> # mkfs /media/btrfs/myraid1 /media/btrfs/tmp
> # mount /dev/sdb /media/btrfs/myraid1/
> # mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
> # cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
> # umount /media/btrfs/tmp/
> # wipefs /dev/sda2   # destroy the FS on sda2
> # btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/
> 
>This will copy all the data from the filesystem on /dev/sda2 into
> the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
> as the second device for the main FS.
> 
> *WARNING!*
> 
>Note that, since the main FS is missing a device, it will probably
> need to be mounted in degraded mode (-o degraded), and that on kernels
> earlier than (IIRC) 4.14, this can only be done *once* without the FS
> becoming more or less permanently read-only. On recent kernels, it
> _should_ be OK.
> 
> *WARNING ENDS*

   Oh, and for the record, to make a RAID-1 filesystem from scratch,
you simply need this:

# mkfs.btrfs -m raid1 -d raid1 /dev/sda2 /dev/sdb

   You do not need to run mkfs.btrfs on each device separately.

   Hugo.

-- 
Hugo Mills | Welcome to Rivendell, Mr Anderson...
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Machinae Supremacy, Hybrid


signature.asc
Description: Digital signature


Re: Which device is missing ?

2018-10-08 Thread Hugo Mills
On Mon, Oct 08, 2018 at 03:49:53PM +0200, Pierre Couderc wrote:
> I ma trying to make a "RAID1" with /dev/sda2 ans /dev/sdb (or similar).
> 
> But I have stranges status or errors  about "missing devices" and I
> do not understand the current situation :
> 
> 
> root@server:~# btrfs fi show
> Label: none  uuid: 28c2b7ab-631c-40a3-bab7-00dac5dd20eb
>     Total devices 1 FS bytes used 190.91GiB
>     devid    1 size 1.82TiB used 196.02GiB path /dev/sda2
> 
> warning, device 1 is missing
> Label: none  uuid: 2d45149a-fb97-4c2a-bae2-4cfe4e01a8aa
>     Total devices 2 FS bytes used 116.18GiB
>     devid    2 size 1.82TiB used 118.03GiB path /dev/sdb
>     *** Some devices missing

   This looks like you've created a RAID-1 array with /dev/sda2 and
/dev/sdb, and then run mkfs.btrfs again on /dev/sda2, overwriting the
original [part of a] filesystem on /dev/sda2, and replacing it with a
wholly different filesystem. Since the new FS on /dev/sda2 (UUID
28c2...) doesn't have the same UUID as the original FS (UUID 2d45...),
and the original FS was made of two devices, btrfs fi show is telling
you that there's some devices missing -- /dev/sda2 is no longer part
of that FS, and is therefore a missing device.

   I note that you've got data on both filesystems, so they must both
have been mounted somewhere and had stuff put on them.

   I recommend doing something like this:

# mkfs /media/btrfs/myraid1 /media/btrfs/tmp
# mount /dev/sdb /media/btrfs/myraid1/
# mount /dev/sda2 /media/btrfs/tmp/  # mount both filesystems
# cp /media/btrfs/tmp/* /media/btrfs/myraid1 # put it where you want it
# umount /media/btrfs/tmp/
# wipefs /dev/sda2   # destroy the FS on sda2
# btrfs replace start 1 /dev/sda2 /media/btrfs/myraid1/

   This will copy all the data from the filesystem on /dev/sda2 into
the filesystem on /dev/sdb, destroy the FS on sda2, and then use sda2
as the second device for the main FS.

*WARNING!*

   Note that, since the main FS is missing a device, it will probably
need to be mounted in degraded mode (-o degraded), and that on kernels
earlier than (IIRC) 4.14, this can only be done *once* without the FS
becoming more or less permanently read-only. On recent kernels, it
_should_ be OK.

*WARNING ENDS*

   Hugo.

[snip]

-- 
Hugo Mills | UNIX: Japanese brand of food containers
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Hugo Mills
On Tue, Sep 18, 2018 at 06:28:37PM +, Gervais, Francois wrote:
> > No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?

   No, there's an ioctl to change the received UUID of a
subvolume. It's used by receive, at the very end of the receive
operation.

   Messing around in this area is basically a recipe for ending up
with a half-completed send/receive full of broken data because the
receiving subvolume isn't quite as identical as you thought. It
enforces the rules for a reason.

   Now, it's possible to modify the send stream and the logic around
it a bit to support a number of additional modes of operation
(bidirectional send, for example), but that's queued up waiting for
(a) a definitive list of send stream format changes, and (b) David's
bandwidth to put them together in one patch set.

   If you want to see more on the underlying UUID model, and how it
could be (ab)used and modified, there's a write-up here, in a thread
on pretty much exactly the same proposal that you've just made:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | Great films about cricket: Monster's No-Ball
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: DRDY errors are not consistent with scrub results

2018-08-29 Thread Hugo Mills
On Wed, Aug 29, 2018 at 09:58:58AM +, Duncan wrote:
> Cerem Cem ASLAN posted on Wed, 29 Aug 2018 09:58:21 +0300 as excerpted:
> 
> > Thinking again, this is totally acceptable. If the requirement was a
> > good health disk, then I think I must check the disk health by myself.
> > I may believe that the disk is in a good state, or make a quick test or
> > make some very detailed tests to be sure.
> 
> For testing you might try badblocks.  It's most useful on a device that 
> doesn't have a filesystem on it you're trying to save, so you can use the 
> -w write-test option.  See the manpage for details.
> 
> The -w option should force the device to remap bad blocks where it can as 
> well, and you can take your previous smartctl read and compare it to a 
> new one after the test.
> 
> Hint if testing multiple spinning-rust devices:  Try running multiple 
> tests at once.  While this might have been slower on old EIDE, at least 
> with spinning rust, on SATA and similar you should be able to test 
> multiple devices at once without them slowing down significantly, because 
> the bottleneck is the spinning rust, not the bus, controller or CPU.  I 
> used badblocks years ago to test my new disks before setting up mdraid on 
> them, and with full disk tests on spinning rust taking (at the time) 
> nearly a day a pass and four passes for the -w test, the multiple tests 
> at once trick saved me quite a bit of time!

   Hah. Only a day? It's up to 2 days now.

   The devices get bigger. The interfaces don't get faster at the same
rate. Back in the late '90s, it was only an hour or so to run a
badblocks pass on a big disk...

   Hugo.

-- 
Hugo Mills | Nostalgia isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS and databases

2018-08-01 Thread Hugo Mills
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> I know it's a decade-old question, but I'd like to hear your thoughts
> of today. By now, I became a heavy BTRFS user. Almost everywhere I use
> BTRFS, except in situations when it is obvious there is no benefit
> (e.g. /var/log, /boot). At home, all my desktop, laptop and server
> computers are mainly running on BTRFS with only a few file systems on
> ext4. I even installed BTRFS in corporate productive systems (in those
> cases, the systems were mainly on ext4; but there were some specific
> file systems those exploited BTRFS features).
> 
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?

   Personally, I'd start with btrfs with autodefrag. It has some
degree of I/O overhead, but if the database isn't performance-critical
and already near the limits of the hardware, it's unlikely to make
much difference. Autodefrag should keep the fragmentation down to a
minimum.

   Hugo.

> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?
> 
> 
> Kind regards,
> MegaBrutal

-- 
Hugo Mills | In theory, theory and practice are the same. In
hugo@... carfax.org.uk | practice, they're different.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs filesystem corruptions with 4.18. git kernels

2018-07-20 Thread Hugo Mills
On Fri, Jul 20, 2018 at 11:28:42PM +0200, Alexander Wetzel wrote:
> Hello,
> 
> I'm running my normal workstation with git kernels from 
> git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-testing.git
> and just got the second file system corruption in three weeks. I do
> not have issues with stable kernels, and just want to give you a
> heads up that there might be something seriously broken in current
> development kernels.
> 
> The first corruption was with a kernel based on 4.18.0-rc1
> (wt-2018-06-20) and the second one today based on 4.18.0-rc4
> (wt-2018-07-09).
> The first corruption definitely destroyed data, the second one has
> not been looked at all, yet.
> 
> After the reinstall I did run some scrubs, the last working one one
> week ago.
> 
> Of course this could be unrelated to the development kernels or even
> btrfs, but two corruptions within weeks after years without problems
> is very suspect.
> And since btrfs also allowed to read corrupted data (with a stable
> ubuntu kernel, see below for more details) it looks like this is
> indeed an issue in btrfs, correct?
> 
> A btrfs subvolume is used as the rootfs on a "Samsung SSD 850 EVO
> mSATA 1TB" and I'm running Gentoo ~amd64 on a Thinkpad W530. Discard
> is enabled as mount option and there were roughly 5 other
> subvolumes.
> 
> I'm currently backing up the full btrfs partition after the second
> corruption which announced itself with the following log entries:
> 
> [  979.223767] BTRFS critical (device sdc2): corrupt leaf: root=2
> block=1029783552 slot=1, unexpected item end, have 16161 expect
> 16250

   This means that the metadata block matches the checksum in its
header, but is internally inconsistent. This means that the error in
the block was made before the csum was computed -- i.e., it was that
way in RAM. This can happen in a couple of different ways, but the
most likely cause is bad RAM.

   In this case, it's not a single bitflip in the metadata page
itself, so it's more likely to be something writing spurious data on
the page in RAM that was holding this metadata block. This is either a
bug in the kernel, or a hardware problem.

   I would strongly recommend checking your RAM (memtest86 for a
minimum of 8 hours, preferably 24).

> [  979.223808] BTRFS: error (device sdc2) in __btrfs_cow_block:1080:
> errno=-5 IO failure
> [  979.223810] BTRFS info (device sdc2): forced readonly
> [  979.224599] BTRFS warning (device sdc2): Skipping commit of
> aborted transaction.
> [  979.224603] BTRFS: error (device sdc2) in
> cleanup_transaction:1847: errno=-5 IO failure
> 
> I'll restore the system from a backup - and stick to stable kernels
> for now - after that, but if needed I can of course also restore the
> partition backup to another disk for testing.

   It may be a kernel issue, but it's not necessarily in btrfs. It
could be a bug in some other kernel component where it does some
pointer arithmetic wrong, or uses some uninitialised data as a
pointer. My money's is on bad RAM, though (by a small margin).

   Hugo.

-- 
Hugo Mills | Stick them with the pointy end.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Jon Snow


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-20 Thread Hugo Mills
On Fri, Jul 20, 2018 at 09:38:14PM +0300, Andrei Borzenkov wrote:
> 20.07.2018 20:16, Goffredo Baroncelli пишет:
[snip]
> > Limiting the number of disk per raid, in BTRFS would be quite simple to 
> > implement in the "chunk allocator"
> > 
> 
> You mean that currently RAID5 stripe size is equal to number of disks?
> Well, I suppose nobody is using btrfs with disk pools of two or three
> digits size.

   But they are (even if not very many of them) -- we've seen at least
one person with something like 40 or 50 devices in the array. They'd
definitely got into /dev/sdac territory. I don't recall what RAID level
they were using. I think it was either RAID-1 or -10.

   That's the largest I can recall seeing mention of, though.

   Hugo.

-- 
Hugo Mills | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Terry Pratchett


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-18 Thread Hugo Mills
On Wed, Jul 18, 2018 at 08:39:48AM +, Duncan wrote:
> Duncan posted on Wed, 18 Jul 2018 07:20:09 + as excerpted:
> 
> >> As implemented in BTRFS, raid1 doesn't have striping.
> > 
> > The argument is that because there's only two copies, on multi-device
> > btrfs raid1 with 4+ devices of equal size so chunk allocations tend to
> > alternate device pairs, it's effectively striped at the macro level,
> > with the 1 GiB device-level chunks effectively being huge individual
> > device strips of 1 GiB.
> > 
> > At 1 GiB strip size it doesn't have the typical performance advantage of
> > striping, but conceptually, it's equivalent to raid10 with huge 1 GiB
> > strips/chunks.
> 
> I forgot this bit...
> 
> Similarly, multi-device single is regarded by some to be conceptually 
> equivalent to raid0 with really huge GiB strips/chunks.
> 
> (As you may note, "the argument is" and "regarded by some" are distancing 
> phrases.  I've seen the argument made on-list, but while I understand the 
> argument and agree with it to some extent, I'm still a bit uncomfortable 
> with it and don't normally make it myself, this thread being a noted 
> exception tho originally I simply repeated what someone else already said 
> in-thread, because I too agree it's stretching things a bit.  But it does 
> appear to be a useful conceptual equivalency for some, and I do see the 
> similarity.
> 
> Perhaps it's a case of coder's view (no code doing it that way, it's just 
> a coincidental oddity conditional on equal sizes), vs. sysadmin's view 
> (code or not, accidental or not, it's a reasonably accurate high-level 
> description of how it ends up working most of the time with equivalent 
> sized devices).)

   Well, it's an *accurate* observation. It's just not a particularly
*useful* one. :)

   Hugo.

-- 
Hugo Mills | I gave up smoking, drinking and sex once. It was the
hugo@... carfax.org.uk | scariest 20 minutes of my life.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 0/4] 3- and 4- copy RAID1

2018-07-15 Thread Hugo Mills
On Fri, Jul 13, 2018 at 08:46:28PM +0200, David Sterba wrote:
[snip]
> An interesting question is the naming of the extended profiles. I picked
> something that can be easily understood but it's not a final proposal.
> Years ago, Hugo proposed a naming scheme that described the
> non-standard raid varieties of the btrfs flavor:
> 
> https://marc.info/?l=linux-btrfs=136286324417767
> 
> Switching to this naming would be a good addition to the extended raid.

   I'd suggest using lower-case letter for the c, s, p, rather than
upper, as it makes it much easier to read. The upper-case version
tends to make the letters and numbers merge into each other. With
lower-case c, s, p, the taller digits (or M) stand out:

  1c
  1cMs2p
  2c3s8p (OK, just kidding about this one)

   Hugo.

-- 
Hugo Mills | The English language has the mot juste for every
hugo@... carfax.org.uk | occasion.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: unsolvable technical issues?

2018-06-25 Thread Hugo Mills
On Mon, Jun 25, 2018 at 06:43:38PM +0200, waxhead wrote:
[snip]
> I hope I am not asking for too much (but I know I probably am), but
> I suggest that having a small snippet of information on the status
> page showing a little bit about what is either currently the
> development focus , or what people are known for working at would be
> very valuable for users and it may of course work both ways, such as
> exciting people or calming them down. ;)
> 
> For example something simple like a "development focus" list...
> 2018-Q4: (planned) Renaming the grotesque "RAID" terminology
> 2018-Q3: (planned) Magical feature X
> 2018-Q2: N-Way mirroring
> 2018-Q1: Feature work "RAID"5/6
> 
> I think it would be good for people living their lives outside as it
> would perhaps spark some attention from developers and perhaps even
> media as well.

   I started doing this a couple of years ago, but it turned out to be
impossible to keep even vaguely accurate or up to date, without going
round and bugging the developers individually on a per-release
basis. I don't think it's going to happen.

   Hugo.

-- 
Hugo Mills | emacs: Emacs Makes A Computer Slow.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs: Add more details while checking tree block

2018-06-22 Thread Hugo Mills
On Fri, Jun 22, 2018 at 05:26:02PM +0200, Hans van Kranenburg wrote:
> On 06/22/2018 01:48 PM, Nikolay Borisov wrote:
> > 
> > 
> > On 22.06.2018 04:52, Su Yue wrote:
> >> For easier debug, print eb->start if level is invalid.
> >> Also make print clear if bytenr found is not expected.
> >>
> >> Signed-off-by: Su Yue 
> >> ---
> >>  fs/btrfs/disk-io.c | 8 
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> >> index c3504b4d281b..a90dab84f41b 100644
> >> --- a/fs/btrfs/disk-io.c
> >> +++ b/fs/btrfs/disk-io.c
> >> @@ -615,8 +615,8 @@ static int btree_readpage_end_io_hook(struct 
> >> btrfs_io_bio *io_bio,
> >>  
> >>found_start = btrfs_header_bytenr(eb);
> >>if (found_start != eb->start) {
> >> -  btrfs_err_rl(fs_info, "bad tree block start %llu %llu",
> >> -   found_start, eb->start);
> >> +  btrfs_err_rl(fs_info, "bad tree block start want %llu have 
> >> %llu",
> > 
> > nit: I'd rather have the want/have in brackets (want %llu have% llu)
> 
> From a user support point of view, this text should really be improved.
> There are a few places where 'want' and 'have' are reported in error
> strings, and it's totally unclear what they mean.
> 
> Intuitively I'd say when checking a csum, the "want" would be what's on
> disk now, since you want that to be correct, and the "have" would be
> what you have calculated, but it's actually the other way round, or
> wasn't it? Or was it?
> 
> Every time someone pastes such a message when we help on IRC for
> example, there's confusion, and I have to look up the source again,
> because I always forget.
> 
> What about (%llu stored on disk, %llu calculated now) or something similar?

   Yes, definitely this. I experience the same confusion as Hans, and
I think a lot of other people do, too. I usually read "want" and
"have" the wrong way round, so more clarity would be really helpful.

   Hugo.

> >> +   eb->start, found_start);
> >>ret = -EIO;
> >>goto err;
> >>}
> >> @@ -628,8 +628,8 @@ static int btree_readpage_end_io_hook(struct 
> >> btrfs_io_bio *io_bio,
> >>}
> >>found_level = btrfs_header_level(eb);
> >>if (found_level >= BTRFS_MAX_LEVEL) {
> >> -  btrfs_err(fs_info, "bad tree block level %d",
> >> -(int)btrfs_header_level(eb));
> >> +  btrfs_err(fs_info, "bad tree block level %d on %llu",
> >> +(int)btrfs_header_level(eb), eb->start);
> >>ret = -EIO;
> >>goto err;
> >>}
> >>
> 

-- 
Hugo Mills | "There's a Martian war machine outside -- they want
hugo@... carfax.org.uk | to talk to you about a cure for the common cold."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Stephen Franklin, Babylon 5


signature.asc
Description: Digital signature


Re: About more loose parameter sequence requirement

2018-06-18 Thread Hugo Mills
On Mon, Jun 18, 2018 at 01:34:32PM +0200, David Sterba wrote:
> On Thu, Jun 14, 2018 at 03:17:45PM +0800, Qu Wenruo wrote:
> > I understand that btrfs-progs introduced restrict parameter/option order
> > to distinguish global and sub-command parameter/option.
> > 
> > However it's really annoying if one just want to append some new options
> > to previous command:
> > 
> > E.g.
> > # btrfs check /dev/data/btrfs
> > # !! --check-data-csum
> > 
> > The last command will fail as current btrfs-progs doesn't allow any
> > option after parameter.
> > 
> > 
> > Despite the requirement to distinguish global and subcommand
> > option/parameter, is there any other requirement for such restrict
> > option-first-parameter-last policy?
> 
> I'd say that it's a common and recommended pattern. Getopt is able to
> reorder the parameters so mixed options and non-options are accepted,
> unless POSIXLY_CORRECT (see man getopt(3)) is not set. With the more
> strict requirement, 'btrfs' option parser works the same regardless of
> that.

   I got bitten by this the other day. I put an option flag at the end
of the line, after the mountpoint, and it refused to work.

   I would definitely prefer it if it parsed options in any
position. (Or at least, any position after the group/command
parameters).

   Hugo.

> > If I could implement a enhanced getopt to allow more loose order inside
> > subcomand while still can distinguish global option, will it be accepted
> > (if it's quality is acceptable) ?
> 
> I think it's not worth updating the parser just to support an IMHO
> narrow usecase.

-- 
Hugo Mills | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4  | Page 129 is loosed upon the world.   Zarf


signature.asc
Description: Digital signature


Re: status page

2018-04-25 Thread Hugo Mills
On Wed, Apr 25, 2018 at 02:30:42PM +0200, Gandalf Corvotempesta wrote:
> 2018-04-25 13:39 GMT+02:00 Austin S. Hemmelgarn <ahferro...@gmail.com>:
> > Define 'stable'.
> 
> Something ready for production use like ext or xfs with no critical
> bugs or with easy data loss.
> 
> > If you just want 'safe for critical data', it's mostly there already
> > provided that your admins and operators are careful.  Assuming you avoid
> > qgroups and parity raid, don't run the filesystem near full all the time,
> > and keep an eye on the chunk allocations (which is easy to automate with
> > newer kernels), you will generally be fine.  We've been using it in
> > production where I work for a couple of years now, with the only issues
> > we've encountered arising from the fact that we're stuck using an older
> > kernel which doesn't automatically deallocate empty chunks.
> 
> For me, RAID56 is mandatory. Any ETA for a stable RAID56 ?
> Is something we should expect this year, next year, next 10 years,  ?

   There's not really any ETAs for anything in the kernel, in general,
unless the relevant code has already been committed and accepted (when
it has a fairly deterministic path from then onwards). ETAs for
finding even known bugs are pretty variable, depending largely on how
easily the bug can be reproduced by the reporter and by the developer.

   As for a stable version -- you'll have to define "stable" in a way
that's actually measurable to get any useful answer, and even then,
see my previous comment about ETAs.

   There have been example patches in the last few months on the
subject of closing the write hole, so there's clear ongoing work on
that particular item, but again, see the comment on ETAs. It'll be
done when it's done.

   Hugo.

-- 
Hugo Mills | Nothing wrong with being written in Perl... Some of
hugo@... carfax.org.uk | my best friends are written in Perl.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  dark


signature.asc
Description: Digital signature


Re: Recovery from full metadata with all device space consumed?

2018-04-19 Thread Hugo Mills
On Thu, Apr 19, 2018 at 04:12:39PM -0700, Drew Bloechl wrote:
> On Thu, Apr 19, 2018 at 10:43:57PM +0000, Hugo Mills wrote:
> >Given that both data and metadata levels here require paired
> > chunks, try adding _two_ temporary devices so that it can allocate a
> > new block group.
> 
> Thank you very much, that seems to have done the trick:
> 
> # fallocate -l 4GiB /var/tmp/btrfs-temp-1
> # fallocate -l 4GiB /var/tmp/btrfs-temp-2
> # losetup -f /var/tmp/btrfs-temp-1
> # losetup -f /var/tmp/btrfs-temp-2
> # btrfs device add /dev/loop0 /broken
> Performing full device TRIM (4.00GiB) ...
> # btrfs device add /dev/loop1 /broken
> Performing full device TRIM (4.00GiB) ...
> # btrfs balance start -v -dusage=1 /broken
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=1

   Excellent. Don't forget to "btrfs dev delete" the devices after
you're finished the balance. You could damage the FS (possibly
irreparably) if you destroy the devices without doing so.

> I'm guessing that'll take a while to complete, but meanwhile, in another
> terminal:
> 
> # btrfs fi show /broken
> Label: 'mon_data'  uuid: 85e52555-7d6d-4346-8b37-8278447eb590
>   Total devices 6 FS bytes used 69.53GiB
>   devid1 size 931.51GiB used 731.02GiB path /dev/sda1
>   devid2 size 931.51GiB used 731.02GiB path /dev/sdb1
>   devid3 size 931.51GiB used 730.03GiB path /dev/sdc1
>   devid4 size 931.51GiB used 730.03GiB path /dev/sdd1
>   devid5 size 4.00GiB used 1.00GiB path /dev/loop0
>   devid6 size 4.00GiB used 1.00GiB path /dev/loop1
> 
> # btrfs fi df /broken
> Data, RAID0: total=2.77TiB, used=67.00GiB
> System, RAID1: total=8.00MiB, used=192.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.49GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Do I understand correctly that this could require up to 3 extra devices,
> if for instance you arrived in this situation with a RAID6 data profile?
> Or is the number even higher for profiles like RAID10?

   The minimum number of devices for each RAID level is:

single, DUP: 1
RAID-0, -1, -5:  2
RAID-6:  3
RAID-10: 4

   Hugo.

-- 
Hugo Mills | Gentlemen! You can't fight here! This is the War
hugo@... carfax.org.uk | Room!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Dr Strangelove


signature.asc
Description: Digital signature


Re: Recovery from full metadata with all device space consumed?

2018-04-19 Thread Hugo Mills
On Thu, Apr 19, 2018 at 03:08:48PM -0700, Drew Bloechl wrote:
> I've got a btrfs filesystem that I can't seem to get back to a useful
> state. The symptom I started with is that rename() operations started
> dying with ENOSPC, and it looks like the metadata allocation on the
> filesystem is full:
> 
> # btrfs fi df /broken
> Data, RAID0: total=3.63TiB, used=67.00GiB
> System, RAID1: total=8.00MiB, used=224.00KiB
> Metadata, RAID1: total=3.00GiB, used=2.50GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> All of the consumable space on the backing devices also seems to be in
> use:
> 
> # btrfs fi show /broken
> Label: 'mon_data'  uuid: 85e52555-7d6d-4346-8b37-8278447eb590
>   Total devices 4 FS bytes used 69.50GiB
>   devid1 size 931.51GiB used 931.51GiB path /dev/sda1
>   devid2 size 931.51GiB used 931.51GiB path /dev/sdb1
>   devid3 size 931.51GiB used 931.51GiB path /dev/sdc1
>   devid4 size 931.51GiB used 931.51GiB path /dev/sdd1
> 
> Even the smallest balance operation I can start fails (this doesn't
> change even with an extra temporary device added to the filesystem):

   Given that both data and metadata levels here require paired
chunks, try adding _two_ temporary devices so that it can allocate a
new block group.

   Hugo.

> # btrfs balance start -v -dusage=1 /broken
> Dumping filters: flags 0x1, state 0x0, force is off
>   DATA (flags 0x2): balancing, usage=1
> ERROR: error during balancing '/broken': No space left on device
> There may be more info in syslog - try dmesg | tail
> # dmesg | tail -1
> [11554.296805] BTRFS info (device sdc1): 757 enospc errors during
> balance
> 
> The current kernel is 4.15.0 from Debian's stretch-backports
> (specifically linux-image-4.15.0-0.bpo.2-amd64), but it was Debian's
> 4.9.30 when the filesystem got into this state. I upgraded it in the
> hopes that a newer kernel would be smarter, but no dice.
> 
> btrfs-progs is currently at v4.7.3.
> 
> Most of what this filesystem stores is Prometheus 1.8's TSDB for its
> metrics, which are constantly written at around 50MB/second. The
> filesystem never really gets full as far as data goes, but there's a lot
> of never-ending churn for what data is there.
> 
> Question 1: Are there other steps that can be tried to rescue a
> filesystem in this state? I still have it mounted in the same state, and
> I'm willing to try other things or extract debugging info.
> 
> Question 2: Is there something I could have done to prevent this from
> happening in the first place?
> 
> Thanks!

-- 
Hugo Mills | Always be sincere, whether you mean it or not.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |  Flanders & Swann
PGP: E2AB1DE4  |The Reluctant Cannibal


signature.asc
Description: Digital signature


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Hugo Mills
On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> Is there some testing utility for this? Is there a way to extract this/tell 
> with a high enough certainty from datasheets/other material before purchase?

   Given that not implementing barriers is basically a bug in the
hardware [for SATA or SAS], I don't think anyone's going to specify
anything other than "fully suppors barriers" in their datasheets.

   I don't know of a testing tool. It may not be obvious that barriers
aren't being honoured without doing things like power-failure testing.

   Hugo.

> https://btrfs.wiki.kernel.org/index.php/FAQ#How_does_this_happen.3F

-- 
Hugo Mills | "Damn and blast British Telecom!" said Dirk,
hugo@... carfax.org.uk | the words coming easily from force of habit.
http://carfax.org.uk/  |Douglas Adams,
PGP: E2AB1DE4  |   Dirk Gently's Holistic Detective Agency


signature.asc
Description: Digital signature


Re: Out of space and incorrect size reported

2018-03-21 Thread Hugo Mills
On Wed, Mar 21, 2018 at 09:53:39PM +, Shane Walton wrote:
> > uname -a
> Linux rockstor 4.4.5-1.el7.elrepo.x86_64 #1 SMP Thu Mar 10 11:45:51 EST 2016 
> x86_64 x86_64 x86_64 GNU/Linux
> 
> > btrfs —version 
> btrfs-progs v4.4.1
> 
> > btrfs fi df /mnt2/pool_homes
> Data, RAID1: total=240.00GiB, used=239.78GiB
> System, RAID1: total=8.00MiB, used=64.00KiB
> Metadata, RAID1: total=8.00GiB, used=5.90GiB
> GlobalReserve, single: total=512.00MiB, used=59.31MiB
> 
> > btrfs filesystem show /mnt2/pool_homes
> Label: 'pool_homes'  uuid: 0987930f-8c9c-49cc-985e-de6383863070
>   Total devices 2 FS bytes used 245.75GiB
>   devid1 size 465.76GiB used 248.01GiB path /dev/sda
>   devid2 size 465.76GiB used 248.01GiB path /dev/sdb
> 
> Why is the line above "Data, RAID1: total=240.00GiB, used=239.78GiB” almost 
> full and limited to 240 GiB when there is I have 2x 500 GB HDD?  This is all 
> create/implemented with the Rockstor platform and it says the “share” should 
> be 400 GB.
> 
> What can I do to make this larger or closer to the full size of 465 GiB 
> (minus the System and Metadata overhead)?

   Most likely, you need to ugrade your kernel to get past the known
bug (fixed in about 4.6 or so, if I recall correctly), and then mount
with -o clear_cache to force the free space cache to be rebuilt.

   Hugo.

-- 
Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
hugo@... carfax.org.uk | A: A parroty error.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: mkfs: add uuid and otime to ROOT_ITEM of FS_TREE

2018-03-19 Thread Hugo Mills
On Mon, Mar 19, 2018 at 02:02:23PM +0100, David Sterba wrote:
> On Mon, Mar 19, 2018 at 08:20:10AM +0000, Hugo Mills wrote:
> > On Mon, Mar 19, 2018 at 05:16:42PM +0900, Misono, Tomohiro wrote:
> > > Currently, the top-level subvolume lacks the UUID. As a result, both
> > > non-snapshot subvolume and snapshot of top-level subvolume do not have
> > > Parent UUID and cannot be distinguisued. Therefore "fi show" of
> > > top-level lists all the subvolumes which lacks the UUID in
> > > "Snapshot(s)" filed.  Also, it lacks the otime information.
> > > 
> > > Fix this by adding the UUID and otime at the mkfs time.  As a
> > > consequence, snapshots of top-level subvolume now have a Parent UUID and
> > > UUID tree will create an entry for top-level subvolume at mount time.
> > > This should not cause the problem for current kernel, but user program
> > > which relies on the empty Parent UUID may be affected by this change.
> > 
> >Is there any way of adding a UUID to the top level subvol on an
> > existing filesystem? It would be helpful not to have to rebuild every
> > filesystem in the world to fix this.
> 
> We can do that by a special purpose tool. The easiest way is to set the
> uuid on an unmouted filesystem, but as this is a one-time action I hope
> this is acceptable. Added to todo, thanks for the suggestion.

   Sounds good to me.

   Hugo.

-- 
Hugo Mills | Talking about music is like dancing about
hugo@... carfax.org.uk | architecture
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Frank Zappa


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: mkfs: add uuid and otime to ROOT_ITEM of FS_TREE

2018-03-19 Thread Hugo Mills
On Mon, Mar 19, 2018 at 05:16:42PM +0900, Misono, Tomohiro wrote:
> Currently, the top-level subvolume lacks the UUID. As a result, both
> non-snapshot subvolume and snapshot of top-level subvolume do not have
> Parent UUID and cannot be distinguisued. Therefore "fi show" of
> top-level lists all the subvolumes which lacks the UUID in
> "Snapshot(s)" filed.  Also, it lacks the otime information.
> 
> Fix this by adding the UUID and otime at the mkfs time.  As a
> consequence, snapshots of top-level subvolume now have a Parent UUID and
> UUID tree will create an entry for top-level subvolume at mount time.
> This should not cause the problem for current kernel, but user program
> which relies on the empty Parent UUID may be affected by this change.

   Is there any way of adding a UUID to the top level subvol on an
existing filesystem? It would be helpful not to have to rebuild every
filesystem in the world to fix this.

   Hugo.

> Signed-off-by: Tomohiro Misono <misono.tomoh...@jp.fujitsu.com>
> ---
> This is also needed in order that "sub list -s" works properly for
> non-privileged user[1] even if there are snapshots of toplevel subvolume.
> 
> Currently the check if a subvolume is a snapshot is done by looking at the key
> offset of ROOT_ITEM of subvolume (non-zero for snapshot) used by tree search 
> ioctl.
> However, non-privileged version of "sub list" won't use tree search ioctl and 
> just
> looking if parent uuid is null or not. Therefore there is no way to recognize
> snapshots of toplevel subvolume.
> 
> [1] https://marc.info/?l=linux-btrfs=152144463907830=2
> 
>  mkfs/common.c | 14 ++
>  mkfs/main.c   |  3 +++
>  2 files changed, 17 insertions(+)
> 
> diff --git a/mkfs/common.c b/mkfs/common.c
> index 16916ca2..6924d9b7 100644
> --- a/mkfs/common.c
> +++ b/mkfs/common.c
> @@ -44,6 +44,7 @@ static int btrfs_create_tree_root(int fd, struct 
> btrfs_mkfs_config *cfg,
>   u32 itemoff;
>   int ret = 0;
>   int blk;
> + u8 uuid[BTRFS_UUID_SIZE];
>  
>   memset(buf->data + sizeof(struct btrfs_header), 0,
>   cfg->nodesize - sizeof(struct btrfs_header));
> @@ -77,6 +78,19 @@ static int btrfs_create_tree_root(int fd, struct 
> btrfs_mkfs_config *cfg,
>   btrfs_set_item_offset(buf, btrfs_item_nr(nritems), itemoff);
>   btrfs_set_item_size(buf, btrfs_item_nr(nritems),
>   sizeof(root_item));
> + if (blk == MKFS_FS_TREE) {
> + time_t now = time(NULL);
> +
> + uuid_generate(uuid);
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
> + btrfs_set_stack_timespec_sec(_item.otime, now);
> + btrfs_set_stack_timespec_sec(_item.ctime, now);
> + } else {
> + memset(uuid, 0, BTRFS_UUID_SIZE);
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
> + btrfs_set_stack_timespec_sec(_item.otime, 0);
> + btrfs_set_stack_timespec_sec(_item.ctime, 0);
> + }
>   write_extent_buffer(buf, _item,
>   btrfs_item_ptr_offset(buf, nritems),
>   sizeof(root_item));
> diff --git a/mkfs/main.c b/mkfs/main.c
> index 5a717f70..52d92581 100644
> --- a/mkfs/main.c
> +++ b/mkfs/main.c
> @@ -315,6 +315,7 @@ static int create_tree(struct btrfs_trans_handle *trans,
>   struct btrfs_key location;
>   struct btrfs_root_item root_item;
>   struct extent_buffer *tmp;
> + u8 uuid[BTRFS_UUID_SIZE] = {0};
>   int ret;
>  
>   ret = btrfs_copy_root(trans, root, root->node, , objectid);
> @@ -325,6 +326,8 @@ static int create_tree(struct btrfs_trans_handle *trans,
>   btrfs_set_root_bytenr(_item, tmp->start);
>       btrfs_set_root_level(_item, btrfs_header_level(tmp));
>   btrfs_set_root_generation(_item, trans->transid);
> + /* clear uuid of source tree */
> + memcpy(root_item.uuid, uuid, BTRFS_UUID_SIZE);
>   free_extent_buffer(tmp);
>  
>   location.objectid = objectid;

-- 
Hugo Mills | This chap Anon is writing some perfectly lovely
hugo@... carfax.org.uk | stuff at the moment.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Improve error stats message

2018-03-07 Thread Hugo Mills
On Wed, Mar 07, 2018 at 08:02:51PM +0100, Diego wrote:
> El miércoles, 7 de marzo de 2018 19:24:53 (CET) Hugo Mills escribió:
> >On multi-device filesystems, the two are not necessarily the same.
> 
> Ouch. FWIW, I was moved to do this because I saw this conversation on
> IRC which made me think that people aren't understanding what the
> message means:
> 
>hi! I noticed bdev rd 13  as a kernel message
>what does it mean
>Well, that's not the whole message.
>Can you paste the whole line in here? (Just one line)
   ^^ nick2... that would be me. :)

>[3.404959] BTRFS info (device sda4): bdev /dev/sda4 errs: 
> wr 0, rd 13, flush 0, corrupt 0, gen 0
> 
> 
> Maybe something like this would be better:
> 
> BTRFS info (device sda4): disk /dev/sda4 errors: write 0, read 13, flush 0, 
> corrupt 0, generation 0

   I think the single most helpful modification here would be to
change "device" to "fs on", to show that it's only an indicator of the
filesystem ID, rather than actually the device on which the errors
occurred. The others I'm not really bothered about, personally.

   Hugo.

> ---
>  fs/btrfs/volumes.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 2ceb924ca0d6..cfa029468585 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7239,7 +7239,7 @@ static void btrfs_dev_stat_print_on_error(struct 
> btrfs_device *dev)
>   if (!dev->dev_stats_valid)
>   return;
>   btrfs_err_rl_in_rcu(dev->fs_info,
> - "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
> + "disk %s errors: write %u, read %u, flush %u, corrupt %u, 
> generation %u",
>  rcu_str_deref(dev->name),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),

-- 
Hugo Mills | Q: What goes, "Pieces of seven! Pieces of seven!"?
hugo@... carfax.org.uk | A: A parroty error.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] Improve error stats message

2018-03-07 Thread Hugo Mills
On Wed, Mar 07, 2018 at 06:37:29PM +0100, Diego wrote:
> A typical notification of filesystem errors looks like this:
> 
> BTRFS error (device sda2): bdev /dev/sda2 errs: wr 0, rd 1, flush 0, corrupt 
> 0, gen 0
> 
> The device name is being printed twice.

   For good reason -- the first part ("device sda2") indicates the
filesystem, and is the arbitrarily-selected device used by the kernel
to represent the FS. The second part ("bdev /dev/sda2") indicates the
_actual_ device for which the errors are being reported.

   On multi-device filesystems, the two are not necessarily the same.

   Hugo.

> Also, these abbreviatures
> feel unnecesary. Make the message look like this instead:
> 
> BTRFS error (device sda2): errors: write 0, read 1, flush 0, corrupt 0, 
> generation 0
> 
> 
> Signed-off-by: Diego Calleja <dieg...@gmail.com>
> ---
>  fs/btrfs/volumes.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 2ceb924ca0d6..52fee5bb056f 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -7238,9 +7238,8 @@ static void btrfs_dev_stat_print_on_error(struct 
> btrfs_device *dev)
>  {
>   if (!dev->dev_stats_valid)
>   return;
> - btrfs_err_rl_in_rcu(dev->fs_info,
> - "bdev %s errs: wr %u, rd %u, flush %u, corrupt %u, gen %u",
> -rcu_str_deref(dev->name),
> + btrfs_err_rl(dev->fs_info,
> + "errors: write %u, read %u, flush %u, corrupt %u, generation 
> %u",
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_WRITE_ERRS),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_READ_ERRS),
>  btrfs_dev_stat_read(dev, BTRFS_DEV_STAT_FLUSH_ERRS),

-- 
Hugo Mills | Would you like an ocelot with that non-sequitur?
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs send/receive in reverse possible?

2018-02-16 Thread Hugo Mills
On Fri, Feb 16, 2018 at 10:43:54AM +0800, Sampson Fung wrote:
> I have snapshot A on Drive_A.
> I send snapshot A to an empty Drive_B.  Then keep Drive_A as backup.
> I use Drive_B as active.
> I create new snapshot B on Drive_B.
> 
> Can I use btrfs send/receive to send incremental differences back to Drive_A?
> What is the correct way of doing this?

   You can't do it with the existing tools -- it needs a change to the
send stream format. Here's a write-up of what's going on behind the
scenes, and what needs to change:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | I can't foretell the future, I just work there.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |The Doctor


signature.asc
Description: Digital signature


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Hugo Mills
On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:
> In discussing the performance of various metadata operations over
> the past few days I've had this idea in the back of my head, and
> wanted to see if anybody had already thought about it before
> (likely, I would guess).
> 
> It appears based on this page:
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> that data and metadata in BTRFS are fairly well isolated from one
> another, particularly in the case of large files.  This appears
> reinforced by a recent comment from Qu ("...btrfs strictly
> split metadata and data usage...").
> 
> Yet, while there are plenty of options to RAID0/1/10/etc across
> generally homogeneous media types, there doesn't appear to be any
> functionality (at least that I can find) to segment different BTRFS
> internals to different types of devices.  E.G., place metadata trees
> and extent block groups on SSD, and data trees and extent block
> groups on HDD(s).
> 
> Is this something that has already been considered (and if so,
> implemented, which would make me extremely happy)?  Is it feasible
> it is hasn't been approached yet?  I admit my internal knowledge of
> BTRFS is fleeting, though I'm trying to work on that daily at this
> time, so forgive me if this is unapproachable for obvious
> architectural reasons.

   Well, it's been discussed, and I wrote up a theoretical framework
which should cover a wide range of use-cases:

https://www.spinics.net/lists/linux-btrfs/msg33916.html

   I never got round to implementing it, though -- I ran into issues
over storing the properties/metadata needed to configure it.

   Hugo.

-- 
Hugo Mills | Dullest spy film ever: The Eastbourne Ultimatum
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   The Thick of It


signature.asc
Description: Digital signature


Re: [PATCH 00/26] btrfs-progs: introduce libbtrfsutil, "btrfs-progs as a library"

2018-01-26 Thread Hugo Mills
|  321 ++
>  libbtrfsutil/python/qgroup.c |  141 +++
>  libbtrfsutil/python/setup.py |  103 ++
>  libbtrfsutil/python/subvolume.c  |  665 +
>  libbtrfsutil/python/tests/__init__.py|   66 ++
>  libbtrfsutil/python/tests/test_filesystem.py |   73 ++
>  libbtrfsutil/python/tests/test_qgroup.py |   57 ++
>  libbtrfsutil/python/tests/test_subvolume.py  |  383 +++
>  libbtrfsutil/qgroup.c|   86 ++
>  libbtrfsutil/subvolume.c | 1383 
> ++
>  messages.h   |   14 +
>  props.c  |   69 +-
>  qgroup.c |  106 --
>  qgroup.h |4 -
>  send-utils.c |   25 +-
>  utils.c  |  152 +--
>  utils.h  |6 -
>  41 files changed, 6188 insertions(+), 1754 deletions(-)
>  create mode 100644 libbtrfsutil/COPYING
>  create mode 100644 libbtrfsutil/COPYING.LESSER
>  create mode 100644 libbtrfsutil/README.md
>  create mode 100644 libbtrfsutil/btrfsutil.h
>  create mode 100644 libbtrfsutil/errors.c
>  create mode 100644 libbtrfsutil/filesystem.c
>  create mode 100644 libbtrfsutil/internal.h
>  create mode 100644 libbtrfsutil/python/.gitignore
>  create mode 100644 libbtrfsutil/python/btrfsutilpy.h
>  create mode 100644 libbtrfsutil/python/error.c
>  create mode 100644 libbtrfsutil/python/filesystem.c
>  create mode 100644 libbtrfsutil/python/module.c
>  create mode 100644 libbtrfsutil/python/qgroup.c
>  create mode 100755 libbtrfsutil/python/setup.py
>  create mode 100644 libbtrfsutil/python/subvolume.c
>  create mode 100644 libbtrfsutil/python/tests/__init__.py
>  create mode 100644 libbtrfsutil/python/tests/test_filesystem.py
>  create mode 100644 libbtrfsutil/python/tests/test_qgroup.py
>  create mode 100644 libbtrfsutil/python/tests/test_subvolume.py
>  create mode 100644 libbtrfsutil/qgroup.c
>  create mode 100644 libbtrfsutil/subvolume.c
> 

-- 
Hugo Mills | And what rough beast, its hour come round at last /
hugo@... carfax.org.uk | slouches towards Bethlehem, to be born?
http://carfax.org.uk/  |
PGP: E2AB1DE4  | W.B. Yeats, The Second Coming


signature.asc
Description: Digital signature


Re: bad key ordering - repairable?

2018-01-22 Thread Hugo Mills
g all current system
> settings, would probably take some time for me to do.
> If it is currently not repairable, it would be nice if this kind of
> corruption could be repaired in the future, even if losing a few
> files. Or if the corruptions could be avoided in the first place.

   Given that the current tools crash, the answer's a definite
no. However, if you can get a developer interested, they may be able
to write a fix for it, given an image of the FS (using btrfs-image).

[snip]
> I have never noticed any corruptions on the NTFS and Ext4 file systems
> on the laptop, only on the Btrfs file systems.

   You've never _noticed_ them. :)

   Hugo.

-- 
Hugo Mills | ... one ping(1) to rule them all, and in the
hugo@... carfax.org.uk | darkness bind(2) them.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: Fwd: Fwd: Question regarding to Btrfs patchwork /2831525

2018-01-14 Thread Hugo Mills
> >>>>>>> convert inode struct to btrfs_inode struct (use btrfsInode =
> >>>>>>> BTRFS_I(inode)), then from btrfs_inode struct i go to root field, and
> >>>>>>> from root i take anon_dev or anon_super.s_dev.
> >>>>>>> struct btrfs_inode *btrfsInode;
> >>>>>>> btrfsInode = BTRFS_I(inode);
> >>>>>>>btrfsInode->root->anon_super.s_devor
> >>>>>>>btrfsInode->root->anon_dev- depend on kernel.
> >>>>>>
> >>>>>> The most directly method would be:
> >>>>>>
> >>>>>> btrfs_inode->root->fs_info->fsid.
> >>>>>> (For newer kernel, as I'm not familiar with older kernels)
> >>>>>>
> >>>>>> Or from superblock:
> >>>>>> btrfs_inode->root->fs_info->super_copy->fsid.
> >>>>>> (The most reliable one, no matter which kernel version you're using, as
> >>>>>> long as the super block format didn't change)
> >>>>>>
> >>>>>> For device id, it's not that commonly used unless you're dealing with
> >>>>>> chunk mapping, so I'm assuming you're referring to fsid.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Qu
> >>>>>>
> >>>>>>>
> >>>>>>> In kernel 3.12.28-4-default in order to get the fsid, i need to go
> >>>>>>> to the inode -> superblock -> device id (inode->i_sb->s_dev)
> >>>>>>>
> >>>>>>> Why is this ? and is there a proper/an official way to get it ?

-- 
Hugo Mills | Gentlemen! You can't fight here! This is the War
hugo@... carfax.org.uk | Room!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Dr Strangelove


signature.asc
Description: Digital signature


Re: Recommendations for balancing as part of regular maintenance?

2018-01-08 Thread Hugo Mills
ch
chunks, so you may end up moving N GiB of data (whereas usage=N could
move much less actual data).

   Personally, I recommend using limit=N, where N is something like
(Allocated - Used)*3/4 GiB.

   Note the caveat below, which is that using "ssd" mount option on
earlier kernels could prevent the balance from doing a decent job.

> The other mystery is how the data allocation became so large.

   You have a non-rotational device. That means that it'd be mounted
automatically with the "ssd" mount option. Up to 4.13 (or 4.14, I
always forget), the behaviour of "ssd" leads to highly fragmented
allocation of extents, which in turn results in new data chunks being
allocated when there's theoretically loads of space available to use
(but which it may not be practical to use, due to the fragmented free
space).

   After 4.13 (or 4.14), the "ssd" mount option has been fixed, and it
no longer has the bad long-term effects that we've seen before, but it
won't deal with the existing fragmented free space without a data
balance.

   If you're running an older kernel, it's definitely recommended to
mount all filesystems with "nossd" to avoid these issues.

   Hugo.

-- 
Hugo Mills | As long as you're getting different error messages,
hugo@... carfax.org.uk | you're making progress.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 1/2] btrfs-progs: Fix progs_extra build dependencies

2017-12-23 Thread Hugo Mills
On Sat, Dec 23, 2017 at 09:52:37PM +0100, Hans van Kranenburg wrote:
> The Makefile does not have a dependency path that builds dependencies
> for tools listed in progs_extra.
> 
> E.g. doing make btrfs-show-super in a clean build environment results in:
> gcc: error: cmds-inspect-dump-super.o: No such file or directory
> Makefile:389: recipe for target 'btrfs-show-super' failed
> 
> Signed-off-by: Hans van Kranenburg <h...@knorrie.org>

   Hans and I worked this one out between us on IRC. Not sure if you
need this, but here it is:

Signed-off-by: Hugo Mills <h...@carfax.org.uk>

   Hugo.

> ---
>  Makefile | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/Makefile b/Makefile
> index 30a0ee22..390b138f 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -220,7 +220,7 @@ cmds_restore_cflags = 
> -DBTRFSRESTORE_ZSTD=$(BTRFSRESTORE_ZSTD)
>  CHECKER_FLAGS += $(btrfs_convert_cflags)
>  
>  # collect values of the variables above
> -standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
> btrfs-%, $(progs,$($(dep)))
> +standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter 
> btrfs-%, $(progs) $(progs_extra)))),$($(dep)))
>  
>  SUBDIRS =
>  BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS))

-- 
Hugo Mills | My code is never released, it escapes from the git
hugo@... carfax.org.uk | repo and kills a few beta testers on the way out.
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Diablo-D3


signature.asc
Description: Digital signature


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 04:18:09PM +, Neal Becker wrote:
> Is it possible to check while it is mounted?

   Certainly not while mounted read-write. While mounted read-only --
I'm not certain. Possibly.

   Hugo.

> On Tue, Dec 12, 2017 at 9:52 AM Hugo Mills <h...@carfax.org.uk> wrote:
> 
> > On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> > > sudo ls -la ~/
> > > [sudo] password for nbecker:
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > ls: cannot access '/home/nbecker/.bash_history': No such file or
> > directory
> > > total 11652
> > > drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> > > drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> > > [...]
> > > -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -?? ? ?   ?  ??  .bash_history
> > > -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> > > [...]
> >
> >Could you show the result of btrfs check --readonly on this FS? The
> > rest, below, doesn't show up anything unusual to me.
> >
> >Hugo.
> >
> > > uname -a
> > > Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> > > 2017 x86_64 x86_64 x86_64 GNU/Linux
> > >
> > >  btrfs --version
> > > btrfs-progs v4.11.1
> > >
> > > sudo btrfs fi show
> > > Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> > > Total devices 1 FS bytes used 80.96GiB
> > > devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> > >
> > > sudo btrfs fi df /home
> > > Data, single: total=226.99GiB, used=78.89GiB
> > > System, single: total=4.00MiB, used=48.00KiB
> > > Metadata, single: total=3.01GiB, used=2.07GiB
> > > GlobalReserve, single: total=222.36MiB, used=0.00B
> > >
> > > dmesg.log is here:
> > > https://nbecker.fedorapeople.org/dmesg.txt
> > >
> > > mount | grep btrfs
> > > /dev/sda3 on / type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> > > /dev/sda3 on /home type btrfs
> > > (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> > >
> >

-- 
Hugo Mills | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4  |  Ford Prefect


signature.asc
Description: Digital signature


Re: broken btrfs filesystem

2017-12-12 Thread Hugo Mills
On Tue, Dec 12, 2017 at 09:02:56AM -0500, Neal Becker wrote:
> sudo ls -la ~/
> [sudo] password for nbecker:
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> ls: cannot access '/home/nbecker/.bash_history': No such file or directory
> total 11652
> drwxr-xr-x. 1 nbecker nbecker 5826 Dec 12 08:48  .
> drwxr-xr-x. 1 rootroot  48 Aug  2 19:32  ..
> [...]
> -rwxrwxr-x. 1 nbecker nbecker  207 Dec  3  2015  BACKUP.sh
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -?? ? ?   ?  ??  .bash_history
> -rw-r--r--. 1 nbecker nbecker   18 Oct  8  2014  .bash_logout
> [...]

   Could you show the result of btrfs check --readonly on this FS? The
rest, below, doesn't show up anything unusual to me.

   Hugo.

> uname -a
> Linux nbecker2 4.14.3-300.fc27.x86_64 #1 SMP Mon Dec 4 17:18:27 UTC
> 2017 x86_64 x86_64 x86_64 GNU/Linux
> 
>  btrfs --version
> btrfs-progs v4.11.1
> 
> sudo btrfs fi show
> Label: 'fedora'  uuid: 93c586fa-6d86-4148-a528-e61e644db0c8
> Total devices 1 FS bytes used 80.96GiB
> devid1 size 230.00GiB used 230.00GiB path /dev/sda3
> 
> sudo btrfs fi df /home
> Data, single: total=226.99GiB, used=78.89GiB
> System, single: total=4.00MiB, used=48.00KiB
> Metadata, single: total=3.01GiB, used=2.07GiB
> GlobalReserve, single: total=222.36MiB, used=0.00B
> 
> dmesg.log is here:
> https://nbecker.fedorapeople.org/dmesg.txt
> 
> mount | grep btrfs
> /dev/sda3 on / type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=257,subvol=/root)
> /dev/sda3 on /home type btrfs
> (rw,relatime,seclabel,ssd,space_cache,subvolid=318,subvol=/home)
> 

-- 
Hugo Mills | Hey, Virtual Memory! Now I can have a *really big*
hugo@... carfax.org.uk | ramdisk!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Odd behaviour of replace -- unknown resulting state

2017-12-09 Thread Hugo Mills
On Sat, Dec 09, 2017 at 05:43:48PM +, Hugo Mills wrote:
>This is on 4.10, so there may have been fixes made to this since
> then. If so, apologies for the noise.
> 
>I had a filesystem on 6 devices with a badly failing drive in it
> (/dev/sdi). I replaced the drive with a new one:
> 
> # btrfs replace start /dev/sdi /dev/sdj /media/video

Sorry, that should, of course, read:

# btrfs replace start /dev/sdi2 /dev/sdj2 /media/video

   Hugo.

>Once it had finished(*), I resized the device from 6 TB to 8 TB:
> 
> # btrfs fi resize 2:max /media/video
> 
>I also removed another, smaller, device:
> 
> # btrfs dev del 7 /media/video
> 
>Following this, btrfs fi show was reporting the correct device
> size, but still the same device node in the filesystem:
> 
> Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
>Total devices 5 FS bytes used 9.15TiB
>devid2 size 7.28TiB used 6.44TiB path /dev/sdi2
>devid3 size 3.63TiB used 3.46TiB path /dev/sde2
>devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
>devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
>devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
> 
>Note that device 2 definitely isn't /dev/sdi2, because /dev/sdi2
> was on a 6 TB device, not an 8 TB device.
> 
>Finally, I physically removed the two deleted devices from the
> machine. The second device came out fine, but the first (/dev/sdi) has
> now resulted in this from btrfs fi show:
> 
> Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
>Total devices 5 FS bytes used 9.15TiB
>devid3 size 3.63TiB used 3.46TiB path /dev/sde2
>devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
>devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
>devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
>*** Some devices missing
> 
>So, what's the *actual* current state of this filesystem? It's not
> throwing write errors in the kernel logs from having a missing device,
> so it seems like it's probably OK. However, the FS's idea of which
> devices it's got seems to be confused.
> 
>I suspect that if I reboot, it'll all be fine, but I'd be happier
> if it hadn't got into this state in the first place.
> 
>Is this bug fixed in later versions of the kernel? Can anyone think
> of any issues I might have if I leave it in this state for a while?
> Likewise, any issues I might have from a reboot? (Probably into 4.14)
> 
>    Hugo.
> 
> (*) as an aside, it was reporting over 300% complete when it finally
> completed. Not sure if that's been fixed since 4.10, either.
>  

-- 
Hugo Mills | I'm on a 30-day diet. So far I've lost 18 days.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Odd behaviour of replace -- unknown resulting state

2017-12-09 Thread Hugo Mills
   This is on 4.10, so there may have been fixes made to this since
then. If so, apologies for the noise.

   I had a filesystem on 6 devices with a badly failing drive in it
(/dev/sdi). I replaced the drive with a new one:

# btrfs replace start /dev/sdi /dev/sdj /media/video

   Once it had finished(*), I resized the device from 6 TB to 8 TB:

# btrfs fi resize 2:max /media/video

   I also removed another, smaller, device:

# btrfs dev del 7 /media/video

   Following this, btrfs fi show was reporting the correct device
size, but still the same device node in the filesystem:

Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
   Total devices 5 FS bytes used 9.15TiB
   devid2 size 7.28TiB used 6.44TiB path /dev/sdi2
   devid3 size 3.63TiB used 3.46TiB path /dev/sde2
   devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
   devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
   devid6 size 3.63TiB used 3.43TiB path /dev/sdc2

   Note that device 2 definitely isn't /dev/sdi2, because /dev/sdi2
was on a 6 TB device, not an 8 TB device.

   Finally, I physically removed the two deleted devices from the
machine. The second device came out fine, but the first (/dev/sdi) has
now resulted in this from btrfs fi show:

Label: 'amelia'  uuid: f7409f7d-bea2-4818-b937-9e45d754b5f1
   Total devices 5 FS bytes used 9.15TiB
   devid3 size 3.63TiB used 3.46TiB path /dev/sde2
   devid4 size 3.63TiB used 3.45TiB path /dev/sdd2
   devid5 size 1.81TiB used 1.65TiB path /dev/sdh2
   devid6 size 3.63TiB used 3.43TiB path /dev/sdc2
   *** Some devices missing

   So, what's the *actual* current state of this filesystem? It's not
throwing write errors in the kernel logs from having a missing device,
so it seems like it's probably OK. However, the FS's idea of which
devices it's got seems to be confused.

   I suspect that if I reboot, it'll all be fine, but I'd be happier
if it hadn't got into this state in the first place.

   Is this bug fixed in later versions of the kernel? Can anyone think
of any issues I might have if I leave it in this state for a while?
Likewise, any issues I might have from a reboot? (Probably into 4.14)

   Hugo.

(*) as an aside, it was reporting over 300% complete when it finally
completed. Not sure if that's been fixed since 4.10, either.
 
-- 
Hugo Mills | Biphocles: Plato's optician
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [RFC] Improve subvolume usability for a normal user

2017-12-07 Thread Hugo Mills
On Thu, Dec 07, 2017 at 07:21:46AM -0500, Austin S. Hemmelgarn wrote:
> On 2017-12-07 06:55, Duncan wrote:
> >Misono, Tomohiro posted on Thu, 07 Dec 2017 16:15:47 +0900 as excerpted:
> >
> >>On 2017/12/07 11:56, Duncan wrote:
> >>>Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as
> >>>excerpted:
> >>>
> >>>>Somewhat OT, but the only operation that's remotely 'instant' is
> >>>>creating an empty subvolume.  Snapshot creation has to walk the tree
> >>>>in the subvolume being snapshotted, which can take a long time (and as
> >>>>a result of it's implementation, also means BTRFS snapshots are _not_
> >>>>atomic). Subvolume deletion has to do a bunch of cleanup work in the
> >>>>background (though it may be fairly quick if it was a snapshot and the
> >>>>source subvolume hasn't changed much).
> >>>
> >>>Indeed, while btrfs in general has taken a strategy of making
> >>>/creating/ snapshots and subvolumes fast, snapshot deletion in
> >>>particular can take some time[1].
> >>>
> >>>And in that regard a question just occurred to me regarding this whole
> >>>very tough problem of a user being able to create but not delete
> >>>subvolumes and snapshots:
> >>>
> >>>Given that at least snapshot deletion (not so sure about non-snapshot
> >>>subvolume deletion, tho I strongly suspect it would depend on the
> >>>number of cross-subvolume reflinks) is already a task that can take
> >>>some time, why /not/ just bite the bullet and make the behavior much
> >>>more like the directory deletion, given that subvolumes already behave
> >>>much like directories.  Yes, for non-root, that /does/ mean tracing the
> >>>entire subtree and checking permissions, and yes, that's going to take
> >>>time and lower performance somewhat, but subvolume and in particular
> >>>snapshot deletion is already an operation that takes time, so this
> >>>wouldn't be unduly changing the situation, and it would eliminate the
> >>>entire class of security issues that come with either asymmetrically
> >>>restricting deletion (but not creation) to root on the one hand,
> >>
> >>>or possible data loss due to allowing a user to delete a subvolume they
> >>>couldn't delete were it an ordinary directory due to not owning stuff
> >>>further down the tree.
> >>
> >>But, this is also the very reason I'm for "sub del" instead of unlink().
> >>Since snapshot creation won't check the permissions of the containing
> >>files/dirs, it can copy a directory which cannot be deleted by the user.
> >>Therefore if we won't allow "sub del" for the user, he couldn't remove
> >>the snapshot.
> >
> >Maybe snapshot creation /should/ check all that, in ordered to allow
> >permissions to allow deletion.
> >
> >Tho that would unfortunately increase the creation time, and btrfs is
> >currently optimized for fast creation time.
> >
> >Hmm... What about creating a "temporary" snapshot if not root, then
> >walking the tree to check perms and deleting it without ever showing it
> >to userspace if the perms wouldn't let the user delete it.  That would
> >retain fast creation logic, tho it wouldn't show up until the perms walk
> >was completed.
> >
> I would argue that it makes more sense to keep snapshot creation as
> is, keep the subvolume deletion command as is (with some proper
> permissions checks of course), and just make unlink() work for
> subvolumes like it does for directories.

   Definitely this.

   Principle of least surprise.

   Hugo.

-- 
Hugo Mills | ... one ping(1) to rule them all, and in the
hugo@... carfax.org.uk | darkness bind(2) them.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: exclusive subvolume space missing

2017-12-01 Thread Hugo Mills
On Fri, Dec 01, 2017 at 05:15:55PM +0100, Tomasz Pala wrote:
> Hello,
> 
> I got a problem with btrfs running out of space (not THE
> Internet-wide, well known issues with interpretation).
> 
> The problem is: something eats the space while not running anything that
> justifies this. There were 18 GB free space available, suddenly it
> dropped to 8 GB and then to 63 MB during one night. I recovered 1 GB
> with rebalance -dusage=5 -musage=5 (or sth about), but it is being eaten
> right now, just as I'm writing this e-mail:
> 
> /dev/sda264G   63G  452M 100% /
> /dev/sda264G   63G  365M 100% /
> /dev/sda264G   63G  316M 100% /
> /dev/sda264G   63G  287M 100% /
> /dev/sda264G   63G  268M 100% /
> /dev/sda264G   63G  239M 100% /
> /dev/sda264G   63G  230M 100% /
> /dev/sda264G   63G  182M 100% /
> /dev/sda264G   63G  163M 100% /
> /dev/sda264G   64G  153M 100% /
> /dev/sda264G   64G  143M 100% /
> /dev/sda264G   64G   96M 100% /
> /dev/sda264G   64G   88M 100% /
> /dev/sda264G   64G   57M 100% /
> /dev/sda264G   64G   25M 100% /
> 
> while my rough calculations show, that there should be at least 10 GB of
> free space. After enabling quotas it is somehow confirmed:
> 
> # btrfs qgroup sh --sort=excl / 
> qgroupid rfer excl 
>    
> 0/5  16.00KiB 16.00KiB 
> [30 snapshots with about 100 MiB excl]
> 0/33324.53GiB305.79MiB 
> 0/29813.44GiB312.74MiB 
> 0/32723.79GiB427.13MiB 
> 0/33123.93GiB930.51MiB 
> 0/26012.25GiB  3.22GiB 
> 0/31219.70GiB  4.56GiB 
> 0/38828.75GiB  7.15GiB 
> 0/29130.60GiB  9.01GiB <- this is the running one
> 
> This is about 30 GB total excl (didn't find a switch to sum this up). I
> know I can't just add 'excl' to get usage, so tried to pinpoint the
> exact files that occupy space in 0/388 exclusively (this is the last
> snapshots taken, all of the snapshots are created from the running fs).

   The thing I'd first go looking for here is some rogue process
writing lots of data. I've had something like this happen to me
before, a few times. First, I'd look for large files with "du -ms /* |
sort -n", then work down into the tree until you find them.

   If that doesn't show up anything unusually large, then lsof to look
for open but deleted files (orphans) which are still being written to
by some process.

   This is very likely _not_ to be a btrfs problem, but instead some
runaway process writing lots of crap very fast. Log files are probably
the most plausible location, but not the only one.

> Now, the weird part for me is exclusive data count:
> 
> # btrfs sub sh ./snapshot-171125
> [...]
> Subvolume ID:   388
> # btrfs fi du -s ./snapshot-171125 
>  Total   Exclusive  Set shared  Filename
>   21.50GiB63.35MiB20.77GiB  snapshot-171125
> 
> 
> How is that possible? This doesn't even remotely relate to 7.15 GiB
> from qgroup.~The same amount differs in total: 28.75-21.50=7.25 GiB.
> And the same happens with other snapshots, much more exclusive data
> shown in qgroup than actually found in files. So if not files, where
> is that space wasted? Metadata?

   Personally, I'd trust qgroups' output about as far as I could spit
Belgium(*).

   Hugo.

(*) No offence indended to Belgium.

-- 
Hugo Mills | I used to live in hope, but I got evicted.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs-image hash collision option, super slow

2017-11-11 Thread Hugo Mills
On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
> OK this might be in the stupid questions category, but I'm not
> understanding the purpose of computing hash collisions with -ss. Or
> more correctly, why it's taking so much longer than -s.
> 
> It seems like what we'd want is every filename to have the same hash,
> but for the file to go through a PBKDF so the hashes we get aren't
> (easily) brute forced. So I totally understand that -ss should take
> much longer than -s, but this is at least two orders magnitude longer
> (so far). That's why I'm confused.
> 
> -s option on this file system took 5 minutes, start to finish.
> -ss option is at 8 hours and counting.
> 
> The other part I'm not groking is that some filenames fail with:
> 
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes
> 
> So? That seems like an undesirable outcome. And if it were just being
> pushed through a PBKDF function, it wouldn't fail. Every
> file/directory "Tool" would get the same hash on *this* run of
> btrs-image. If I run it again, or someone else runs it, they'd get
> some other hash (same hashes for each instance of "Tool" on their
> filesystem).

   In the FS tree, you can go from the inode of the file to its name
(where the inode is in the index, and the name is stored in the
corresponding data item). Alternatively, you can go from the filename
to the inode. In the latter case, since the keys are a structured 17
byte object, you obviously can't fit the whole filename into the key,
so the filename is hashed (using, IIRC, CRC32), and it's the hash that
appears in the key of the index.

   When an image is made without the -s options, the whole metadata is
stored, including all the filenames in the data items. For some
people, that's a security risk, and they don't want their filenames
leaking out, so -s exists to put junk in the filename records.
However, it doesn't change the hashes in the index to correspond with
the modified filenames, because that would at minimum require the
whole tree to be rebuilt (because all the items would have different
hashes, and hence different ordering in the index). This is a bad
thing for debugging, because you're not getting the details of the
tree as it was in the broken filesystem. So, in this case, the image
is actually broken, because the filenames don't match the hashes.

   Most of the time, that's absolutely fine, because the thing being
debugged is somewhere else, and it doesn't matter that "ls" on the
restored FS won't work right.

   However, in some (possibly hypothetical) cases, it _does_ matter,
and you do need the hashes to match the filenames. This is where -ss
comes in. We can't generate random filenames and then take the hashes
of those, because of the undesirability of rewriting the whole FS tree
to reindex it with the changed hashes. So, what -ss tries to do is
stick with the original hashes and find arbitrary filenames which
match them. It's (I think) CRC32, so it shouldn't be too hard, but
it's still non-trivial amounts of work to reverse engineer a
human-readable ASCII filename which hashes to a given value.
Particularly if, as was the case when Josef wrote it, a simple
brute-force algorithm was used.

   It could definitely be improved -- I believe there are some good
(but non-trivial) algorithms for finding preimages for CRC32 checksums
out there. It's just that btrfs-image doesn't use them. However, it's
not an option that's needed very often, so it's probably not worth
putting in the effort to fix it up. (I definitely remember Josef
commenting on IRC when he wrote -s and -ss that it could almost
certainly be done more efficiently, but he had bigger fish to fry at
the time, like fixing the broken FS he was working on)

   As to the thing where it's not finding a pre-image at all -- I'm
guessing here, but it's possible that this is a case where two of the
orginal filenames hashed to the same value. If that happens, one of
the hashes is incremented by a small integer in a predictable way
before storage. So it may be that the resulting value isn't mappable
to an ASCII pre-image, or that the search just gives up before finding
one.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature


Re: Problem with file system

2017-11-08 Thread Hugo Mills
On Wed, Nov 08, 2017 at 10:17:28AM -0700, Chris Murphy wrote:
> On Wed, Nov 8, 2017 at 5:13 AM, Austin S. Hemmelgarn
> <ahferro...@gmail.com> wrote:
> 
> >> It definitely does fix ups during normal operations. During reads, if
> >> there's a UNC or there's corruption detected, Btrfs gets the good
> >> copy, and does a (I think it's an overwrite, not COW) fixup. Fixups
> >> don't just happen with scrubbing. Even raid56 supports these kinds of
> >> passive fixups back to disk.
> >
> > I could have sworn it didn't rewrite the data on-disk during normal usage.
> > I mean, I know for certain that it will return the correct data to userspace
> > if at all possible, but I was under the impression it will just log the
> > error during normal operation.
> 
> No, everything except raid56 has had it since a long time, I can't
> even think how far back, maybe even before 3.0. Whereas raid56 got it
> in 4.12.

   Yes, I'm pretty sure it's been like that ever since I've been using
btrfs (somewhere around the early neolithic).

   Hugo.

-- 
Hugo Mills | Turning, pages turning in the widening bath,
hugo@... carfax.org.uk | The spine cannot bear the humidity.
http://carfax.org.uk/  | Books fall apart; the binding cannot hold.
PGP: E2AB1DE4  | Page 129 is loosed upon the world.   Zarf


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-04 Thread Hugo Mills
On Tue, Oct 03, 2017 at 03:49:25PM -0700, Stephen Nesbitt wrote:
> 
> On 10/3/2017 2:11 PM, Hugo Mills wrote:
> >Hi, Stephen,
> >
> >On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> >>Here it i. There are a couple of out-of-order entries beginning at 117. And
> >>yes I did uncover a bad stick of RAM:
> >>
> >>btrfs-progs v4.9.1
> >>leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> >>fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> >>chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
> >[snip]
> >>item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> >>extent refs 1 gen 3346444 flags DATA
> >>extent data backref root 271 objectid 2478 offset 0 count 1
> >>item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> >>extent refs 1 gen 3346495 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6733824 count 1
> >>item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> >>extent refs 1 gen 3351513 flags DATA
> >>extent data backref root 271 objectid 5724364 offset 680640512 count 1
> >>item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> >>extent refs 1 gen 3346376 flags DATA
> >>extent data backref root 271 objectid 21751764 offset 6701056 count 1
> >>>>hex(1623012749312)
> >'0x179e3193000'
> >>>>hex(1621939052544)
> >'0x179a319e000'
> >>>>hex(1623012450304)
> >'0x179e314a000'
> >>>>hex(1623012802560)
> >'0x179e31a'
> >
> >That's "e" -> "a" in the fourth hex digit, which is a single-bit
> >flip, and should be fixable by btrfs check (I think). However, even
> >fixing that, it's not ordered, because 118 is then before 117, which
> >could be another bitflip ("9" -> "4" in the 7th digit), but two bad
> >bits that close to each other seems unlikely to me.
> >
> >Hugo.
> 
> Hope this is a duplicate reply - I might have fat fingered something.
> 
> The underlying file is disposable/replaceable. Any way to zero
> out/zap the bad BTRFS entry?

   Not really. Even trying to delete the related file(s), it's going
to fall over when reading the metadata in in the first place. (The key
order check is a metadata invariant, like the csum checks and transid
checks).

   At best, you'd have to get btrfs check to fix it. It should be able
to manage a single-bit error, but you've got two single-bit errors in
close proximity, and I'm not sure it'll be able to deal with it. Might
be worth trying it. The FS _might_ blow up as a result of an attempted
fix, but you say it's replacable, so that's kind of OK. The worst I'd
_expect_ to happen with btrfs check --repair is that it just won't be
able to deal with it and you're left where you started.

   Go for it.

   Hugo.

-- 
Hugo Mills | You shouldn't anthropomorphise computers. They
hugo@... carfax.org.uk | really don't like that.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
   Hi, Stephen,

On Tue, Oct 03, 2017 at 08:52:04PM +, Stephen Nesbitt wrote:
> Here it i. There are a couple of out-of-order entries beginning at 117. And
> yes I did uncover a bad stick of RAM:
> 
> btrfs-progs v4.9.1
> leaf 2589782867968 items 134 free space 6753 generation 3351574 owner 2
> fs uuid 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> chunk uuid 19ce12f0-d271-46b8-a691-e0d26c1790c6
[snip]
> item 116 key (1623012749312 EXTENT_ITEM 45056) itemoff 10908 itemsize 53
> extent refs 1 gen 3346444 flags DATA
> extent data backref root 271 objectid 2478 offset 0 count 1
> item 117 key (1621939052544 EXTENT_ITEM 8192) itemoff 10855 itemsize 53
> extent refs 1 gen 3346495 flags DATA
> extent data backref root 271 objectid 21751764 offset 6733824 count 1
> item 118 key (1623012450304 EXTENT_ITEM 8192) itemoff 10802 itemsize 53
> extent refs 1 gen 3351513 flags DATA
> extent data backref root 271 objectid 5724364 offset 680640512 count 1
> item 119 key (1623012802560 EXTENT_ITEM 12288) itemoff 10749 itemsize 53
> extent refs 1 gen 3346376 flags DATA
> extent data backref root 271 objectid 21751764 offset 6701056 count 1

>>> hex(1623012749312)
'0x179e3193000'
>>> hex(1621939052544)
'0x179a319e000'
>>> hex(1623012450304)
'0x179e314a000'
>>> hex(1623012802560)
'0x179e31a'

   That's "e" -> "a" in the fourth hex digit, which is a single-bit
flip, and should be fixable by btrfs check (I think). However, even
fixing that, it's not ordered, because 118 is then before 117, which
could be another bitflip ("9" -> "4" in the 7th digit), but two bad
bits that close to each other seems unlikely to me.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Silly Point Break
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Seeking Help on Corruption Issues

2017-10-03 Thread Hugo Mills
On Tue, Oct 03, 2017 at 01:06:50PM -0700, Stephen Nesbitt wrote:
> All:
> 
> I came back to my computer yesterday to find my filesystem in read
> only mode. Running a btrfs scrub start -dB aborts as follows:
> 
> btrfs scrub start -dB /mnt
> ERROR: scrubbing /mnt failed for device id 4: ret=-1, errno=5
> (Input/output error)
> ERROR: scrubbing /mnt failed for device id 5: ret=-1, errno=5
> (Input/output error)
> scrub device /dev/sdb (id 4) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:09:02
>     total bytes scrubbed: 75.58GiB with 1 errors
>     error details: csum=1
>     corrected errors: 0, uncorrectable errors: 1, unverified errors: 0
> scrub device /dev/sdc (id 5) canceled
>     scrub started at Mon Oct  2 21:51:46 2017 and was aborted after
> 00:11:11
>     total bytes scrubbed: 50.75GiB with 0 errors
> 
> The resulting dmesg is:
> [  699.534066] BTRFS error (device sdc): bdev /dev/sdb errs: wr 0,
> rd 0, flush 0, corrupt 6, gen 0
> [  699.703045] BTRFS error (device sdc): unable to fixup (regular)
> error at logical 1609808347136 on dev /dev/sdb
> [  783.306525] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116

   This error usually means bad RAM. Can you show us the output of
"btrfs-debug-tree -b 2589782867968 /dev/sdc"?

   Hugo.

> [  789.776132] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  911.529842] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> [  918.365225] BTRFS critical (device sdc): corrupt leaf, bad key
> order: block=2589782867968, root=1, slot=116
> 
> Running btrfs check /dev/sdc results in:
> btrfs check /dev/sdc
> Checking filesystem on /dev/sdc
> UUID: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
> checking extents
> bad key ordering 116 117
> bad block 2589782867968
> ERROR: errors found in extent allocation tree or chunk allocation
> checking free space cache
> There is no free space entry for 1623012450304-1623012663296
> There is no free space entry for 1623012450304-1623225008128
> cache appears valid but isn't 1622151266304
> found 288815742976 bytes used err is -22
> total csum bytes: 0
> total tree bytes: 350781440
> total fs tree bytes: 0
> total extent tree bytes: 350027776
> btree space waste bytes: 115829777
> file data blocks allocated: 156499968
> 
> uname -a:
> Linux sysresccd 4.9.24-std500-amd64 #2 SMP Sat Apr 22 17:14:43 UTC
> 2017 x86_64 Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz GenuineIntel
> GNU/Linux
> 
> btrfs --version: btrfs-progs v4.9.1
> 
> btrfs fi show:
> Label: none  uuid: 24b768c3-2141-44bf-ae93-1c3833c8c8e3
>     Total devices 2 FS bytes used 475.08GiB
>     devid    4 size 931.51GiB used 612.06GiB path /dev/sdb
>     devid    5 size 931.51GiB used 613.09GiB path /dev/sdc
> 
> btrfs fi df /mnt:
> Data, RAID1: total=603.00GiB, used=468.03GiB
> System, RAID1: total=64.00MiB, used=112.00KiB
> System, single: total=32.00MiB, used=0.00B
> Metadata, RAID1: total=9.00GiB, used=7.04GiB
> Metadata, single: total=1.00GiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> What is the recommended procedure at this point? Run btrfs check
> --repair? I have backups so losing a file or two isn't critical, but
> I really don't want to go through the effort of a bare metal
> reinstall.
> 
> In the process of researching this I did uncover a bad DIMM. Am I
> correct that the problems I'm seeing are likely linked to the
> resulting memory errors.
> 
> Thx in advance,
> 
> -steve
> 

-- 
Hugo Mills | Quidquid latine dictum sit, altum videtur
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Lost about 3TB

2017-10-03 Thread Hugo Mills
On Tue, Oct 03, 2017 at 05:45:54PM +0200, fred.lar...@free.fr wrote:
> Hi,
> 
> 
> >   What does "btrfs sub list -a /RAID01/" say?
> Nothing (no lines displayed)
> 
> >   Also "grep /RAID01/ /proc/self/mountinfo"?
> Nothing (no lines displayed)
> 
> 
> Also server has been rebooted many times and no process has left "deleted 
> open files" on the volume (lsof...).

   OK. The second command (the grep) was incorrect -- I should have
omitted the slashes. However, it doesn't matter too much, because the
first command indicates that you don't have any subvolumes or
snapshots anyway.

   This means that you're probably looking at the kind of issue
Timofey mentioned in his mail, where writes into the middle of an
existing extent don't free up the overwritten data. This is most
likely to happen on database or VM files, but could happen on others,
depending on the application and how it uses files.

   Since you don't seem to have any snapshots, I _think_ you can deal
with the issue most easily by defragmenting the affected files. It's
worth just getting a second opinion on this one before you try it for
the whole FS. I'm not 100% sure about what defrag will do in this
case, and there are some people round here who have investigated the
behaviour of partially-overwritten extents in more detail than I have.

   Hugo.

> Fred.
> 
> 
> - Mail original -
> De: "Hugo Mills - h...@carfax.org.uk" 
> <btrfs.fredo.d1c3ddb588.hugo#carfax.org...@ob.0sg.net>
> À: "btrfs fredo" <btrfs.fr...@xoxy.net>
> Cc: linux-btrfs@vger.kernel.org
> Envoyé: Mardi 3 Octobre 2017 12:54:05
> Objet: Re: Lost about 3TB
> 
> On Tue, Oct 03, 2017 at 12:44:29PM +0200, btrfs.fr...@xoxy.net wrote:
> > Hi,
> > 
> > I can't figure out were 3TB on a 36 TB BTRFS volume (on LVM) are gone !
> > 
> > I know BTRFS can be tricky when speaking about space usage when using many 
> > physical drives in a RAID setup, but my conf is a very simple BTRFS volume 
> > without RAID(single Data type) using the whole disk (perhaps did I do 
> > something wrong with the LVM setup ?).
> > 
> > My BTRFS volume is mounted on /RAID01/.
> > 
> > There's only one folder in /RAID01/ shared with Samba, Windows also see a 
> > total of 28 TB used.
> > 
> > It only contains 443 files (big backup files created by Veeam), most of the 
> > file size is greater than 1GB and be be up to 5TB.
> > 
> > ##> du -hs /RAID01/
> > 28T /RAID01/
> > 
> > If I sum up the result of : ##> find . -printf '%s\n'
> > I also find 28TB.
> > 
> > I extracted btrfs binary from rpm version v4.9.1 and used ##> btrfs fi 
> > du
> > on each file and the result is 28TB.
> 
>The conclusion here is that there are things that aren't being
> found by these processes. This is usually in the form of dot-files
> (but I think you've covered that case in what you did above) or
> snapshots/subvolumes outside the subvol you've mounted.
> 
>What does "btrfs sub list -a /RAID01/" say?
>Also "grep /RAID01/ /proc/self/mountinfo"?
> 
>There are other possibilities for missing space, but let's cover
> the obvious ones first.
> 
>Hugo.
> 
> > OS : CentOS Linux release 7.3.1611 (Core)
> > btrfs-progs v4.4.1
> > 
> > 
> > ##> ssm list
> > 
> > -
> > DeviceFree  Used  Total  Pool Mount point
> > -
> > /dev/sda   36.39 TB   PARTITIONED
> > /dev/sda1 200.00 MB   /boot/efi
> > /dev/sda2   1.00 GB   /boot
> > /dev/sda3  0.00 KB  36.32 TB   36.32 TB  lvm_pool
> > /dev/sda4  0.00 KB  54.00 GB   54.00 GB  cl_xxx-xxxamrepo-01
> > -
> > ---
> > PoolType   Devices Free  Used Total
> > ---
> > cl_xxx-xxxamrepo-01 lvm10.00 KB  54.00 GB  54.00 GB
> > lvm_poollvm10.00 KB  36.32 TB  36.32 TB
> > btrfs_lvm_pool-lvol001  btrfs  14.84 TB  36.32 TB  36.32 TB
> > ---
> > --

Re: Lost about 3TB

2017-10-03 Thread Hugo Mills
hecking root refs
> found 34600611349019 bytes used err is 0
> total csum bytes: 33752513152
> total tree bytes: 38037848064
> total fs tree bytes: 583942144
> total extent tree bytes: 653754368
> btree space waste bytes: 2197658704
> file data blocks allocated: 183716661284864 ?? what's this ??
>  referenced 30095956975616 = 27.3 TB !!
> 
> 
> 
> Tried the "new usage" display but the problem is the same : 31 TB used but 
> total file size is 28TB
> 
> Overall:
> Device size:  36.32TiB
> Device allocated: 31.65TiB
> Device unallocated:4.67TiB
> Device missing:  0.00B
> Used: 31.52TiB
> Free (estimated):  4.80TiB  (min: 2.46TiB)
> Data ratio:   1.00
> Metadata ratio:   2.00
> Global reserve:  512.00MiB  (used: 0.00B)
> 
> Data,single: Size:31.58TiB, Used:31.45TiB
>/dev/mapper/lvm_pool-lvol001   31.58TiB
> 
> Metadata,DUP: Size:38.00GiB, Used:35.37GiB
>/dev/mapper/lvm_pool-lvol001   76.00GiB
> 
> System,DUP: Size:8.00MiB, Used:3.69MiB
>/dev/mapper/lvm_pool-lvol001   16.00MiB
> 
> Unallocated:
>/dev/mapper/lvm_pool-lvol0014.67TiB
> The only btrfs tool speaking about 28TB is btrfs check (but I'm not sure if 
> it's bytes because it speaks about "referenced blocks" and I don't understand 
> the meaning of "file data blocks allocated")
> Code:
> file data blocks allocated: 183716661284864 ?? what's this ??
>  referenced 30095956975616 = 27.3 TB !!
> 
> 
> 
> I also used the verbose option of https://github.com/knorrie/btrfs-heatmap/ 
> to sum up the total size of all DATA EXTENT and found 32TB.
> 
> I did scrub, balance up to -dusage=90 (and also dusage=0) and ended up with 
> 32TB used.
> No snasphots nor subvolumes nor TB hidden under the mount point after 
> unmounting the BTRFS volume  
> 
> 
> What did I do wrong or am I missing ?
> 
> Thanks in advance.
> Frederic Larive.
> 

-- 
Hugo Mills | Beware geeks bearing GIFs
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v2] btrfs-progs: subvol: change subvol set-default to also accept subvol path

2017-10-02 Thread Hugo Mills
 struct root_info ri;
> > +   char *fullpath;
> >
> > -   objectid = arg_strtou64(subvolid);
> > +   path = argv[optind];
> > +   if (path[0] != '/') {
> > +   error("only absolute path is allowed");
> > +   return 1;
> > +   }
> > +
> > +   fullpath = realpath(path, NULL);
> > +   if (!fullpath) {
> > +   error("cannot find real path for '%s': %s",
> > +   path, strerror(errno));
> > +   return 1;
> > +   }
> > +
> > +   ret = get_subvol_info(fullpath, );
> > +   free(fullpath);
> > +
> > +   if (ret)
> > +   return 1;
> > +
> > +   objectid = ri.root_id;
> > +   } else {
> > +   /* subvol id and path to the filesystem are specified */
> > +   subvolid = argv[optind];
> > +   path = argv[optind + 1];
> > +   objectid = arg_strtou64(subvolid);
> > +   }
> >
> > fd = btrfs_open_dir(path, , 1);
> > if (fd < 0)

-- 
Hugo Mills | Great oxymorons of the world, no. 4:
hugo@... carfax.org.uk | Future Perfect
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Hugo Mills
On Mon, Sep 25, 2017 at 04:04:03PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月25日 15:52, Hugo Mills wrote:
> >On Mon, Sep 25, 2017 at 03:46:15PM +0800, Qu Wenruo wrote:
> >>
> >>
> >>On 2017年09月25日 15:42, Marat Khalili wrote:
> >>>On 25/09/17 10:30, Nikolay Borisov wrote:
> >>>>On 19.09.2017 10:41, Misono, Tomohiro wrote:
> >>>>>"btrfs subvolume create/delete" outputs the message of "Create/Delete
> >>>>>subvolume ..." even when an operation fails.
> >>>>>Since it is confusing, let's outputs the message only when an
> >>>>>operation succeeds.
> >>>>Please change the verb to past tense, more strongly signaling success -
> >>>>i.e. "Created subvolume"
> >>>What about recalling some UNIX standards and returning to NOT
> >>>outputting any message when operation succeeds? My scripts are
> >>>full of grep -v calls after each btrfs command, and this sucks
> >>>(and I don't think I'm alone in this situation).
> >>
> >>Isn't the correct way to catch the return value instead of grepping
> >>the output?
> >
> >It is, but if, for example, you're using the command in a cron
> >script which is expected to work, you don't want it producing output
> >because then you get a mail every time the script runs. So you have to
> >grep -v on the "success" output to make the successful script silent.
> 
> What about redirecting stdout to /dev/null and redirecting stderr to
> mail if return value is not 0?
> As for expected-to-work case, the stdout doesn't has much meaning
> and return value should be good enough to judge the result.
> 
> >
> >>If it's some command not returning value properly, would you please
> >>report it as a bug so we can fix it.
> >
> >It's not the return value that's problematic (although those used
> >to be a real mess). It's the fact that a successful run of the command
> >produces noise on stdout, which most commands don't.
> 
> Yes, a lot of tried-and-true tools don't output anything for
> successful run, but also a lot of other tools do output something by
> default, especially for complex tools like LVM.

   btrfs sub create and btrfs sub delete, though, aren't complex.
They're about as complex as mkdir and rmdir, from a user point of
view. What's more, and like mkdir/rmdir, the effects of those commands
show up in the filesystem at the path given, so manual verification
could be as simple as "ls -d !$" or "ls !$/..". It's really, really
not necessary to have this command unconditionally print "yes, I
created a directory for you" to stdout.

   Yes, there's ways to deal with it in shell scripts, but wouldn't
life be so much better if you didn't have to? Like you don't have to
filter out success reports from mkdir.

> Maybe we can introduce a global --quite option to silent some output.

   Or drop the spurious output unless it's asked for with --verbose.
Because then it makes it so much easier to compose tools together into
bigger and more complex things, which is, after all, one of the
fundamental things that UNIX does right.

   Hugo.

> Thanks,
> Qu
> >
> >Hugo.
> >>Thanks,
> >>Qu
> >>
> >>>If you change the message a lot of scripts will have to be
> >>>changed, at least make it worth it.
> >>>
> >>>  --
> >>>
> >>>With Best Regards,
> >>>Marat Khaliili
> >>>
> >

-- 
Hugo Mills | Great films about cricket: The Fantastic Four
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Hugo Mills
On Mon, Sep 25, 2017 at 03:46:15PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月25日 15:42, Marat Khalili wrote:
> >On 25/09/17 10:30, Nikolay Borisov wrote:
> >>On 19.09.2017 10:41, Misono, Tomohiro wrote:
> >>>"btrfs subvolume create/delete" outputs the message of "Create/Delete
> >>>subvolume ..." even when an operation fails.
> >>>Since it is confusing, let's outputs the message only when an
> >>>operation succeeds.
> >>Please change the verb to past tense, more strongly signaling success -
> >>i.e. "Created subvolume"
> >What about recalling some UNIX standards and returning to NOT
> >outputting any message when operation succeeds? My scripts are
> >full of grep -v calls after each btrfs command, and this sucks
> >(and I don't think I'm alone in this situation).
> 
> Isn't the correct way to catch the return value instead of grepping
> the output?

   It is, but if, for example, you're using the command in a cron
script which is expected to work, you don't want it producing output
because then you get a mail every time the script runs. So you have to
grep -v on the "success" output to make the successful script silent.

> If it's some command not returning value properly, would you please
> report it as a bug so we can fix it.

   It's not the return value that's problematic (although those used
to be a real mess). It's the fact that a successful run of the command
produces noise on stdout, which most commands don't.

   Hugo.
 
> Thanks,
> Qu
> 
> >If you change the message a lot of scripts will have to be
> >changed, at least make it worth it.
> >
> >  --
> >
> >With Best Regards,
> >Marat Khaliili
> >

-- 
Hugo Mills | If you see something, say nothing and drink to
hugo@... carfax.org.uk | forget
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Welcome to Night Vale


signature.asc
Description: Digital signature


Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds

2017-09-25 Thread Hugo Mills
On Mon, Sep 25, 2017 at 10:42:06AM +0300, Marat Khalili wrote:
> On 25/09/17 10:30, Nikolay Borisov wrote:
> >On 19.09.2017 10:41, Misono, Tomohiro wrote:
> >>"btrfs subvolume create/delete" outputs the message of "Create/Delete
> >>subvolume ..." even when an operation fails.
> >>Since it is confusing, let's outputs the message only when an operation 
> >>succeeds.
> >Please change the verb to past tense, more strongly signaling success -
> >i.e. "Created subvolume"
> What about recalling some UNIX standards and returning to NOT
> outputting any message when operation succeeds? My scripts are full
> of grep -v calls after each btrfs command, and this sucks (and I
> don't think I'm alone in this situation). If you change the message
> a lot of scripts will have to be changed, at least make it worth it.

   Seconded. Make sure the return code reflects the result, and drop
the printed message (or keep it if there's a --verbose flag, maybe).

   Hugo.

-- 
Hugo Mills | If you see something, say nothing and drink to
hugo@... carfax.org.uk | forget
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Welcome to Night Vale


signature.asc
Description: Digital signature


Re: Does btrfs use crc32 for error correction?

2017-09-19 Thread Hugo Mills
On Tue, Sep 19, 2017 at 06:35:48PM +0300, Timofey Titovets wrote:
> Stupid question:
> Does btrfs use crc32 for error correction?

   It uses it for error _detection_. On read, it'll verify the data
(or metadata) against the checksum.

   With no reduncancy (single, RAID-0), a bad csum check will return
I/O error.

   With redundancy (RAID-1, 10, 5, 6), a bad csum check will try
reading the other copy. If that's good, it will use it and repair the
broken copy.

   Hugo.

> If no, why?
> 
> (AFAIK if using CRC that possible to fix 1 bit flip)
> 
> P.S. I try check that (i create image, create text file, flip bit, try
> read and btrfs show IO-error)
> 
> Thanks!

-- 
Hugo Mills | Dullest spy film ever: The Eastbourne Ultimatum
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   The Thick of It


signature.asc
Description: Digital signature


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Hugo Mills
On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 12:18 AM, Hugo Mills wrote:
> >As far as I know, both of these are basically known issues, with no
> > good solution, other than not using O_DIRECT. Certainly the first
> > issue is one I recognise. The second isn't one I recognise directly,
> > but is unsurprising to me.
> > 
> >There have been discussions -- including developers -- on this list
> > as recent as a month or so ago. The general outcome seems to be that
> > any problems with O_DIRECT are not going to be fixed.
> 
> I missed this thread; could you point it to me ?

   No, you didn't miss it -- you were part of it. :)

   http://www.spinics.net/lists/linux-btrfs/msg68244.html

   Hugo.

> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
> O_DIRECT at all...
> 
> In fact most of the applications which benefit from O_DIRECT (it comes to me 
> VM e DB), are the ones which need also nodatasum to have good performance.
> 
> One of the strongest point of BTRFS was the checksums; but these are not 
> effective when the file is opened with O_DIRECT; worse there are cases where 
> the file is corrupted and the application got -EIO; not mentioning that the 
> dmesg is filled by "csum failed  "
> 
> 
> > 
> >Hugo.
> > 
> > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> >> Hi all,
> >>
> >> I discovered two bugs when O_DIRECT is used...
> >>
> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> >>
> >> Normally BTRFS prevents to access the contents of a corrupted file; 
> >> however I was able read the content of a corrupted file simply using 
> >> O_DIRECT
> >>
> >> # in a new btrfs filesystem, create a file
> >> $ sudo mkfs.btrfs -f /dev/sdd5
> >> $ mount /dev/sdd5 t
> >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> >> bs=$((16*1024)) iflag=fullblock count=1024
> >>
> >> # corrupt the file
> >> $ sudo filefrag -v t/abcd 
> >> Filesystem type is: 9123683e
> >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
> >>  ext: logical_offset:physical_offset: length:   expected: 
> >> flags:
> >>0:0..3475:  70656.. 74131:   3476:
> >>1: 3476..4095:  74212.. 74831:620:  74132: 
> >> last,eof
> >> t/abcd: 2 extents found
> >> $ sudo umount t
> >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> >> /dev/sdd5
> >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> >> corrupting 289406976 copy 1
> >>
> >> # try to access the file; expected result: -EIO
> >> $ sudo mount /dev/sdd5 t
> >> $ dd if=t/abcd | hexdump -c | head
> >> dd: error reading 't/abcd': Input/output error
> >> 0+0 records in
> >> 0+0 records out
> >> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> >>
> >>
> >> # try to access the file using O_DIRECT; expected result: -EIO, instead 
> >> the file is accessible
> >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> >> *
> >> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >>
> >> (dmesg report the checksum mismatch)
> >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
> >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> >>
> >> Note the first 4k filled by 0x01 !
> >>
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page fille

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Hugo Mills
y 0x01 instead of the data from the disk
> Even worse than a)
> 
> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't 
> support it; see https://github.com/zfsonlinux/zfs/issues/224
> 
> BR
> G.Baroncelli
> 
> - cut --- cut --- cut 
> 
> #define _GNU_SOURCE
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> #define FILESIZE  (4096*4)
> 
> int fd;
> char *buffer = NULL;
> 
> void read_thread(const char *nf) {
>   
>   void *data = mmap(NULL,  FILESIZE,
>   PROT_READ|PROT_WRITE, 
>   MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>   
>   assert(data);
>   fprintf(stderr, "read_thread:  data = %p\n", data);
>   int rfd;
>   rfd = open(nf, O_RDONLY);
>   
>   for(;;) {
>   ssize_t r = pread(rfd, data, FILESIZE, 0);
>   if (r < 0) {
>   int e = errno;
>   fprintf(stderr, "ERROR: read thread; e = %d - %s\n", 
>  e, strerror(e));
> 
>   } else if (r != FILESIZE) {
>   fprintf(stderr, "ERROR: read thread; r = %ld, expected 
> = %d\n", 
>  r, FILESIZE);
>   }
>   }
> }
> 
> void write_thread(void) {
> 
>   for(;;) {
>   ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
>   assert(r == FILESIZE);
>   }
> }
> 
> void update_thread(void) {
> 
>   for(;;) {
>   int i;
>   for (i = 0 ; i < FILESIZE ; i++)
>   buffer[i] += i+10;
>   }
> }
> 
> 
> int main(int argc, char **argv) {
>   
>   if (argc < 2) {
>   fprintf(stderr, "usage: %s \n", argv[0]);
>   exit(100);
>   }
>   
>   
>   buffer = mmap(NULL,  FILESIZE,
>   PROT_READ|PROT_WRITE, 
>   MAP_SHARED|MAP_ANONYMOUS, -1, 0);
>   
>   assert(buffer);
>   fprintf(stderr, "main:  data = %p\n", buffer);
>   
>   fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660);
>   assert(fd>=0);
>   
>   ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
>   assert(r == FILESIZE);
>   
>   pid_t child;
>   
>   child = fork();
>   assert(child >= 0);
>   if (child == 0)
>   write_thread();
>   fprintf(stderr, "write_thread pid = %d\n", child);
>   
>   child = fork();
>   assert(child >= 0);
>   if (child == 0)
>   read_thread(argv[1]);
>   fprintf(stderr, "read_thread pid = %d\n", child);
>   
>   child = fork();
>   assert(child >= 0);
>   if (child == 0)
>   update_thread();
>   fprintf(stderr, "update_thread pid = %d\n", child);
>   
>   for(;;)
>   sleep(100*100*100);
> 
>   
>   return 0;
> }
> 
> - cut --- cut --- cut -- 

-- 
Hugo Mills | "There's more than one way to do it" is not a
hugo@... carfax.org.uk | commandment. It is a dire warning.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: snapshots of encrypted directories?

2017-09-14 Thread Hugo Mills
On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote:
> I use encfs on top of btrfs.
> I can create btrfs snapshots, but I have no suggestive access to the files
> in these snaspshots, because they look like:
> 
> drwx--  framstag users- 2017-09-08 11:47:18 
> uHjprldmxo3-nSfLmcH54HMW
> drwxr-xr-x  framstag users- 2017-09-08 11:47:18 
> wNEWaDCgyXTj0d-Myk8wXZfh
> -rw-r--r--  framstag users  377 2015-06-12 14:02:53 
> -zDmc7xfobKDkbl8z7oKOHxv
> -rw-r--r--  framstag users2,367 2012-07-10 14:32:30 
> 7pfKs27K9k5zANE4WOQEuFa2
> -rw---  framstag users  692 2009-10-20 13:45:41 
> 8SQElYCph85kDdcFasUHybVr
> -rw---  framstag users2,872 2017-08-31 16:21:52 
> bm,yNi1e4fsAClDv7lNxxSfJ
> lrwxrwxrwx  framstag users- 2017-06-01 15:53:00 
> GZxNYI0Gy96R18fz40f7k5rl -> 
> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9
> -rw-r--r--  framstag users  182 2016-12-01 13:34:31 
> rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1
> 
> I have to mount the snapshot with encfs, to have access to the (decrypted)
> files. 
> 
> Any better ideas?

   I'd say it's doing exactly what it should be doing. You're making a
copy of an encrypted data store, and the result is encrypted. In order
to read it, it needs to have the decrpytion layer applied to it with
the correct key (which is the need to mount the snapshot with encfs).

   Would you _really_ want a system where the encrypted contents of a
subvolume can be decrypted by simply snapshotting it?

   Hugo.

-- 
Hugo Mills | Great films about cricket: Umpire of the Rising Sun
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: generic name for volume and subvolume root?

2017-09-09 Thread Hugo Mills
On Sat, Sep 09, 2017 at 06:58:38PM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年09月09日 18:48, Ulli Horlacher wrote:
> >On Sat 2017-09-09 (18:40), Qu Wenruo wrote:
> >
> >>>Is there a generic name for both volume and subvolume root?
> >>
> >>Nope, subvolume (including snapshot) is not distinguished by its
> >>filename/path/directory name.
> >>
> >>And you can only do snapshot on subvolume (snapshot is one kind of
> >>subvolume) boundary.
> >
> >So, I can name a btrfs root volume also btrfs subvolume?
> 
> Yes, root volume is also a subvolume, so just call "btrfs root volume"
> a "subvolume".

   I find it's best to avoid the word "root" entirely, as it's got
several meanings, and it tends to get confusing in conversation.
Instead, we have:

 - "the top level" (subvolid=5)
 - "/" (what you see at / in your running system)
 - "/@" or similar names
   (the subvolume that's mounted at /)

> >I am talking about documentation, not coding!
> >
> >I just want yo use the correct terms.
> 
> If you're referring to the term, I think subvolume is good enough.
> Which represents your original term, "directories one can snapshot".
> 
> 
> For the whole btrfs "volume", I would just call it "filesystem" to
> avoid the name "volume" or "subvolume" at all.

   Yes, it's a filesystem. (Although that does occasionally cause
confusion between "the conceptual filesystem implemented by btrfs.ko"
and "the concrete filesystem stored on /dev/sda1", but it's generally
far less confusing than the overloading of "root").

   Hugo.

-- 
Hugo Mills | Well, you don't get to be a kernel hacker simply by
hugo@... carfax.org.uk | looking good in Speedos.
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Rusty Russell


signature.asc
Description: Digital signature


Re: generic name for volume and subvolume root?

2017-09-09 Thread Hugo Mills
On Sat, Sep 09, 2017 at 10:35:51AM +0200, Ulli Horlacher wrote:
> As I am writing some documentation abount creating snapshots:
> Is there a generic name for both volume and subvolume root?
> 
> Example:
> 
> root@fex:~# btrfs subvol show /mnt
> ERROR: not a subvolume: /mnt
> 
> root@fex:~# btrfs subvol show /mnt/test
> /mnt/test is toplevel subvolume
> 
> root@fex:~# btrfs subvol show /mnt/test/data
> /mnt/test/data
> Name:   data
> UUID:   b32a5949-dfd6-ef45-8616-34ae4cdf6fb8
> (...)
> 
> root@fex:~# btrfs subvol show /mnt/test/data/sw
> ERROR: not a subvolume: /mnt/test/data/sw
> 
> 
> I can create snapshots of /mnt/test and /mnt/test/data, but not of /mnt
> and /mnt/test/data/sw
> 
> Is there a simple name for directories I can snapshot?

   Subvolume. If you can snapshot it, it's a subvolume. Some
subvolumes are also snapshots. (And all snapshots are subvolumes).

   The subvolume with ID 5 (or ID 0, which is an alias) is the "top
level subvolume", and has the unique property that it can't be
renamed, deleted or replaced, where all other subvolumes can be.

   Hugo.

-- 
Hugo Mills | Well, you don't get to be a kernel hacker simply by
hugo@... carfax.org.uk | looking good in Speedos.
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Rusty Russell


signature.asc
Description: Digital signature


Re: test if a subvolume is a snapshot?

2017-09-08 Thread Hugo Mills
On Fri, Sep 08, 2017 at 05:12:11PM +0100, Tomasz Kłoczko wrote:
> On 8 September 2017 at 16:38, Hugo Mills <h...@carfax.org.uk> wrote:
> [..]
> >> sometimes I'm really thinking about start rewrite btrfs-progs to make
> >> btrfs basic tools syntax as similar as it is only possible to ZFS zfs,
> >> zpool and zdb commands on using which in +90% cases you can guess how
> >> necessary syntax must look like without looking on man pages.
> >>
> >> Any volunteers want to join to help implement something like this?
> >> Maybe someone already started doing this?
> >
> >The main complaint that can be directed at the btrfs command is
> > that its output is rarely machine-processable. It would therefore make
> > sense to have a "--table" or "--structured" mode for output, which
> > would be more trivially parsable by shell tools.
> 
> Output of the btrfs command it is coompletely different pair of shoes.
> On making btrfs tools similar to ZFS analogues *obviously* output
> should be as same similar.
> By this would possible to solve complains about unreadable output in one go.
> 
> For example zfs command parseable output is possible to generate by
> add -p switch in those subcommands where it is needed (no --tables or
> --structures .. just one switch).

   --tables _is_ one switch.

> Instead reinventing the wheel just please try to look first how it is

   What in what I said was reinventing a wheel? Literally the *only*
thing I was suggesting was adding some option to make the btrfs tool
output more machine-parsable.

   Call the option whatever you like. However, note that there are
probably very few single-letter options which are not used in at least
one of the btrfs tool subcommands.

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: test if a subvolume is a snapshot?

2017-09-08 Thread Hugo Mills
On Fri, Sep 08, 2017 at 04:25:55PM +0100, Tomasz Kłoczko wrote:
> On 8 September 2017 at 14:10, David Sterba <dste...@suse.cz> wrote:
> > On Fri, Sep 08, 2017 at 10:54:46AM +0200, Ulli Horlacher wrote:
> > > How can I test if a subvolume is a snapshot?
> >
> > The inode number is 256 on a btrfs filesystem:
> >
> > if [ stat -f --format=%T $path = btrfs -a stat --format=%i $path = 256 ]; 
> > ...
> 
> This oneliner shows how much really basic btrfs tools commands syntax
> is broken by design :(
> Looking on how so freakishly overcomplicated btrfs command syntax is
> that command like above is completely unintuitive and unreadable

   This is nothing to do with btrfs tooling. The two commands involved
here are test (aka "[") and stat.

> sometimes I'm really thinking about start rewrite btrfs-progs to make
> btrfs basic tools syntax as similar as it is only possible to ZFS zfs,
> zpool and zdb commands on using which in +90% cases you can guess how
> necessary syntax must look like without looking on man pages.
> 
> Any volunteers want to join to help implement something like this?
> Maybe someone already started doing this?

   The main complaint that can be directed at the btrfs command is
that its output is rarely machine-processable. It would therefore make
sense to have a "--table" or "--structured" mode for output, which
would be more trivially parsable by shell tools.

   Hugo.

-- 
Hugo Mills | Ceci est un travail pour l'Australien.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Louison, Delicatessen


signature.asc
Description: Digital signature


Re: Is autodefrag recommended? -- re-duplication???

2017-09-05 Thread Hugo Mills
On Tue, Sep 05, 2017 at 05:01:10PM +0300, Marat Khalili wrote:
> Dear experts,
> 
> At first reaction to just switching autodefrag on was positive, but
> mentions of re-duplication are very scary. Main use of BTRFS here is
> backup snapshots, so re-duplication would be disastrous.
> 
> In order to stick to concrete example, let there be two files, 4KB
> and 4GB in size, referenced in read-only snapshots 100 times each,
> and some 4KB of both files are rewritten each night and then another
> snapshot is created (let's ignore snapshots deletion here). AFAIU
> 8KB of additional space (+metadata) will be allocated each night
> without autodefrag. With autodefrag will it be perhaps 4KB+128KB or
> something much worse?

   I'm going for 132 KiB (4+128).

   Of course, if there's two 4 KiB writes close together, then there's
less overhead, as they'll share the range.

   Hugo.

-- 
Hugo Mills | Once is happenstance; twice is coincidence; three
hugo@... carfax.org.uk | times is enemy action.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Is autodefrag recommended?

2017-09-04 Thread Hugo Mills
On Mon, Sep 04, 2017 at 12:31:54PM +0300, Marat Khalili wrote:
> Hello list,
> good time of the day,
> 
> More than once I see mentioned in this list that autodefrag option
> solves problems with no apparent drawbacks, but it's not the
> default. Can you recommend to just switch it on indiscriminately on
> all installations?
> 
> I'm currently on kernel 4.4, can switch to 4.10 if necessary (it's
> Ubuntu that gives us this strange choice, no idea why it's not 4.9).
> Only spinning rust here, no SSDs.

   autodefrag effectively works by taking a small region around every
write or cluster of writes and making that into a stand-alone extent.

   This has two consequences:

 - You end up duplicating more data than is strictly necessary. This
   is, IIRC, something like 128 KiB for a write.

 - There's an I/O overhead for enabling autodefrag, because it's
   increasing the amount of data written.

   Hugo.

-- 
Hugo Mills | The future isn't what it used to be.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: joining to contribute

2017-09-01 Thread Hugo Mills
On Fri, Sep 01, 2017 at 01:15:45PM +0800, Qu Wenruo wrote:
> On 2017年09月01日 11:36, Anthony Riley wrote:
> >Hey folks,
> >
> >I thought I would finally take a swing at things I've wanted to be an
> >kernel/fs dev fora few years now. My current $job is as an
> >Infrastructure Engineer. I'm currently teaching myself C and have
> >background in shell scripting & python. I love doing deep dives and
> >learning about linux internals.  I've read the btrfs.wiki and can't
> >really decide which project to choose to start.
> >
> >Also should I go through this https://kernelnewbies.org/FirstKernelPatch 
> >first?
> >Or should i start with something in Userspace?
> 
> Well, personally I strongly recommended to start with btrfs on-disk
> format first, and then btrfs-progs/test cases, and kernel
> contribution as final objective.

   Pick a project which bothers you -- is there some feature that you
want to have, or a particular bug, or a sharp corner that needs
rounding off?

   This next bit is purely my opinion. Feel free to ignore it if it
doesn't float your boat.

   For a relatively easy example (from a code point of view), there's
a bug with send-receive where stuff breaks if you try snapshotting and
then sending a subvolume which already has a received-uuid set.
There's probably several ways of dealing with this.

   Alongside this, there's also a requirement for being able to do
round-trip send/receive while preserving the ability to do incremental
sends. This is likely to be related to the above bug-fix. I did a
complete write-up of what's happening, and what needs to happen, here:

http://www.spinics.net/lists/linux-btrfs/msg44089.html

   If you can fix the first bug in a way that doesn't rule out the
round-trip work, that's great. If you can also get the round-trip
stuff working, that's even better (but be warned that it will get
postponed until kdave is ready for all the stream format changes to
happen at once).

> BTW, if you want to start with btrfs on-disk format, print-tree.c
> from btrfs-progs is a good start point and btrfs wiki has relatively
> well documented entry for it.

   Seconded on this. For the on-disk format, it's also useful to
create a small test filesystem and use btrfs-debug-tree on it as you
do stuff to it (create files, create subvols, make links, modify
files). Do this in conjunction with the "Data Structures" page, and
you can see how it all actually fits together.

   It took me a couple of weeks to learn how it all worked the first
time round, but I didn't have much detailed documentation to work from
back then.

   Since you speak python, there's also Hans's python-btrfs:

https://github.com/knorrie/python-btrfs

> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> https://btrfs.wiki.kernel.org/index.php/Btree_Items

And, more generally: 
https://btrfs.wiki.kernel.org/index.php/Main_Page#Developer_documentation

I'd also point out the Data Structures and Trees pages linked
there. Some of the information is a bit out of date, or represents a
prototype of what it's describing. The source code is canonical -- use
the documentation as a guide to help you see where in the source to
look.

   Hugo.

-- 
Hugo Mills | I'm always right. But I might be wrong about that.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-08-31 Thread Hugo Mills
On Thu, Aug 31, 2017 at 03:21:07PM -0400, Eric Wolf wrote:
> I've previously confirmed it's a bad ram module which I have already
> submitted an RMA for. Any advice for manually fixing the bits?

   What I'd do... use a hex editor and the contents of ctree.h as
documentation to find the byte in question, change it back to what it
should be, mount the FS, try reading the directory again, look up the
csum failure in dmesg, edit the block again to fix up the csum, and
it's done. (Yes, I've done this before, and I'm a massive nerd).

   It's also possible to use Hans van Kranenberg's btrfs-python to fix
up this kind of thing, but I've not done it myself. There should be a
couple of talk-throughs from Hans in various archives -- both this
list (find it on, say, http://www.spinics.net/lists/linux-btrfs/), and
on the IRC archives (http://logs.tvrrug.org.uk/logs/%23btrfs/latest.html).

> Sorry for top leveling, not sure how mailing lists work (again sorry
> if this message is top leveled, how do I ensure it's not?)

   Just write your answers _after_ the quoted text that you're
replying to, not before. It's a convention, rather than a technical
thing...

   Hugo.

> ---
> Eric Wolf
> (201) 316-6098
> 19w...@gmail.com
> 
> 
> On Thu, Aug 31, 2017 at 2:59 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> >(Please don't top-post; edited for conversation flow)
> >
> > On Thu, Aug 31, 2017 at 02:44:39PM -0400, Eric Wolf wrote:
> >> On Thu, Aug 31, 2017 at 2:33 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> >> > On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:
> >> >> I'm having issues with a bad block(?) on my root ssd.
> >> >>
> >> >> dmesg is consistently outputting "BTRFS critical (device sda2):
> >> >> corrupt leaf, bad key order: block=293438636032, root=1, slot=11"
> >> >>
> >> >> "btrfs scrub stat /" outputs "scrub status for 
> >> >> b2c9ff7b-[snip]-48a02cc4f508
> >> >> scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55
> >> >> total bytes scrubbed: 53.41GiB with 2 errors
> >> >> error details: verify=2
> >> >> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0"
> >> >>
> >> >> Running "btrfs check --repair /dev/sda2" from a live system stalls
> >> >> after telling me corrupt leaf etc etc then "11 12". CPU usage hits
> >> >> 100% and disk activity remains at 0.
> >> >
> >> >This error is usually attributable to bad hardware. Typically RAM,
> >> > but might also be marginal power regulation (blown capacitor
> >> > somewhere) or a slightly broken CPU.
> >> >
> >> >Can you show us the output of "btrfs-debug-tree -b 293438636032 
> >> > /dev/sda2"?
> >
> >Here's the culprit:
> >
> > [snip]
> >> item 10 key (890553 EXTENT_DATA 0) itemoff 14685 itemsize 269
> >>inline extent data size 248 ram 248 compress 0
> >> item 11 key (890554 INODE_ITEM 0) itemoff 14525 itemsize 160
> >>inode generation 5386763 transid 5386764 size 135 nbytes 135
> >>block group 0 mode 100644 links 1 uid 10 gid 10
> >>rdev 0 flags 0x0
> >> item 12 key (856762 INODE_REF 31762) itemoff 14496 itemsize 29
> >>inode ref index 2745 namelen 19 name: dpkg.statoverride.0
> >> item 13 key (890554 EXTENT_DATA 0) itemoff 14340 itemsize 156
> >>inline extent data size 135 ram 135 compress 0
> > [snip]
> >
> >Note the objectid field -- the first number in the brackets after
> > "key" for each item. This sequence of values should be non-decreasing.
> > Thus, item 12 should have an objectid of 890554 to match the items
> > either side of it, and instead it has 856762.
> >
> >In hex, these are:
> >
> >>>> hex(890554)
> > '0xd96ba'
> >>>> hex(856762)
> > '0xd12ba'
> >
> >Which means you've had two bitflips close together:
> >
> >>>> hex(856762 ^ 890554)
> > '0x8400'
> >
> >Given that everything else is OK, and it's just one byte affected
> > in the middle of a load of data that's really quite sensitive to
> > errors, it's very unlikely that it's the result of a misplaced pointer
> > in the kernel, or some other subsystem accidentally walking over that
> > piece of RAM. It is, therefore, almost certainly your hardware that's
> > at fault.
> >
> >I would strongly suggest running memtest86 on your machin

Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-08-31 Thread Hugo Mills
   (Please don't top-post; edited for conversation flow)

On Thu, Aug 31, 2017 at 02:44:39PM -0400, Eric Wolf wrote:
> On Thu, Aug 31, 2017 at 2:33 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> > On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:
> >> I'm having issues with a bad block(?) on my root ssd.
> >>
> >> dmesg is consistently outputting "BTRFS critical (device sda2):
> >> corrupt leaf, bad key order: block=293438636032, root=1, slot=11"
> >>
> >> "btrfs scrub stat /" outputs "scrub status for b2c9ff7b-[snip]-48a02cc4f508
> >> scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55
> >> total bytes scrubbed: 53.41GiB with 2 errors
> >> error details: verify=2
> >> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0"
> >>
> >> Running "btrfs check --repair /dev/sda2" from a live system stalls
> >> after telling me corrupt leaf etc etc then "11 12". CPU usage hits
> >> 100% and disk activity remains at 0.
> >
> >This error is usually attributable to bad hardware. Typically RAM,
> > but might also be marginal power regulation (blown capacitor
> > somewhere) or a slightly broken CPU.
> >
> >Can you show us the output of "btrfs-debug-tree -b 293438636032 
> > /dev/sda2"?

   Here's the culprit:

[snip]
> item 10 key (890553 EXTENT_DATA 0) itemoff 14685 itemsize 269
>inline extent data size 248 ram 248 compress 0
> item 11 key (890554 INODE_ITEM 0) itemoff 14525 itemsize 160
>inode generation 5386763 transid 5386764 size 135 nbytes 135
>block group 0 mode 100644 links 1 uid 10 gid 10
>rdev 0 flags 0x0
> item 12 key (856762 INODE_REF 31762) itemoff 14496 itemsize 29
>inode ref index 2745 namelen 19 name: dpkg.statoverride.0
> item 13 key (890554 EXTENT_DATA 0) itemoff 14340 itemsize 156
>inline extent data size 135 ram 135 compress 0
[snip]

   Note the objectid field -- the first number in the brackets after
"key" for each item. This sequence of values should be non-decreasing.
Thus, item 12 should have an objectid of 890554 to match the items
either side of it, and instead it has 856762.

   In hex, these are:

>>> hex(890554)
'0xd96ba'
>>> hex(856762)
'0xd12ba'

   Which means you've had two bitflips close together:

>>> hex(856762 ^ 890554)
'0x8400'

   Given that everything else is OK, and it's just one byte affected
in the middle of a load of data that's really quite sensitive to
errors, it's very unlikely that it's the result of a misplaced pointer
in the kernel, or some other subsystem accidentally walking over that
piece of RAM. It is, therefore, almost certainly your hardware that's
at fault.

   I would strongly suggest running memtest86 on your machine -- I'd
usually say a minimum of 8 hours, or longer if you possibly can (24
hours), or until you have errors reported. If you get errors reported
in the same place on multiple passes, then it's the RAM. If you have
errors scattered around seemingly at random, then it's probably your
power regulation (PSU or motherboard).

   Sadly, btrfs check on its own won't be able to fix this, as it's
two bits flipped. (It can cope with one bit flipped in the key, most
of the time, but not two). It can be fixed manually, if you're
familiar with a hex editor and the on-disk data structures.

   Hugo.

-- 
Hugo Mills | "You got very nice eyes, Deedee. Never noticed them
hugo@... carfax.org.uk | before. They real?"
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Don Logan, Sexy Beast


signature.asc
Description: Digital signature


Re: BTRFS critical (device sda2): corrupt leaf, bad key order: block=293438636032, root=1, slot=11

2017-08-31 Thread Hugo Mills
On Thu, Aug 31, 2017 at 01:53:58PM -0400, Eric Wolf wrote:
> I'm having issues with a bad block(?) on my root ssd.
> 
> dmesg is consistently outputting "BTRFS critical (device sda2):
> corrupt leaf, bad key order: block=293438636032, root=1, slot=11"
> 
> "btrfs scrub stat /" outputs "scrub status for b2c9ff7b-[snip]-48a02cc4f508
> scrub started at Wed Aug 30 11:51:49 2017 and finished after 00:02:55
> total bytes scrubbed: 53.41GiB with 2 errors
> error details: verify=2
> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0"
> 
> Running "btrfs check --repair /dev/sda2" from a live system stalls
> after telling me corrupt leaf etc etc then "11 12". CPU usage hits
> 100% and disk activity remains at 0.

   This error is usually attributable to bad hardware. Typically RAM,
but might also be marginal power regulation (blown capacitor
somewhere) or a slightly broken CPU.

   Can you show us the output of "btrfs-debug-tree -b 293438636032 /dev/sda2"?

   Hugo.

-- 
Hugo Mills | "You got very nice eyes, Deedee. Never noticed them
hugo@... carfax.org.uk | before. They real?"
http://carfax.org.uk/  |
PGP: E2AB1DE4  | Don Logan, Sexy Beast


signature.asc
Description: Digital signature


Re: deleted subvols don't go away?

2017-08-28 Thread Hugo Mills
On Mon, Aug 28, 2017 at 03:03:47PM +0300, Nikolay Borisov wrote:
> 
> 
> On 28.08.2017 11:07, Christoph Anton Mitterer wrote:
> > Thanks...
> > 
> > Still a bit strange that it displays that entry... especially with a
> > generation that seems newer than what I thought was the actually last
> > generation on the fs.
> 
> Snapshot destroy is a 2-phase process. The first phase deletes just the
> root references. After it you see what you've described. Then, later,
> when the cleaner thread runs again the snapshot's root item is going to
> be deleted for good and you no longer will see it.

   It's worth noting also that if the subvol is still used in some way
(still mounted, nested subvol, processes with CWD in it, open files),
then it won't be cleaned up until the usage stops. Basically the same
behaviour as deleting a file. This could also explain the more recent
than expected generation values.

   Hugo.

-- 
Hugo Mills | "Big data" doesn't just mean increasing the font
hugo@... carfax.org.uk | size.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Hugo Mills
On Tue, Aug 22, 2017 at 10:12:25AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-08-22 09:53, Ulli Horlacher wrote:
> >On Tue 2017-08-22 (09:37), Austin S. Hemmelgarn wrote:
> >
> >>>root@fex:~# df -T /local/.backup/home
> >>>Filesystem Type  1K-blocks  Used Available Use% Mounted on
> >>>-  -1073740800 104252160 967766336  10% /local/.backup/home
> >>
> >>Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and
> >>16.04.3 VM's I have (I only run current and the most recent LTS
> >>version), and neither of them behave like this.
> >
> >I have this kind of output on all of my Ubuntu hosts:
> >
> >root@moep:~# grep PRETTY_NAME /etc/os-release
> >PRETTY_NAME="Ubuntu 16.04.3 LTS"
> >
> >root@moep:~# df -T /usb/UF/tmp/blubb
> >Filesystem Type 1K-blocksUsed Available Use% Mounted on
> >-  - 12581888 3690524   7253700  34% /usb/UF/tmp/blubb
> >
> >root@moep:~# btrfs subvolume show /usb/UF/tmp/blubb
> >/usb/UF/tmp/blubb
> > Name:   blubb
> > UUID:   ecf8c804-d4a3-9948-89fe-b0c1971c25cb
> > Parent UUID:-
> > Received UUID:  -
> > Creation time:  2017-08-22 12:54:16 +0200
> > Subvolume ID:   262
> > Generation: 23
> > Gen at creation:22
> > Parent ID:  5
> > Top level ID:   5
> > Flags:  -
> > Snapshot(s):
> >
> >root@moep:~# dpkg -l | grep btrfs
> >ii  btrfs-tools 4.4-1ubuntu1 
> >amd64Checksumming Copy on Write Filesystem utilities
> >
> Hmm, interesting.  Are you using qgroups by chance?

   I get this behaviour (the "- -") only if it's a non-mounted
subvolume:

hrm@amelia:~ $ df -T .
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95271852  18611060  84% /home

hrm@amelia:~ $ sudo btrfs sub crea foo
Create subvolume './foo'

hrm@amelia:~ $ df -T ./foo
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271880  18611032  84% /home/hrm/foo

hrm@amelia:~ $ sudo mkdir foo/bar
hrm@amelia:~ $ df -T foo/bar
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271852  18611060  84% /home/hrm/foo

hrm@amelia:~ $ mkdir foo2

hrm@amelia:~ $ sudo mount /dev/sdb1 ./foo2 -o subvol=home/hrm/foo

hrm@amelia:~ $ df -T foo2
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95272384  18610528  84% /home/hrm/foo2

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Hugo Mills
On Tue, Aug 22, 2017 at 02:23:50PM +0200, Ulli Horlacher wrote:
> How do I find the root filesystem of a subvolume?
> Example:
> 
> root@fex:~# df -T 
> Filesystem Type  1K-blocks  Used Available Use% Mounted on
> -  -1073740800 104244552 967773976  10% /local/.backup/home

   I've never seen the "- -" output from df before. Is this a bind
mount or something?

> root@fex:~# btrfs subvolume show /local/.backup/home
> /local/.backup/home
> Name:   home
> uuid:   f86a2db0-6a82-124f-9a71-1cd4c20fd6fb
> Parent uuid:ba4d388f-44bf-7b46-b2b8-00e2a9a87181
> Creation time:  2017-08-10 22:19:15
> Object ID:  383
> Generation (Gen):   148
> Gen at creation:148
> Parent: 5
> Top Level:  5
> Flags:  readonly
> Snapshot(s):
> 
> 
> I know, the root filesystem is /local, but who can I show it by command?

   Probably in /proc/self/mountinfo -- that should give you the full
set of applied mount options, plus the original source for the mount
(which will be a block device for most filesystem mounts, a path for
bind mounts, or something FS-specific for network filesystems).

   Hugo.

-- 
Hugo Mills | And what rough beast, its hour come round at last /
hugo@... carfax.org.uk | slouches towards Bethlehem, to be born?
http://carfax.org.uk/  |
PGP: E2AB1DE4  | W.B. Yeats, The Second Coming


signature.asc
Description: Digital signature


Re: Btrfs data recovery

2017-08-13 Thread Hugo Mills
On Sun, Aug 13, 2017 at 07:12:48PM +0200, Christian Rene Thelen wrote:
> I have formated an encrypted disk, containing a LVM with a btrfs system.

   What did you format it as? (i.e. what are the locations of the
damaged blocks?)

> All superblocks appear to be destroyed; the btrfs-progs tools can't
> find the root tree anymore and scalpel, binwalk, foremost & co
> return only scrap. The filesystem was on an ssd and mounted with -o
> compression=lzo.

   The compression would explain the junk you're getting from the
carving tools. They tend to rely on being able to identify sequences
of bytes as something recognisable -- compression defeats that by
reducing everything to (statistically) random bits.

> How screwed am I?

   Quite badly.

> Any chances to recover some files?

   The compression isn't helping, as noted above.

   The metadata will be uncompressed, though, so that should be
readable, depending on how much was formatted/damaged in the original
incident.

> Is there a plausible way to rebuild the superblock manually?
> Checking the raw image with xxd gives me not a single readable word.

   That's unsurprising. Metadata isn't human-readable, and nor is
compressed data.

   Did you ever balance this filesystem? More particularly, did you
ever balance the metadata? If you did, then there's a good chance it
wasn't at the front of the device, and so has a much smaller chance of
being damaged.

> I managed to decrypt the LV and dd it to an image. What can I do?

   btrfs-find-root may be able to find some of the tree heads. That at
minimum is the information you need in order to reconstruct the
superblock (well, that plus the UUID, but the UUID is going to be all
over the place -- it shouldn't be hard to find that if the rest is
discoverable).

   That said, recovering this is going to be somewhere between very
hard and miraculous.

   Hugo.

-- 
Hugo Mills | But somewhere along the line, it seems
hugo@... carfax.org.uk | That pimp became cool, and punk mainstream.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Machinae Supremacy, Rise


signature.asc
Description: Digital signature


Re: btrfs issue with mariadb incremental backup

2017-08-12 Thread Hugo Mills
On Sat, Aug 12, 2017 at 03:34:01PM -0600, Chris Murphy wrote:
> On Fri, Aug 11, 2017 at 11:08 PM,  <siranee...@tpc.co.th> wrote:
> 
> 
> > The backup script has the btrfs sync command since Aug 3
> 
> 
> From your script:
> > system btrfs sub snap -r $basepath $snappath
> > system btrfs sub sync $basepath
> 
> From the man page: sync  [subvolid...]
>Wait until given subvolume(s) are completely removed from the
>filesystem after deletion.
> 
> 
> This 'subvolume sync' command, per the man page, is only about
> subvolume deletion. I suggest replacing it with a regular sync
> command.
> 
> I think the problem is that the script does things so fast that the
> snapshot is not always consistent on disk before btrfs send starts.
> It's just a guess though. If I'm right, this means the rsync mismaches
> mean the destination snapshots are bad. Here's what I would do:

   I don't see how that can happen. Snapshots are atomic -- they're
either there or not there. It's not a matter even of copying the
metadata part of the subvol. It's literally just adding a pointer to
point at an existing FS tree.

   Hugo.

-- 
Hugo Mills | If it's December 1941 in Casablanca, what time is it
hugo@... carfax.org.uk | in New York?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Rick Blaine, Casablanca


signature.asc
Description: Digital signature


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-12 Thread Hugo Mills
On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote:
> On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote:
> > And how are you going to write your data and checksum atomically when
> > doing in-place updates?
> 
> Maybe I misunderstand something, but what's the big deal with not doing
> it atomically (I assume you mean in terms of actually writing to the
> pyhsical medium)? Isn't that anyway already a problem in case of a
> crash?

   With normal CoW operations, the atomicity is achieved by
constructing a completely new metadata tree containing both changes
(references to the data, and the csum metadata), and then atomically
changing the superblock to point to the new tree, so it really is
atomic.

   With nodatacow, that approach doesn't work, because the new data
replaces the old on the physical medium, so you'd have to make the
data write atomic with the superblock write -- which can't be done,
because it's (at least) two distinct writes.

> And isn't that the case also with all forms of e.g. software RAID (when
> not having a journal)?
> 
> And as I've said, what's the worst thing that can happen? Either the
> data would not have been completely written - with or without
> checksumming. Then what's the difference to try the checksumming (and
> do it successfully in all non crash cases)?
> My understanding was (but that may be wrong of course, I'm not a
> filesystem expert at all), that worst that can happen is that data an
> csum aren't *both* fully written (in all possible combinations), so
> we'd have four cases in total:
> 
> data=good csum=good => fine
> data=bad  csum=bad  => doesn't matter whether csum or not and whether atomic 
> or not
> data=bad  csum=good => the csum will tell us, that the data is bad
> data=
> good csum=bad  => the only real problem, data would be actually
> 
>   good, but csum is not

   I don't think this is a particularly good description of the
problem. I'd say it's more like this:

   If you write data and metadata separately (which you have to do in
the nodatacow case), and the system halts between the two writes, then
you either have the new data with the old csum, or the old csum with
the new data. Both data and csum are "good", but good from different
states of the FS. In both cases (data first or metadata first), the
csum doesn't match the data, and so you now have an I/O error reported
when trying to read that data.

   You can't easily fix this, because when the data and csum don't
match, you need to know the _reason_ they don't match -- is it because
the machine was interrupted during write (in which case you can fix
it), or is it because the hard disk has had someone write data to it
directly, and the data is now toast (in which case you shouldn't fix
the I/O error)?

   Basically, nodatacow bypasses the very mechanisms that are meant to
provide consistency in the filesystem.

   Hugo.

-- 
Hugo Mills | vi vi vi: the Editor of the Beast.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Hugo Mills
On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:
> On 08/10/2017 04:30 AM, Eric Biggers wrote:
> >
> >Theses benchmarks are misleading because they compress the whole file as a
> >single stream without resetting the dictionary, which isn't how data will
> >typically be compressed in kernel mode.  With filesystem compression the data
> >has to be divided into small chunks that can each be decompressed 
> >independently.
> >That eliminates one of the primary advantages of Zstandard (support for large
> >dictionary sizes).
> 
> I did btrfs benchmarks of kernel trees and other normal data sets as
> well.  The numbers were in line with what Nick is posting here.
> zstd is a big win over both lzo and zlib from a btrfs point of view.
> 
> It's true Nick's patches only support a single compression level in
> btrfs, but that's because btrfs doesn't have a way to pass in the
> compression ratio.  It could easily be a mount option, it was just
> outside the scope of Nick's initial work.

   Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

   In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.

   Hugo.

-- 
Hugo Mills | Klytus! Are your men on the right pills? Maybe you
hugo@... carfax.org.uk | should execute their trainer!
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Ming the Merciless, Flash Gordon


signature.asc
Description: Digital signature


Re: [PATCH v3] Btrfs: fix out of bounds array access while reading extent buffer

2017-08-09 Thread Hugo Mills
On Wed, Aug 09, 2017 at 11:10:16AM -0600, Liu Bo wrote:
> There is a cornel case that slip through the checkers in functions
 ^^ corner

   Sorry, that's been bugging me every time it goes past. A cornel is
a kind of tree, apparently.

   Hugo.

> reading extent buffer, ie.
> 
> if (start < eb->len) and (start + len > eb->len),
> then
> 
> a) map_private_extent_buffer() returns immediately because
> it's thinking the range spans across two pages,
> 
> b) and the checkers in read_extent_buffer(), WARN_ON(start > eb->len)
> and WARN_ON(start + len > eb->start + eb->len), both are OK in this
> corner case, but it'd actually try to access the eb->pages out of
> bounds because of (start + len > eb->len).
> 
> The case is found by switching extent inline ref type from shared data
> ref to non-shared data ref, which is a kind of metadata corruption.
> 
> It'd use the wrong helper to access the eb,
> eg. btrfs_extent_data_ref_root(eb, ref) is used but the %ref passing
> here is "struct btrfs_shared_data_ref".  And if the extent item
> happens to be the first item in the eb, then offset/length will get
> over eb->len which ends up an invalid memory access.
> 
> This is adding proper checks in order to avoid invalid memory access,
> ie. 'general protection fault', before it's too late.
> 
> Reviewed-by: Filipe Manana <fdman...@suse.com>
> Signed-off-by: Liu Bo <bo.li@oracle.com>
> ---
> 
> v3: Remove the unnecessary ASSERT and num_pages.
> 
> v2: Improve the commit log to clarify that this can only happen if
> metadata is corrupted.
> 
>  fs/btrfs/extent_io.c | 20 
>  1 file changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index 0aff9b2..e6c6853 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5417,8 +5417,12 @@ void read_extent_buffer(struct extent_buffer *eb, void 
> *dstv,
>   size_t start_offset = eb->start & ((u64)PAGE_SIZE - 1);
>   unsigned long i = (start_offset + start) >> PAGE_SHIFT;
>  
> - WARN_ON(start > eb->len);
> - WARN_ON(start + len > eb->start + eb->len);
> + if (start + len > eb->len) {
> + WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
> wanted %lu %lu\n",
> +  eb->start, eb->len, start, len);
> + memset(dst, 0, len);
> + return;
> + }
>  
>   offset = (start_offset + start) & (PAGE_SIZE - 1);
>  
> @@ -5491,6 +5495,12 @@ int map_private_extent_buffer(struct extent_buffer 
> *eb, unsigned long start,
>   unsigned long end_i = (start_offset + start + min_len - 1) >>
>   PAGE_SHIFT;
>  
> + if (start + min_len > eb->len) {
> + WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
> wanted %lu %lu\n",
> +eb->start, eb->len, start, min_len);
> + return -EINVAL;
> + }
> +
>   if (i != end_i)
>   return 1;
>  
> @@ -5502,12 +5512,6 @@ int map_private_extent_buffer(struct extent_buffer 
> *eb, unsigned long start,
>   *map_start = ((u64)i << PAGE_SHIFT) - start_offset;
>   }
>  
> - if (start + min_len > eb->len) {
> - WARN(1, KERN_ERR "btrfs bad mapping eb start %llu len %lu, 
> wanted %lu %lu\n",
> -eb->start, eb->len, start, min_len);
> - return -EINVAL;
> - }
> -
>   p = eb->pages[i];
>   kaddr = page_address(p);
>   *map = kaddr + offset;

-- 
Hugo Mills | Jazz is the sort of music where no-one plays
hugo@... carfax.org.uk | anything the same way once.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Crashed filesystem, nothing helps

2017-08-02 Thread Hugo Mills
 292552704 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> parent transid verify failed on 292585472 wanted 1486826 found 1486086
> Ignoring transid failure
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> Ignoring transid failure
> leaf parent key incorrect 290766848
> bad block 290766848
> ERROR: errors found in extent allocation tree or chunk allocation
> parent transid verify failed on 290766848 wanted 1486826 found 1486085
> Ignoring transid failure
> leaf parent key incorrect 290766848
> ERROR: failed to repair root items: Operation not permitted
> mainframe:~ # btrfs rescue super-recover /dev/sdb1
> All supers are valid, no need to recover
> mainframe:~ # btrfs rescue zero-log /dev/sdb1
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> parent transid verify failed on 29392896 wanted 1486833 found 1486836
> Ignoring transid failure
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> parent transid verify failed on 29409280 wanted 1486829 found 1486833
> Ignoring transid failure
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> parent transid verify failed on 29376512 wanted 1327723 found 1486833
> Ignoring transid failure
> Clearing log on /dev/sdb1, previous log_root 0, level 0
> parent transid verify failed on 29474816 wanted 1486833 found 1486837
> Ignoring transid failure
> leaf parent key incorrect 29474816
> disk-io.c:524: update_cowonly_root: BUG_ON `ret` triggered, value -1
> btrfs(+0x26304)[0x555686369304]
> btrfs(btrfs_commit_transaction+0x98)[0x55568636aed8]
> btrfs(+0x69efa)[0x5556863acefa]
> btrfs(main+0x84)[0x555686361e34]
> /lib64/libc.so.6(__libc_start_main+0xf1)[0x7f43653df291]
> btrfs(_start+0x2a)[0x555686361f4a]
> Aborted (core dumped)
> mainframe:~ # btrfs rescue chunk-recover /dev/sdb1
> This command still runs, even after 15 hours.

> Is there anything else i could do?

   Well, a good start would be to take a time machine and go back a
day or so and not run all the random recovery tools you can get your
hands on... Most of them are quite specialised (super-recover,
chunk-recover, zero-log). It's fairly unlikely that they've made
things worse, but it's not certain. Certainly zero-log will have
modified the filesystem. I don't know about the other two.

   More productively, it's definitely worth trying to mount with the
-o usebackuproot option. (And -o usebackuproot,ro as well). The
transid verify failure is a small difference in generations, and it's
likely that the older metadata is still there. If that's the case, the
mount with usebackuproot will work, and you should be good to continue
as normal.

   If that doesn't work, then btrfs restore to refresh your backups is
probably the way to go, followed by mkfs and restore from backup.

   Hugo.

-- 
Hugo Mills | "How deep will this sub go?"
hugo@... carfax.org.uk | "Oh, she'll go all the way to the bottom if we don't
http://carfax.org.uk/  | stop her."
PGP: E2AB1DE4  |  U571


signature.asc
Description: Digital signature


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills
On Tue, Aug 01, 2017 at 10:56:39AM -0600, Liu Bo wrote:
> On Tue, Aug 01, 2017 at 05:28:57PM +0000, Hugo Mills wrote:
> >Hi,
> > 
> >Great to see something addressing the write hole at last.
> > 
> > On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> > > This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> > > separate disk as a journal (aka raid5/6 log), so that after unclean
> > > shutdown we can make sure data and parity are consistent on the raid
> > > array by replaying the journal.
> > 
> >What's the behaviour of the FS if the log device dies during use?
> >
> 
> Error handling on IOs is still under construction (belongs to known
> limitations).
> 
> If the log device dies suddenly, I think we could skip the writeback
> to backend raid arrays and follow the rule in btrfs, filp FS to
> readonly as it may expose data loss.  What do you think?

   I think the key thing for me is that the overall behaviour of the
redundancy in the FS is not compromised by the logging solution. That
is, the same guarantees still hold: For RAID-5, you can lose up to one
device of the FS (*including* any log devices), and the FS will
continue to operate normally, but degraded. For RAID-6, you can lose
up to two devices without losing any capabilities of the FS. Dropping
to read-only if the (single) log device fails would break those
guarantees.

   I quite like the idea of embedding the log chunks into the
allocated structure of the FS -- although as pointed out, this is
probably going to need a new chunk type, and (to retain the guarantees
of the RAID-6 behaviour above) the ability to do 3-way RAID-1 on those
chunks. You'd also have to be able to balance the log structures while
in flight. It sounds like a lot more work for you, though.

   Hmm... if 3-way RAID-1 (3c) is available, then you could also have
RAID-1*3 on metadata, RAID-6 on data, and have 2-device redundancy
throughout. That's also a very attractive configuration in many
respects. (Analagous to RAID-1 metadata and RAID-5 data).

   Hugo.

-- 
Hugo Mills | That's not rain, that's a lake with slots in it.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH 00/14 RFC] Btrfs: Add journal for raid5/6 writes

2017-08-01 Thread Hugo Mills
   Hi,

   Great to see something addressing the write hole at last.

On Tue, Aug 01, 2017 at 10:14:23AM -0600, Liu Bo wrote:
> This aims to fix write hole issue on btrfs raid5/6 setup by adding a
> separate disk as a journal (aka raid5/6 log), so that after unclean
> shutdown we can make sure data and parity are consistent on the raid
> array by replaying the journal.

   What's the behaviour of the FS if the log device dies during use?

   Hugo.

> The idea and the code are similar to the write-through mode of md
> raid5-cache, so ppl(partial parity log) is also feasible to implement.
> (If you've been familiar with md, you may find this patch set is
> boring to read...)
> 
> Patch 1-3 are about adding a log disk, patch 5-8 are the main part of
> the implementation, the rest patches are improvements and bugfixes,
> eg. readahead for recovery, checksum.
> 
> Two btrfs-progs patches are required to play with this patch set, one
> is to enhance 'btrfs device add' to add a disk as raid5/6 log with the
> option '-L', the other is to teach 'btrfs-show-super' to show
> %journal_tail.
> 
> This is currently based on 4.12-rc3.
> 
> The patch set is tagged with RFC, and comments are always welcome,
> thanks.
> 
> Known limitations:
> - Deleting a log device is not implemented yet.
> 
> 
> Liu Bo (14):
>   Btrfs: raid56: add raid56 log via add_dev v2 ioctl
>   Btrfs: raid56: do not allocate chunk on raid56 log
>   Btrfs: raid56: detect raid56 log on mount
>   Btrfs: raid56: add verbose debug
>   Btrfs: raid56: add stripe log for raid5/6
>   Btrfs: raid56: add reclaim support
>   Btrfs: raid56: load r5log
>   Btrfs: raid56: log recovery
>   Btrfs: raid56: add readahead for recovery
>   Btrfs: raid56: use the readahead helper to get page
>   Btrfs: raid56: add csum support
>   Btrfs: raid56: fix error handling while adding a log device
>   Btrfs: raid56: initialize raid5/6 log after adding it
>   Btrfs: raid56: maintain IO order on raid5/6 log
> 
>  fs/btrfs/ctree.h|   16 +-
>  fs/btrfs/disk-io.c  |   16 +
>  fs/btrfs/ioctl.c|   48 +-
>  fs/btrfs/raid56.c   | 1429 
> ++-
>  fs/btrfs/raid56.h   |   82 +++
>  fs/btrfs/transaction.c  |2 +
>  fs/btrfs/volumes.c  |   56 +-
>  fs/btrfs/volumes.h  |7 +-
>  include/uapi/linux/btrfs.h  |3 +
>  include/uapi/linux/btrfs_tree.h |4 +
>  10 files changed, 1487 insertions(+), 176 deletions(-)
> 

-- 
Hugo Mills | Some days, it's just not worth gnawing through the
hugo@... carfax.org.uk | straps
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Massive loss of disk space

2017-08-01 Thread Hugo Mills
-8beb-5521d9931a31
> checking extents
> checking free space cache
> checking fs roots
> checking csums
> checking root refs
> found 5057294639104 bytes used err is 0
> total csum bytes: 4529856120
> total tree bytes: 5170151424
> total fs tree bytes: 178700288
> total extent tree bytes: 209616896
> btree space waste bytes: 182357204
> file data blocks allocated: 5073330888704
>  referenced 5052040339456
> 
> 
> 
> pwm@europium:~$ sudo btrfs scrub status /mnt/snap_04/
> scrub status for c46df8fa-03db-4b32-8beb-5521d9931a31
> scrub started at Mon Jul 31 21:26:50 2017 and finished after
> 06:53:47
> total bytes scrubbed: 4.60TiB with 0 errors
> 
> 
> 
> So where have my 5TB disk space gone lost?
> And what should I do to be able to get it back again?
> 
> I could obviously reformat the partition and rebuild the parity
> since I still have one good parity, but that doesn't feel like a
> good route. It isn't impossible this might happen again.
> 
> /Per W

-- 
Hugo Mills | Well, sir, the floor is yours. But remember, the
hugo@... carfax.org.uk | roof is ours!
http://carfax.org.uk/  |
PGP: E2AB1DE4  | The Goons


signature.asc
Description: Digital signature


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Hugo Mills
On Fri, Jul 28, 2017 at 06:20:14PM +, William Muriithi wrote:
> Hi Roman,
> 
> > autodefrag
> 
> This sure sounded like a good thing to enable? on paper? right?...
> 
> The moment you see anything remotely weird about btrfs, this is the first 
> thing you have to disable and retest without. Oh wait, the first would be 
> qgroups, this one is second.
> 
> What's the problem with autodefrag?  I am also using it, so you caught my 
> attention when you implied that it shouldn't be used.  According to docs, it 
> seem like one of the very mature feature of the filesystem.  See below for 
> the doc I am referring to 
> 
> https://btrfs.wiki.kernel.org/index.php/Status
> 
> I am using it as I assumed it could prevent the filesystem being too 
> fragmented long term, but never thought there was price to pay for using it

   It introduces additional I/O on writes, as it modifies a small area
surrounding any write or cluster of writes.

   I'm not aware of it causing massive slowdowns, in the way the
qgroups does in some situations.

   If your system is already marginal in terms of being able to
support the I/O required, then turning on autodefrag will make things
worse (but you may be heading for _much_ worse performance in the
future as the FS becomes more fragmented -- depending on your write
patterns and use case).

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 6:
hugo@... carfax.org.uk | Mature Student
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Raid0 rescue

2017-07-27 Thread Hugo Mills
On Thu, Jul 27, 2017 at 03:43:37PM -0400, Alan Brand wrote:
> Correct, I should have said 'superblock'.
> It is/was raid0.  Funny thing is that this all happened when I was
> prepping to convert to raid1.

   If youre metadata was also RAID-0, then your filesystem is almost
certainly toast. If any part of the btrfs metadata was overwritten by
some of the NTFS metadata, then the FS will be broken (somewhere) and
probably not in a fixable way.

> running a btrfs-find-root shows this (which gives me hope)
> Well block 4871870791680(gen: 73257 level: 1) seems good, but
> generation/level doesn't match, want gen: 73258 level: 1
> Well block 4639933562880(gen: 73256 level: 1) seems good, but
> generation/level doesn't match, want gen: 73258 level: 1
> Well block 4639935168512(gen: 73255 level: 1) seems good, but
> generation/level doesn't match, want gen: 73258 level: 1
> Well block 4639926239232(gen: 73242 level: 0) seems good, but
> generation/level doesn't match, want gen: 73258 level: 1
> 
> but when I run btrfs
> inspect-internal dump-tree -r /dev/sdc1
> 
> checksum verify failed on 874856448 found 5A85B5D9 wanted 17E3CB7D
> checksum verify failed on 874856448 found 5A85B5D9 wanted 17E3CB7D
> checksum verify failed on 874856448 found 2204C752 wanted C6ADDF7E
> checksum verify failed on 874856448 found 2204C752 wanted C6ADDF7E
> bytenr mismatch, want=874856448, have=8568478783891655077

   This would suggest that some fairly important part of the metadata
was damaged. You'll probably spend far less effort recovering the data
by restoring your backups than trying to fix this.

   Hugo.

> root tree: 4871875543040 level 1
> chunk tree: 20971520 level 1
> extent tree key (EXTENT_TREE ROOT_ITEM 0) 4871875559424 level 2
> device tree key (DEV_TREE ROOT_ITEM 0) 4635801976832 level 1
> fs tree key (FS_TREE ROOT_ITEM 0) 4871870414848 level 3
> checksum tree key (CSUM_TREE ROOT_ITEM 0) 4871876034560 level 3
> uuid tree key (UUID_TREE ROOT_ITEM 0) 29376512 level 0
> checksum verify failed on 728891392 found 75E2752C wanted D6CA4FB4
> checksum verify failed on 728891392 found 75E2752C wanted D6CA4FB4
> checksum verify failed on 728891392 found F4F3A4AD wanted E6D063C7
> checksum verify failed on 728891392 found 75E2752C wanted D6CA4FB4
> bytenr mismatch, want=728891392, have=269659807399918462
> total bytes 5000989728768
> bytes used 3400345264128
> 
> 
> 
> On Thu, Jul 27, 2017 at 11:10 AM, Hugo Mills <h...@carfax.org.uk> wrote:
> > On Thu, Jul 27, 2017 at 10:49:37AM -0400, Alan Brand wrote:
> >> I know I am screwed but hope someone here can point at a possible solution.
> >>
> >> I had a pair of btrfs drives in a raid0 configuration.  One of the
> >> drives was pulled by mistake, put in a windows box, and a quick NTFS
> >> format was done.  Then much screaming occurred.
> >>
> >> I know the data is still there.
> >
> >Well, except for all the parts overwritten by a blank NTFS metadata
> > structure.
> >
> >>   Is there anyway to rebuild the raid
> >> bringing in the bad disk?  I know some info is still good, for example
> >> metadata0 is corrupt but 1 and 2 are good.
> >
> >I assume you mean superblock there.
> >
> >> The trees look bad which is probably the killer.
> >
> >We really should improve the error messages at some point. Whatever
> > you're inferring from the kernel logs is probably not quite right. :)
> >
> >What's the metadata configuration on this FS? Also RAID-0? or RAID-1?
> >
> >> I can't run a normal recovery as only half of each file is there.
> >
> >Welcome to RAID-0...
> >
> >Hugo.
> >

-- 
Hugo Mills | Great oxymorons of the world, no. 1:
hugo@... carfax.org.uk | Family Holiday
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Raid0 rescue

2017-07-27 Thread Hugo Mills
On Thu, Jul 27, 2017 at 10:49:37AM -0400, Alan Brand wrote:
> I know I am screwed but hope someone here can point at a possible solution.
> 
> I had a pair of btrfs drives in a raid0 configuration.  One of the
> drives was pulled by mistake, put in a windows box, and a quick NTFS
> format was done.  Then much screaming occurred.
> 
> I know the data is still there.

   Well, except for all the parts overwritten by a blank NTFS metadata
structure.

>   Is there anyway to rebuild the raid
> bringing in the bad disk?  I know some info is still good, for example
> metadata0 is corrupt but 1 and 2 are good.

   I assume you mean superblock there.

> The trees look bad which is probably the killer.

   We really should improve the error messages at some point. Whatever
you're inferring from the kernel logs is probably not quite right. :)

   What's the metadata configuration on this FS? Also RAID-0? or RAID-1?

> I can't run a normal recovery as only half of each file is there.

   Welcome to RAID-0...

   Hugo.

-- 
Hugo Mills | We don't just borrow words; on occasion, English has
hugo@... carfax.org.uk | pursued other languages down alleyways to beat them
http://carfax.org.uk/  | unconscious and rifle their pockets for new
PGP: E2AB1DE4  | vocabulary.   James D. Nicoll


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:36:54AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-26 08:27, Hugo Mills wrote:
> >On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> >>On 2017-07-25 17:45, Hugo Mills wrote:
> >>>On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> >>>>
> >>>>
> >>>>Hugo Mills wrote:
> >>>>>
> >>>>>>>You can see about the disk usage in different scenarios with the
> >>>>>>>online tool at:
> >>>>>>>
> >>>>>>>http://carfax.org.uk/btrfs-usage/
> >>>>>>>
> >>>>>>>Hugo.
> >>>>>>>
> >>>>As a side note, have you ever considered making this online tool
> >>>>(that should never go away just for the record) part of btrfs-progs
> >>>>e.g. a proper tool? I use it quite often (at least several timers
> >>>>per. month) and I would love for this to be a visual tool
> >>>>'btrfs-space-calculator' would be a great name for it I think.
> >>>>
> >>>>Imagine how nice it would be to run
> >>>>
> >>>>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> >>>>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> >>>>something similar to my example below (no accuracy intended)
> >>>
> >>>It's certainly a thought. I've already got the algorithm written
> >>>up. I'd have to resurrect my C skills, though, and it's a long way
> >>>down my list of things to do. :/
> >>>
> >>>Also on the subject of this tool, I'd like to make it so that the
> >>>parameters get set in the URL, so that people can copy-paste the URL
> >>>of the settings they've got into IRC for discussion. However, that
> >>>would involve doing more JavaScript, which is possibly even lower down
> >>>my list of things to do than starting doing C again...
> >
> >>Is the core logic posted somewhere?  Because if I have some time, I
> >>might write up a quick Python script to do this locally (it may not
> >>be as tightly integrated with the regular tools, but I can count on
> >>half a hand how many distros don't include Python by default).
> >
> >If it's going to be done in python, I might as well do it myself --
> >I can do python with my eyes closed. It's just C and JS I'm rusty with.
> Same here ironically :)
> >
> >There is a write-up of the usable-space algorithm somewhere. I
> >wrote it up in detail (with pseudocode) in a mail on this list. I've
> >also got several pages of LaTeX somewhere where I tried and failed to
> >prove the correctness of the formula. I'll see if I can dig them out
> >this evening.
> It looks like the Message-ID for the one on the mailing list is
> <20160311221703.gj17...@carfax.org.uk>
> I had forgotten that I'd archived that with the intent of actually
> doing something with it eventually...

   Here's the write-up of my attempted proof of the optimality of the
current allocator algorithm:

http://carfax.org.uk/files/temp/btrfs-allocator-draft.pdf

   Section 1 is a general (allocator-agnostic) description of the
process. Section 2 finds a bound on how well _any_ allocator can
do. That's the formula (eq 9) used in the online btrfs-usage
tool. Section 3 described the current allocator. Section 4 is a failed
attempt at proving that the algorithm achieves the bound from section
2. I wasn't able to complete the proof.

   Hugo.

-- 
Hugo Mills | Great films about cricket: Interview with the Umpire
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 12:27:20PM +, Hugo Mills wrote:
> On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> > On 2017-07-25 17:45, Hugo Mills wrote:
> > >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> > >>
> > >>
> > >>Hugo Mills wrote:
> > >>>
> > >>>>>You can see about the disk usage in different scenarios with the
> > >>>>>online tool at:
> > >>>>>
> > >>>>>http://carfax.org.uk/btrfs-usage/
> > >>>>>
> > >>>>>Hugo.
> > >>>>>
> > >>As a side note, have you ever considered making this online tool
> > >>(that should never go away just for the record) part of btrfs-progs
> > >>e.g. a proper tool? I use it quite often (at least several timers
> > >>per. month) and I would love for this to be a visual tool
> > >>'btrfs-space-calculator' would be a great name for it I think.
> > >>
> > >>Imagine how nice it would be to run
> > >>
> > >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> > >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> > >>something similar to my example below (no accuracy intended)
> > >
> > >It's certainly a thought. I've already got the algorithm written
> > >up. I'd have to resurrect my C skills, though, and it's a long way
> > >down my list of things to do. :/
> > >
> > >Also on the subject of this tool, I'd like to make it so that the
> > >parameters get set in the URL, so that people can copy-paste the URL
> > >of the settings they've got into IRC for discussion. However, that
> > >would involve doing more JavaScript, which is possibly even lower down
> > >my list of things to do than starting doing C again...
> 
> > Is the core logic posted somewhere?  Because if I have some time, I
> > might write up a quick Python script to do this locally (it may not
> > be as tightly integrated with the regular tools, but I can count on
> > half a hand how many distros don't include Python by default).
> 
>If it's going to be done in python, I might as well do it myself --
> I can do python with my eyes closed. It's just C and JS I'm rusty with.
> 
>There is a write-up of the usable-space algorithm somewhere. I
> wrote it up in detail (with pseudocode) in a mail on this list. I've
> also got several pages of LaTeX somewhere where I tried and failed to
> prove the correctness of the formula. I'll see if I can dig them out
> this evening.

   Oh, and of course there's the JS from the website... that's not
minified, and should be readable (if not particularly well-commented).

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-26 Thread Hugo Mills
On Wed, Jul 26, 2017 at 08:12:19AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-25 17:45, Hugo Mills wrote:
> >On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> >>
> >>
> >>Hugo Mills wrote:
> >>>
> >>>>>You can see about the disk usage in different scenarios with the
> >>>>>online tool at:
> >>>>>
> >>>>>http://carfax.org.uk/btrfs-usage/
> >>>>>
> >>>>>Hugo.
> >>>>>
> >>As a side note, have you ever considered making this online tool
> >>(that should never go away just for the record) part of btrfs-progs
> >>e.g. a proper tool? I use it quite often (at least several timers
> >>per. month) and I would love for this to be a visual tool
> >>'btrfs-space-calculator' would be a great name for it I think.
> >>
> >>Imagine how nice it would be to run
> >>
> >>btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> >>/dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> >>something similar to my example below (no accuracy intended)
> >
> >It's certainly a thought. I've already got the algorithm written
> >up. I'd have to resurrect my C skills, though, and it's a long way
> >down my list of things to do. :/
> >
> >Also on the subject of this tool, I'd like to make it so that the
> >parameters get set in the URL, so that people can copy-paste the URL
> >of the settings they've got into IRC for discussion. However, that
> >would involve doing more JavaScript, which is possibly even lower down
> >my list of things to do than starting doing C again...

> Is the core logic posted somewhere?  Because if I have some time, I
> might write up a quick Python script to do this locally (it may not
> be as tightly integrated with the regular tools, but I can count on
> half a hand how many distros don't include Python by default).

   If it's going to be done in python, I might as well do it myself --
I can do python with my eyes closed. It's just C and JS I'm rusty with.

   There is a write-up of the usable-space algorithm somewhere. I
wrote it up in detail (with pseudocode) in a mail on this list. I've
also got several pages of LaTeX somewhere where I tried and failed to
prove the correctness of the formula. I'll see if I can dig them out
this evening.

   Hugo.

-- 
Hugo Mills | How do you become King? You stand in the marketplace
hugo@... carfax.org.uk | and announce you're going to tax everyone. If you
http://carfax.org.uk/  | get out alive, you're King.
PGP: E2AB1DE4  |Harry Harrison


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 11:29:13PM +0200, waxhead wrote:
> 
> 
> Hugo Mills wrote:
> >
> >>>You can see about the disk usage in different scenarios with the
> >>>online tool at:
> >>>
> >>>http://carfax.org.uk/btrfs-usage/
> >>>
> >>>Hugo.
> >>>
> As a side note, have you ever considered making this online tool
> (that should never go away just for the record) part of btrfs-progs
> e.g. a proper tool? I use it quite often (at least several timers
> per. month) and I would love for this to be a visual tool
> 'btrfs-space-calculator' would be a great name for it I think.
> 
> Imagine how nice it would be to run
> 
> btrfs-space-calculator -mraid1 -draid10 /dev/sda1 /dev/sdb1
> /dev/sdc2 /dev/sdd2 /dev/sde3 for example and instantly get
> something similar to my example below (no accuracy intended)

   It's certainly a thought. I've already got the algorithm written
up. I'd have to resurrect my C skills, though, and it's a long way
down my list of things to do. :/

   Also on the subject of this tool, I'd like to make it so that the
parameters get set in the URL, so that people can copy-paste the URL
of the settings they've got into IRC for discussion. However, that
would involve doing more JavaScript, which is possibly even lower down
my list of things to do than starting doing C again...

   Hugo.

> d=data
> m=metadata
> .=unusable
> 
> {  500mb} [|d|] /dev/sda1
> { 3000mb} [|d|m|m|m|m|mm...|] /dev/sdb1
> { 3000mb} [|d|m|m|m|m|mmm..|] /dev/sdc2
> { 5000mb}
> [|d|m|m|m|m|m|m|m|m|m|]
> /dev/sdb1
> 
> {11500mb} Total space
> 
> usable for data (raid10): 1000mb / 2000mb
> usable for metadata (raid1): 4500mb / 9000mb
> unusable: 500mb
> 
> Of course this would have to change one (if ever) subvolumes can
> have different raid levels etc, but I would have loved using
> something like this instead of jumping around carfax abbey (!) at
> night.

   The core algorithm for the tool actually works pretty well for
dealing with different RAID levels, as long as you know how much of
each kind of data you're going to be using. (Although it's actually
path-dependent -- write 100 GB of RAID-0 then 100 GB of RAID-1 can
have different results than if you write them in the opposite order --
but that's a kind of edge effect).

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 4:
hugo@... carfax.org.uk | Future Perfect
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 10:55:18AM -0300, Hérikz Nawarro wrote:
> And btw, my current disk conf is a 1x 500GB, 2x3TB and a 5TB.

   OK, so by my mental arithmetic(*), you'd get:

 -  9.5  TB usable in RAID-0
 - 11.5  TB usable in single mode
 -  5.75 TB usable in RAID-1

   Hugo.

(*) Which may be a bit wobbly. :)

> 2017-07-25 10:51 GMT-03:00 Hugo Mills <h...@carfax.org.uk>:
> > On Tue, Jul 25, 2017 at 01:46:56PM +, Hugo Mills wrote:
> >> On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> >> > Hello everyone,
> >> >
> >> > I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> >> > with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> >> > drives can i lost without losing the entire array?
> >
> >Oh, and one other thing -- if you have different-sized devices,
> > RAID-0 is probably the wrong thing to be using anyway, as you won't be
> > able to use the difference between the largest and second-largest
> > device. If you want to use all the space on the available devices,
> > then "single" mode is probably better (although you still lose a lot
> > of data if a device breaks), or RAID-1 (which will cope well with the
> > different sizes as long as the largest device is smaller than the rest
> > of them added together).
> >
> >You can see about the disk usage in different scenarios with the
> > online tool at:
> >
> > http://carfax.org.uk/btrfs-usage/
> >
> >Hugo.
> >

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 01:46:56PM +, Hugo Mills wrote:
> On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> > Hello everyone,
> > 
> > I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> > with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> > drives can i lost without losing the entire array?

   Oh, and one other thing -- if you have different-sized devices,
RAID-0 is probably the wrong thing to be using anyway, as you won't be
able to use the difference between the largest and second-largest
device. If you want to use all the space on the available devices,
then "single" mode is probably better (although you still lose a lot
of data if a device breaks), or RAID-1 (which will cope well with the
different sizes as long as the largest device is smaller than the rest
of them added together).

   You can see about the disk usage in different scenarios with the
online tool at:

http://carfax.org.uk/btrfs-usage/

   Hugo.

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: btrfs raid assurance

2017-07-25 Thread Hugo Mills
On Tue, Jul 25, 2017 at 09:55:37AM -0300, Hérikz Nawarro wrote:
> Hello everyone,
> 
> I'm migrating to btrfs and i would like to know, in a btrfs filesystem
> with 4 disks (multiple sizes) with -d raid0 & -m raid1, how many
> drives can i lost without losing the entire array?

   You can lose one device in the array, and the FS structure will be
OK -- it will still mount, and you'll be able to see all the filenames
and directory structures and so on.

   However, if you do lose one device, then you'll lose
(approximately) half of the bytes in all of your files, most likely in
alternating 64k slices in each file. Attempting to read the missing
parts will result in I/O errors being returned from the filesystem.

   So, while the FS is in theory still fine as a (probably read-only)
filesystem, it's actually going to be *completely* useless with a
missing device, because none of your file data will be usably intact.

   If you want the FS to behave well when you lose a device, you'll
need some kind of actual redundancy in the data storage part -- RAID-1
would be my recommendation (it stores two copies of each piece of
data, so you can lose up to one device and still be OK).

   Hugo.

-- 
Hugo Mills | One of these days, I'll catch that man without a
hugo@... carfax.org.uk | quotation, and he'll look undressed.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Leto Atreides, Dune


signature.asc
Description: Digital signature


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread Hugo Mills
On Mon, Jul 24, 2017 at 02:55:00PM -0600, Chris Murphy wrote:
> On Mon, Jul 24, 2017 at 2:42 PM, Hugo Mills <h...@carfax.org.uk> wrote:
> 
> >
> >In my experience, it's pretty consistent at about a minute per 1
> > GiB for data on rotational drives on RAID-1. For metadata, it can go
> > up to several hours (or more) per 256 MiB chunk, depending on what
> > kind of metadata it is. With extents shared between lots of files, it
> > slows down. In my case, with a few hundred snapshots of the same
> > thing, my system was taking 4h per chunk for the chunks full of the
> > extent tree.
> 
> Egads.
> 
> Maybe Cloud Admin ought to consider using a filter to just balance the
> data chunks across the three devices, and just leave the metadata on
> the original two disks?
> 
> Maybe
> 
> sudo btrfs balance start -dusage=100 

   It's certainly a plausible approach, yes.

   Or just wait it out -- the number of slow chunks is typically very
small. Note that most of the metadata will be csums (which are fast),
and not all of the other metadata chunks are slow ones.

   It would be interesting to know in this case the times of the
chunks that have been balanced to date (grep for the lines with the
chunk IDs in system logs).

   Hugo.

-- 
Hugo Mills | Two things came out of Berkeley in the 1960s: LSD
hugo@... carfax.org.uk | and Unix. This is not a coincidence.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread Hugo Mills
 size: 48.00KiB
> Inline data: 0.00B
> Total seeks: 2
> Forward seeks: 0
> Backward seeks: 2
> Avg seek len: 62.95MiB
> Total clusters: 1
> Avg cluster size: 0.00B
> Min cluster size: 0.00B
> Max cluster size: 16.00KiB
> Total disk spread: 125.86MiB
> Total read time: 0 s 19675 us
> Levels: 2
> [chris@f26s ~]$
> 
> 
> I don't think the number of snapshots you have for Docker containers
> is the problem. There's this thread (admittedly on SSD) which suggests
> decent performance is possible with thousands of containers per day
> (100,000 - 200,000 per day but I don't think that's per file system,
> I'm actually not sure how many file systems are involved).
> 
> https://www.spinics.net/lists/linux-btrfs/msg67308.html
> 
> 
> 

-- 
Hugo Mills | Two things came out of Berkeley in the 1960s: LSD
hugo@... carfax.org.uk | and Unix. This is not a coincidence.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs device ready purpose

2017-07-22 Thread Hugo Mills
On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote:
> I just did an additional test that's pretty icky behavior.
> 
> 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
> exits with 0 as expected. Physically remove both USB devices.
> Reconnect one device. `btrfs device ready` still exits 0. That's
> definitely not good. (If I leave that one device connected and reboot,
> `btrfs device ready` exits 1).

   In a slightly less-specific way, this has been a problem pretty
much since the inception of the FS. It's not possible to do the
reverse of the "scan" operation on a device -- that is, invalidate/
remove the device's record in the kernel. So, as you've discovered
here, if you have a device which is removed (overwritten, unplugged),
the kernel still thinks it's a part of the FS.

   It's something I recall being talked about a bit, some years ago. I
don't recall now why it was going to be useful, though. I think you
have a good use-case for such a new ioctl (or extension to the
SCAN_DEV ioctl) now, though.

   Hugo.

-- 
Hugo Mills | UNIX: Italian pen maker
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: interrupt btrfs filesystem defragment -r /

2017-07-08 Thread Hugo Mills
On Sat, Jul 08, 2017 at 01:34:44PM +0200, David Arendt wrote:
> Hi,
> 
> Is it safe to interrupt a btrfs filesystem defrag -r / by using ctrl-c
> or should it be avoided ?

   Yes, it's safe.

   Hugo.

-- 
Hugo Mills | Klytus, I'm bored. What plaything can you offer me
hugo@... carfax.org.uk | today?
http://carfax.org.uk/  |
PGP: E2AB1DE4  |  Ming the Merciless, Flash Gordon


signature.asc
Description: Digital signature


Re: Exactly what is wrong with RAID5/6

2017-06-20 Thread Hugo Mills
can create a RAID10 profile that requires a minimum of
> four disks?

   Yes. RAID-10 will work on any number of devices (>=4), not just an
even number. Obviously, if you have a 6-device array and lose one,
you'll need to deal with the loss of redundancy -- either add a new
device and rebalance, or replace the missing device with a new one, or
(space permitting) rebalance with existing devices.

   Hugo.

-- 
Hugo Mills | Let me past! There's been a major scientific
hugo@... carfax.org.uk | break-in!
http://carfax.org.uk/  | Through! Break-through!
PGP: E2AB1DE4  |  Ford Prefect


signature.asc
Description: Digital signature


Re: 4.11.3: BTRFS critical (device dm-1): unable to add free space :-17 => btrfs check --repair runs clean

2017-06-20 Thread Hugo Mills
On Tue, Jun 20, 2017 at 08:26:48AM -0700, Marc MERLIN wrote:
> On Tue, Jun 20, 2017 at 03:23:54PM +0000, Hugo Mills wrote:
> > On Tue, Jun 20, 2017 at 07:39:16AM -0700, Marc MERLIN wrote:
> > > My filesystem got remounted read only, and yet after a lengthy
> > > btrfs check --repair, it ran clean.
> > > 
> > > Any idea what went wrong?
> > > [846332.992285] WARNING: CPU: 4 PID: 4095 at 
> > > fs/btrfs/free-space-cache.c:1476 tree_insert_offset+0x78/0xb1
> > > [846333.744721] BTRFS critical (device dm-1): unable to add free space 
> > > :-17
> > > [847312.529660] BTRFS: Transaction aborted (error -17)
> > > [847313.218391] BTRFS: error (device dm-1) in 
> > > btrfs_run_delayed_refs:2961: errno=-17 Object already exists
> > 
> >Error 17 is EEXIST, so I'd guess (and it is a guess) that it's
> > trying to add a free space cache record for some space that already
> > has such a record. This might also match with:
>  
> Thanks for having a look. Is it a bug, or is it a problem with my storage
> subsystem?

   Well, I'd say it's probably a problem with some inconsistent data
on the disk. How that data got there is another matter -- it may be
due to a bug which wrote the inconsistent data some time ago, and has
only now been found out.

> > [...]
> > > gargamel:~# btrfs check --repair /dev/mapper/dshelf2
> > [...]
> > > cache and super generation don't match, space cache will be invalidated
> > [...]
> > 
> >I'd try clearing the cache (mount with -o clear_cache, once), and
> > then letting it rebuild.
> 
> "space cache will be invalidated " => doesn't that mean that my cache was
> already cleared by check --repair, or are you saying I need to clear it
> again?

   I'm never quite sure about that one. :)

   It can't hurt to clear it manually as well.

   Hugo.

-- 
Hugo Mills | I believe that it's closely correlated with the
hugo@... carfax.org.uk | aeroswine coefficient
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Adrian Bridgett


signature.asc
Description: Digital signature


Re: 4.11.3: BTRFS critical (device dm-1): unable to add free space :-17 => btrfs check --repair runs clean

2017-06-20 Thread Hugo Mills
] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863279.888096] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863279.918393] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863279.948740] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863279.979033] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863280.009362] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863280.040438] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> [863280.070966] BTRFS error (device dm-1): parent transid verify failed on 
> 1932065177600 wanted 37959 found 3634
> 

-- 
Hugo Mills | I believe that it's closely correlated with the
hugo@... carfax.org.uk | aeroswine coefficient
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Adrian Bridgett


signature.asc
Description: Digital signature


Re: does using different uid/gid/forceuid/... mount options for different subvolumes work / does fuse.bindfs play nice with btrfs?

2017-06-20 Thread Hugo Mills
On Tue, Jun 20, 2017 at 04:35:48PM +0200, Alexander Peganz wrote:
> Hello everyone,
> 
> I intend to provide different "views" of the data stored on btrfs subvolumes.
> e.g. mount a subvolume in location A rw; and ro in location B while
> also overwriting uids, gids, and permissions.
> In the past I have been using fuse.bindfs for this. Now I'm trying to
> find out if there is a 'native' way to do this in btrfs, and if it
> works by design or by accident (I wouldn't want to rely on something
> that might go away in a newer kernel version).
> 
> So here goes:
> Do different uid/gid/... mount options for different subvolumes work?

   No. uid= and gid= mount options don't work at all for btrfs (or on
most other filesystems with their own concept of file ownership). You
can't arbitrarily change the ownership of files like this -- except
for filesystems like FAT, which don't have the concept at all.

> Or does the first mounted subvolume "win"?

   See above. The options don't do anything.

> Can a subvolume be mounted more than once (I guess not, but btrfs
> might surprise me)?

   Yes.

> Is there some completely different way I don't know about to do this
> in btrfs that might not even have anything to do with mount options?

   If you want to do it, it'll _have_ to be done without mount
options, because the options you're proposing to use don't work.

   As far as I know, there's no mount option for this in btrfs (or any
other POSIX filesystem), and there are no plans to implement such a
feature.

> If there is no magical btrfs way, does fuse.bindfs play nice with
> btrfs or should I be worried?

   We haven't had any complaints about it that I'm aware of.

   Hugo.

-- 
Hugo Mills | I believe that it's closely correlated with the
hugo@... carfax.org.uk | aeroswine coefficient
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Adrian Bridgett


signature.asc
Description: Digital signature


Re: Filesystem won't mount (open_ctree failed) or repair (BUG_ON)

2017-06-09 Thread Hugo Mills
On Fri, Jun 09, 2017 at 09:12:16PM +0200, Koen Kooi wrote:
> Hi,
> 
> Today the kernel got wedged during shutdown (4.11.x tends to do that, haven't
> debugged) and I pressed the reset button. The next boot btrfs won't mount:
> 
> [Fri Jun  9 20:46:07 2017] BTRFS error (device md0): parent transid verify 
> failed on 5840011722752 wanted 170755 found 170832
> [Fri Jun  9 20:46:07 2017] BTRFS error (device md0): parent transid verify 
> failed on 5840011722752 wanted 170755 found 170832
> [Fri Jun  9 20:46:07 2017] BTRFS error (device md0): failed to read block 
> groups: -5
> [Fri Jun  9 20:46:08 2017] BTRFS error (device md0): open_ctree failed

   With a transid failure on mount, about the only thing that's likely
to work is mounting with -o usebackuproot. If that doesn't work, then
a rebuild of the FS is almost certainly needed.

   Hugo.

> I tried repair, but that didn't work either:
> 
> # btrfsck --repair /dev/md0
> enabling repair mode
> couldn't open RDWR because of unsupported option features (3).
> ERROR: cannot open file system
> enabling repair mode
> 
> Googling around it was suggested clearing the v2 space cache:
> 
> # btrfsck --mode=lowmem --clear-space-cache v2 /dev/md0
> parent transid verify failed on 5840011722752 wanted 170755 found 170832
> parent transid verify failed on 5840011722752 wanted 170755 found 170832
> parent transid verify failed on 5840011722752 wanted 170755 found 170832
> parent transid verify failed on 5840011722752 wanted 170755 found 170832
> Ignoring transid failure
> leaf parent key incorrect 5840011722752
> parent transid verify failed on 5367057465344 wanted 170755 found 170828
> parent transid verify failed on 5367057465344 wanted 170755 found 170828
> parent transid verify failed on 5367057465344 wanted 170755 found 170828
> parent transid verify failed on 5367057465344 wanted 170755 found 170828
> Ignoring transid failure
> leaf parent key incorrect 72105984
> btrfs unable to find ref byte nr 4628577484800 parent 0 root 10  owner 0 
> offset 1
> parent transid verify failed on 5366993256448 wanted 170755 found 170827
> parent transid verify failed on 5366993256448 wanted 170755 found 170827
> parent transid verify failed on 5366993256448 wanted 170755 found 170827
> parent transid verify failed on 5366993256448 wanted 170755 found 170827
> Ignoring transid failure
> leaf parent key incorrect 41287680
> ERROR: failed to clear free space cache v2: -1
> transaction.h:41: btrfs_start_transaction: BUG_ON `root->commit_root` 
> triggered, value 22938400
> btrfs check[0x411674]
> btrfs check(close_ctree_fs_info+0x125)[0x41368c]
> btrfs check(cmd_check+0x36d8)[0x45e8e8]
> btrfs check(main+0x15d)[0x40ac5c]
> /lib/libc.so.6(__libc_start_main+0xf0)[0x7f9b4cb060d0]
> btrfs check[0x40a729]
> Clear free space cache v2
> 
> The underlying md0 (raid6) doesn't report any errors, trying different 
> kernels makes no difference, 4.10.17, 4.11.4 and 4.12.0-rc4 all give the same 
> errors. Everything above was
> done with btrfs-progs 4.11.
> 
> Any hints on how I can fix the errors in the filesystem? I don't mind loosing 
> todays changes, but I would like to keep all the older data :)
> 
> regards,
> 
> Koen
> 

-- 
Hugo Mills | Close enough for government work.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs native encryption

2017-06-09 Thread Hugo Mills
On Fri, Jun 09, 2017 at 08:50:12AM -0700, Filip Bystricky wrote:
> Dear btrfs maintainers,
> Google is evaluating btrfs for its potential use in android, but
> currently the lack of native file-based encryption unfortunately makes
> it a nonstarter. According to the FAQ (specifically the answer to
> "Does btrfs support encryption"), nobody is currently working on this.
> How up-to-date is that answer, and are there any new plans to add
> native FBE in the future?

   There were initial patches from Anand Jain back in September, but
they weren't well-received in terms of the (lack of) cryptography
design. IIRC, the patches provided file-level data encryption without
encrypting metadata. I haven't seen anything since then (although
Anand was planning on doing a session on btrfs encryption at LSF/MM in
March -- I don't know if that happened, or what the outcome was).

   So, there's some interest in a fairly minimal implementation, but
progress doesn't seem to be particularly fast.

   Hugo.

-- 
Hugo Mills | "Are you the man who rules the Universe?" "Well, I
hugo@... carfax.org.uk | try not to."
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Life, the Universe and Everything.


signature.asc
Description: Digital signature


Re: BTRFS converted from EXT4 becomes read-only after reboot

2017-05-23 Thread Hugo Mills
On Tue, May 23, 2017 at 02:49:43PM -0700, Marc MERLIN wrote:
> On Tue, May 23, 2017 at 03:38:01PM -0600, Chris Murphy wrote:
> > > I've tried an ext4 to btrfs conversion 3 times in the last 3 years, it
> > > never worked properly any of those times, sadly.
> > 
> > Since the 4.6 total rewrite? There are also recent bug fixes related
> > to convert in the changelog, it should be working now and if there are
> > problems Qu is probably interested in getting them fixed.
> 
> It was a 4.9 kernel from debian.

   It's the userspace tools that make the difference here (and what
Chris was referring to). Conversion has nothing to do with the kernel.

   Hugo.

> The conversion looked like it worked, I rebooted ok, and then it got
> corrupted quickly after I deleted the subvolumes that had the old ext4 data.
> I've since wiped that disk and done a fresh btrfs install on it, because I
> had to get some work done :)
> 
> Marc

-- 
Hugo Mills | Essex: a branch of philothophy.
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Hugo Mills
On Fri, May 19, 2017 at 06:25:22PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:57:09AM +0000, Hugo Mills wrote:
> >I think from the POV of removing these BUG_ONs, it doesn't matter
> > which FS causes them. "All" you need to know is where the error
> > happened. From there, you can (in theory) work out what was wrong and
> > handle it more elagantly than simply stopping.
>  
> Sorry, "you" being the code author, or the user?

   Author.

> If code author, I'd rather this be worked out without the extra steps of
> having to guess or spend more time to see which FS.

   As I understand it, it doesn't really matter which FS it comes
from. The question is: The kernel has hit this BUG_ON. What do you
actually want to do when this happens? You can't bring the kernel to a
grinding halt (BUG_ON), so how do you handle this more elegantly?

   It actually doesn't matter about the state of any specific FS that
caused this particular problem. The fact is, someone decided to check
on the FS's state, and punted the problem of handling the check's
failure to someone later (the BUG_ON). You(*)'ve got to pick up that punt
and deal with it more cleanly.

(*) You == some kernel developer.

> My FS takes up to a day to scrub and btrfs check, clearly making me do this
> over 3 of them is not a good use of time and a loss of up to 3 days of wall
> clock time.
> Not counting that during that time, I have loss of service on all my
> filesystems because I don't want to mount them read-write.
> 
> >Obviously it would be nice, from the POV of the sysadmin, to know
> > which FS was complaining, but as an FS developer it's secondary to
> > identifying a BUG_ON which happens in real life, which offers an
> > opportunity to make the error path more elegant.
> 
> If the FS is remounted R/O, further damage is averted and it's obvious to
> the admin which FS has a problem.
> 
> Is there a reason why all errors that are serious enough, do not cause the
> FS to remount R/O instead of having any BUG/BUG_ON at all?

   Simply that it's easier to write a BUG_ON than to write the code to
bubble up a failure to the point that the FS can be made RO. This is a
clean-up kind of process: BUG_ONs should mostly be changed into a
proper error-handling path leading to remount-RO (in the worst
cases). As I understand it, it's not massively difficult, but it's
probably non-trivial effort to get right in each case.

> WARN_ON is also fine obviously if the error is not serious enough, or doing
> a WARN_ON + a remount R/O

   Sure, but everything shouild really be turned into either a proper
error-handling path (most likely remount RO), or explicitly defined as
BUG_ON (i.e. "this must never happen -- if it does, then the hardware
is fucked up, and we're not responsible for the consequences") It's
that latter definition that's part of the hard decision-making process
for the kernel dev.

   Hugo.

-- 
Hugo Mills | Great oxymorons of the world, no. 7:
hugo@... carfax.org.uk | The Simple Truth
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: 4.11.0: kernel BUG at fs/btrfs/ctree.h:1779!

2017-05-19 Thread Hugo Mills
On Fri, May 19, 2017 at 05:47:48PM -0700, Marc MERLIN wrote:
> On Sat, May 20, 2017 at 12:37:47AM +0000, Hugo Mills wrote:
> > > Can I make another plea for just removing all those BUG/BUG_ON?
> > > They really have no place in production code, there is no excuse for a
> > > filesystem to bring down the entire and in the process not even tell you
> > > which of your filesystems had the issue to start with.
> > > 
> > > Could this be made part of a cleanup for this build to remove them all?
> > 
> >The removal of these has been an ongoing process for at least the
> > last 5 years.
>  
> That's great news, thanks. I guess I'm a bit edgy because I've hit too many
> of them already :) but glad to hear that there are a lot fewer now.
> 
> >I don't understand the specifics of the kernel code in question(*),
> > but compared to 5 years ago, btrfs has got rid of most of the
> > BUG_ONs(**) a few years ago. The remaining ones are probably
> > complicated to deal with in any way more elegant than just stopping.
> 
> The biggest problem is that those BUG* do not even tell you where the
> problem.
> The assumption that you'd only ever have a single btrfs filesystem mounted,
> is flawed to say the least :)
> (I have 5 different ones on my server)

   I think from the POV of removing these BUG_ONs, it doesn't matter
which FS causes them. "All" you need to know is where the error
happened. From there, you can (in theory) work out what was wrong and
handle it more elagantly than simply stopping.

   Obviously it would be nice, from the POV of the sysadmin, to know
which FS was complaining, but as an FS developer it's secondary to
identifying a BUG_ON which happens in real life, which offers an
opportunity to make the error path more elegant.

> >I recall seeing someone's stats on BUG_ON locations a couple of
> > years ago, and btrfs had managed to get the number of locations down
> > below XFS (but no other FS). It's a kind of success, at least...
> 
> Good to know, thanks, and thanks to anyone who has worked on removing those.

   I don't know what the current state is. Maybe someone on IRC will
be able to do the greps/stats to give proper numbers to it.

   Hugo.

-- 
Hugo Mills | IMPROVE YOUR ORGANISMS!!
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Subject line of spam email


signature.asc
Description: Digital signature


  1   2   3   4   5   6   7   8   9   10   >