[PATCH v2] fstests: test regression of -EEXIST on creating new file after log replay

2018-03-10 Thread Liu Bo
The regression is introduced to btrfs in linux v4.4 and it refuses to create
new files after log replay by returning -EEXIST.

Although the problem is on btrfs only, there is no btrfs stuff in terms of
test, so this makes it generic.

The kernel fix is
  Btrfs: fix unexpected -EEXIST when creating new inode

Reviewed-by: Filipe Manana 
Signed-off-by: Liu Bo 
---
v2: - Remove failed message from 481.out
- Drop the unnecessary write in creating a file

 tests/generic/481 | 75 +++
 tests/generic/481.out |  2 ++
 tests/generic/group   |  1 +
 3 files changed, 78 insertions(+)
 create mode 100755 tests/generic/481
 create mode 100644 tests/generic/481.out

diff --git a/tests/generic/481 b/tests/generic/481
new file mode 100755
index 000..6a7e9dd
--- /dev/null
+++ b/tests/generic/481
@@ -0,0 +1,75 @@
+#! /bin/bash
+# FSQA Test No. 481
+#
+# Reproduce a regression of btrfs that leads to -EEXIST on creating new files
+# after log replay.
+#
+# The kernel fix is
+#   Btrfs: fix unexpected -EEXIST when creating new inode
+#
+#---
+#
+# Copyright (C) 2018 Oracle. All Rights Reserved.
+# Author: Bo Liu 
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_dm_target flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs >>$seqres.full 2>&1
+_init_flakey
+_mount_flakey
+
+# create a file and keep it in write ahead log
+$XFS_IO_PROG -f -c "fsync" $SCRATCH_MNT/foo
+
+# fail this filesystem so that remount can replay the write ahead log
+_flakey_drop_and_remount
+
+# see if we can create a new file successfully
+touch $SCRATCH_MNT/bar
+
+_unmount_flakey
+
+echo "Silence is golden"
+
+status=0
+exit
diff --git a/tests/generic/481.out b/tests/generic/481.out
new file mode 100644
index 000..206e116
--- /dev/null
+++ b/tests/generic/481.out
@@ -0,0 +1,2 @@
+QA output created by 481
+Silence is golden
diff --git a/tests/generic/group b/tests/generic/group
index ea2056b..05f60f2 100644
--- a/tests/generic/group
+++ b/tests/generic/group
@@ -483,3 +483,4 @@
 478 auto quick
 479 auto quick metadata
 480 auto quick metadata
+481 auto quick
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread Duncan
Andrei Borzenkov posted on Sat, 10 Mar 2018 13:27:03 +0300 as excerpted:


> And "missing" is not the answer because I obviously may have more than
> one missing device.

"missing" is indeed the answer when using btrfs device remove.  See the 
btrfs-device manpage, which explains that if there's more than one device 
missing, either just the first one described by the metadata will be 
removed (if missing is only specified once), or missing can be specified 
multiple times.

raid6 with two devices missing is the only normal candidate for that 
presently, tho on-list we've seen aborted-add cases where it still worked 
as well, because while the metadata listed the new device it didn't 
actually have any data when it became apparent it was bad and thus needed 
to be removed again.

Note that because btrfs raid1 and raid10 only does two-way-mirroring 
regardless of the number of devices, and because of the per-chunk (as 
opposed to per-device) nature of btrfs raid10, those modes can only 
expect successful recovery with a single missing device, altho as 
mentioned above we've seen on-list at least one case where an aborted 
device-add of device found to be bad after the add didn't actually have 
anything on it, so it could still be removed along with the device it was 
originally intended to replace.

Of course the N-way-mirroring mode, whenever it eventually gets 
implemented, will allow missing devices upto N-1, and N-way-parity mode, 
if it's ever implemented, similar, but N-way-mirroring was scheduled for 
after raid56 mode so it could make use of some of the same code, and that 
has of course taken years on years to get merged and stabilize, and 
there's no sign yet of N-way-mirroring patches, which based on the raid56 
case could take years to stabilize and debug after original merge, so the 
still somewhat iffy raid6 mode is likely to remain the only normal usage 
of multiple missing for years, yet.

For btrfs replace, the manpage says ID's the only way to handle missing, 
but getting that ID, as you've indicated, could be difficult.  For 
filesystems with only a few devices that haven't had any or many device 
config changes, it should be pretty easy to guess (a two device 
filesystem with no changes should have IDs 1 and 2, so if only one is 
listed, the other is obvious, and a 3-4 device fs with only one or two 
previous device changes, likely well remembered by the admin, should 
still be reasonably easy to guess), but as the number of devices and the 
number of device adds/removes/replaces increases, finding/guessing the 
missing one becomes far more difficult.

Of course the sysadmin's first rule of backups states in simple form that 
not having one == defining the value of the data as trivial, not worth 
the trouble of a backup, which in turn means that at some point before 
there's /too/ many device change events, it's likely going to be less 
trouble (particularly after factoring in reliability) to restore from 
backups to a fresh filesystem than it is to do yet another device change, 
and together with the current practical limits btrfs imposes on the 
number of missing devices, that tends to impose /some/ limit on the 
possibilities for missing device IDs, so the situation, while not ideal, 
isn't yet /entirely/ out of hand, either, because a successful guess 
based on available information should be possible without /too/ many 
attempts.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 6:51 PM, Linus Torvalds
 wrote:
>
> So in *historical* context - when a compiler didn't do variable length
> arrays at all - the original semantics of C "constant expressions"
> actually make a ton of sense.
>
> You can basically think of a constant expression as something that can
> be (historically) handled by the front-end without any real
> complexity, and no optimization phases - just evaluating a simple
> parse tree with purely compile-time constants.
>
> So there's a good and perfectly valid reason for why C limits certain
> expressions to just a very particular subset. It's not just array
> sizes, it's  case statements etc too. And those are very much part of
> the C standard.
>
> So an error message like
>
>warning: ISO C90 requires array sizes to be constant-expressions
>
> would be technically correct and useful from a portability angle. It
> tells you when you're doing something non-portable, and should be
> automatically enabled with "-ansi -pedantic", for example.
>
> So what's misleading is actually the name of the warning and the
> message, not that it happens. The warning isn't about "variable
> length", it's literally about the rules for what a
> "constant-expression" is.
>
> And in C, the expression (2,3) has a constant _value_ (namely 3), but
> it isn't a constant-expression as specified by the language.
>
> Now, the thing is that once you actually do variable length arrays,
> those old front-end rules make no sense any more (outside of the "give
> portability warnings" thing).
>
> Because once you do variable length arrays, you obviously _parse_
> everything just fine, and you're doing to evaluate much more complex
> expressions than some limited constant-expression rule.
>
> And at that point it would make a whole lot more sense to add a *new*
> warning that basically says "I have to generate code for a
> variable-sized array", if you actually talk about VLA's.
>
> But that's clearly not what gcc actually did.
>
> So the problem really is that -Wvla doesn't actually warn about VLA's,
> but about something technically completely different.
>

I *think* I followed your reasoning. For gcc, -Wvla is the "I have to
generate code for a variable-sized array" one; but in this case, the
array size is the actual issue that you would have liked to be warned
about; since people writing:

int a[(2,3)];

did not really mean to declare a VLA. Therefore, you say warning them
about the "warning: ISO C90 requires array sizes to be
constant-expressions" (let's call it -Wpedantic-array-sizes) would be
more helpful here instead of saying stuff about VLAs.

In my case, I was just expecting gcc to give us both warnings and
that's it, instead of trying to be smart and give only the
-Wpedantic-array-sizes one (which is the one I was wondering in my
previous email about why it was missing). I think it would be clear
enough if both warnings are shown are the same time. And it makes
sense, since if you write that line in ISO C90 it means there really
are 2 things going wrong in the end (fishy syntax while in ISO C90
mode and, due to that, VLA code generated as well), no?

Thanks for taking the time to write about the historical context, by the way!

Miguel

> And that's why those stupid syntactic issues with min/max matter. It's
> not whether the end result is a compile-time constant or not, it's
> about completely different issues, like whether there is a
> comma-expression in it.
>
>   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 23:31 +0500, Roman Mamedov wrote:
> QCOW2 would add a second layer of COW
> on top of
> Btrfs, which sounds like a nightmare.

I've just seen there is even a nocow option "specifically" for btrfs...
it seems however that it doesn't disable the CoW of qcow, but rather
that of btrfs... (thus silently also the checksumming).


Does plain qcow2 really CoW on every write? I've always assumed it
would only CoW when one makes snapshots or so...


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Roman Mamedov
On Sat, 10 Mar 2018 16:50:22 +0100
Adam Borowski  wrote:

> Since we're on a btrfs mailing list, if you use qemu, you really want
> sparse format:raw instead of qcow2 or preallocated raw.  This also works
> great with TRIM.

Agreed, that's why I use RAW. QCOW2 would add a second layer of COW on top of
Btrfs, which sounds like a nightmare. Even if you would run those files as
NOCOW in Btrfs, somehow I feel FS-native COW is more efficient than emulating
it in userspace with special format files.

> > It works, just not with some of the QEMU virtualized disk device drivers.
> > You don't need to use qemu-img to manually dig holes either, it's all
> > automatic.
> 
> It works only with scsi and virtio-scsi drivers.  Most qemu setups use
> either ide (ouch!) or virtio-blk.

It works with IDE as well.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to change/fix 'Received UUID'

2018-03-10 Thread Marc MERLIN
Thanks all for the help again.
I just wrote a blog post to explain the process to others should anyone
need this later.

http://marc.merlins.org/perso/btrfs/post_2018-03-09_Btrfs-Tips_-Rescuing-A-Btrfs-Send-Receive-Relationship.html

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/   | PGP 7F55D5F27AAF9D08
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 9:34 AM, Miguel Ojeda
 wrote:
>
> So the warning is probably implemented to just trigger whenever VLAs
> are used but the given standard does not allow them, for all
> languages. The problem is why the ISO C90 frontend is not giving an
> error for using invalid syntax for array sizes to begin with?

So in *historical* context - when a compiler didn't do variable length
arrays at all - the original semantics of C "constant expressions"
actually make a ton of sense.

You can basically think of a constant expression as something that can
be (historically) handled by the front-end without any real
complexity, and no optimization phases - just evaluating a simple
parse tree with purely compile-time constants.

So there's a good and perfectly valid reason for why C limits certain
expressions to just a very particular subset. It's not just array
sizes, it's  case statements etc too. And those are very much part of
the C standard.

So an error message like

   warning: ISO C90 requires array sizes to be constant-expressions

would be technically correct and useful from a portability angle. It
tells you when you're doing something non-portable, and should be
automatically enabled with "-ansi -pedantic", for example.

So what's misleading is actually the name of the warning and the
message, not that it happens. The warning isn't about "variable
length", it's literally about the rules for what a
"constant-expression" is.

And in C, the expression (2,3) has a constant _value_ (namely 3), but
it isn't a constant-expression as specified by the language.

Now, the thing is that once you actually do variable length arrays,
those old front-end rules make no sense any more (outside of the "give
portability warnings" thing).

Because once you do variable length arrays, you obviously _parse_
everything just fine, and you're doing to evaluate much more complex
expressions than some limited constant-expression rule.

And at that point it would make a whole lot more sense to add a *new*
warning that basically says "I have to generate code for a
variable-sized array", if you actually talk about VLA's.

But that's clearly not what gcc actually did.

So the problem really is that -Wvla doesn't actually warn about VLA's,
but about something technically completely different.

And that's why those stupid syntactic issues with min/max matter. It's
not whether the end result is a compile-time constant or not, it's
about completely different issues, like whether there is a
comma-expression in it.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Miguel Ojeda
On Sat, Mar 10, 2018 at 5:30 PM, Linus Torvalds
 wrote:
> On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>>
>> Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or
>> some other name for the simple macro. Bleh.
>
> Oh, and I'm starting to see the real problem.
>
> It's not that our current "min/max()" are broiken. It's that "-Wvla" is 
> garbage.
>
> Lookie here:
>
> int array[(1,2)];
>
> results in gcc saying
>
>  warning: ISO C90 forbids variable length array ‘array’ [-Wvla]
>int array[(1,2)];
>^~~
>
> and that error message - and the name of the flag - is obviously pure garbage.
>
> What is *actually* going on is that ISO C90 requires an array size to
> be not a constant value, but a constant *expression*. Those are two
> different things.
>
> A constant expression has little to do with "compile-time constant".
> It's a more restricted form of it, and has actual syntax requirements.
> A comma expression is not a constant expression, for example, which
> was why I tested this.
>
> So "-Wvla" is garbage, with a misleading name, and a misleading
> warning string. It has nothing to do with "variable length" and
> whether the compiler can figure it out at build time, and everything
> to do with a _syntax_ rule.

The warning string is basically the same to the one used for C++, i.e.:

int size2 = 2;
constexpr int size3 = 2;

int array1[(2,2)];
int array2[(size2, size2)];
int array3[(size3, size3)];

only warns for array2 with:

warning: ISO C++ forbids variable length array 'array2' [-Wvla]
 int array2[(size2, size2)];

So the warning is probably implemented to just trigger whenever VLAs
are used but the given standard does not allow them, for all
languages. The problem is why the ISO C90 frontend is not giving an
error for using invalid syntax for array sizes to begin with?

Miguel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 16:50 +0100, Adam Borowski wrote:
> Since we're on a btrfs mailing list
Well... my original question was whether someone could make zerofree
support for btrfs (which I think would be best if someone who knows how
btrfs really works)... thus I directed the question to this list and
not to some of qemu :-)


> It works only with scsi and virtio-scsi drivers.  Most qemu setups
> use
> either ide (ouch!) or virtio-blk.
Seems my libvirt created VMs use "sata" per default... and it does seem
to work with that either in the meantime.


Thanks :-)

Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 19:37 +0500, Roman Mamedov wrote:
> Note you can use it on HDDs too, even without QEMU and the like: via
> using LVM
> "thin" volumes. I use that on a number of machines, the benefit is
> that since
> TRIMed areas are "stored nowhere", those partitions allow for
> incredibly fast
> block-level backups, as it doesn't have to physically read in all the
> free
> space, let alone any stale data in there. LVM snapshots are also way
> more
> efficient with thin volumes, which helps during backup.
I was thinking about using those... but then I'd have to use loop
device files I guess... which I also want to avoid.



> > dm-crypt per default blocks discard.
> 
> Out of misguided paranoia. If your crypto is any good (and last I
> checked AES
> was good enough), there's really not a lot to gain for the "attacker"
> knowing
> which areas of the disk are used and which are not.
I'm not an expert here... but a) I think it would be independent of AES
and rather the encryption mode (e.g. XTS) which protects here or not...
and b) we've seen too many attacks on crypto based on smart statistics
and knowing which blocks on a medium are actually data or just "random
crypto noise" (and you know that when using TRIM) can already tell a
lot.
At least it could tell an attacker how much data there is on a fs.

 
> It works, just not with some of the QEMU virtualized disk device
> drivers.
> You don't need to use qemu-img to manually dig holes either, it's all
> automatic.
You're right... seems like in older version one needed to set virtio-
scsi as device driver (which I possible missed), but nowadays it even
seems to work with sata.



> QEMU deallocates parts of its raw images for those areas which have
> been
> TRIM'ed by the guest. In fact I never use qcow2, always raw images
> only.
> Yet, boot a guest, issue fstrim, and see the raw file while still
> having the
> same size, show much lower actual disk usage in "du".
Works with qcow2 as well... heck even Windows can do it (though it has
no fstrim and it seems one needs to run defrag (which probably does
next to defragmentation also what fstrim does).


Fine for me,... though non qemu users may still be interested in having
 zerofree.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or
> some other name for the simple macro. Bleh.

Oh, and I'm starting to see the real problem.

It's not that our current "min/max()" are broiken. It's that "-Wvla" is garbage.

Lookie here:

int array[(1,2)];

results in gcc saying

 warning: ISO C90 forbids variable length array ‘array’ [-Wvla]
   int array[(1,2)];
   ^~~

and that error message - and the name of the flag - is obviously pure garbage.

What is *actually* going on is that ISO C90 requires an array size to
be not a constant value, but a constant *expression*. Those are two
different things.

A constant expression has little to do with "compile-time constant".
It's a more restricted form of it, and has actual syntax requirements.
A comma expression is not a constant expression, for example, which
was why I tested this.

So "-Wvla" is garbage, with a misleading name, and a misleading
warning string. It has nothing to do with "variable length" and
whether the compiler can figure it out at build time, and everything
to do with a _syntax_ rule.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Sat, Mar 10, 2018 at 7:33 AM, Kees Cook  wrote:
>
> And sparse freaks out too:
>
>drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect
> type in initializer (different address spaces) @@expected void
> *addr @@got struct mac_regs [noderef] http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Linus Torvalds
On Fri, Mar 9, 2018 at 11:03 PM, Miguel Ojeda
 wrote:
>
> Just compiled 4.9.0 and it seems to work -- so that would be the
> minimum required.
>
> Sigh...
>
> Some enterprise distros are either already shipping gcc >= 5 or will
> probably be shipping it soon (e.g. RHEL 8), so how much does it hurt
> to ask for a newer gcc? Are there many users/companies out there using
> enterprise distributions' gcc to compile and run the very latest
> kernels?

I wouldn't mind upping the compiler requirements, and we have other
reasons to go to 4.6.

But _this_ particular issue doesn't seem worth it to then go even
further. Annoying.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 07:37:22PM +0500, Roman Mamedov wrote:
> Note you can use it on HDDs too, even without QEMU and the like: via using LVM
> "thin" volumes. I use that on a number of machines, the benefit is that since
> TRIMed areas are "stored nowhere", those partitions allow for incredibly fast
> block-level backups, as it doesn't have to physically read in all the free
> space, let alone any stale data in there. LVM snapshots are also way more
> efficient with thin volumes, which helps during backup.

Since we're on a btrfs mailing list, if you use qemu, you really want
sparse format:raw instead of qcow2 or preallocated raw.  This also works
great with TRIM.

> > Back then it didn't seem to work.
> 
> It works, just not with some of the QEMU virtualized disk device drivers.
> You don't need to use qemu-img to manually dig holes either, it's all
> automatic.

It works only with scsi and virtio-scsi drivers.  Most qemu setups use
either ide (ouch!) or virtio-blk.

You'd obviously want virtio-scsi; note that defconfig enables virtio-blk but
not virtio-scsi; I assume most distribution kernels have both.  It's a bit
tedious to switch between the two as -blk is visible as /dev/vda while -scsi
as /dev/sda.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] kernel.h: Skip single-eval logic on literals in min()/max()

2018-03-10 Thread Kees Cook
On Fri, Mar 9, 2018 at 10:10 PM, Miguel Ojeda
 wrote:
> On Sat, Mar 10, 2018 at 4:11 AM, Randy Dunlap  wrote:
>> On 03/09/2018 04:07 PM, Andrew Morton wrote:
>>> On Fri, 9 Mar 2018 12:05:36 -0800 Kees Cook  wrote:
>>>
 When max() is used in stack array size calculations from literal values
 (e.g. "char foo[max(sizeof(struct1), sizeof(struct2))]", the compiler
 thinks this is a dynamic calculation due to the single-eval logic, which
 is not needed in the literal case. This change removes several accidental
 stack VLAs from an x86 allmodconfig build:

 $ diff -u before.txt after.txt | grep ^-
 -drivers/input/touchscreen/cyttsp4_core.c:871:2: warning: ISO C90 forbids 
 variable length array ‘ids’ [-Wvla]
 -fs/btrfs/tree-checker.c:344:4: warning: ISO C90 forbids variable length 
 array ‘namebuf’ [-Wvla]
 -lib/vsprintf.c:747:2: warning: ISO C90 forbids variable length array 
 ‘sym’ [-Wvla]
 -net/ipv4/proc.c:403:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:198:2: warning: ISO C90 forbids variable length array 
 ‘buff’ [-Wvla]
 -net/ipv6/proc.c:218:2: warning: ISO C90 forbids variable length array 
 ‘buff64’ [-Wvla]

 Based on an earlier patch from Josh Poimboeuf.
>>>
>>> v1, v2 and v3 of this patch all fail with gcc-4.4.4:
>>>
>>> ./include/linux/jiffies.h: In function 'jiffies_delta_to_clock_t':
>>> ./include/linux/jiffies.h:444: error: first argument to 
>>> '__builtin_choose_expr' not a constant
>>
>>
>> I'm seeing that problem with
>>> gcc --version
>> gcc (SUSE Linux) 4.8.5
>
> Same here, 4.8.5 fails. gcc 5.4.1 seems to work. I compiled a minimal
> 5.1.0 and it seems to work as well.

And sparse freaks out too:

   drivers/net/ethernet/via/via-velocity.c:97:26: sparse: incorrect
type in initializer (different address spaces) @@expected void
*addr @@got struct mac_regs [noderef] *mac_regs
   drivers/net/ethernet/via/via-velocity.c:100:49: sparse: incorrect
type in argument 2 (different base types) @@expected restricted
pci_power_t [usertype] state @@got _t [usertype] state @@

Alright, I'm giving up on fixing max(). I'll go back to STACK_MAX() or
some other name for the simple macro. Bleh.

-Kees

-- 
Kees Cook
Pixel Security
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Change of Ownership of the filesystem content when cloning a volume

2018-03-10 Thread Saravanan Shanmugham (sarvi)
I am 100% sure Netapp Flexclone can change the ownership of the clone content.
We are using that functionality right now.

https://docs.netapp.com/ontap-9/index.jsp?topic=%2Fcom.netapp.doc.dot-cm-cmpr-900%2Fvolume__clone__create.html

When you create clone in Flrxclone, you can specify a uid/gid. 
When the clone is created, all the files/directories/content in that clone is 
instantly owned by the uid/gid.

What I have been looking for is a similar functionality in Btrfs or ZFS?

I would like to put a diskimage on NFS, and mount it as a disk on my machine. 
And I want such snapshot and cloning(change ownership) capabilities in that 
disk image.

So I was considering BTRFS or ZFS, and was wondering if they miht have that 
feature.

Thanks
Sarvi
-
Occam's Razor Rules

On 3/9/18, 9:48 PM, "Andrei Borzenkov"  wrote:

10.03.2018 02:13, Saravanan Shanmugham (sarvi) пишет:
> 
> Netapp’s storage system, has the concept of snapshot/clones.
> And when I create a clone from a snapshot, I can give/change ownership of 
entire tree in the volume to a different userid.

You are probably mistaken. NetApp FlexClone (which you probably mean)
does not have any ways to change volume content. Of course you can now
mount this clone and do whatever you like from host, but that is
completely unrelated to NetApp itself and can just as well be done using
btrfs subvolume.

> 
> Is something like that possible in BTRFS?
> 
> We are looking to use CopyOnWrite to snapshot nightly build workspace and 
clone as developer workspaces to avoid building from scratch for developers,
> And move directly for incremental builds.
> For this we would like the clone workspace/volume to be instantly owned 
by the developer cloning the workspace.
> 
> Thanks,
> Sarvi
> -
> Occam's Razor Rules
> 
> 
> 
N�r��y���b�X��ǧv�^�)޺{.n�+{�n�߲)���w*?jg���?�ݢj/���z�ޖ��2�ޙ���&�)ߡ�a����?�G���h�?�j:+v���w�٥
> 





Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 10.3.2018 v 13:13 Nikolay Borisov napsal(a):
>
> 
>
 And then report back on the output of the extra debug 
 statements. 

 Your global rsv is essentially unused, this means 
 in the worst case the code should fallback to using the global rsv
 for satisfying the memory allocation for delayed refs. So we should
 figure out why this isn't' happening. 
>>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as 
>>> we hit ENOSPC again.
>> There is the output:
>>
>> [24672.573075] BTRFS info (device sdb): space_info 4 has 
>> 18446744072971649024 free, is not full
>> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
>> used=304593289216, pinned=2321940480, reserved=174800896, 
>> may_use=1811644416, readonly=131072
>> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
>> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
>> global_rsv->space_info = 999a57db7000
>> [24672.573083] BTRFS: Transaction aborted (error -28)
> Bummer, so you are indeed running out of global space reservations in
> context which can't really use any other reservation type, thus the
> ENOSPC. Was the stacktrace again during processing of running delayed refs?

Yes, the stacktrace is below.

[24672.573132] WARNING: CPU: 3 PID: 808 at fs/btrfs/extent-tree.c:3089 
btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573132] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
ip6table_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat 
nf_conntrack ip6table_mangle ip6table_raw ip6_tables iptable_mangle 
intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel pcbc aesni_intel snd_pcm aes_x86_64 snd_timer crypto_simd 
glue_helper snd cryptd soundcore iTCO_wdt intel_cstate joydev 
iTCO_vendor_support pcspkr dcdbas intel_uncore sg serio_raw evdev lpc_ich 
mgag200 ttm drm_kms_helper drm i2c_algo_bit shpchp mfd_core i7core_edac ipmi_si 
ipmi_devintf acpi_power_meter ipmi_msghandler button acpi_cpufreq ip_tables 
x_tables autofs4 xfs libcrc32c crc32c_generic btrfs xor zstd_decompress 
zstd_compress
[24672.573161]  xxhash hid_generic usbhid hid raid6_pq sd_mod crc32c_intel 
psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas usbcore scsi_mod bnx2
[24672.573170] CPU: 3 PID: 808 Comm: btrfs-transacti Tainted: GW I 
4.14.23-znr8+ #73
[24672.573171] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.6.3 
02/01/2011
[24672.573172] task: 999a23229140 task.stack: a85642094000
[24672.573186] RIP: 0010:btrfs_run_delayed_refs+0x259/0x270 [btrfs]
[24672.573187] RSP: 0018:a85642097de0 EFLAGS: 00010282
[24672.573188] RAX: 0026 RBX: 99975c75c3c0 RCX: 0006
[24672.573189] RDX:  RSI: 0082 RDI: 999a6fcd66f0
[24672.573190] RBP: 95c24d68 R08: 0001 R09: 0479
[24672.573190] R10: 99974b1960e0 R11: 0479 R12: 999a5a65
[24672.573191] R13: 999a5a6511f0 R14:  R15: 
[24672.573192] FS:  () GS:999a6fcc() 
knlGS:
[24672.573193] CS:  0010 DS:  ES:  CR0: 80050033
[24672.573194] CR2: 558bfd56dfd0 CR3: 00030a60a005 CR4: 000206e0
[24672.573195] Call Trace:
[24672.573215]  btrfs_commit_transaction+0x3e1/0x950 [btrfs]
[24672.573231]  ? start_transaction+0x89/0x410 [btrfs]
[24672.573246]  transaction_kthread+0x195/0x1b0 [btrfs]
[24672.573249]  kthread+0xfc/0x130
[24672.573265]  ? btrfs_cleanup_transaction+0x580/0x580 [btrfs]
[24672.573266]  ? kthread_create_on_node+0x70/0x70
[24672.573269]  ret_from_fork+0x35/0x40
[24672.573270] Code: c7 c6 20 e8 37 c0 48 89 df 44 89 04 24 e8 59 bc 09 00 44 
8b 04 24 eb 86 44 89 c6 48 c7 c7 30 58 38 c0 44 89 04 24 e8 82 30 3f cf <0f> 0b 
44 8b 04 24 eb c4 0f 1f 44 00 00 66 2e 0f 1f 84 00 00 00
[24672.573292] ---[ end trace b17d927a946cb02e ]---


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Roman Mamedov
On Sat, 10 Mar 2018 15:19:05 +0100
Christoph Anton Mitterer  wrote:

> TRIM/discard... not sure how far this is really a solution.

It is the solution in a great many of usage scenarios, don't know enough about
your particular one, though.

Note you can use it on HDDs too, even without QEMU and the like: via using LVM
"thin" volumes. I use that on a number of machines, the benefit is that since
TRIMed areas are "stored nowhere", those partitions allow for incredibly fast
block-level backups, as it doesn't have to physically read in all the free
space, let alone any stale data in there. LVM snapshots are also way more
efficient with thin volumes, which helps during backup.

> dm-crypt per default blocks discard.

Out of misguided paranoia. If your crypto is any good (and last I checked AES
was good enough), there's really not a lot to gain for the "attacker" knowing
which areas of the disk are used and which are not.

> Some longer time ago I had a look at whether qemu would support that on
> it's own,... i.e. the guest and it's btrfs would normally use discard,
> but the image file below would mark the block as discarded and later on
> e can use some qemu-img command to dig holes into exactly those
> locations.
> Back then it didn't seem to work.

It works, just not with some of the QEMU virtualized disk device drivers.
You don't need to use qemu-img to manually dig holes either, it's all
automatic.

> But even if it would in the meantime, a proper zerofree implementation
> would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw
> images in qemu, the whole thing couldn't work but with really zeroing
> the blocks inside the guest.

QEMU deallocates parts of its raw images for those areas which have been
TRIM'ed by the guest. In fact I never use qcow2, always raw images only.
Yet, boot a guest, issue fstrim, and see the raw file while still having the
same size, show much lower actual disk usage in "du".

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 14:04 +0200, Nikolay Borisov wrote:
> So for OLTP workloads you definitely want nodatacow enabled, bear in
> mind this also disables crc checksumming, but your db engine should
> already have such functionality implemented in it.

Unlike repeated claims made here on the list and other places... I
woudln't now *any* DB system which actually does this per default and
or in a way that would be comparable to filesystem lvl checksumming.


Look back in the archives... when I've asked several times for
checksumming support *with* nodatacow, I evaluated the existing status
for the big ones (postgres,mysql,sqlite,bdb)... and all of them had
this either not enabled per default, not at all, or requiring special
support for the program using the DB.


Similar btw: no single VM image type I've evaluated back then had any
form of checksumming integrated.


Still, one of the major deficiencies (not in comparison to other fs,
but in comparison to how it should be) of btrfs unfortunately :-(


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Christoph Anton Mitterer
On Sat, 2018-03-10 at 09:16 +0100, Adam Borowski wrote:
> Do you want zerofree for thin storage optimization, or for security?
I don't think one can really use it for security (neither on SSD or
HDD).
On both, zeroed blocks may still be readable by forensic measures.

So optimisation, i.e. digging holes in VM image files and make them
sparse.


> For the former, you can use fstrim; this is enough on any modern SSD;
> on HDD
> you can rig the block device to simulate TRIM by writing zeroes.  I'm
> sure
> one of dm-* can do this, if not -- should be easy to add, there's
> also
> qemu-nbd which allows control over discard, but incurs a performance
> penalty
> compared to playing with the block layer.

Writing zeros if of course possible... but rather ugly... one really
needs to write *everything* while a smart tool could just zero those
block groups that have been used (while everything else is still zero
from the original image file).

TRIM/discard... not sure how far this is really a solution.

The first thing that comes to my mind is, that *if* the discard would
propagate down below a dm-crypt layer (e.g. in my case there is:
SSD->partitions->dmcrypt->LUKS->btrfs->image-files-I-want-to-zero)
it has effects on security, which is why dm-crypt per default blocks
discard.

Some longer time ago I had a look at whether qemu would support that on
it's own,... i.e. the guest and it's btrfs would normally use discard,
but the image file below would mark the block as discarded and later on
e can use some qemu-img command to dig holes into exactly those
locations.
Back then it didn't seem to work.

But even if it would in the meantime, a proper zerofree implementation
would be beneficial for all non-qemu/qcow2 users (e.g. if one uses raw
images in qemu, the whole thing couldn't work but with really zeroing
the blocks inside the guest.


Cheers,
Chris.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] btrfs-progs: dump-tree: add degraded option

2018-03-10 Thread Anand Jain
btrfs inspect dump-tree cli picks the disk with the largest generation
to read the root tree, even when all the devices were not provided in
the cli. But in 2 disks RAID1 you may need to know what's in the disks
individually, so this option -x | --noscan indicates to use only the
given disk to dump.

Signed-off-by: Anand Jain 
---
v1->v2: rename --degraded to --noscan

 cmds-inspect-dump-tree.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c
index df44bb635c9c..d2676ce55af7 100644
--- a/cmds-inspect-dump-tree.c
+++ b/cmds-inspect-dump-tree.c
@@ -198,6 +198,7 @@ const char * const cmd_inspect_dump_tree_usage[] = {
"-u|--uuid  print only the uuid tree",
"-b|--block  print info from the specified block only",
"-t|--tree print only tree with the given id (string or 
number)",
+   "-x|--noscanuse the disk in the arg, do not scan for the 
disks (for raid1)",
NULL
 };
 
@@ -234,10 +235,11 @@ int cmd_inspect_dump_tree(int argc, char **argv)
{ "uuid", no_argument, NULL, 'u'},
{ "block", required_argument, NULL, 'b'},
{ "tree", required_argument, NULL, 't'},
+   { "noscan", no_argument, NULL, 'x'},
{ NULL, 0, NULL, 0 }
};
 
-   c = getopt_long(argc, argv, "deb:rRut:", long_options, NULL);
+   c = getopt_long(argc, argv, "deb:rRut:x", long_options, NULL);
if (c < 0)
break;
switch (c) {
@@ -286,6 +288,9 @@ int cmd_inspect_dump_tree(int argc, char **argv)
}
break;
}
+   case 'x':
+   open_ctree_flags |= OPEN_CTREE_NO_DEVICES;
+   break;
default:
usage(cmd_inspect_dump_tree_usage);
}
-- 
2.15.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Nikolay Borisov




>>> And then report back on the output of the extra debug 
>>> statements. 
>>>
>>> Your global rsv is essentially unused, this means 
>>> in the worst case the code should fallback to using the global rsv
>>> for satisfying the memory allocation for delayed refs. So we should
>>> figure out why this isn't' happening. 
>> Patch applied. Thank you very much, Nikolay. I'll let you know as soon as we 
>> hit ENOSPC again.
> 
> There is the output:
> 
> [24672.573075] BTRFS info (device sdb): space_info 4 has 18446744072971649024 
> free, is not full
> [24672.573077] BTRFS info (device sdb): space_info total=308163903488, 
> used=304593289216, pinned=2321940480, reserved=174800896, may_use=1811644416, 
> readonly=131072
> [24672.573079] use_block_rsv: Not using global blockrsv! Current 
> blockrsv->type = 1 blockrsv->space_info = 999a57db7000 
> global_rsv->space_info = 999a57db7000
> [24672.573083] BTRFS: Transaction aborted (error -28)

Bummer, so you are indeed running out of global space reservations in
context which can't really use any other reservation type, thus the
ENOSPC. Was the stacktrace again during processing of running delayed refs?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ongoing Btrfs stability issues

2018-03-10 Thread Nikolay Borisov


On  9.03.2018 21:05, Alex Adriaanse wrote:
> Am I correct to understand that nodatacow doesn't really avoid CoW when 
> you're using snapshots? In a filesystem that's snapshotted 

Yes, so nodatacow won't interfere with how snapshots operate. For more
information on that topic check the following mailing list thread:
https://www.spinics.net/lists/linux-btrfs/msg62715.html

every 15 minutes, is there a difference between normal CoW and nodatacow
when (in the case of Postgres) you update a small portion of a 1GB file
many times per minute? Do you anticipate us seeing a benefit in
stability and performance if we set nodatacow for the
So regarding this, you can check :
https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

Essentially every bit of small, random postgres update in the db file
will cause a CoW operation + checksum IO which cause, and I quote, "
thrashing on HDDs and excessive multi-second spikes of CPU load on
systems with an SSD or large amount a RAM."

So for OLTP workloads you definitely want nodatacow enabled, bear in
mind this also disables crc checksumming, but your db engine should
already have such functionality implemented in it.

entire FS while retaining snapshots? Does nodatacow increase the chance
of corruption in a database like Postgres, i.e. are writes still
properly ordered/sync'ed when flushed to disk?

Well most modern DB already implement some sort of a WAL, so the
reliability responsibility is shifted on the db engine.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs remounted read-only due to ENOSPC in btrfs_run_delayed_refs

2018-03-10 Thread Martin Svec
Dne 9.3.2018 v 20:03 Martin Svec napsal(a):
> Dne 9.3.2018 v 17:36 Nikolay Borisov napsal(a):
>> On 23.02.2018 16:28, Martin Svec wrote:
>>> Hello,
>>>
>>> we have a btrfs-based backup system using btrfs snapshots and rsync. 
>>> Sometimes,
>>> we hit ENOSPC bug and the filesystem is remounted read-only. However, 
>>> there's 
>>> still plenty of unallocated space according to "btrfs fi usage". So I think 
>>> this
>>> isn't another edge condition when btrfs runs out of space due to fragmented 
>>> chunks,
>>> but a bug in disk space allocation code. It suffices to umount the 
>>> filesystem and
>>> remount it back and it works fine again. The frequency of ENOSPC seems to be
>>> dependent on metadata chunks usage. When there's a lot of free space in 
>>> existing
>>> metadata chunks, the bug doesn't happen for months. If most metadata chunks 
>>> are
>>> above ~98%, we hit the bug every few days. Below are details regarding the 
>>> backup
>>> server and btrfs.
>>>
>>> The backup works as follows: 
>>>
>>>   * Every night, we create a btrfs snapshot on the backup server and rsync 
>>> data
>>> from a production server into it. This snapshot is then marked 
>>> read-only and
>>> will be used as a base subvolume for the next backup snapshot.
>>>   * Every day, expired snapshots are removed and their space is freed. 
>>> Cleanup
>>> is scheduled in such a way that it doesn't interfere with the backup 
>>> window.
>>>   * Multiple production servers are backed up in parallel to one backup 
>>> server.
>>>   * The backed up servers are mostly webhosting servers and mail servers, 
>>> i.e.
>>> hundreds of billions of small files. (Yes, we push btrfs to the limits 
>>> :-))
>>>   * Backup server contains ~1080 snapshots, Zlib compression is enabled.
>>>   * Rsync is configured to use whole file copying.
>>>
>>> System configuration:
>>>
>>> Debian Stretch, vanilla stable 4.14.20 kernel with one custom btrfs patch 
>>> (see below) and
>>> Nikolay's patch 1b816c23e9 (btrfs: Add enospc_debug printing in 
>>> metadata_reserve_bytes)
>>>
>>> btrfs mount options: 
>>> noatime,compress=zlib,enospc_debug,space_cache=v2,commit=15
>>>
>>> $ btrfs fi df /backup:
>>>
>>> Data, single: total=28.05TiB, used=26.37TiB
>>> System, single: total=32.00MiB, used=3.53MiB
>>> Metadata, single: total=255.00GiB, used=250.73GiB
>>> GlobalReserve, single: total=512.00MiB, used=0.00B
>>>
>>> $ btrfs fi show /backup:
>>>
>>> Label: none  uuid: a52501a9-651c-4712-a76b-7b4238cfff63
>>> Total devices 2 FS bytes used 26.62TiB
>>> devid1 size 416.62GiB used 255.03GiB path /dev/sdb
>>> devid2 size 36.38TiB used 28.05TiB path /dev/sdc
>>>
>>> $ btrfs fi usage /backup:
>>>
>>> Overall:
>>> Device size:  36.79TiB
>>> Device allocated: 28.30TiB
>>> Device unallocated:8.49TiB
>>> Device missing:  0.00B
>>> Used: 26.62TiB
>>> Free (estimated): 10.17TiB  (min: 10.17TiB)
>>> Data ratio:   1.00
>>> Metadata ratio:   1.00
>>> Global reserve:  512.00MiB  (used: 0.00B)
>>>
>>> Data,single: Size:28.05TiB, Used:26.37TiB
>>>/dev/sdc   28.05TiB
>>>
>>> Metadata,single: Size:255.00GiB, Used:250.73GiB
>>>/dev/sdb  255.00GiB
>>>
>>> System,single: Size:32.00MiB, Used:3.53MiB
>>>/dev/sdb   32.00MiB
>>>
>>> Unallocated:
>>>/dev/sdb  161.59GiB
>>>/dev/sdc8.33TiB
>>>
>>> Btrfs filesystem uses two logical drives in single mode, backed by
>>> hardware RAID controller PERC H710; /dev/sdb is HW RAID1 consisting
>>> of two SATA SSDs and /dev/sdc is HW RAID6 SATA volume.
>>>
>>> Please note that we have a simple custom patch in btrfs which ensures
>>> that metadata chunks are allocated preferably on SSD volume and data
>>> chunks are allocated only on SATA volume. The patch slightly modifies
>>> __btrfs_alloc_chunk() so that its loop over devices ignores rotating
>>> devices when a metadata chunk is requested and vice versa. However, 
>>> I'm quite sure that this patch doesn't cause the reported bug because
>>> we log every call of the modified code and there're no __btrfs_alloc_chunk()
>>> calls when ENOSPC is triggered. Moreover, we observed the same bug before
>>> we developed the patch. (IIRC, Chris Mason mentioned that they work on
>>> a similar feature in facebook, but I've found no official patches yet.)
>>>
>>> Dmesg dump:
>>>
>>> [285167.750763] use_block_rsv: 62468 callbacks suppressed
>>> [285167.750764] BTRFS: block rsv returned -28
>>> [285167.750789] [ cut here ]
>>> [285167.750822] WARNING: CPU: 5 PID: 443 at fs/btrfs/extent-tree.c:8463 
>>> btrfs_alloc_tree_block+0x39b/0x4c0 [btrfs]
>>> [285167.750823] Modules linked in: binfmt_misc xt_comment xt_tcpudp 
>>> iptable_filter nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack iptable_raw 
>>> ip6table_filter iptable_nat 

Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread Andrei Borzenkov
09.03.2018 19:43, Austin S. Hemmelgarn пишет:
> 
> If the answer to either one or two is no but the answer to three is yes,
> pull out the failed disk, put in a new one, mount the volume degraded,
> and use `btrfs replace` as well (you will need to specify the device ID
> for the now missing failed disk, which you can find by calling `btrfs
> filesystem show` on the volume).

I do not see it and I do not remember ever seeing device ID of missing
devices.

10:/home/bor # blkid
/dev/sda1: UUID="ce0caa57-7140-4374-8534-3443d21f3edc" TYPE="swap"
PARTUUID="d2714b67-01"
/dev/sda2: UUID="cc072e56-f671-4388-a4a0-2ffee7c98fdb"
UUID_SUB="eaeb4c78-da94-43b3-acc7-c3e963f1108d" TYPE="btrfs"
PTTYPE="dos" PARTUUID="d2714b67-02"
/dev/sdb1: UUID="e4af8f3c-8307-4397-90e3-97b90989cf5d"
UUID_SUB="f421f1e7-2bb0-4a67-a18e-cfcbd63560a8" TYPE="btrfs"
PARTUUID="875525bf-01"
10:/home/bor # mount /dev/sdb1 /mnt
mount: /mnt: wrong fs type, bad option, bad superblock on /dev/sdb1,
missing codepage or helper program, or other error.
10:/home/bor # mount -o degraded /dev/sdb1 /mnt
10:/home/bor # btrfs fi sh /mnt
Label: none  uuid: e4af8f3c-8307-4397-90e3-97b90989cf5d
Total devices 2 FS bytes used 256.00KiB
devid2 size 1023.00MiB used 212.50MiB path /dev/sdb1
*** Some devices missing

10:/home/bor # btrfs fi us /mnt
Overall:
Device size:   2.00GiB
Device allocated:425.00MiB
Device unallocated:1.58GiB
Device missing: 1023.00MiB
Used:512.00KiB
Free (estimated):912.62MiB  (min: 912.62MiB)
Data ratio:   2.00
Metadata ratio:   2.00
Global reserve:   16.00MiB  (used: 0.00B)

Data,RAID1: Size:102.25MiB, Used:128.00KiB
   /dev/sdb1 102.25MiB
   missing   102.25MiB

Metadata,RAID1: Size:102.25MiB, Used:112.00KiB
   /dev/sdb1 102.25MiB
   missing   102.25MiB

System,RAID1: Size:8.00MiB, Used:16.00KiB
   /dev/sdb1   8.00MiB
   missing 8.00MiB

Unallocated:
   /dev/sdb1 810.50MiB
   missing   810.50MiB
10:/home/bor # rpm -q btrfsprogs
btrfsprogs-4.15-2.1.x86_64
10:/home/bor # uname -a
Linux 10 4.15.7-1-default #1 SMP PREEMPT Wed Feb 28 12:40:23 UTC 2018
(a36e160) x86_64 x86_64 x86_64 GNU/Linux
10:/home/bor #



And "missing" is not the answer because I obviously may have more than
one missing device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to replace a failed drive in btrfs RAID 1 filesystem

2018-03-10 Thread waxhead

Austin S. Hemmelgarn wrote:

On 2018-03-09 11:02, Paul Richards wrote:

Hello there,

I have a 3 disk btrfs RAID 1 filesystem, with a single failed drive.
Before I attempt any recovery I’d like to ask what is the recommended
approach?  (The wiki docs suggest consulting here before attempting
recovery[1].)

The system is powered down currently and a replacement drive is being
delivered soon.

Should I use “replace”, or “add” and “delete”?

Once replaced should I rebalance and/or scrub?

I believe that the recovery may involve mounting in degraded mode.  If
I do this, how do I later get out of degraded mode, or if it’s
automatic how do i determine when I’m out of degraded mode?

It won't automatically mount degraded, you either have to explicitly ask 
it to, or you have to have an option to do so in your default mount 
options for the volume in /etc/fstab (which is dangerous for multiple 
reasons).


Now, as to what the best way to go about this is, there are three things 
to consider:


1. Is the failed disk still usable enough that you can get good data off 
of it in a reasonable amount of time?  If you're replacing the disk 
because of a lot of failed sectors, you can still probably get data off 
of it, while something like a head crash isn't worth trying to get data 
back.
2. Do you have enough room in the system itself to add another disk 
without removing one?

3. Is the replacement disk at least as big as the failed disk?

If the answer to all three is yes, then just put in the new disk, mount 
the volume normally (you don't need to mount it degraded if the failed 
disk is working this well), and use `btrfs replace` to move the data. 
This is the most efficient option in terms of both time and is also 
generally the safest (and I personally always over-spec drive-bays in 
systems we build where I work specifically so that this approach can be 
used).


If the answer to the third question is no, put in the new disk (removing 
the failed one first if the answer to the second question is no), mount 
the volume (mount it degraded if one of the first two questions is no, 
normally otherwise), then add the new disk to the volume with `btrfs 
device add` and remove the old one with `btrfs device delete` (using the 
'missing' option if you had to remove the failed disk).  This is needed 
because the replace operation requires the new device to be at least as 
big as the old one.


If the answer to either one or two is no but the answer to three is yes, 
pull out the failed disk, put in a new one, mount the volume degraded, 
and use `btrfs replace` as well (you will need to specify the device ID 
for the now missing failed disk, which you can find by calling `btrfs 
filesystem show` on the volume).  In the event that the replace 
operation refuses to run in this case, instead add the new disk to the 
volume with `btrfs device add` and then run `btrfs device delete 
missing` on the volume.


If you follow any of the above procedures, you don't need to balance 
(the replace operation is equivalent to a block level copy and will 
result in data being distributed exactly the same as it was before, 
while the delete operation is a special type of balance), and you 
generally don't need to scrub the volume either (though it may still be 
a good idea).  As far as getting back from degraded mode, you can just 
remount the volume to do so, though I would generally suggest rebooting.


Note that there are three other possible approaches to consider as well:

1. If you can't immediately get a new disk _and_ all the data will fit 
on the other two disks, use `btrfs device delete` to remove the failed 
disk anyway, and run with just the two until you can get a new disk. 
This is exponentially safer than running the volume degraded until you 
get a new disk, and is the only case you realistically should delete a 
device before adding the new one.  Make sure to balance the volume after 
adding the new device.
2. Depending on the situation, it may be faster to just recreate the 
whole volume from scratch using a backup than it is to try to repair it. 
  This is actually the absolute safest method of handling this 
situation, as it makes sure that nothing from the old volume with the 
failed disk causes problems in the future.
3. If you don't have a backup, but have some temporary storage space 
that will fit all the data from the volume, you could also use `btrfs 
restore` to extract files from the old volume to temporary storage, 
recreate the volume, and copy the data back in from the temporary storage.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


I did a quick scan of the wiki just to see, but I did not find any good 
info about how to recover a "RAID" like set if degraded. Information 
about how to recover, and what profiles can be recovered from would be 
good to have 

Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 03:55:25AM +0100, Christoph Anton Mitterer wrote:
> Just wondered... was it ever planned (or is there some equivalent) to
> get support for btrfs in zerofree?

Do you want zerofree for thin storage optimization, or for security?

For the former, you can use fstrim; this is enough on any modern SSD; on HDD
you can rig the block device to simulate TRIM by writing zeroes.  I'm sure
one of dm-* can do this, if not -- should be easy to add, there's also
qemu-nbd which allows control over discard, but incurs a performance penalty
compared to playing with the block layer.

For zerofree for security, you'd need defrag (to dislodge partial pinned
extents) first, and do a full balance to avoid data left in metadata nodes
and in blocks beyond file ends (note that zerofree doesn't do this on
traditional filesystems either).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html