Re: [PATCH 01/10] btrfs: create a mount option for dax

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 02:43:03PM +0200, Nikolay Borisov wrote:
> One question below though .
> 
> > +++ b/fs/btrfs/super.c
> > @@ -739,6 +741,17 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> > case Opt_user_subvol_rm_allowed:
> > btrfs_set_opt(info->mount_opt, USER_SUBVOL_RM_ALLOWED);
> > break;
> > +#ifdef CONFIG_FS_DAX
> > +   case Opt_dax:
> > +   if (btrfs_super_num_devices(info->super_copy) > 1) {
> > +   btrfs_info(info,
> > +  "dax not supported for multi-device 
> > btrfs partition\n");
> 
> What prevents supporting dax for multiple devices so long as all devices
> are dax?

As I mentioned in a separate mail, most profiles are either redundant
(RAID0), require hardware support (RAID1, DUP) or are impossible (RAID5,
RAID6).

But, "single" profile multi-device would be useful and actually provide
something other dax-supporting filesystems don't have: combining multiple
devices into one logical piece.

On the other hand, DUP profiles need to be banned.  In particular, the
filesystem you mount might have existing DUP block groups.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 00/10] btrfs: Support for DAX devices

2018-12-05 Thread Adam Borowski
On Wed, Dec 05, 2018 at 06:28:25AM -0600, Goldwyn Rodrigues wrote:
> This is a support for DAX in btrfs.

Yay!

> I understand there have been previous attempts at it.  However, I wanted
> to make sure copy-on-write (COW) works on dax as well.

btrfs' usual use of CoW and DAX are thoroughly in conflict.

The very point of DAX is to have writes not go through the kernel, you
mmap the file then do all writes right to the pmem, flushing when needed
(without hitting the kernel) and having the processor+memory persist what
you wrote.

CoW via page faults are fine -- pmem is closer to memory than disk, and this
means the kernel will ask the filesystem for an extent to place the new page
in, copy the contents and let the process play with it.  But real btrfs CoW
would mean we'd need to page fault on ᴇᴠᴇʀʏ ꜱɪɴɢʟᴇ ᴡʀɪᴛᴇ.

Delaying CoW until the next commit doesn't help -- you'd need to store the
dirty page in DRAM then write it, which goes against the whole concept of
DAX.

Only way I see would be to CoW once then pretend the page is nodatacow until
the next commit, when we checksum it, add to the metadata trees, and mark
for CoWing on the next write.  Lots of complexity, and you still need to
copy the whole thing every commit (so no gain).

Ie, we're in nodatacow land.  CoW for metadata is fine.

> Before I present this to the FS folks I wanted to run this through the
> btrfs. Even though I wish, I cannot get it correct the first time
> around :/.. Here are some questions for which I need suggestions:
> 
> Questions:
> 1. I have been unable to do checksumming for DAX devices. While
> checksumming can be done for reads and writes, it is a problem when mmap
> is involved because btrfs kernel module does not get back control after
> an mmap() writes. Any ideas are appreciated, or we would have to set
> nodatasum when dax is enabled.

Per the above, it sounds like nodatacow (ie, "cow once") would be needed.

> 2. Currently, a user can continue writing on "old" extents of an mmaped file
> after a snapshot has been created. How can we enforce writes to be directed
> to new extents after snapshots have been created? Do we keep a list of
> all mmap()s, and re-mmap them after a snapshot?

Same as for any other memory that's shared: when a new instance of sharing
is added (a snapshot/reflink in our case), you deny writes, causing a page
fault on the next attempt.  "pmem" is named "ᴘersistent ᴍᴇᴍory" for a
reason...

> Tested by creating a pmem device in RAM with "memmap=2G!4G" kernel
> command line parameter.

Might be more useful to use a bigger piece of the "disk" than 2G, it's not
in the danger area though.

Also note that it's utterly pointless to use any RAID modes; multi-dev
single is fine, DUP counts as RAID here.
* RAID0 is already done better in hardware (interleave)
* RAID1 would require hardware support, replication isn't easy
* RAID5/6 

What would make sense, is disabling dax for any files that are not marked as
nodatacow.  This way, unrelated files can still use checksums or
compression, while only files meant as a pmempool or otherwise by a
pmem-aware program would have dax writes (you can still give read-only pages
that CoW to DRAM).  This way we can have write dax for only a subset of
files, and full set of btrfs features for the rest.  Write dax is dangerous
for programs that have no specific support: the vast majority of
database-like programs rely on page-level atomicity while pmem gives you
cacheline/word atomicity only; torn writes mean data loss.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


Re: [PATCH 7/9] btrfs-progs: Fix Wmaybe-uninitialized warning

2018-12-04 Thread Adam Borowski
On Tue, Dec 04, 2018 at 01:17:04PM +0100, David Sterba wrote:
> On Fri, Nov 16, 2018 at 03:54:24PM +0800, Qu Wenruo wrote:
> > The only location is the following code:
> > 
> > int level = path->lowest_level + 1;
> > BUG_ON(path->lowest_level + 1 >= BTRFS_MAX_LEVEL);
> > while(level < BTRFS_MAX_LEVEL) {
> > slot = path->slots[level] + 1;
> > ...
> > }
> > path->slots[level] = slot;
> > 
> > Again, it's the stupid compiler needs some hint for the fact that
> > we will always enter the while loop for at least once, thus @slot should
> > always be initialized.
> 
> Harsh words for the compiler, and I say not deserved. The same code
> pasted to kernel a built with the same version does not report the
> warning, so it's apparently a missing annotation of BUG_ON in
> btrfs-progs that does not give the right hint.

It's be nice if the C language provided a kind of a while loop that executes
at least once...

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ Ivan was a worldly man: born in St. Petersburg, raised in
⢿⡄⠘⠷⠚⠋⠀ Petrograd, lived most of his life in Leningrad, then returned
⠈⠳⣄ to the city of his birth to die.


[PATCH RESEND 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2018-11-21 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..66111bf4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", );
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.19.1



[PATCH RESEND-v3 2/2] btrfs-progs: defrag: open files RO on new enough kernels

2018-11-21 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)
v3: more eloquentier description; s/defrag_ro/defrag_open_mode/

 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index d1af21ee..c67bf5da 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -878,7 +881,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -915,6 +918,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1015,7 +1021,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.1



Re: Filesystem mounts fine but hangs on access

2018-11-04 Thread Adam Borowski
On Sun, Nov 04, 2018 at 06:29:06PM +, Duncan wrote:
> So do consider adding noatime to your mount options if you haven't done 
> so already.  AFAIK, the only /semi-common/ app that actually uses atimes 
> these days is mutt (for read-message tracking), and then not for mbox, so 
> you should be safe to at least test turning it off.

To the contrary, mutt uses atimes only for mbox.
 
> And YMMV, but if you do use mutt or something else that uses atimes, I'd 
> go so far as to recommend finding an alternative, replacing either btrfs 
> (because as I said, relatime is arguably enough on a traditional non-COW 
> filesystem) or whatever it is that uses atimes, your call, because IMO it 
> really is that big a deal.

Fortunately, mutt's use could be fixed by teaching it to touch atimes
manually.  And that's already done, for both forks (vanilla and neomutt).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Have you heard of the Amber Road?  For thousands of years, the
⣾⠁⢰⠒⠀⣿⡁ Romans and co valued amber, hauled through the Europe over the
⢿⡄⠘⠷⠚⠋⠀ mountains and along the Vistula, from Gdańsk.  To where it came
⠈⠳⣄ together with silk (judging by today's amber stalls).


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-10-07 Thread Adam Borowski
On Mon, Oct 08, 2018 at 02:03:44AM +0200, Hans van Kranenburg wrote:
> And yes, when promoting things like the new show_usage example to
> programs that are easily available, users will probably start parsing
> the output of them with sed and awk which is a total abomination and the
> absolute opposite of the purpose of the library. So be it. Let it go. :D
> "The code never bothered me any way".

It's not like some deranged person would parse the output of, say, show_file
in Perl...
 
> The interesting question that remains is where the result should go.
> 
> btrfs-heatmap is a thing of its own now, but it's a bit of the "show
> case" example using the lib, with its own collection of documentation
> and even possibility to script it again.
> 
> Shipping the 'binaries' in the python3-btrfs package wouldn't be the
> right thing, so where should they go? apt-get install btrfs-moar-utils-yolo?

At least in Debian, moving executables between packages is a matter of
versioned Replaces (+Conflicts: old), so if any point you decide differently
it's not a problem.  So btrfs-moar-utils-yolo should work well.

> Or should btrfs-progs start to use this to accelerate improvement for
> providing a richer collection of useful progs for things that are not on
> essential level (like, you won't need them inside initramfs, so they can
> use python)?

You might want your own package that's agile and btrfs-progs for things
declared to be rock stable (WRT command-line API, not neccesarily stability
of code).

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 10 people enter a bar: 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ 1 who doesn't, D who prefer to write it as hex,
⠈⠳⣄ and 1 who narrowly avoided an off-by-one error.


Re: python-btrfs v10 preview... detailed usage reporting and a tutorial

2018-09-23 Thread Adam Borowski
On Sun, Sep 23, 2018 at 11:54:12PM +0200, Hans van Kranenburg wrote:
> Two examples have been added, which use the new code. I would appreciate
> extra testing. Please try them and see if the reported numbers make sense:
> 
> space_calculator.py
> ---
> Best to be initially described as a CLI version of the well-known
> webbased btrfs space calculator by Hugo. ;] Throw a few disk sizes at
> it, choose data and metadata profile and see how much space you would
> get to store actual data.
> 
> See commit message "Add example to calculate usable and wasted space"
> for example output.
> 
> show_usage.py
> -
> The contents of the old show_usage.py example that simply showed a list
> of block groups are replaced with a detailed usage report of an existing
> filesystem.

I wonder, perhaps at least some of the examples could be elevated to
commands meant to be run by end-user?  Ie, installing them to /usr/bin/,
dropping the extension?  They'd probably need less generic names, though.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 10 people enter a bar:
⣾⠁⢰⠒⠀⣿⡁ • 1 who understands binary,
⢿⡄⠘⠷⠚⠋⠀ • 1 who doesn't,
⠈⠳⣄ • and E who prefer to write it as hex.


Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Sat, Sep 08, 2018 at 08:45:47PM +, Martin Raiber wrote:
> Am 08.09.2018 um 18:24 schrieb Adam Borowski:
> > On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> >> On 2018-09-06 03:23, Nathan Dehnel wrote:
> >>> So I guess my question is, does btrfs support atomic writes across
> >>> multiple files? Or is anyone interested in such a feature?
> >>>
> >> I'm fairly certain that it does not currently, but in theory it would not 
> >> be
> >> hard to add.

> >> However, if this were extended to include rename, unlink, touch, and a
> >> handful of other VFS operations, then I can easily think of a few dozen use
> >> cases.  Package managers in particular would likely be very interested in
> >> being able to atomically rename a group of files as a single transaction, 
> >> as
> >> it would make their job _much_ easier.

> > I wonder, what about:
> > sync; mount -o remount,commit=999,flushoncommit
> > eatmydata apt dist-upgrade
> > sync; mount -o remount,commit=30,noflushoncommit
> >
> > Obviously, this gets fooled by fsyncs, and makes the transaction affects the
> > whole system (if you have unrelated writes they won't get committed until
> > the end of transaction).  Then there are nocow files, but you already made
> > the decision to disable most features of btrfs for them.

> Now combine this with snapshot root, then on success rename exchange to
> root and you are there.

No need: no unsuccessful transactions ever get written to the disk.
(Not counting unreachable stuff.)

> Btrfs had in the past TRANS_START and TRANS_END ioctls (for ceph, I
> think), but no rollback (and therefore no error handling incl. ENOSPC).
> 
> If you want to look at a working file system transaction mechanism, you
> should look at transactional NTFS (TxF). They are writing they are
> deprecating it, so it's perhaps not very widely used. Windows uses it
> for updates, I think.

You're talking about multiple simultaneous transactions, they have a massive
complexity cost.  And btrfs is already ridiculously complex.  I don't really
see a good way to tie this with the POSIX API without some serious
rethinking.

dpkg can already recover from a properly returned error (although not as
nicely as a full rollback); what is fatal for it is having its status
database corrupted/out of sync.  That's why it does a multiple fsync dance
and keeps fully rewriting its files over and over and over.

Atomic operations are pretty useful even without a rollback: you still need
to be able to handle failure, but not a crash.

> Specifically for btrfs, the problem would be that it really needs to
> support multiple simultaneous writers, otherwise one transaction can
> block the whole system.

My dirty hack above doesn't suffer from such a block: it only suffers from
compromising durability of concurrent writers.  During that userspace
transaction, there are no commits until it finishes; this means that if
there's unrelated activity it may suffer from losing writes that were done
between transaction start and crash.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: Transactional btrfs

2018-09-08 Thread Adam Borowski
On Thu, Sep 06, 2018 at 06:08:33AM -0400, Austin S. Hemmelgarn wrote:
> On 2018-09-06 03:23, Nathan Dehnel wrote:
> > So I guess my question is, does btrfs support atomic writes across
> > multiple files? Or is anyone interested in such a feature?
> > 
> I'm fairly certain that it does not currently, but in theory it would not be
> hard to add.
> 
> Realistically, the only cases I can think of where cross-file atomic
> _writes_ would be of any benefit are database systems.
> 
> However, if this were extended to include rename, unlink, touch, and a
> handful of other VFS operations, then I can easily think of a few dozen use
> cases.  Package managers in particular would likely be very interested in
> being able to atomically rename a group of files as a single transaction, as
> it would make their job _much_ easier.

I wonder, what about:
sync; mount -o remount,commit=999,flushoncommit
eatmydata apt dist-upgrade
sync; mount -o remount,commit=30,noflushoncommit

Obviously, this gets fooled by fsyncs, and makes the transaction affects the
whole system (if you have unrelated writes they won't get committed until
the end of transaction).  Then there are nocow files, but you already made
the decision to disable most features of btrfs for them.

So unless something forces a commit, this should already work, giving
cross-file atomic writes, renames and so on.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: dduper - Offline btrfs deduplication tool

2018-09-07 Thread Adam Borowski
On Fri, Sep 07, 2018 at 09:27:28AM +0530, Lakshmipathi.G wrote:
> > One question:
> > Why not ioctl_fideduperange?
> > i.e. you kill most of benefits from that ioctl - atomicity.
> > 
> I plan to add fideduperange as an option too. User can
> choose between fideduperange and ficlonerange call.
> 
> If I'm not wrong, with fideduperange, kernel performs
> comparsion check before dedupe. And it will increase
> time to dedupe files.

You already read the files to md5sum them, so you have no speed gain.
You get nasty data-losing races, and risk collisions as well.  md5sum is
safe against random occurences (compared eg. to the chance of lightning
hitting you today), but is exploitable by a hostile user.  On the other
hand, full bit-to-bit comparison is faster and 100% safe.

You can't skip verification -- the checksums are only 32-bit.  They have a
1:4G chance to mismatch, which means you can expect one false positive with
64K extents, rising quadratically as the number of files grows.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ Collisions shmolisions, let's see them find a collision or second
⠈⠳⣄ preimage for double rot13!


[PATCH v3] btrfs-progs: defrag: open files RO on new enough kernels

2018-09-03 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro (commit 616d374efa23).

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)
v3: more eloquentier description; s/defrag_ro/defrag_open_mode/

 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..99e2aec0 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_open_mode = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_open_mode);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_open_mode = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , defrag_open_mode);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



[PATCH v2] btrfs-progs: defrag: open files RO on new enough kernels

2018-09-03 Thread Adam Borowski
Defragging an executable conflicts both way with it being run, resulting in
ETXTBSY.  This either makes defrag fail or prevents the program from being
executed.

Kernels 4.19-rc1 and later allow defragging files you could have possibly
opened rw, even if the passed descriptor is ro.

Signed-off-by: Adam Borowski 
---
v2: more eloquent description; root can't defrag RO on old kernels (unlike
dedupe)


 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..17e992a3 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_ro = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_ro);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0))
+   defrag_ro = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , defrag_ro);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



Re: [PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
On Mon, Sep 03, 2018 at 02:04:23PM +0300, Nikolay Borisov wrote:
> On  3.09.2018 13:14, Adam Borowski wrote:
> > -   fd = open(fpath, O_RDWR);
> > +   fd = open(fpath, defrag_ro);
> 
> Looking at the kernel code I think this is in fact incorrect, because in
> ioctl.c we have:
> 
> if (!(file->f_mode & FMODE_WRITE)) {
> 
> ret = -EINVAL;
> 
> goto out;
> 
> }
> 
> So it seems a hard requirement to have opened a file for RW when you
> want to defragment it.

Oif!  I confused this with dedup, which does allow root to dedup RO even on
old kernels.  Good catch.


-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: [PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
On Mon, Sep 03, 2018 at 02:01:21PM +0300, Nikolay Borisov wrote:
> On  3.09.2018 13:14, Adam Borowski wrote:
> > Fixes EXTXBSY races.
> 
> You have to be more eloquent than that and explain at least one race
> condition.

If you try to defrag an executable that's currently running:

ERROR: cannot open XXX: Text file busy
total 1 failures

If you try to run an executable that's being defragged:

-bash: XXX: Text file busy

The former tends to be a long-lasting condition but has only benign fallout
(executables almost never get fragmented, not recompressing a single file is
not the end of the world), the latter is only a brief window of time but has
potential for data loss.

> > +static mode_t defrag_ro = O_RDONLY;
> 
> This brings no value whatsoever, just use O_RDONLY directly

On old kernels it gets overwritten with:

> > +   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0) && getuid())
> > +   defrag_ro = O_RDWR;


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: IO errors when building RAID1.... ?

2018-09-03 Thread Adam Borowski
On Sun, Sep 02, 2018 at 09:15:25PM -0600, Chris Murphy wrote:
> For > 10 years drive firmware handles bad sector remapping internally.
> It remaps the sector logical address to a reserve physical sector.
> 
> NTFS and ext[234] have a means of accepting a list of bad sectors, and
> will avoid using them. Btrfs doesn't. But also ZFS, XFS, APFS, HFS+
> and I think even FAT, lack this capability.


FAT entry FF7 (FAT12)/FFF7 (FAT16)/...



-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


[PATCH 1/2] btrfs-progs: fix kernel version parsing on some versions past 3.0

2018-09-03 Thread Adam Borowski
The code fails if the third section is missing (like "4.18") or is followed
by anything but "." or "-".  This happens for example if we're not exactly
at a tag and CONFIG_LOCALVERSION_AUTO=n (which results in "4.18.5+").

Signed-off-by: Adam Borowski 
---
 fsfeatures.c | 5 +
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/fsfeatures.c b/fsfeatures.c
index 7d85d60f..66111bf4 100644
--- a/fsfeatures.c
+++ b/fsfeatures.c
@@ -216,11 +216,8 @@ u32 get_running_kernel_version(void)
return (u32)-1;
version |= atoi(tmp) << 8;
tmp = strtok_r(NULL, ".", );
-   if (tmp) {
-   if (!string_is_numerical(tmp))
-   return (u32)-1;
+   if (tmp && string_is_numerical(tmp))
version |= atoi(tmp);
-   }
 
return version;
 }
-- 
2.19.0.rc1



[PATCH 2/2] btrfs-progs: defrag: open files RO on new enough kernels or if root

2018-09-03 Thread Adam Borowski
Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 06c8311b..4c9df69f 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -39,12 +40,14 @@
 #include "list_sort.h"
 #include "disk-io.h"
 #include "help.h"
+#include "fsfeatures.h"
 
 /*
  * for btrfs fi show, we maintain a hash of fsids we've already printed.
  * This way we don't print dups if a given FS is mounted more than once.
  */
 static struct seen_fsid *seen_fsid_hash[SEEN_FSID_HASH_SIZE] = {NULL,};
+static mode_t defrag_ro = O_RDONLY;
 
 static const char * const filesystem_cmd_group_usage[] = {
"btrfs filesystem []  []",
@@ -877,7 +880,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, defrag_ro);
if (fd < 0) {
goto error;
}
@@ -914,6 +917,9 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int compress_type = BTRFS_COMPRESS_NONE;
DIR *dirstream;
 
+   if (get_running_kernel_version() < KERNEL_VERSION(4,19,0) && getuid())
+   defrag_ro = O_RDWR;
+
/*
 * Kernel has a different default (256K) that is supposed to be safe,
 * but it does not defragment very well. The 32M will likely lead to
@@ -1014,7 +1020,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , defrag_ro);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.19.0.rc1



Re: lazytime mount option—no support in Btrfs

2018-08-21 Thread Adam Borowski
On Mon, Aug 20, 2018 at 08:16:16AM -0400, Austin S. Hemmelgarn wrote:
> Also, slightly OT, but atimes are not where the real benefit is here for
> most people.  No sane software other than mutt uses atimes (and mutt's use
> of them is not sane, but that's a different argument)

Right.  There are two competing forks of mutt: neomutt and vanilla:
https://github.com/neomutt/neomutt/commit/816095bfdb72caafd8845e8fb28cbc8c6afc114f
https://gitlab.com/dops/mutt/commit/489a1c394c29e4b12b705b62da413f322406326f

So this has already been taken care of.

> so pretty much everyone who wants to avoid the overhead from them can just
> use the `noatime` mount option.

atime updates (including relatime) are bad not only for performance, they
also explode disk size used by snapshots (btrfs, LVM, ...) -- to the tune of
~5% per snapshot for some non-crafted loads.  And, are bad for media with
low write endurance (SD cards, as used by most SoCs).

Thus, atime needs to die.

> The real benefit for most people is with mtimes, for which there is no
> other way to limit the impact they have on performance.

With btrfs, any write already triggers metadata update (except nocow), thus
there's little benefit of lazytime for mtimes.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal [Mt3:16-17]
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs [Mt14:17-20, Mt15:34-37]
⠈⠳⣄ • use glitches to walk on water [Mt14:25-26]


Re: [RESEND][PATCH v5 0/2] vfs: better dedupe permission check

2018-08-07 Thread Adam Borowski
On Tue, Aug 07, 2018 at 02:49:47PM -0700, Mark Fasheh wrote:
> Hi Andrew,
> 
> Could I please have these patches upstreamed or at least put in a tree for
> more public testing? They've hit fsdevel a few times now, I have links to
> the discussions in the change log below.

> The first patch expands our check to allow dedupe of a file if the
> user owns it or otherwise would be allowed to write to it.
[...]
> The other problem we have is also related to forcing the user to open
> target files for write - A process trying to exec a file currently
> being deduped gets ETXTBUSY. The answer (as above) is to allow them to
> open the targets ro - root can already do this. There was a patch from
> Adam Borowski to fix this back in 2016

> The 2nd patch fixes our return code for permission denied to be
> EPERM. For some reason we're returning EINVAL - I think that's
> probably my fault. At any rate, we need to be returning something
> descriptive of the actual problem, otherwise callers see EINVAL and
> can't really make a valid determination of what's gone wrong.

Note that the counterpart of these two patches for BTRFS_IOC_DEFRAG, which
fixes the same issues, is included in btrfs' for-next, slated for 4.19. 
While technically dedupe and defrag are independent, there would be somewhat
less confusion if both behave the same in the same kernel version.

Thus, it'd be nice if you would consider taking this.  Should be safe:
even the permission check is paranoid.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ So a Hungarian gypsy mountainman, lumberjack by day job,
⣾⠁⢰⠒⠀⣿⡁ brigand by, uhm, hobby, invented a dish: goulash on potato
⢿⡄⠘⠷⠚⠋⠀ pancakes.  Then the Polish couldn't decide which of his
⠈⠳⣄ adjectives to use for the dish's name.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS and databases

2018-08-01 Thread Adam Borowski
On Wed, Aug 01, 2018 at 05:45:15AM +0200, MegaBrutal wrote:
> But there is still one question that I can't get over: if you store a
> database (e.g. MySQL), would you prefer having a BTRFS volume mounted
> with nodatacow, or would you just simply use ext4?
> 
> I know that with nodatacow, I take away most of the benefits of BTRFS
> (those are actually hurting database performance – the exact CoW
> nature that is elsewhere a blessing, with databases it's a drawback).
> But are there any advantages of still sticking to BTRFS for a database
> albeit CoW is disabled, or should I just return to the old and
> reliable ext4 for those applications?

Is this database performance-critical?

If yes, you'd want ext4 -- nocow is a crappy ext4 lookalike, with no
benefits of btrfs.  Or, if you snapshot it, you get bad fragmentation yet no
checksums/etc.

If no, regular cow (especially with autodefrag) will be enough.  Sure, this
particular load won't be as performant (mysql really loves fsync, which is
an anathema to btrfs), but you get all the data safety improvements,
frequent cheap backups, and so on.

Thus: if the server's primary purpose is that database, you don't want
btrfs.  If the database is merely incidental, not microoptimizing it will
save a lot of your time.

In neither case nocow is a good idea.  Especially if raid (!= 0) is
involved.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-07-24 Thread Adam Borowski
On Mon, Jul 23, 2018 at 05:26:24PM +0200, David Sterba wrote:
> On Wed, Jul 18, 2018 at 12:08:59AM +0200, Adam Borowski wrote:
(Combined with as-folded)

| | btrfs: allow defrag on a file opened read-only that has rw permissions
| |
> > Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
> > whenever you try to defrag a program that's currently being run, or
> > causing intermittent exec failures on a live system being defragged.
> > 
> > As defrag doesn't change the file's contents in any way, there's no reason
> > to consider it a rw operation.  Thus, let's check only whether the file
> > could have been opened rw.  Such access control is still needed as
> > currently defrag can use extra disk space, and might trigger bugs.
<-
| | We give EINVAL when the request is invalid; here it's ok but merely the
| | user has insufficient privileges.  Thus, this return value reflects the
| | error better -- as discussed in the identical case for dedupe.
| |
| | According to codesearch.debian.net, no userspace program distinguishes
| | these values beyond strerror().
| |
| | Signed-off-by: Adam Borowski 
| | Reviewed-by: David Sterba 
| | [ fold the EPERM patch from Adam ]
| | Signed-off-by: David Sterba 

[...]
> So, I'll add the patch to 4.19 queue. It's small and isolated change so
> a revert would be easy in case we find something bad. The 2nd patch
> should be IMHO part of this change as it's logical to return the error
> code in the patch that modifies the user visible behaviour.

A nitpick: the new commit message has a dangling pointer "this" to the title
of the commit that was squashed.  It was:

| btrfs: defrag: return EPERM not EINVAL when only permissions fail

It'd be nice if it could be inserted in some form in the place I marked with
an arrow.

But then, commit messages are not vital.  The actual functionality patch has
been applied correctly.  And thanks for adding the comment.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-07-17 Thread Adam Borowski
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
whenever you try to defrag a program that's currently being run, or
causing intermittent exec failures on a live system being defragged.

As defrag doesn't change the file's contents in any way, there's no reason
to consider it a rw operation.  Thus, let's check only whether the file
could have been opened rw.  Such access control is still needed as
currently defrag can use extra disk space, and might trigger bugs.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 43ecbe620dea..01c150b6ab62 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2941,7 +2941,8 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
ret = btrfs_defrag_root(root);
break;
case S_IFREG:
-   if (!(file->f_mode & FMODE_WRITE)) {
+   if (!capable(CAP_SYS_ADMIN) &&
+   inode_permission(inode, MAY_WRITE)) {
ret = -EINVAL;
goto out;
}
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail

2018-07-17 Thread Adam Borowski
We give EINVAL when the request is invalid; here it's ok but merely the
user has insufficient privileges.  Thus, this return value reflects the
error better -- as discussed in the identical case for dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski 
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 01c150b6ab62..e96e3c3caca1 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2943,7 +2943,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
case S_IFREG:
if (!capable(CAP_SYS_ADMIN) &&
inode_permission(inode, MAY_WRITE)) {
-   ret = -EINVAL;
+   ret = -EPERM;
goto out;
}
 
-- 
2.18.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH resend 0/2] btrfs: fix races between exec and defrag

2018-07-17 Thread Adam Borowski
Hi!
Here's a ping for a patch to fix ETXTBSY races between defrag and exec, just
like the dedupe counterpart.  Unlike that one which is shared to multiple
filesystems and thus lives in Al Viro's land, it is btrfs only.

Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length
1 at random positions; racey vs concurrent writes or execs but shouldn't
damage the file otherwise.

Also attached: a preliminary patch for -progs; it yet lacks a check for the
kernel version, but to add such a check we'd need to know which kernels
actually permit ro defrag for non-root.

No man page patch -- there's no man page to be patched...


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *txt, ...) __attribute__((format (printf, 1, 2)));
static void die(const char *txt, ...)
{
fprintf(stderr, "fragme: ");

va_list ap;
va_start(ap, txt);
vfprintf(stderr, txt, ap);
va_end(ap);

exit(1);
}

static uint64_t rnd(uint64_t max)
{
__uint128_t r;
if (syscall(SYS_getrandom, , sizeof(r), 0)==-1)
die("getrandom(): %m\n");
return r%max;
}

int main(int argc, char **argv)
{
if (argc!=2)
die("Usage: fragme \n");

int fd = open(argv[1], O_RDWR|O_SYNC);
if (fd == -1)
die("open(\"%s\"): %m\n", argv[1]);
off_t size = lseek(fd, 0, SEEK_END);
if (size == -1)
die("lseek(SEEK_END): %m\n");

for (int i=0; i<10; ++i)
{
off_t off = rnd(size);
char b;
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for read: %m\n");
if (read(fd, , 1) != 1)
die("read(%lu): %m\n", off);
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for write: %m\n");
if (write(fd, , 1) != 1)
die("write: %m\n");
    }

return 0;
}
>From d040af09adb03daadbba4336700f40425a860320 Mon Sep 17 00:00:00 2001
From: Adam Borowski 
Date: Tue, 28 Nov 2017 01:00:21 +0100
Subject: [PATCH] defrag: open files RO

NOT FOR MERGING -- requires kernel versioning

Fixes EXTXBSY races.

Signed-off-by: Adam Borowski 
---
 cmds-filesystem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 30a50bf5..7eb6b7bb 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct stat *sb,
 	if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
 		if (defrag_global_verbose)
 			printf("%s\n", fpath);
-		fd = open(fpath, O_RDWR);
+		fd = open(fpath, O_RDONLY);
 		if (fd < 0) {
 			goto error;
 		}
@@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
 		int defrag_err = 0;
 
 		dirstream = NULL;
-		fd = open_file_or_dir(argv[i], );
+		fd = open_file_or_dir3(argv[i], , O_RDONLY);
 		if (fd < 0) {
 			error("cannot open %s: %m", argv[i]);
 			ret = -errno;
-- 
2.18.0



Re: unsolvable technical issues?

2018-06-28 Thread Adam Borowski
On Wed, Jun 27, 2018 at 08:50:11PM +0200, waxhead wrote:
> Chris Murphy wrote:
> > On Thu, Jun 21, 2018 at 5:13 PM, waxhead  wrote:
> > > According to this:
> > > 
> > > https://stratis-storage.github.io/StratisSoftwareDesign.pdf
> > > Page 4 , section 1.2
> > > 
> > > It claims that BTRFS still have significant technical issues that may 
> > > never
> > > be resolved.
> > > Could someone shed some light on exactly what these technical issues might
> > > be?! What are BTRFS biggest technical problems?
> > 
> > 
> > I think it's appropriate to file an issue and ask what they're
> > referring to. It very well might be use case specific to Red Hat.
> > https://github.com/stratis-storage/stratis-storage.github.io/issues

> https://github.com/stratis-storage/stratis-storage.github.io/issues/1
> 
> Apparently the author have toned down the wording a bit, this confirm that
> the claim was without basis and probably based on "popular myth".
> The document the PDF links to is not yet updated.

It's a company whose profits rely on users choosing it over anything that
competes.  Adding propaganda to a public document is a natural thing for
them to do.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: Do extra device generation check at mount time

2018-06-28 Thread Adam Borowski
On Thu, Jun 28, 2018 at 03:04:43PM +0800, Qu Wenruo wrote:
> There is a reporter considering btrfs raid1 has a major design flaw
> which can't handle nodatasum files.
> 
> Despite his incorrect expectation, btrfs indeed doesn't handle device
> generation mismatch well.
> 
> This means if one devices missed and re-appeared, even its generation
> no longer matches with the rest device pool, btrfs does nothing to it,
> but treat it as normal good device.
> 
> At least let's detect such generation mismatch and avoid mounting the
> fs.

Uhm, that'd be a nasty regression for the regular (no-nodatacow) case. 
The vast majority of data is fine, and extents that have been written to
while a device is missing will be either placed elsewhere (if the filesystem
knew it was degraded) or read one of the copies to notice a wrong checksum
and automatically recover (if the device was still falsely believed to be
good at write time).

We currently don't have selective scrub yet so resyncing such single-copy
extents is costly, but 1. all will be fine if the data is read, 2. it's
possible to add such a smart resync in the future, far better than a
write-intent bitmap can do.

To do the latter, we can note the last generation the filesystem was known
to be fully coherent (ie, all devices were successfully flushed with no
mysterious write failures), then run selective scrub (perhaps even
automatically) when the filesystem is no longer degraded.  There's some
extra complexity with 3- or 4-way RAID (multiple levels of degradation) but
a single number would help even there.

But even currently, without the above not-yet-written recovery, it's
reasonably safe to continue without scrub -- it's a case of running
partially degraded when the bad copy is already known to be suspicious.

For no-nodatacow data and metadata, that is.

> Currently there is no automatic rebuild yet, which means if users find
> device generation mismatch error message, they can only mount the fs
> using "device" and "degraded" mount option (if possible), then replace
> the offending device to manually "rebuild" the fs.

As nodatacow already means "I don't care about this data, or have another
way of recovering it", I don't quite get why we would drop existing
auto-recovery for a common transient failure case.

If you're paranoid, perhaps some bit "this filesystem has some nodatacow
data on it" could warrant such a block, but it would still need to be
overridable _without_ a need for replace.  There's also the problem that
systemd marks its journal nodatacow (despite it having infamously bad
handling of failures!), and too many distributions infect their default
installs with systemd, meaning such a bit would be on in most cases.

But why would I put all my other data at risk, just because there's a
nodatacow file?  There's a big difference between scrubbing when only a few
transactions worth of data is suspicious and completely throwing away a
mostly-good replica to replace it from the now fully degraded copy.

> I totally understand that, generation based solution can't handle
> split-brain case (where 2 RAID1 devices get mounted degraded separately)
> at all, but at least let's handle what we can do.

Generation can do well at least unless both devices were mounted elsewhere
and got the exact same number of transactions, the problem is that nodatacow
doesn't bump generation number.

> The best way to solve the problem is to make btrfs treat such lower gen
> devices as some kind of missing device, and queue an automatic scrub for
> that device.
> But that's a lot of extra work, at least let's start from detecting such
> problem first.

I wonder if there's some way to treat problematic nodatacow files as
degraded only?

Nodatacow misses most of btrfs mechanisms, thus to get it done right you'd
need to pretty much copy all of md's logic, with a write-intent bitmap or an
equivalent.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ There's an easy way to tell toy operating systems from real ones.
⣾⠁⢰⠒⠀⣿⡁ Just look at how their shipped fonts display U+1F52B, this makes
⢿⡄⠘⠷⠚⠋⠀ the intended audience obvious.  It's also interesting to see OSes
⠈⠳⣄ go back and forth wrt their intended target.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] defrag: open files RO

2018-05-21 Thread Adam Borowski
NOT FOR MERGING -- requires kernel versioning

Fixes EXTXBSY races.

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 cmds-filesystem.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 30a50bf5..7eb6b7bb 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -876,7 +876,7 @@ static int defrag_callback(const char *fpath, const struct 
stat *sb,
if ((typeflag == FTW_F) && S_ISREG(sb->st_mode)) {
if (defrag_global_verbose)
printf("%s\n", fpath);
-   fd = open(fpath, O_RDWR);
+   fd = open(fpath, O_RDONLY);
if (fd < 0) {
goto error;
}
@@ -1012,7 +1012,7 @@ static int cmd_filesystem_defrag(int argc, char **argv)
int defrag_err = 0;
 
dirstream = NULL;
-   fd = open_file_or_dir(argv[i], );
+   fd = open_file_or_dir3(argv[i], , O_RDONLY);
if (fd < 0) {
error("cannot open %s: %m", argv[i]);
ret = -errno;
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: defrag: return EPERM not EINVAL when only permissions fail

2018-05-21 Thread Adam Borowski
We give EINVAL when the request is invalid; here it's ok but merely the
user has insufficient privileges.  Thus, this return value reflects the
error better -- as discussed in the identical case for dedupe.

According to codesearch.debian.net, no userspace program distinguishes
these values beyond strerror().

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 fs/btrfs/ioctl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index b75db9d72106..ae6a110987a7 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2563,7 +2563,7 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
case S_IFREG:
if (!capable(CAP_SYS_ADMIN) &&
inode_permission(inode, MAY_WRITE)) {
-   ret = -EINVAL;
+   ret = -EPERM;
goto out;
}
 
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: allow defrag on a file opened ro that has rw permissions

2018-05-21 Thread Adam Borowski
Requiring a rw descriptor conflicts both ways with exec, returning ETXTBSY
whenever you try to defrag a program that's currently being run, or
causing intermittent exec failures on a live system being defragged.

As defrag doesn't change the file's contents in any way, there's no reason
to consider it a rw operation.  Thus, let's check only whether the file
could have been opened rw.  Such access control is still needed as
currently defrag can use extra disk space, and might trigger bugs.

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 632e26d6f7ce..b75db9d72106 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2561,7 +2561,8 @@ static int btrfs_ioctl_defrag(struct file *file, void 
__user *argp)
ret = btrfs_defrag_root(root);
break;
case S_IFREG:
-   if (!(file->f_mode & FMODE_WRITE)) {
+   if (!capable(CAP_SYS_ADMIN) &&
+   inode_permission(inode, MAY_WRITE)) {
ret = -EINVAL;
goto out;
}
-- 
2.17.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] btrfs: fix races between exec and defrag

2018-05-21 Thread Adam Borowski
Hi!
Here's a patch to fix ETXTBSY races between defrag and exec -- similar to
what was just submitted for dedupe, even to the point of being followed by
a second patch that replaces EINVAL with EPERM.

As defrag is not something you're going to do on files you don't write, I
skipped complex rules and I'm sending the original version of the patch
as-is.  It has stewed in my tree for two years (long story...), tested on
multiple machines.

Attached: a simple tool to fragment a file, by ten O_SYNC rewrites of length
1 at random positions; racey vs concurrent writes or execs but shouldn't
damage the file otherwise.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static void die(const char *txt, ...) __attribute__((format (printf, 1, 2)));
static void die(const char *txt, ...)
{
fprintf(stderr, "fragme: ");

va_list ap;
va_start(ap, txt);
vfprintf(stderr, txt, ap);
va_end(ap);

exit(1);
}

static uint64_t rnd(uint64_t max)
{
__uint128_t r;
if (syscall(SYS_getrandom, , sizeof(r), 0)==-1)
die("getrandom(): %m\n");
return r%max;
}

int main(int argc, char **argv)
{
if (argc!=2)
die("Usage: fragme \n");

int fd = open(argv[1], O_RDWR|O_SYNC);
if (fd == -1)
die("open(\"%s\"): %m\n", argv[1]);
off_t size = lseek(fd, 0, SEEK_END);
if (size == -1)
die("lseek(SEEK_END): %m\n");

for (int i=0; i<10; ++i)
{
off_t off = rnd(size);
char b;
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for read: %m\n");
if (read(fd, , 1) != 1)
die("read(%lu): %m\n", off);
if (lseek(fd, off, SEEK_SET) != off)
die("lseek for write: %m\n");
if (write(fd, , 1) != 1)
die("write: %m\n");
}

return 0;
}


Re: [PATCH v2 2/3] btrfs: balance: add args info during start and resume

2018-05-16 Thread Adam Borowski
On Wed, May 16, 2018 at 10:57:57AM +0300, Nikolay Borisov wrote:
> On 16.05.2018 05:51, Anand Jain wrote:
> > Balance args info is an important information to be reviewed for the
> > system audit. So this patch adds it to the kernel log.
> > 
> > Example:
> > 
> > -> btrfs bal start -dprofiles='raid1|single',convert=raid5 
> > -mprofiles='raid1|single',convert=raid5 /btrfs
> > 
> >  kernel: BTRFS info (device sdb): balance: start data profiles=raid1|single 
> > convert=raid5 metadata profiles=raid1|single convert=raid5 system 
> > profiles=raid1|single convert=raid5
> > 
> > -> btrfs bal start -dprofiles=raid5,convert=single 
> > -mprofiles='raid1|single',convert=raid5 --background /btrfs
> > 
> >  kernel: BTRFS info (device sdb): balance: start data profiles=raid5 
> > convert=single metadata profiles=raid1|single convert=raid5 system 
> > profiles=raid1|single convert=raid5
> > 
> > Signed-off-by: Anand Jain 
> 
> Why can't this code be part of progs, the bctl which you are parsing is
> constructed from the arguments passed from users space? I think you are
> adding way too much string parsing code to the kernel and this is never
> a good sign, since it's very easy to trip.

progs are not the only way to start a balance, they're merely the most
widespread one.  For example, Hans van Kranenburg has some smarter scripts
among his tools -- currently only of "example" quality, but quite useful
already.  "balance_least_used" works in greedy order (least used first) with
nice verbose output.  It's not unlikely he or someone else improves this
further.  Thus, I really think the logging should be kernel side.

On the other hand, the string producing (not parsing) code is so repetitive
that it indeed could use some refactoring.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-13 Thread Adam Borowski
On Sun, May 13, 2018 at 06:16:53PM +, Mark Fasheh wrote:
> On Sat, May 12, 2018 at 04:49:20AM +0200, Adam Borowski wrote:
> > On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> > > The permission check in vfs_dedupe_file_range() is too coarse - We
> > > only allow dedupe of the destination file if the user is root, or
> > > they have the file open for write.
> > > 
> > > This effectively limits a non-root user from deduping their own
> > > read-only files. As file data during a dedupe does not change,
> > > this is unexpected behavior and this has caused a number of issue
> > > reports.
[...]
> > > So change the check so we allow dedupe on the target if:
> > > 
> > > - the root or admin is asking for it
> > > - the owner of the file is asking for the dedupe
> > > - the process has write access
> > 
> > I submitted a similar patch in May 2016, yet it has never been applied
> > despite multiple pings, with no NAK.  My version allowed dedupe if:
> > - the root or admin is asking for it
> > - the file has w permission (on the inode -- ie, could have been opened rw)
> 
> Ahh, yes I see that now. I did wind up acking it too :)
> > 
> > I like this new version better than mine: "root or owner or w" is more
> > Unixy than "could have been opened w".
> 
> I agree, IMHO the behavior in this patch is intuitive. What we had before
> would surprise users.

Actually, there's one reason to still consider "could have been opened w":
with it, deduplication programs can simply open the file r and not care
about ETXTBSY at all.  Otherwise, every program needs to stat() and have
logic to pick the proper argument to the open() call (r if owner/root,
rw or w if not).

I also have a sister patch: btrfs_ioctl_defrag wants the same change, for
the very same reason.  But, let's discuss dedupe first to avoid unnecessary
round trips.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] vfs: dedupe should return EPERM if permission is not granted

2018-05-13 Thread Adam Borowski
On Fri, May 11, 2018 at 05:06:34PM -0700, Darrick J. Wong wrote:
> On Fri, May 11, 2018 at 12:26:51PM -0700, Mark Fasheh wrote:
> > Right now we return EINVAL if a process does not have permission to dedupe a
> > file. This was an oversight on my part. EPERM gives a true description of
> > the nature of our error, and EINVAL is already used for the case that the
> > filesystem does not support dedupe.

> > -   info->status = -EINVAL;
> > +   info->status = -EPERM;
> 
> Hmm, are we allowed to change this aspect of the kabi after the fact?
> 
> Granted, we're only trading one error code for another, but will the
> existing users of this care?  xfs_io won't and I assume duperemove won't
> either, but what about bees? :)

There's more:
https://codesearch.debian.net/search?q=FILE_EXTENT_SAME

This includes only software that has been packaged for Debian (notably, not
bees), but that gives enough interesting coverage.  And none of these cases
discriminate between error codes -- they merely report them to the user.

Thus, I can't think of a downside of making the error code more accurate.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] vfs: allow dedupe of user owned read-only files

2018-05-11 Thread Adam Borowski
On Fri, May 11, 2018 at 12:26:50PM -0700, Mark Fasheh wrote:
> The permission check in vfs_dedupe_file_range() is too coarse - We
> only allow dedupe of the destination file if the user is root, or
> they have the file open for write.
> 
> This effectively limits a non-root user from deduping their own
> read-only files. As file data during a dedupe does not change,
> this is unexpected behavior and this has caused a number of issue
> reports. For an example, see:
> 
> https://github.com/markfasheh/duperemove/issues/129
> 
> So change the check so we allow dedupe on the target if:
> 
> - the root or admin is asking for it
> - the owner of the file is asking for the dedupe
> - the process has write access

I submitted a similar patch in May 2016, yet it has never been applied
despite multiple pings, with no NAK.  My version allowed dedupe if:
- the root or admin is asking for it
- the file has w permission (on the inode -- ie, could have been opened rw)

There was a request to include in xfstests a test case for the ETXTBSY race
this patch fixes, but there's no reasonable way to make such a test case:
the race condition is not a bug, it's write-xor-exec working as designed.

Another idea discussed was about possibly just allowing everyone who can
open the file to deduplicate it, as the file contents are not modified in
any way.  Zygo Blaxell expressed a concern that it could be used by an
unprivileged user who can trigger a crash to abuse writeout bugs.

I like this new version better than mine: "root or owner or w" is more
Unixy than "could have been opened w".

> Signed-off-by: Mark Fasheh 
> ---
>  fs/read_write.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c4eabbfc90df..77986a2e2a3b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -2036,7 +2036,8 @@ int vfs_dedupe_file_range(struct file *file, struct 
> file_dedupe_range *same)
>  
>   if (info->reserved) {
>   info->status = -EINVAL;
> - } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE))) {
> + } else if (!(is_admin || (dst_file->f_mode & FMODE_WRITE) ||
> +  uid_eq(current_fsuid(), dst->i_uid))) {
I had:
  + } else if (!(is_admin || !inode_permission(dst, MAY_WRITE))) {
>   info->status = -EINVAL;
>   } else if (file->f_path.mnt != dst_file->f_path.mnt) {
>   info->status = -EXDEV;
> -- 

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: add balance args info during start and resume

2018-04-26 Thread Adam Borowski
On Thu, Apr 26, 2018 at 04:01:29PM +0800, Anand Jain wrote:
> Balance args info is an important information to be reviewed on the
> system under audit. So this patch adds that.

This kept annoying me.  Thanks a lot!

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ Certified airhead; got the CT scan to prove that!
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [wiki] Please clarify how to check whether barriers are properly implemented in hardware

2018-04-02 Thread Adam Borowski
On Mon, Apr 02, 2018 at 10:07:01PM +, Hugo Mills wrote:
> On Mon, Apr 02, 2018 at 06:03:00PM -0400, Fedja Beader wrote:
> > Is there some testing utility for this? Is there a way to extract this/tell 
> > with a high enough certainty from datasheets/other material before purchase?
> 
>Given that not implementing barriers is basically a bug in the
> hardware [for SATA or SAS], I don't think anyone's going to specify
> anything other than "fully suppors barriers" in their datasheets.
> 
>I don't know of a testing tool. It may not be obvious that barriers
> aren't being honoured without doing things like power-failure testing.

And you'd need to do a lot of power-cycling during writes, with various
write patterns -- as unless you have a case of "let's lie about barriers to
make benchmarks better than the competition" where barriers are consistently
absent, it might be a genuine bug in a well-meaning controller that at least
tries but sometimes fails to.  The intentional case is usually easy to
detect -- but just wait go get volkswagenized. :/


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ 
⢿⡄⠘⠷⠚⠋⠀ ... what's the frequency of that 5V DC?
⠈⠳⣄
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Question, will ls -l eventually be able to show subvolumes?

2018-03-30 Thread Adam Borowski
On Fri, Mar 30, 2018 at 10:42:10AM +0100, Pete wrote:
> I've just notice work going on to make rmdir be able to delete
> subvolumes.  Is there an intent to allow ls -l to display directories as
> subvolumes?

That's entirely up to coreutils guys.


Meow!
-- 
ᛊᚨᚾᛁᛏᚣ᛫ᛁᛊ᛫ᚠᛟᚱ᛫ᚦᛖ᛫ᚹᛖᚨᚲ
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid1 volume stuck as read-only: How to dump, recreate and restore its content?

2018-03-11 Thread Adam Borowski
On Sun, Mar 11, 2018 at 11:28:08PM +0700, Andreas Hild wrote:
> Following a physical disk failure of a RAID1 array, I tried to mount
> the remaining volume of a root partition with "-o degraded". For some
> reason it ended up as read-only as described here:
> https://btrfs.wiki.kernel.org/index.php/Gotchas#raid1_volumes_only_mountable_once_RW_if_degraded
> 
> 
> How to precisely do this: dump, recreate and restore its contents?
> Could someone please provided more details how to recover this volume
> safely?

> Linux debian 4.9.0-4-amd64 #1 SMP Debian 4.9.65-3 (2017-12-03) x86_64 
> GNU/Linux

> [ 1313.279140] BTRFS warning (device sdb2): missing devices (1)
> exceeds the limit (0), writeable mount is not allowed

I'd recommend instead going with kernel 4.14 or newer (available in
stretch-backports), which handles this case well without the need to
restore.  If there's no actual data loss (there shouldn't be, it's RAID1
with only a single device missing), you can mount degraded normally, then
balance the data onto the new disk.

Recovery with 4.9 is unpleasant.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 07:37:22PM +0500, Roman Mamedov wrote:
> Note you can use it on HDDs too, even without QEMU and the like: via using LVM
> "thin" volumes. I use that on a number of machines, the benefit is that since
> TRIMed areas are "stored nowhere", those partitions allow for incredibly fast
> block-level backups, as it doesn't have to physically read in all the free
> space, let alone any stale data in there. LVM snapshots are also way more
> efficient with thin volumes, which helps during backup.

Since we're on a btrfs mailing list, if you use qemu, you really want
sparse format:raw instead of qcow2 or preallocated raw.  This also works
great with TRIM.

> > Back then it didn't seem to work.
> 
> It works, just not with some of the QEMU virtualized disk device drivers.
> You don't need to use qemu-img to manually dig holes either, it's all
> automatic.

It works only with scsi and virtio-scsi drivers.  Most qemu setups use
either ide (ouch!) or virtio-blk.

You'd obviously want virtio-scsi; note that defconfig enables virtio-blk but
not virtio-scsi; I assume most distribution kernels have both.  It's a bit
tedious to switch between the two as -blk is visible as /dev/vda while -scsi
as /dev/sda.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: zerofree btrfs support?

2018-03-10 Thread Adam Borowski
On Sat, Mar 10, 2018 at 03:55:25AM +0100, Christoph Anton Mitterer wrote:
> Just wondered... was it ever planned (or is there some equivalent) to
> get support for btrfs in zerofree?

Do you want zerofree for thin storage optimization, or for security?

For the former, you can use fstrim; this is enough on any modern SSD; on HDD
you can rig the block device to simulate TRIM by writing zeroes.  I'm sure
one of dm-* can do this, if not -- should be easy to add, there's also
qemu-nbd which allows control over discard, but incurs a performance penalty
compared to playing with the block layer.

For zerofree for security, you'd need defrag (to dislodge partial pinned
extents) first, and do a full balance to avoid data left in metadata nodes
and in blocks beyond file ends (note that zerofree doesn't do this on
traditional filesystems either).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Metadata / Data on Heterogeneous Media

2018-02-15 Thread Adam Borowski
On Thu, Feb 15, 2018 at 12:15:49PM -0500, Ellis H. Wilson III wrote:
> In discussing the performance of various metadata operations over the past
> few days I've had this idea in the back of my head, and wanted to see if
> anybody had already thought about it before (likely, I would guess).
>
> It appears based on this page:
> https://btrfs.wiki.kernel.org/index.php/Btrfs_design
> that data and metadata in BTRFS are fairly well isolated from one another,
> particularly in the case of large files.  This appears reinforced by a
> recent comment from Qu ("...btrfs strictly
> split metadata and data usage...").
> 
> Yet, while there are plenty of options to RAID0/1/10/etc across generally
> homogeneous media types, there doesn't appear to be any functionality (at
> least that I can find) to segment different BTRFS internals to different
> types of devices.  E.G., place metadata trees and extent block groups on
> SSD, and data trees and extent block groups on HDD(s).
> 
> Is this something that has already been considered (and if so, implemented,
> which would make me extremely happy)?  Is it feasible it is hasn't been
> approached yet?  I admit my internal knowledge of BTRFS is fleeting, though
> I'm trying to work on that daily at this time, so forgive me if this is
> unapproachable for obvious architectural reasons.

Considered: many times.  It's an obvious improvement, and one that shouldn't
even be that hard to implement.  What remains, it's SMoC then SMoR (Simple
Matter of Coding then Simple Matter of Review), but both of those are in
short supply.

After the maximum size of inline extents has been lowered, there's no real
point in putting different types of metadata or not-really-metadata on
different media: thus, existing split of data -vs- metadata block groups is
fine.

What you'd want is an ability to tell the block allocator that metadata
block groups should prefer device[s] A, while data ones, device[s] B.

Right now, the allocator's algorithm is: any new allocations are placed on
device that has the most available space, for 2nd/etc RAID chunk obviously
excluding the device which 1st chunk has been already placed on.  This is
optimal wrt not wasting space, but doesn't always provide best performance,
especially when devices' speed varies.  There are also other downsides, like
usual RAID10 having 2/3 chance for tolerating two missing devices, while
btrfs RAID10 almost guarantees massive data loss with two missing devices.

Thus, allowing to specify an allocation policy that alters this algorithm
would be the way to go.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Adam Borowski
On Sun, Feb 11, 2018 at 12:31:42PM +0300, Andrei Borzenkov wrote:
> 11.02.2018 04:02, Hans van Kranenburg пишет:
> >> - /dev/sda6 / btrfs
> >> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
> >> 0 0
> > 
> > Note that changes on atime cause writes to metadata, which means cowing
> > metadata blocks and unsharing them from a previous snapshot, only when
> > using the filesystem, not even when changing things (!).
> 
> With relatime atime is updated only once after file was changed. So your
> description is not entirely accurate and things should not be that
> dramatic unless files are continuously being changed.

Alas, that's untrue.  relatime updates happen if:
* the file has been written after it was last read, or
* previous atime was older than 24 hours

Thus, you get at least one unshare per inode per day, which is also the most
widespread frequency of both snapshotting and cronjobs.

Fortunately, most uses of atime are gone, thus it's generally safe to
disable it completely.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-29 Thread Adam Borowski
On Mon, Jan 29, 2018 at 09:54:04AM +0100, Tomasz Pala wrote:
> On Sun, Jan 28, 2018 at 17:00:46 -0700, Chris Murphy wrote:
> 
> > systemd can't possibly need to know more information than a person
> > does in the exact same situation in order to do the right thing. No
> > human would wait 10 minutes, let alone literally the heat death of the
> > planet for "all devices have appeared" but systemd will. And it does
> 
> We're already repeating - systemd waits for THE btrfs-compound-device,
> not ALL the block-devices.

Because there is NO compound device.  You can't wait for something that
doesn't exist.  The user wants a filesystem, not some mythical compound
device, and as knowing whether we have enough requires doing most of mount
work, we can as well complete the mount instead of backing off and
reporting, so you can then racily repeat the work.

> Just like it 'waits' for someone to plug USB pendrive in.

Plugging an USB pendrive is an event -- there's no user request.  On the
other hand, we already know we want to mount -- the user requested so either
by booting ("please mount everything in fstab") or by an explicit mount
command.

So any event (the user's request) has already happened.  A rc system, of
which systemd is one, knows whether we reached the "want root filesystem" or
"want secondary filesystems" stage.  Once you're there, you can issue the
mount() call and let the kernel do the work.

> It is a btrfs choice to not expose compound device as separate one (like
> every other device manager does)

Btrfs is not a device manager, it's a filesystem.

> it is a btrfs drawback that doesn't provice anything else except for this
> IOCTL with it's logic

How can it provide you with something it doesn't yet have?  If you want the
information, call mount().  And as others in this thread have mentioned,
what, pray tell, would you want to know "would a mount succeed?" for if you
don't want to mount?

> it is a btrfs drawback that there is nothing to push assembling into "OK,
> going degraded" state

The way to do so is to timeout, then retry with -o degraded.

> I've told already - pretend the /dev/sda1 device doesn't
> exist until assembled.

It does... you're confusing a block device (a _part_ of the filesystem) with
the filesystem itself.  MD takes a bunch of such block devices and provides
you with another block devices, btrfs takes a bunch of block devices and
provides you with a filesystem.

> If this overlapping usage was designed with 'easier mounting' on mind,
> this is simply bad design.

No other rc system but systemd has a problem.

> > that by its own choice, its own policy. That's the complaint. It's
> > choosing to do something a person wouldn't do, given identical
> > available information.
> 
> You are expecting systemd to mix in functions of kernel and udev.
> There is NO concept of 'assembled stuff' in systemd AT ALL.
> There is NO concept of 'waiting' in udev AT ALL.
> If you want to do some crazy interlayer shortcuts just implement btrfsd.

No, I don't want systemd, or any userspace daemon, to try knowing kernel
stuff better than the kernel.  Just call mount(), and that's it.

Let me explain via a car analogy.  There is a flood that covers many roads,
the phone network is unreliable, and you want to drive to help relatives at
place X.

You can ask someone who was there yesterday how to get there (ie, ask a
device; it can tell you "when I was a part of the filesystem last time, its
layout was such and such").  Usually, this is reliable (you don't reshape an
array every day), but if there's flooding (you're contemplating a degraded
mount), yesterday's data being stale shouldn't be a surprise.

So, you climb into the car and drive.  It's possible that the road you
wanted to take has changed, it's also possible some other roads you didn't
even know about are now driveable.  Once you have X in sight, do you retrace
all the way home, tell your mom (systemd) who's worrying but has no way to
help, that the road is clear, and only then get to X?  Or do you stop,
search for a spot with working phone coverage to phone mom asking for
advice, despite her having no informations you don't have?  The reasonable
thing to do (and what all other rc systems do) is to get to X, help the
relatives, and only then tell mom that all is ok.

But with mom wanting to control everything, things can go worse.  If you,
without mom's prior knowledge (the user typed "mount" by hand) manage to
find a side road to X, she shouldn't tell you "I hear you telling me you're
at X -- as the road is flooded, that's impossible, so get home this instant"
(ie, systemd thinking the filesystem not being complete, despite it being
already mounted).

> > There's nothing the kernel is doing that's
> > telling systemd to wait for goddamn ever.
> 
> There's nothing the kernel is doing that's
> telling udev there IS a degraded device assembled to be used.

Because there is no device.

> There's nothing a userspace-thing is doing that's
> 

Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 03:36:48PM +0100, Goffredo Baroncelli wrote:
> I think that the real problem relies that the mounting a btrfs filesystem
> cannot be a responsibility of systemd (or whichever rc-system). 
> Unfortunately in the past it was thought that it would be sufficient to
> assemble a devices list in the kernel, then issue a simple mount...

Yeah... every device that comes online may have its own idea what devices
are part of the filesystem.  There's also a quite separate question whether
we have enough chunks for a degraded mount (implemented by Qu), which
requires reading the chunk tree.

> In the past[*] I proposed a mount helper, which would perform all the
> device registering and mounting in degraded mode (depending by the
> option).  My idea is that all the policies should be placed only in one
> place.  Now some policies are in the kernel, some in udev, some in
> systemd...  It is a mess.  And if something goes wrong, you have to look
> to several logs to understand which/where is the problem..

Since most of the logic needs to be in the kernel anyway, I believe it'd be
best to keep as much as possible in the kernel, and let the userspace
request at most "try regular/degraded mount, block/don't block".  Anything
else would be duplicating functionality.

> I have to point out that there is not a sane default for mounting in
> degraded mode or not.  May be that now RAID1/10 are "mount-degraded"
> friendly, so it would be a sane default; but for other (raid5/6) I think
> that this is not mature enough.  And it is possible to exist hybrid
> filesystem (both RAID1/10 and RAID5/6)

Not yet: if one of the devices comes a bit late, btrfs won't let it into the
filesystem yet (patches to do so have been proposed), and if you run
degraded for even a moment, a very lengthy action is required.  That lengthy
action could be improved -- we can note the last generation when the raid
was complete[1], and scrub/balance only extents newer than that[2] -- but
that's a SMOC then SMOR, and I don't see volunteers yet.

Thus, auto-degrading without a hearty timeout first is currently sitting
strongly in the "do not want" land.

> Mounting in degraded mode would be better for a root filesystem, than a
> non-root one (think about remote machine)

I for one use ext4-on-md for root, and btrfs raid for the actual data.  It's
not like production servers see much / churn anyway.


Meow!

[1]. Extra fun for raid6 (or possible future raid1×N where N>2 modes):
there's "fully complete", "degraded missing A", "degraded missing B",
"degraded missing A and B".

[2]. NOCOW extents would require an artificial generation bump upon writing
to whenever the level of degradeness changes.
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded permanent mount option

2018-01-27 Thread Adam Borowski
On Sat, Jan 27, 2018 at 12:06:19PM +0100, Tomasz Pala wrote:
> On Sat, Jan 27, 2018 at 13:26:13 +0300, Andrei Borzenkov wrote:
> 
> >> I just tested to boot with a single drive (raid1 degraded), even with
> >> degraded option in fstab and grub, unable to boot !  The boot process
> >> stop on initramfs.
> >> 
> >> Is there a solution to boot with systemd and degraded array ?
> > 
> > No. It is finger pointing. Both btrfs and systemd developers say
> > everything is fine from their point of view.

It's quite obvious who's the culprit: every single remaining rc system
manages to mount degraded btrfs without problems.  They just don't try to
outsmart the kernel.

> Treating btrfs volume as ready by systemd would open a window of
> opportunity when volume would be mounted degraded _despite_ all the
> components are (meaning: "would soon") be ready - just like Chris Murphy
> wrote; provided there is -o degraded somewhere.

For this reason, currently hardcoding -o degraded isn't a wise choice.  This
might chance once autoresync and devices coming back at runtime are
implemented.

> This is not a systemd issue, but apparently btrfs design choice to allow
> using any single component device name also as volume name itself.

And what other user interface would you propose?  The only alternative I see
is inventing a device manager (like you're implying below that btrfs does),
which would needlessly complicate the usual, single-device, case.
 
> If btrfs pretends to be device manager it should expose more states,

But it doesn't pretend to.

> especially "ready to be mounted, but not fully populated" (i.e.
> "degraded mount possible"). Then systemd could _fallback_ after timing
> out to degraded mount automatically according to some systemd-level
> option.

You're assuming that btrfs somehow knows this itself.  Unlike the bogus
assumption systemd does that by counting devices you can know whether a
degraded or non-degraded mount is possible, it is in general not possible to
know whether a mount attempt will succeed without actually trying.

Compare with the 4.14 chunk check patchset by Qu -- in the past, btrfs did
naive counting of this kind, it had to be replaced by actually checking
whether at least one copy of every block group is actually present.

An example scenario: you have a 3-device filesystem, sda sdb sdc.  Suddenly,
sda goes offline due to a loose cable, controller hiccup, evil fairies, or
something of this kind.  The sysadmin notices this, rushes in with an
USB-attached disk (sdd), rebalances.  After reboot, sda works well (or got
its cable reseated, etc), while sdd either got accidentally removed or is
just slow to initialize (USB...).  So, systemd asks sda how many devices
there are, answer is "3" (sdb and sdc would answer the same, BTW).  It can
even ask for UUIDs -- all devices are present.  So, mount will succeed,
right?
 
> Unless there is *some* signalling from btrfs, there is really not much
> systemd can *safely* do.

Btrfs already tells everything it knows.  To learn more, you need to do most
of the mount process (whether you continue or abort is another matter). 
This can't be done sanely from outside the kernel.  Adding finer control
would be reasonable ("wait and block" vs "try and return immediately") but
that's about all.  It's be also wrong to have a different interface for
daemon X than for humans.

Ie, the thing systemd can safely do, is to stop trying to rule everything,
and refrain from telling the user whether he can mount something or not.
And especially, unmounting after the user mounts manually...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: hang in btrfs_async_reclaim_metadata_space

2018-01-10 Thread Adam Borowski
On Sun, Jan 07, 2018 at 01:17:19PM +0200, Nikolay Borisov wrote:
> On  6.01.2018 07:10, Adam Borowski wrote:
> > Hi!
> > I got a reproducible infinite hang, reliably triggered by the testsuite of
> > "flatpak"; fails on at least 4.15-rc6, 4.9.75, and on another machine with
> > Debian's 4.14.2-1.
> > 
> > [580632.355107] INFO: task kworker/u8:2:11105 blocked for more than 120 
> > seconds.
> > [580632.355120]   Not tainted 4.14.0-1-amd64 #1 Debian 4.14.2-1
> > [580632.355124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [580632.355129] kworker/u8:2D0 11105  2 0x8000
> > [580632.355176] Workqueue: events_unbound 
> > btrfs_async_reclaim_metadata_space [btrfs]
> > [580632.355179] Call Trace:
> > [580632.355192]  __schedule+0x28e/0x880
> > [580632.355196]  schedule+0x2c/0x80
> > [580632.355200]  wb_wait_for_completion+0x64/0x90
> > [580632.355205]  ? finish_wait+0x80/0x80
> > [580632.355207]  __writeback_inodes_sb_nr+0xa1/0xd0
> > [580632.355210]  writeback_inodes_sb_nr+0x10/0x20
> > [580632.355235]  flush_space+0x3ed/0x520 [btrfs]
> > [580632.355238]  ? pick_next_task_fair+0x158/0x590
> > [580632.355242]  ? __switch_to+0x1f3/0x460
> > [580632.355267]  btrfs_async_reclaim_metadata_space+0xf6/0x4a0 [btrfs]
> > [580632.355278]  process_one_work+0x198/0x390
> > [580632.355281]  worker_thread+0x35/0x3c0
> > [580632.355284]  kthread+0x125/0x140
> > [580632.355287]  ? process_one_work+0x390/0x390
> > [580632.355289]  ? kthread_create_on_node+0x70/0x70
> > [580632.355292]  ? SyS_exit_group+0x14/0x20
> > [580632.355295]  ret_from_fork+0x25/0x30
> > 
> > The machines are distinct enough that this probably should happen
> > everywhere:
> > 
> > AMD Phenom2, SSD, noatime,compress=lzo,space_cache=v2
> > Intel Braswell, rust, noatime,autodefrag,space_cache=v2

It does cause data loss: while everything seems to work ok, files that are
written to while there's this stuck worker become size 0 after rebooting.
Only after a longish time other processes start getting stuck as well.

For this reason, I was reluctant to try on a real system -- but somehow I
don't seem to be able to reproduce on minimal VMs.

> Provide output of echo w > /proc/sysrq-trigger when the hang happens.

[ 5679.403833] INFO: task kworker/u12:2:9904 blocked for more than 120 seconds.
[ 5679.413938]   Not tainted 4.15.0-rc7-debug-00137-g13f8e1b5cc83 #1
[ 5679.423336] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 5679.434136] kworker/u12:2   D0  9904  2 0x8000
[ 5679.442528] Workqueue: writeback wb_workfn (flush-btrfs-1)
[ 5679.450809] Call Trace:
[ 5679.455920]  ? __schedule+0x1a5/0x6e0
[ 5679.462251]  ? check_preempt_curr+0x74/0x80
[ 5679.468950]  ? atomic_t_wait+0x50/0x50
[ 5679.475134]  schedule+0x23/0x80
[ 5679.480670]  bit_wait+0x8/0x50
[ 5679.485985]  __wait_on_bit+0x3d/0x80
[ 5679.491828]  ? atomic_t_wait+0x50/0x50
[ 5679.497726]  __inode_wait_for_writeback+0x9e/0xc0
[ 5679.504531]  ? bit_waitqueue+0x30/0x30
[ 5679.510292]  inode_wait_for_writeback+0x18/0x30
[ 5679.516782]  evict+0xa4/0x180
[ 5679.521596]  btrfs_run_delayed_iputs+0x61/0xb0
[ 5679.527842]  btrfs_commit_transaction+0x7b0/0x8c0
[ 5679.534310]  ? start_transaction+0xa0/0x390
[ 5679.540151]  __writeback_single_inode+0x168/0x1b0
[ 5679.546477]  writeback_sb_inodes+0x1be/0x420
[ 5679.552264]  wb_writeback+0xe0/0x1d0
[ 5679.557292]  wb_workfn+0x7d/0x2c0
[ 5679.561984]  ? __switch_to+0x17c/0x370
[ 5679.567045]  process_one_work+0x1a7/0x340
[ 5679.572286]  worker_thread+0x26/0x3f0
[ 5679.577043]  ? create_worker+0x190/0x190
[ 5679.582007]  kthread+0x107/0x120
[ 5679.586198]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 5679.592200]  ? kthread_create_worker_on_cpu+0x40/0x40
[ 5679.598122]  ret_from_fork+0x1f/0x30
[ 5679.602549] INFO: task testlibrary:12647 blocked for more than 120 seconds.
[ 5679.610434]   Not tainted 4.15.0-rc7-debug-00137-g13f8e1b5cc83 #1
[ 5679.617758] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this 
message.
[ 5679.626453] testlibrary D0 12647  12606 0x
[ 5679.632812] Call Trace:
[ 5679.636038]  ? __schedule+0x1a5/0x6e0
[ 5679.640483]  ? bdi_split_work_to_wbs+0x159/0x290
[ 5679.645881]  schedule+0x23/0x80
[ 5679.649790]  wb_wait_for_completion+0x39/0x70
[ 5679.654901]  ? wait_woken+0x80/0x80
[ 5679.659092]  __writeback_inodes_sb_nr+0x95/0xa0
[ 5679.664325]  sync_filesystem+0x21/0x80
[ 5679.668797]  SyS_syncfs+0x44/0x90
[ 5679.672809]  entry_SYSCALL_64_fastpath+0x17/0x70
[ 5679.678136] RIP: 0033:0x7fe82599e057
[ 5679.682405] RSP: 002b:7ffdfda6cf18 EFLAGS: 0202
[ 5679.682431] INFO: task pool:14436 blocked for more than 120 seconds.
[ 5679.69539

hang in btrfs_async_reclaim_metadata_space

2018-01-05 Thread Adam Borowski
Hi!
I got a reproducible infinite hang, reliably triggered by the testsuite of
"flatpak"; fails on at least 4.15-rc6, 4.9.75, and on another machine with
Debian's 4.14.2-1.

[580632.355107] INFO: task kworker/u8:2:11105 blocked for more than 120 seconds.
[580632.355120]   Not tainted 4.14.0-1-amd64 #1 Debian 4.14.2-1
[580632.355124] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
this message.
[580632.355129] kworker/u8:2D0 11105  2 0x8000
[580632.355176] Workqueue: events_unbound btrfs_async_reclaim_metadata_space 
[btrfs]
[580632.355179] Call Trace:
[580632.355192]  __schedule+0x28e/0x880
[580632.355196]  schedule+0x2c/0x80
[580632.355200]  wb_wait_for_completion+0x64/0x90
[580632.355205]  ? finish_wait+0x80/0x80
[580632.355207]  __writeback_inodes_sb_nr+0xa1/0xd0
[580632.355210]  writeback_inodes_sb_nr+0x10/0x20
[580632.355235]  flush_space+0x3ed/0x520 [btrfs]
[580632.355238]  ? pick_next_task_fair+0x158/0x590
[580632.355242]  ? __switch_to+0x1f3/0x460
[580632.355267]  btrfs_async_reclaim_metadata_space+0xf6/0x4a0 [btrfs]
[580632.355278]  process_one_work+0x198/0x390
[580632.355281]  worker_thread+0x35/0x3c0
[580632.355284]  kthread+0x125/0x140
[580632.355287]  ? process_one_work+0x390/0x390
[580632.355289]  ? kthread_create_on_node+0x70/0x70
[580632.355292]  ? SyS_exit_group+0x14/0x20
[580632.355295]  ret_from_fork+0x25/0x30

The machines are distinct enough that this probably should happen
everywhere:

AMD Phenom2, SSD, noatime,compress=lzo,space_cache=v2
Intel Braswell, rust, noatime,autodefrag,space_cache=v2


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Adam Borowski
On Mon, Dec 18, 2017 at 03:28:14PM -0700, Chris Murphy wrote:
> On Mon, Dec 18, 2017 at 1:49 AM, Anand Jain  wrote:
> >  Agreed. IMO degraded-raid1-single-chunk is an accidental feature
> >  caused by [1], which we should revert back, since..
> >- balance (to raid1 chunk) may fail if FS is near full
> >- recovery (to raid1 chunk) will take more writes as compared
> >  to recovery under degraded raid1 chunks
> 
> The advantage of writing single chunks when degraded, is in the case
> where a missing device returns (is readded, intact). Catching up that
> device with the first drive, is a manual but simple invocation of
> 'btrfs balance start -dconvert=raid1,soft -mconvert=raid1,soft'   The
> alternative is a full balance or full scrub. It's pretty tedious for
> big arrays.
> 
> mdadm uses bitmap=internal for any array larger than 100GB for this
> reason, avoiding full resync.
> 
> 'btrfs sub find' will list all *added* files since an arbitrarily
> specified generation; but not deletions.

This is fine as scrub cares about extents not files.  The newer generation
of metadata doesn't have a reference to the deleted extent anymore.

Selective scrub hasn't been implemented, but it should be pretty
straightforward -- unless nocow is involved.  Correct me if I'm wrong, but I
believe there's no way to tell which copy of a nocow extent is the good one.


Meow!
-- 
// If you believe in so-called "intellectual property", please immediately
// cease using counterfeit alphabets.  Instead, contact the nearest temple
// of Amon, whose priests will provide you with scribal services for all
// your writing needs, for Reasonable And Non-Discriminatory prices.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] fs/*/Kconfig: drop links to 404-compliant http://acl.bestbits.at

2017-12-12 Thread Adam Borowski
This link is replicated in most filesystems' config stanzas.  Referring
to an archived version of that site is pointless as it mostly deals with
patches; user documentation is available elsewhere.

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
Sending this as one piece; if you guys would instead prefer this chopped
into tiny per-filesystem bits, please say so.


 Documentation/filesystems/ext2.txt |  2 --
 Documentation/filesystems/ext4.txt |  7 +++
 fs/9p/Kconfig  |  3 ---
 fs/Kconfig |  6 +-
 fs/btrfs/Kconfig   |  3 ---
 fs/ceph/Kconfig|  3 ---
 fs/cifs/Kconfig| 15 +++
 fs/ext2/Kconfig|  6 +-
 fs/ext4/Kconfig|  3 ---
 fs/f2fs/Kconfig|  6 +-
 fs/hfsplus/Kconfig |  3 ---
 fs/jffs2/Kconfig   |  6 +-
 fs/jfs/Kconfig |  3 ---
 fs/reiserfs/Kconfig|  6 +-
 fs/xfs/Kconfig |  3 ---
 15 files changed, 15 insertions(+), 60 deletions(-)

diff --git a/Documentation/filesystems/ext2.txt 
b/Documentation/filesystems/ext2.txt
index 55755395d3dc..81c0becab225 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -49,12 +49,10 @@ sb=nUse alternate 
superblock at this location.
 
 user_xattr Enable "user." POSIX Extended Attributes
(requires CONFIG_EXT2_FS_XATTR).
-   See also http://acl.bestbits.at
 nouser_xattr   Don't support "user." extended attributes.
 
 aclEnable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
-   See also http://acl.bestbits.at
 noacl  Don't support POSIX ACLs.
 
 nobh   Do not attach buffer_heads to file pagecache.
diff --git a/Documentation/filesystems/ext4.txt 
b/Documentation/filesystems/ext4.txt
index 75236c0c2ac2..8cd63e16f171 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -202,15 +202,14 @@ inode_readahead_blks=nThis tuning parameter controls 
the maximum
the buffer cache.  The default value is 32 blocks.
 
 nouser_xattr   Disables Extended User Attributes.  See the
-   attr(5) manual page and http://acl.bestbits.at/
-   for more information about extended attributes.
+   attr(5) manual page for more information about
+   extended attributes.
 
 noacl  This option disables POSIX Access Control List
support. If ACL support is enabled in the kernel
configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
enabled by default on mount. See the acl(5) manual
-   page and http://acl.bestbits.at/ for more information
-   about acl.
+   page for more information about acl.
 
 bsddf  (*) Make 'df' act like BSD.
 minixdfMake 'df' act like Minix.
diff --git a/fs/9p/Kconfig b/fs/9p/Kconfig
index 6489e1fc1afd..11045d8e356a 100644
--- a/fs/9p/Kconfig
+++ b/fs/9p/Kconfig
@@ -25,9 +25,6 @@ config 9P_FS_POSIX_ACL
  POSIX Access Control Lists (ACLs) support permissions for users and
  groups beyond the owner/group/world scheme.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website <http://acl.bestbits.at/>.
-
  If you don't know what Access Control Lists are, say N
 
 endif
diff --git a/fs/Kconfig b/fs/Kconfig
index 7aee6d699fd6..0ed56752f208 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -167,17 +167,13 @@ config TMPFS_POSIX_ACL
  files for sound to work properly.  In short, if you're not sure,
  say Y.
 
- To learn more about Access Control Lists, visit the POSIX ACLs for
- Linux website <http://acl.bestbits.at/>.
-
 config TMPFS_XATTR
bool "Tmpfs extended attributes"
depends on TMPFS
default n
help
  Extended attributes are name:value pairs associated with inodes by
- the kernel or by users (see the attr(5) manual page, or visit
- <http://acl.bestbits.at/> for details).
+ the kernel or by users (see the attr(5) manual page for details).
 
  Currently this enables support for the trusted.* and
  security.* namespaces.
diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig
index 2e558227931a..273351ee4c46 100644
--- a/fs/btrfs/Kconfig
+++ b/fs/btrfs/Kconfig
@@ -38,9 +38,6 @@ config BTRFS_FS_POSIX_ACL
  POSIX Access Control Lists (AC

Re: exclusive subvolume space missing

2017-12-03 Thread Adam Borowski
On Sun, Dec 03, 2017 at 01:45:45AM +, Duncan wrote:
> Tomasz Pala posted on Sat, 02 Dec 2017 18:18:19 +0100 as excerpted:
> >> I got ~500 small files (100-500 kB) updated partially in regular
> >> intervals:
> >> 
> >> # du -Lc **/*.rrd | tail -n1
> >> 105Mtotal
> 
> FWIW, I've no idea what rrd files, or rrdcached (from the grandparent post)
> are (other than that a quick google suggests that it's...
> round-robin-database...

Basically: preallocate a file, its size doesn't change since then.  Every a
few minutes, write several bytes into the file, slowly advancing.

This is indeed the worst possible case for btrfs, and nocow doesn't help the
slightest as the database doesn't wrap around before a typical snapshot
interval.

> Meanwhile, /because/ nocow has these complexities along with others (nocow
> automatically turns off data checksumming and compression for the files
> too), and the fact that they nullify some of the big reasons people might
> choose btrfs in the first place, I actually don't recommend setting
> nocow in the first place -- if usage is such than a file needs nocow,
> my thinking is that btrfs isn't a particularly good hosting choice for
> that file in the first place, a more traditional rewrite-in-place
> filesystem is likely to be a better fit.

I'd say that the only good use for nocow is "I wish I have placed this file
on a non-btrfs, but it'd be too much hassle to repartition".

If you snapshot nocow at all, you get the worst of both worlds.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritic Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...;
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
On Tue, Nov 28, 2017 at 08:51:07AM +0800, Qu Wenruo wrote:
> On 2017年11月27日 22:22, David Sterba wrote:
> > On Mon, Nov 27, 2017 at 02:23:49PM +0100, Adam Borowski wrote:
> >> On 4.15-rc1, I get the following failure:
> >>
> >> BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
> >> slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
> >> extent, have 134 expect 281474976710677
> > 
> > By a quick look at suspiciously large number
> > 
> > hex(281474976710677) = 0x10015
> > 
> > may be a bitflip, but 0x15 does not match 134, so there could be
> > something else involved in the corruption.
> 
> That's a known bug, fixed by that patch which is not merged yet.
> 
> https://patchwork.kernel.org/patch/10047489/

This helped, thanks!


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ Mozilla's Hippocritical Oath: "Keep trackers off your trail"
⣾⠁⢰⠒⠀⣿⡁ blah blah evading "tracking technology" blah blah
⢿⡄⠘⠷⠚⠋⠀ "https://click.e.mozilla.org/?qs=e7bb0dcf14b1013fca3820...;
⠈⠳⣄ (same for all links)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


splat on 4.15-rc1: invalid ram_bytes for uncompressed inline extent

2017-11-27 Thread Adam Borowski
Hi!
On 4.15-rc1, I get the following failure:

BTRFS critical (device sda1): corrupt leaf: root=1 block=3820662898688
slot=43 ino=35691 file_offset=0, invalid ram_bytes for uncompressed inline
extent, have 134 expect 281474976710677

Repeatable every boot attempt.  4.14 and earlier boot fine; btrfs check
(progs 4.13.3) doesn't find any badness either.

[   11.347451] BTRFS info (device sda1): use lzo compression
[   11.352914] BTRFS info (device sda1): using free space tree
[] Activating lvm and md swap...[ ok done.
[] Checking file systems...fsck from util-l
[   11.660352] BTRFS critical (device sda1): corrupt leaf: root=1 
block=3820662898688 slot=43 ino=35691 file_offset=0, invalid ram_bytes for 
uncompressed inline extent, have 134 expect 281474976710677
inux 2.30.2
[   11.678550] BTRFS info (device sda1): leaf 3820662898688 total ptrs 103 free 
space 4350
[   11.687767]  item 0 key (35663 12 32909) itemoff 16263 itemsize 20
[   11.695021]  item 1 key (35663 108 0) itemoff 15751 itemsize 512
[   11.701274]  inline extent data size 509
 ok
[   11.705704]  item 2 key (35664 1 0) itemoff 15591 itemsize 160
[   11.713034]  inode generation 1292 size 509 mode 100644
[   11.718811]  item 3 key (35664 12 32909) itemoff 15571 itemsize 20
[   11.725113]  item 4 key (35664 108 0) itemoff 15059 itemsize 512
[   11.732168]  inline extent data size 509
done.
[   11.736280]  item 5 key (35665 1 0) itemoff 14899 itemsize 160
[   11.742681]  inode generation 1292 size 457 mode 100644
[   11.748275]  item 6 key (35665 12 32909) itemoff 14879 itemsize 20
[   11.754584]  item 7 key (35665 108 0) itemoff 14411 itemsize 468
[   11.760674]  inline extent data size 457
[   11.764780]  item 8 key (35666 1 0) itemoff 14251 itemsize 160
[   11.770711]  inode generation 1292 size 533 mode 100644
[]
[   11.776145]  item 9 key (35666 12 32909) itemoff 14231 itemsize 20
Cleaning up temp
[   11.783069]  item 10 key (35666 108 0) itemoff 13697 itemsize 534
orary files...
[   11.790666]  inline extent data size 533
[   11.795980]  item 11 key (35668 1 0) itemoff 13537 itemsize 160
[   11.801989]  inode generation 1292 size 319 mode 100644
 /tmp
[   11.807413]  item 12 key (35668 12 32909) itemoff 13517 itemsize 20
[   11.814250]  item 13 key (35668 108 0) itemoff 13247 itemsize 270
[   11.820512]  inline extent data size 319
[   11.825539]  item 14 key (35669 1 0) itemoff 13087 itemsize 160
[   11.831577]  inode generation 1292 size 375 mode 100644
[   11.837149]  item 15 key (35669 12 32909) itemoff 13067 itemsize 20
[   11.843873]  item 16 key (35669 108 0) itemoff 12783 itemsize 284
 ok 
[   11.850098]  inline extent data size 375
[   11.855579]  item 17 key (35670 1 0) itemoff 12623 itemsize 160
[   11.862861]  inode generation 1292 size 168 mode 100644
.
[   11.869194]  item 18 key (35670 12 33512) itemoff 12588 itemsize 35
[   11.876467]  item 19 key (35670 108 0) itemoff 12399 itemsize 189
[   11.883564]  inline extent data size 168
[   11.888551]  item 20 key (35676 1 0) itemoff 12239 itemsize 160
[   11.895421]  inode generation 1292 size 512 mode 100600
[   11.901719]  item 21 key (35676 12 32911) itemoff 12218 itemsize 21
[   11.909045]  item 22 key (35676 108 0) itemoff 11685 itemsize 533
[   11.916136]  inline extent data size 512
[   11.921125]  item 23 key (35685 1 0) itemoff 11525 itemsize 160
[   11.928047]  inode generation 1292 size 32128 mode 100644
[   11.934553]  item 24 key (35685 12 32783) itemoff 11508 itemsize 17
[] Mounting 
[   11.941874]  item 25 key (35685 108 0) itemoff 11455 itemsize 53
local filesystem
[   11.949377]  extent data disk bytenr 3757990555648 nr 4096
s...
[   11.956471]  extent data offset 0 nr 4096 ram 4096
[   11.962383]  item 26 key (35685 108 4096) itemoff 11402 itemsize 53
[   11.969704]  extent data disk bytenr 3755041128448 nr 4096
[   11.976324]  extent data offset 4096 nr 24576 ram 32768
[   11.982686]  item 27 key (35685 108 28672) itemoff 11349 itemsize 53
[   11.990140]  extent data disk bytenr 3749090922496 nr 4096
[   11.996786]  extent data offset 0 nr 4096 ram 4096
[   12.002732]  item 28 key (35686 1 0) itemoff 11189 itemsize 160
[   12.009755]  inode generation 1292 size 5023 mode 100644
[   12.016204]  item 29 key (35686 12 32783) itemoff 11165 itemsize 24
[   12.023576]  item 30 key (35686 108 0) itemoff 2 itemsize 53
[   12.030665]  extent data disk bytenr 3651995594752 nr 4096
[   12.037298]  extent data offset 0 nr 8192 ram 8192
[   12.043184]  item 31 key (35687 1 0) itemoff 10952 itemsize 160
[   12.050181]  inode generation 1292 size 293168 mode 100664
[   12.056793]  item 32 key (35687 12 32783) itemoff 10935 itemsize 17
[   12.064082]  item 33 key (35687 108 0) itemoff 10882 itemsize 53
[   12.071154]  extent data disk bytenr 

Re: Unrecoverable scrub errors

2017-11-17 Thread Adam Borowski
On Fri, Nov 17, 2017 at 08:19:11PM -0700, Chris Murphy wrote:
> On Fri, Nov 17, 2017 at 8:41 AM, Nazar Mokrynskyi  
> wrote:
> 
> >> [551049.038718] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038720] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238048: metadata leaf (level 0) in tree 985
> >> [551049.038723] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 1, gen 0
> >> [551049.039634] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039635] BTRFS warning (device dm-2): checksum error at logical 
> >> 470069526528 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1, sector 
> >> 942238176: metadata leaf (level 0) in tree 985
> >> [551049.039637] BTRFS error (device dm-2): bdev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1 errs: wr 0, rd 
> >> 0, flush 0, corrupt 2, gen 0
> >> [551049.413114] BTRFS error (device dm-2): unable to fixup (regular) error 
> >> at logical 470069460992 on dev 
> >> /dev/mapper/luks-bd5dd3e7-ad80-405f-8dfd-752f2b870f93-part1
> 
> These are metadata errors. Are there any other storage stack related
> errors in the previous 2-5 minutes, such as read errors (UNC) or SATA
> link reset messages?
> 
> >Maybe I can find snapshot that contains file with wrong checksum and
> > remove corresponding snapshot or something like that?
> 
> It's not a file. It's metadata leaf.

Just for the record: had this be a data block (ie, a non-inline file
extent), the dmesg message would include one of filenames that refer to that
extent.  To clear the error, you'd need to remove all such files.

> >> nazar-pc@nazar-pc ~> sudo btrfs filesystem df /media/Backup
> >> Data, single: total=879.01GiB, used=877.24GiB
> >> System, DUP: total=40.00MiB, used=128.00KiB
> >> Metadata, DUP: total=20.50GiB, used=18.96GiB
> >> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Metadata is DUP, but both copies have corruption. Kinda strange. But I
> don't know how close the DUP copies are to each other, if possibly a
> big enough media defect can explain this.

The original post mentioned SSD (but was unclear if _this_ filesystem is
backed by one).  If so, DUP is nearly worthless as both copies will be
written to physical cells next to each other, no matter what positions the
FTL shows them at.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Adam Borowski
On Tue, Nov 14, 2017 at 10:36:22AM +0200, Klaus Agnoletti wrote:
> I used to have 3x2TB in a btrfs in raid0. A few weeks ago, one of the
 ^
> 2TB disks started giving me I/O errors in dmesg like this:
> 
> [388659.188988] Add. Sense: Unrecovered read error - auto reallocate failed

Alas, chances to recover anything are pretty slim.  That's RAID0 metadata
for you.

On the other hand, losing any non-trivial file while being able to gape at
intact metadata isn't that much better, thus -mraid0 isn't completely
unreasonable.

> To fix it, it ended up with me adding a new 6TB disk and trying to
> delete the failing 2TB disks.
> 
> That didn't go so well; apparently, the delete command aborts when
> ever it encounters I/O errors. So now my raid0 looks like this:
> 
> klaus@box:~$ sudo btrfs fi show
> [sudo] password for klaus:
> Label: none  uuid: 5db5f82c-2571-4e62-a6da-50da0867888a
> Total devices 4 FS bytes used 5.14TiB
> devid1 size 1.82TiB used 1.78TiB path /dev/sde
> devid2 size 1.82TiB used 1.78TiB path /dev/sdf
> devid3 size 0.00B used 1.49TiB path /dev/sdd
> devid4 size 5.46TiB used 305.21GiB path /dev/sdb

> Obviously, I want /dev/sdd emptied and deleted from the raid.
> 
> So how do I do that?
> 
> I thought of three possibilities myself. I am sure there are more,
> given that I am in no way a btrfs expert:
> 
> 1)Try to force a deletion of /dev/sdd where btrfs copies all intact
> data to the other disks
> 2) Somehow re-balances the raid so that sdd is emptied, and then deleted
> 3) converting into a raid1, physically removing the failing disk,
> simulating a hard error, starting the raid degraded, and converting it
> back to raid0 again.

There's hardly any intact data: roughly 2/3 of chunks have half of their
blocks on the failed disk, densely interspersed.  Even worse, metadata
required to map those blocks to files is gone, too: if we naively assume
there's only a single tree, a tree node is intact only if it and every
single node on the path to the root is intact.  In practice, this means
it's a total filesystem loss.

> How do you guys think I should go about this? Given that it's a raid0
> for a reason, it's not the end of the world losing all data, but I'd
> really prefer losing as little as possible, obviously.

As the disk isn't _completely_ gone, there's a slim chance of some stuff
requiring only still-readable sectors.  Probably a waste of time to try
to recover, though.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updatedb does not index /home when /home is Btrfs

2017-11-04 Thread Adam Borowski
On Sat, Nov 04, 2017 at 09:26:36AM +0300, Andrei Borzenkov wrote:
> 04.11.2017 07:49, Adam Borowski пишет:
> > On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
> >> Ancient bug, still seems to be a bug.
> >> https://bugzilla.redhat.com/show_bug.cgi?id=906591
> >>
> >> The issue is that updatedb by default will not index bind mounts, but
> >> by default on Fedora and probably other distros, put /home on a
> >> subvolume and then mount that subvolume which is in effect a bind
> >> mount.
> >>
> >> There's a lot of early discussion in 2013 about it, but then it's
> >> dropped off the radar as nobody has any ideas how to fix this in
> >> mlocate.
> > 
> > I don't see how this would be a bug in btrfs.  The same happens if you
> > bind-mount /home (or individual homes), which is a valid and non-rare setup.
> 
> It is the problem *on* btrfs because - as opposed to normal bind mount -
> those mount points do *not* refer to the same content.

Neither do they refer to in a "normal" bind mount.

> As was commented in mentioned bug report:
> 
> mount -o subvol=root /dev/sdb1 /root
> mount -o subvol=foo /dev/sdb1 /root/foo
> mount -o subvol bar /dev/sdb1 /bar/bar
> 
> Both /root/foo and /root/bar, will be skipped even though they are not
> accessible via any other path (on mounted filesystem)

losetup -D
truncate -s 4G junk
losetup -f junk
mkfs.ext4 /dev/loop0
mkdir -p foo bar
mount /dev/loop0 foo
mkdir foo/bar
touch foo/fileA foo/bar/fileB
mount --bind foo/bar bar
umount foo

> It is a problem *of* btrfs because it does not offer any easy way to
> distinguish between subvolume mount and bind mount. If you are aware of
> one, please comment on mentioned bug report.

Well, subvolume mounts are indistinguishable from bind mounts because they
_are_ bind mounts.  You merely don't need to mount the "master" first.

The only way such a "master" mount is special is that, on most filesystems,
its root was accessible at least at some point (but it might no longer be,
thanks to chroot, pivot_root, etc).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: updatedb does not index /home when /home is Btrfs

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 06:15:53PM -0600, Chris Murphy wrote:
> Ancient bug, still seems to be a bug.
> https://bugzilla.redhat.com/show_bug.cgi?id=906591
> 
> The issue is that updatedb by default will not index bind mounts, but
> by default on Fedora and probably other distros, put /home on a
> subvolume and then mount that subvolume which is in effect a bind
> mount.
> 
> There's a lot of early discussion in 2013 about it, but then it's
> dropped off the radar as nobody has any ideas how to fix this in
> mlocate.

I don't see how this would be a bug in btrfs.  The same happens if you
bind-mount /home (or individual homes), which is a valid and non-rare setup.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with file system

2017-11-03 Thread Adam Borowski
On Fri, Nov 03, 2017 at 04:03:44PM -0600, Chris Murphy wrote:
> On Tue, Oct 31, 2017 at 5:28 AM, Austin S. Hemmelgarn
>  wrote:
> 
> > If you're running on an SSD (or thinly provisioned storage, or something
> > else which supports discards) and have the 'discard' mount option enabled,
> > then there is no backup metadata tree (this issue was mentioned on the list
> > a while ago, but nobody ever replied),
> 
> 
> This is a really good point. I've been running discard mount option
> for some time now without problems, in a laptop with Samsung
> Electronics Co Ltd NVMe SSD Controller SM951/PM951.
> 
> However, just trying btrfs-debug-tree -b on a specific block address
> for any of the backup root trees listed in the super, only the current
> one returns a valid result.  All others fail with checksum errors. And
> even the good one fails with checksum errors within seconds as a new
> tree is created, the super updated, and Btrfs considers the old root
> tree disposable and subject to discard.
> 
> So absolutely if I were to have a problem, probably no rollback for
> me. This seems to totally obviate a fundamental part of Btrfs design.

How is this an issue?  Discard is issued only once we're positive there's no
reference to the freed blocks anywhere.  At that point, they're also open
for reuse, thus they can be arbitrarily scribbled upon.

Unless your hardware is seriously broken (such as lying about barriers,
which is nearly-guaranteed data loss on btrfs anyway), there's no way the
filesystem will ever reference such blocks.  The corpses of old trees that
are left lying around with no discard can at most be used for manual
forensics, but whether a given block will have been overwritten or not is
a matter of pure luck.

For rollbacks, there are snapshots.  Once a transaction has been fully
committed, the old version is considered gone.

>  because it's already been discarded.
> > This is ideally something which should be addressed (we need some sort of
> > discard queue for handling in-line discards), but it's not easy to address.
> 
> Discard data extents, don't discard metadata extents? Or put them on a
> substantial delay.

Why would you special-case metadata?  Metadata that points to overwritten or
discarded blocks is of no use either.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: avoid misleading talk about "compression level 0"

2017-10-25 Thread Adam Borowski
On Wed, Oct 25, 2017 at 03:23:11PM +0200, David Sterba wrote:
> On Sat, Oct 21, 2017 at 06:49:01PM +0200, Adam Borowski wrote:
> > Many compressors do assign a meaning to level 0: either null compression or
> > the lowest possible level.  This differs from our "unset thus default".
> > Thus, let's not unnecessarily confuse users.
> 
> I agree 'level 0' confusing, however I'd like to keep the level
> mentioned in the message.
> 
> We could add
> 
> #define   BTRFS_COMPRESSION_ZLIB_DEFAULT  3
> 
> and use it in btrfs_compress_str2level.

I considered this but every algorithm has a different default, thus we'd
need separate cases for zlib vs zstd, while lzo has no settable level at
all.  Still, this is just some extra lines of code, thus doable.

> > Signed-off-by: Adam Borowski <kilob...@angband.pl>
> > ---
> >  fs/btrfs/super.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> > index f9d4522336db..144fabfbd246 100644
> > --- a/fs/btrfs/super.c
> > +++ b/fs/btrfs/super.c
> > @@ -551,7 +551,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
> > char *options,
> >   compress_force != saved_compress_force)) ||
> > (!btrfs_test_opt(info, COMPRESS) &&
> >  no_compress == 1)) {
> > -   btrfs_info(info, "%s %s compression, level %d",
> > +   btrfs_printk(info, info->compress_level ?
> > +  KERN_INFO"%s %s compression, level 
> > %d" :
> > +  KERN_INFO"%s %s compression",
> 
> Please keep using btrfs_info, the KERN_INFO prefix would not work here.
> btrfs_printk prepends the filesystem description and the message level
> must be at the beginning.

Seems to work for me:
[   14.072575] BTRFS info (device sda1): use lzo compression
with identical colors as other info messages next to it.

But if we're to expand this code, ternary operators would get too hairy,
thus this can go at least for clarity.

> >(compress_force) ? "force" : "use",
> >compress_type, info->compress_level);
> > }
> 

-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SLES 11 SP4: can't mount btrfs

2017-10-21 Thread Adam Borowski
On Sat, Oct 21, 2017 at 01:46:06PM +0200, Lentes, Bernd wrote:
> - Am 21. Okt 2017 um 4:31 schrieb Duncan 1i5t5.dun...@cox.net:
> > Lentes, Bernd posted on Fri, 20 Oct 2017 20:40:15 +0200 as excerpted:
> > 
> >> Is it generally possible to restore a btrfs partition from a tape backup
> >> ?
> >> I'm just starting, and I'm asking myself. What is about the subvolumes ?
> >> This information isn't stored in files, but in the fs ? This is not on a
> >> file-based backup on a tape.
> > 
> > Yes it's possible to restore a btrfs partition from tape backup, /if/ you
> > backed up the partition itself, not just the files on top of it.

Which is usually a quite bad idea: unless you shut down (or remount ro) the
filesystem in question, the data _will_ be corrupted, and in the case of
btrfs, this kind of corruption tends to be fatal.  You also back up all the
unused space (trim greatly recommended), and the backup process takes ages
as it needs to read everything.

An efficient block-level backup of btrfs _would_ be possible as it can
nicely enumerate blocks touched since generation X, but AFAIK no one wrote
such a program yet.  It'd be also corruption free if done in two passes:
first a racey copy, fsfreeze(), copy of just newest updates.

> > Otherwise, as you deduce, you get the files, but not the snapshot history
> > or relationship, nor the subvolumes, which will look to normal file-level
> > backup software (that is, backup software not designed with btrfs-
> > specifics like subvolumes, which if it did, would likely use btrfs send/
> > receive at least optionally) like normal directories.

If the backup software does incrementals well, this is not as bad as it
sounds.  While rsync takes half an hour just to stat() a typical small piece
spinning rust (obviously depending on # of files), that's still in the
acceptable range.  That backup software can be then be told to back every
snapshot in turn.  You still lose reflinks between unrelated subvolumes but
those tend to be quite rare -- and you can re-dedupe.

> i apprehend that i have just a file based backup.  We use EMC Networker
> (version 8.1 or 8.2), and from what i read in the net i think it does not
> support BTRFS.  So i have to reinstall, which is maybe not the worst,
> because i'm thinking about using SLES 11 SP3.
> 
> What i know now is that i can't rely on our EMC backup.
> What would you propose to backup a complete btrfs partition
> (https://btrfs.wiki.kernel.org/index.php/Incremental_Backup) ?
> We have a NAS with propable enough space, and the servers aren't used
> heavily over night.  So using one of the mentioned tools in a cronjob over
> night is possible.

> Which tool do you recommend ?

It depends on what you use subvolumes for.

While a simple file-base backup may be inadequate for the general case, for
most actual uses it works well or at least well enough.  Only if you're
doing something special, bothering with the complexity might be worth it.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ Laws we want back: Poland, Dz.U. 1921 nr.30 poz.177 (also Dz.U. 
⣾⠁⢰⠒⠀⣿⡁ 1920 nr.11 poz.61): Art.2: An official, guilty of accepting a gift
⢿⡄⠘⠷⠚⠋⠀ or another material benefit, or a promise thereof, [in matters
⠈⠳⣄ relevant to duties], shall be punished by death by shooting.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: avoid misleading talk about "compression level 0"

2017-10-21 Thread Adam Borowski
Many compressors do assign a meaning to level 0: either null compression or
the lowest possible level.  This differs from our "unset thus default".
Thus, let's not unnecessarily confuse users.

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 fs/btrfs/super.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index f9d4522336db..144fabfbd246 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -551,7 +551,9 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
  compress_force != saved_compress_force)) ||
(!btrfs_test_opt(info, COMPRESS) &&
 no_compress == 1)) {
-   btrfs_info(info, "%s %s compression, level %d",
+   btrfs_printk(info, info->compress_level ?
+  KERN_INFO"%s %s compression, level 
%d" :
+  KERN_INFO"%s %s compression",
   (compress_force) ? "force" : "use",
   compress_type, info->compress_level);
}
-- 
2.15.0.rc1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Adam Borowski
On Wed, Oct 18, 2017 at 07:30:55AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 16:21, Adam Borowski wrote:
> > > > It's a single-device filesystem, thus disconnects are obviously fatal.  
> > > > But,
> > > > they never caused even a single bit of damage (as scrub goes), thus 
> > > > proving
> > > > btrfs handles this kind of disconnects well.  Unlike times past, the 
> > > > kernel
> > > > doesn't get confused thus no reboot is needed, merely an unmount, 
> > > > "service
> > > > nbd-client restart", mount, restart the rebuild jobs.
> > > That's expected behavior though.  _Single_ device BTRFS has nothing to get
> > > out of sync most of the time, the only time there's any possibility of an
> > > issue is when you die after writing the first copy of a block that's in a
> > > dup profile chunk, but even that is not very likely to cause problems
> > > (you'll just lose at most the last  worth of data).
> > 
> > How come?  In a DUP profile, the writes are: chunk 1, chunk2, barrier,
> > superblock.  The two prior writes may be arbitrarily reordered -- both
> > between each other or even individual sectors inside the chunks, but unless
> > the disk lies about barriers, there's no way to have any corruption, thus
> > running scrub is not needed.
> If the device dies after writing chunk 1 but before the barrier, you end up
> needing scrub.  How much of a failure window is present is largely a
> function of how fast the device is, but there is a failure window there.

CoW is there to ensure there is _no_ failure window.  The new content
doesn't matter until there are live pointers to it -- from the filesystem's
point of view we merely scribbled something on an unused part of the block
device.  Only after all pieces are in place (as ensured by the barrier), the
superblock is updated with a reference to the new metadata->data chain.

Thus, no matter when a disconnect happens, after a crash you get either
uncorrupted old version or uncorrupted new version.

No scrub is ever needed for this reason on single device or on RAID1 that
didn't run degraded.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 03:19:09PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 13:06, Adam Borowski wrote:
> > On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-10-17 07:42, Zoltan wrote:
> > > > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> > > > <ahferro...@gmail.com> wrote:
> > > > 
> > > > > I forget sometimes that people insist on storing large volumes of 
> > > > > data on
> > > > > unreliable storage...
> > > > 
> > > > In my opinion the unreliability of the storage is the exact reason for
> > > > wanting to use raid1. And I think any problem one encounters with an
> > > > unreliable disk can likely happen with more reliable ones as well,
> > > > only less frequently, so if I don't feel comfortable using raid1 on an
> > > > unreliable medium then I wouldn't trust it on a more reliable one
> > > > either.
> > 
> > > The thing is that you need some minimum degree of reliability in the other
> > > components in the storage stack for it to be viable to use any given 
> > > storage
> > > technology.  If you don't meet that minimum degree of reliability, then 
> > > you
> > > can't count on the reliability guarantees of the storage technology.
> > 
> > The thing is, reliability guarantees required vary WILDLY depending on your
> > particular use cases.  On one hand, there's "even an one-minute downtime
> > would cost us mucho $$$s, can't have that!" -- on the other, "it died?
> > Okay, we got backups, lemme restore it after the weekend".
> Yes, but if you are in the second case, you arguably don't need replication,
> and would be better served by improving the reliability of your underlying
> storage stack than trying to work around it's problems. Even in that case,
> your overall reliability is still constrained by the least reliable
> component (in more idiomatic terms 'a chain is only as strong as it's
> weakest link').

MD can handle this case well, there's no reason btrfs shouldn't do that too.
A RAID is not akin to serially connected chain, it's a parallel connected
chain: while pieces of the broken second chain hanging down from the first
don't make it strictly more resilient than having just a single chain, in
general case it _is_ more reliable even if the other chain is weaker.

Don't we have a patchset that deals with marking a device as failed at
runtime floating on the mailing list?  I did not look at those patches yet,
but they are a step in this direction.

> Using replication with a reliable device and a questionable device is
> essentially the same as trying to add redundancy to a machine by adding an
> extra linkage that doesn't always work and can get in the way of the main
> linkage it's supposed to be protecting from failure.  Yes, it will work most
> of the time, but the system is going to be less reliable than it is without
> the 'redundancy'.

That's the current state of btrfs, but the design is sound, and reaching
more than parity with MD is a matter of implementation.

> > Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
> > the network driver allocates memory with GFP_NOIO which causes NBD
> > disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
> > be obvious but on regular filesystem where throwing out userspace memory is
> > safe).  The disconnects happen around once per week.
> Somewhat off-topic, but you might try looking at ATAoE as an alternative,
> it's more reliable in my experience (if you've got a reliable network),
> gives better performance (there's less protocol overhead than NBD, and it
> runs on top of layer 2 instead of layer 4)

I've tested it -- not on the Odroid-U2 but on Pine64 (fully working GbE). 
NBD delivers 108MB/sec in a linear transfer, ATAoE is lucky to break
40MB/sec, same target (Qnap-253a, spinning rust), both in default
configuration without further tuning.  NBD is over IPv6 for that extra 20
bytes per packet overhead.

Also, NBD can be encrypted or arbitrarily routed.

> > It's a single-device filesystem, thus disconnects are obviously fatal.  But,
> > they never caused even a single bit of damage (as scrub goes), thus proving
> > btrfs handles this kind of disconnects well.  Unlike times past, the kernel
> > doesn't get confused thus no reboot is needed, merely an unmount, "service
> > nbd-client restart", mount, restart the rebuild jobs.
> That's expected behavior though.  _Single_ device BTRFS has nothing to get
> out of sync most of the time, the only time there's any possibility of an
> issue is when you die after wr

Re: Is it safe to use btrfs on top of different types of devices?

2017-10-17 Thread Adam Borowski
On Tue, Oct 17, 2017 at 08:40:20AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-17 07:42, Zoltan wrote:
> > On Tue, Oct 17, 2017 at 1:26 PM, Austin S. Hemmelgarn
> >  wrote:
> > 
> > > I forget sometimes that people insist on storing large volumes of data on
> > > unreliable storage...
> > 
> > In my opinion the unreliability of the storage is the exact reason for
> > wanting to use raid1. And I think any problem one encounters with an
> > unreliable disk can likely happen with more reliable ones as well,
> > only less frequently, so if I don't feel comfortable using raid1 on an
> > unreliable medium then I wouldn't trust it on a more reliable one
> > either.

> The thing is that you need some minimum degree of reliability in the other
> components in the storage stack for it to be viable to use any given storage
> technology.  If you don't meet that minimum degree of reliability, then you
> can't count on the reliability guarantees of the storage technology.

The thing is, reliability guarantees required vary WILDLY depending on your
particular use cases.  On one hand, there's "even an one-minute downtime
would cost us mucho $$$s, can't have that!" -- on the other, "it died? 
Okay, we got backups, lemme restore it after the weekend".

Lemme tell you a btrfs blockdev disconnects story.
I have an Odroid-U2, a cheap ARM SoC that, despite being 5 years old and
costing mere $79 (+$89 eMMC...) still beats the performance of much newer
SoCs that have far better theoretical specs, including subsequent Odroids.
After ~1.5 year of CPU-bound stress tests for one program, I switched this
machine to doing Debian package rebuilds, 24/7/365¼, for QA purposes.
Being a moron, I did not realize until pretty late that high parallelism to
keep all cores utilized is still a net performance loss when a memory-hungry
package goes into a swappeathon, even despite the latter being fairly rare.
Thus, I can say disk utilization was pretty much 100%, with almost as much
writing as reading.  The eMMC card endured all of this until very recently
(nowadays it sadly throws errors from time to time).

Thus, I switched the machine to NBD (albeit it sucks on 100Mbit eth).  Alas,
the network driver allocates memory with GFP_NOIO which causes NBD
disconnects (somehow, this doesn't ever happen on swap where GFP_NOIO would
be obvious but on regular filesystem where throwing out userspace memory is
safe).  The disconnects happen around once per week.

It's a single-device filesystem, thus disconnects are obviously fatal.  But,
they never caused even a single bit of damage (as scrub goes), thus proving
btrfs handles this kind of disconnects well.  Unlike times past, the kernel
doesn't get confused thus no reboot is needed, merely an unmount, "service
nbd-client restart", mount, restart the rebuild jobs.

I also can recreate this filesystem and the build environment on it with
just a few commands, thus, unlike /, there's no need for backups.  But I
had no need to recreate it yet.

This is single-device not RAID5, but it's a good example for an use case
where an unreliable storage medium is acceptable (even if the GFP_NOIO issue
is still worth fixing).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Imagine there are bandits in your house, your kid is bleeding out,
⢿⡄⠘⠷⠚⠋⠀ the house is on fire, and seven big-ass trumpets are playing in the
⠈⠳⣄ sky.  Your cat demands food.  The priority should be obvious...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-16 Thread Adam Borowski
On Mon, Oct 16, 2017 at 01:27:40PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-10-16 12:57, Zoltan wrote:
> > On Mon, Oct 16, 2017 at 1:53 PM, Austin S. Hemmelgarn wrote:
> In an ideal situation, scrubbing should not be an 'only if needed' thing,
> even for a regular array that isn't dealing with USB issues. From a
> practical perspective, there's no way to know for certain if a scrub is
> needed short of reading every single file in the filesystem in it's
> entirety, at which point, you're just better off running a scrub (because if
> you _do_ need to scrub, you'll end up reading everything twice).

> [...]  There are three things to deal with here:
> 1. Latent data corruption caused either by bit rot, or by a half-write (that
> is, one copy got written successfully, then the other device disappeared
> _before_ the other copy got written).
> 2. Single chunks generated when the array is degraded.
> 3. Half-raid1 chunks generated by newer kernels when the array is degraded.

Note that any of the above other than bit rot affect only very recent data. 
If we keep record of the last known-good generation, all of that can be
enumerated, allowing us to make a selective scrub that checks only a small
part of the disk.  A linear read a 8TB disk takes 14 hours...

If we ever get auto-recovery, this is a fine candidate.

> Scrub will fix problem 1 because that's what it's designed to fix.  it will
> also fix problem 3, since that behaves just like problem 1 from a
> higher-level perspective.  It won't fix problem 2 though, as it doesn't look
> at chunk types (only if the data in the chunk doesn't have the correct
> number of valid copies).

Here not even tracking generations is required: a soft convert balance
touches only bad chunks.  Again, would work well for auto-recovery, as it's
a no-op if all is well.

> In contrast, the balance command you quoted won't fix issue 1 (because it
> doesn't validate checksums or check that data has the right number of
> copies), or issue 3 (because it's been told to only operate on non-raid1
> chunks), but it will fix issue 2.
> 
> In comparison to both of the above, a full balance without filters will fix
> all three issues, although it will do so less efficiently (in terms of both
> time and disk usage) than running a soft-conversion balance followed by a
> scrub.

"less efficiently" is an understatement.  Scrub gets a good part of
theoretical linear speed, while I just had a single metadata block take
14428 seconds to balance.

> In the case of normal usage, device disconnects are rare, so you should
> generally be more worried about latent data corruption.

Yeah, but certain setups (like anything USB) gets disconnect quite often. 
It would be nice to get them right.  MD thanks to write-intent bitmap can
recover almost instantly, btrfs could do it better -- the code to do so
isn't written yet.

> monitor the kernel log to watch for device disconnects, remount the
> filesystem when the device reconnects, and then run the balance command
> followed by a scrub.  With most hardware I've seen, USB disconnects tend to
> be relatively frequent unless you're using very high quality cabling and
> peripheral devices.  If, however, they happen less than once a day most of
> the time, just set up the log monitor to remount, and set the balance and
> scrub commands on the schedule I suggested above for normal usage.

A day-long recovery for an event that happens daily isn't a particularly
enticing prospect.

-- 
⢀⣴⠾⠻⢶⣦⠀ Meow!
⣾⠁⢰⠒⠀⣿⡁
⢿⡄⠘⠷⠚⠋⠀ I was born a dumb, ugly and work-loving kid, then I got swapped on
⠈⠳⣄ the maternity ward.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Give up on bcache?

2017-09-26 Thread Adam Borowski
On Tue, Sep 26, 2017 at 11:33:19PM +0500, Roman Mamedov wrote:
> On Tue, 26 Sep 2017 16:50:00 + (UTC)
> Ferry Toth  wrote:
> 
> > https://www.phoronix.com/scan.php?page=article=linux414-bcache-
> > raid=2
> > 
> > I think it might be idle hopes to think bcache can be used as a ssd cache 
> > for btrfs to significantly improve performance..
> 
> My personal real-world experience shows that SSD caching -- with lvmcache --
> does indeed significantly improve performance of a large Btrfs filesystem with
> slowish base storage.
> 
> And that article, sadly, only demonstrates once again the general mediocre
> quality of Phoronix content: it is an astonishing oversight to not check out
> lvmcache in the same setup, to at least try to draw some useful conclusion, is
> it Bcache that is strangely deficient, or SSD caching as a general concept
> does not work well in the hardware setup utilized.

Also, it looks as if Phoronix' tests don't stress metadata at all.  Btrfs is
all about metadata, speeding it up greatly helps most workloads.

A pipe-dream wishlist would be:
* store and access master copy of metadata on SSD only
* pin all data blocks referenced by generations not yet mirrored
* slowly copy over metadata to HDD

-- 
⢀⣴⠾⠻⢶⣦⠀ We domesticated dogs 36000 years ago; together we chased
⣾⠁⢰⠒⠀⣿⡁ animals, hung out and licked or scratched our private parts.
⢿⡄⠘⠷⠚⠋⠀ Cats domesticated us 9500 years ago, and immediately we got
⠈⠳⣄ agriculture, towns then cities. -- whitroth on /.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-18 Thread Adam Borowski
On Wed, Sep 13, 2017 at 08:21:01AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 17:13, Adam Borowski wrote:
> > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-09-12 16:00, Adam Borowski wrote:
> > > > Noted.  Both Marat's and my use cases, though, involve VMs that are off 
> > > > most
> > > > of the time, and at least for me, turned on only to test something.
> > > > Touching mtime makes rsync run again, and it's freaking _slow_: worse 
> > > > than
> > > > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
> > > 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
> > > you're going direct to a hard drive.  I get better performance than that 
> > > on
> > > my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s 
> > > there,
> > > but it's for archival storage so I don't really care).  I'm actually 
> > > curious
> > > what the exact rsync command you are using is (you can obviously redact
> > > paths as you see fit), as the only way I can think of that it should be 
> > > that
> > > slow is if you're using both --checksum (but if you're using this, you can
> > > tell rsync to skip the mtime check, and that issue goes away) and 
> > > --inplace,
> > > _and_ your HDD is slow to begin with.
> >
> > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ 
> > mordor:$BASE/qemu
> > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
> > with nothing notable on SMART, in a Qnap 253a, kernel 4.9.
> compress=zlib is probably your biggest culprit.  As odd as this sounds, I'd
> suggest switching that to lzo (seriously, the performance difference is
> ludicrous), and then setting up a cron job (or systemd timer) to run defrag
> over things to switch to zlib.  As a general point of comparison, we do
> archival backups to a file server running BTRFS where I work, and the
> archiving process runs about four to ten times faster if we take this
> approach (LZO for initial compression, then recompress using defrag once the
> initial transfer is done) than just using zlib directly.

Turns out that lzo is actually the slowest, but only by a bit.

I tried a different disk, in the same Qnap; also an old disk but 7200 rpm
rather than 5400.  Mostly empty, only a handful subvolumes, not much
reflinking.  I made three separate copies, fallocated -d, upgraded Windows
inside the VM, then:

[/mnt/btr1/qemu]$ for x in none lzo zlib;do time rsync -axX --delete --inplace 
--numeric-ids win10.img mordor:/SOME/DIR/$x/win10.img;done

real31m37.459s
user27m21.587s
sys 2m16.210s

real33m28.258s
user27m19.745s
sys 2m17.642s

real32m57.058s
user27m24.297s
sys 2m17.640s

Note the "user" values.  So rsync does something bad on the source side.

Despite fragmentation, reads on the source are not a problem:

[/mnt/btr1/qemu]$ time cat /dev/null

real1m28.815s
user0m0.061s
sys 0m48.094s
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 63682 extents found
[/mnt/btr1/qemu]$ btrfs fi def win10.img
[/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img 
win10.img: 18015 extents found
[/mnt/btr1/qemu]$ time cat /dev/null

real1m17.879s
user0m0.076s
sys 0m37.757s

> `--inplace` is probably not helping (especially if most of the file changed,
> on BTRFS, it actually is marginally more efficient to just write out a whole
> new file and then replace the old one with a rename if you're rewriting most
> of the file), but is probably not as much of an issue as compress=zlib.

Yeah, scp + dedupe would run faster.  For deduplication, instead of
duperemove it'd be better to call file_extent_same on the first 128K, then
the second, ... -- without even hashing the blocks beforehand.

Not that this particular VM takes enough backup space to make spending too
much time worthwhile, but it's a good test case for performance issues like
this.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/3] btrfs: allow to set compression level for zlib

2017-09-15 Thread Adam Borowski
From: David Sterba 

Preliminary support for setting compression level for zlib, the
following works:

$ mount -o compess=zlib # default
$ mount -o compess=zlib0# same
$ mount -o compess=zlib9# level 9, slower sync, less data
$ mount -o compess=zlib1# level 1, faster sync, more data
$ mount -o remount,compress=zlib3   # level set by remount

The level is visible in the same format in /proc/mounts. Level set via
file property does not work yet.

Required patch: "btrfs: prepare for extensions in compression options"

Signed-off-by: David Sterba 
---
 fs/btrfs/compression.c | 20 +++-
 fs/btrfs/compression.h |  6 +-
 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/inode.c   |  5 -
 fs/btrfs/lzo.c |  5 +
 fs/btrfs/super.c   |  7 +--
 fs/btrfs/zlib.c| 12 +++-
 7 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index b51d23f5cafa..70a50194fcf5 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -867,6 +867,11 @@ static void free_workspaces(void)
  * Given an address space and start and length, compress the bytes into @pages
  * that are allocated on demand.
  *
+ * @type_level is encoded algorithm and level, where level 0 means whatever
+ * default the algorithm chooses and is opaque here;
+ * - compression algo are 0-3
+ * - the level are bits 4-7
+ *
  * @out_pages is an in/out parameter, holds maximum number of pages to allocate
  * and returns number of actually allocated pages
  *
@@ -881,7 +886,7 @@ static void free_workspaces(void)
  * @max_out tells us the max number of bytes that we're allowed to
  * stuff into pages
  */
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -889,9 +894,11 @@ int btrfs_compress_pages(int type, struct address_space 
*mapping,
 {
struct list_head *workspace;
int ret;
+   int type = type_level & 0xF;
 
workspace = find_workspace(type);
 
+   btrfs_compress_op[type - 1]->set_level(workspace, type_level);
ret = btrfs_compress_op[type-1]->compress_pages(workspace, mapping,
  start, pages,
  out_pages,
@@ -1081,3 +1088,14 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
 
return ret;
 }
+
+unsigned int btrfs_compress_str2level(const char *str)
+{
+   if (strncmp(str, "zlib", 4) != 0)
+   return 0;
+
+   if ('1' <= str[4] && str[4] <= '9' )
+   return str[4] - '0';
+
+   return 0;
+}
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index d2781ff8f994..da20755ebf21 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -76,7 +76,7 @@ struct compressed_bio {
 void btrfs_init_compress(void);
 void btrfs_exit_compress(void);
 
-int btrfs_compress_pages(int type, struct address_space *mapping,
+int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
 unsigned long *out_pages,
 unsigned long *total_in,
@@ -95,6 +95,8 @@ blk_status_t btrfs_submit_compressed_write(struct inode 
*inode, u64 start,
 blk_status_t btrfs_submit_compressed_read(struct inode *inode, struct bio *bio,
 int mirror_num, unsigned long bio_flags);
 
+unsigned btrfs_compress_str2level(const char *str);
+
 enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
@@ -124,6 +126,8 @@ struct btrfs_compress_op {
  struct page *dest_page,
  unsigned long start_byte,
  size_t srclen, size_t destlen);
+
+   void (*set_level)(struct list_head *ws, unsigned int type);
 };
 
 extern const struct btrfs_compress_op btrfs_zlib_compress;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 5a8933da39a7..dd07a7ef234c 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -791,6 +791,7 @@ struct btrfs_fs_info {
 */
unsigned long pending_changes;
unsigned long compress_type:4;
+   unsigned int compress_level;
int commit_interval;
/*
 * It is a suggestive number, the read side is safe even it gets a
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 128f3e58634f..28201b924575 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -530,7 +530,10 @@ static noinline void compress_file_range(struct inode 
*inode,
 */
extent_range_clear_dirty_for_io(inode, 

[RFC PATCH 2/3] btrfs: allow setting zlib compression level via :9

2017-09-15 Thread Adam Borowski
This is bikeshedding, but it seems people are drastically more likely to
understand "zlib:9" as compression level rather than an algorithm version
compared to "zlib9".

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 fs/btrfs/compression.c | 2 ++
 fs/btrfs/super.c   | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 70a50194fcf5..71782ec976c7 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1096,6 +1096,8 @@ unsigned int btrfs_compress_str2level(const char *str)
 
if ('1' <= str[4] && str[4] <= '9' )
return str[4] - '0';
+   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9')
+   return str[5] - '0';
 
return 0;
 }
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 5467187701ef..537e04120457 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1248,7 +1248,7 @@ static int btrfs_show_options(struct seq_file *seq, 
struct dentry *dentry)
else
seq_printf(seq, ",compress=%s", compress_type);
if (info->compress_level)
-   seq_printf(seq, "%d", info->compress_level);
+   seq_printf(seq, ":%d", info->compress_level);
}
if (btrfs_test_opt(info, NOSSD))
seq_puts(seq, ",nossd");
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 3/3] btrfs: allow setting zstd level

2017-09-15 Thread Adam Borowski
Capped at 15 because of currently used encoding, which is also a reasonable
limit because highest levels shine only on blocks much bigger than btrfs'
128KB.

Memory is allocated for the biggest supported level rather than for
what is actually used.

Signed-off-by: Adam Borowski <kilob...@angband.pl>
---
 fs/btrfs/compression.c | 21 +++--
 fs/btrfs/props.c   |  2 +-
 fs/btrfs/super.c   |  3 ++-
 fs/btrfs/zstd.c| 24 +++-
 4 files changed, 37 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 71782ec976c7..2d4337756fef 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1091,13 +1091,22 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
 
 unsigned int btrfs_compress_str2level(const char *str)
 {
-   if (strncmp(str, "zlib", 4) != 0)
+   long level;
+   int max;
+
+   if (strncmp(str, "zlib", 4) == 0)
+   max = 9;
+   else if (strncmp(str, "zstd", 4) == 0)
+   max = 15; // encoded on 4 bits, real max is 22
+   else
return 0;
 
-   if ('1' <= str[4] && str[4] <= '9' )
-   return str[4] - '0';
-   if (str[4] == ':' && '1' <= str[5] && str[5] <= '9')
-   return str[5] - '0';
+   str += 4;
+   if (*str == ':')
+   str++;
 
-   return 0;
+   if (kstrtoul(str, 10, ))
+   return 0;
+
+   return (level > max) ? 0 : level;
 }
diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index f6a05f836629..2e35aa2b2d79 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -414,7 +414,7 @@ static int prop_compression_apply(struct inode *inode,
type = BTRFS_COMPRESS_LZO;
else if (!strncmp("zlib", value, 4))
type = BTRFS_COMPRESS_ZLIB;
-   else if (!strncmp("zstd", value, len))
+   else if (!strncmp("zstd", value, 4))
type = BTRFS_COMPRESS_ZSTD;
else
return -EINVAL;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 537e04120457..f9d4522336db 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -515,9 +515,10 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char 
*options,
btrfs_clear_opt(info->mount_opt, NODATASUM);
btrfs_set_fs_incompat(info, COMPRESS_LZO);
no_compress = 0;
-   } else if (strcmp(args[0].from, "zstd") == 0) {
+   } else if (strncmp(args[0].from, "zstd", 4) == 0) {
compress_type = "zstd";
info->compress_type = BTRFS_COMPRESS_ZSTD;
+   info->compress_level = 
btrfs_compress_str2level(args[0].from);
btrfs_set_opt(info->mount_opt, COMPRESS);
btrfs_clear_opt(info->mount_opt, NODATACOW);
btrfs_clear_opt(info->mount_opt, NODATASUM);
diff --git a/fs/btrfs/zstd.c b/fs/btrfs/zstd.c
index 607ce47b483a..99e11cb2d60e 100644
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -26,11 +26,14 @@
 #define ZSTD_BTRFS_MAX_WINDOWLOG 17
 #define ZSTD_BTRFS_MAX_INPUT (1 << ZSTD_BTRFS_MAX_WINDOWLOG)
 #define ZSTD_BTRFS_DEFAULT_LEVEL 3
+// Max supported by the algorithm is 22, but gains for small blocks (128KB)
+// are limited, thus we cap at 15.
+#define ZSTD_BTRFS_MAX_LEVEL 15
 
-static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len)
+static ZSTD_parameters zstd_get_btrfs_parameters(size_t src_len, int level)
 {
-   ZSTD_parameters params = ZSTD_getParams(ZSTD_BTRFS_DEFAULT_LEVEL,
-   src_len, 0);
+   ZSTD_parameters params = ZSTD_getParams(level, src_len, 0);
+   BUG_ON(level > ZSTD_maxCLevel());
 
if (params.cParams.windowLog > ZSTD_BTRFS_MAX_WINDOWLOG)
params.cParams.windowLog = ZSTD_BTRFS_MAX_WINDOWLOG;
@@ -43,6 +46,7 @@ struct workspace {
size_t size;
char *buf;
struct list_head list;
+   int level;
 };
 
 static void zstd_free_workspace(struct list_head *ws)
@@ -57,7 +61,8 @@ static void zstd_free_workspace(struct list_head *ws)
 static struct list_head *zstd_alloc_workspace(void)
 {
ZSTD_parameters params =
-   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT);
+   zstd_get_btrfs_parameters(ZSTD_BTRFS_MAX_INPUT,
+ ZSTD_BTRFS_MAX_LEVEL);
struct workspace *workspace;
 
workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
@@ -101,7 +106,7 @@ static int zstd_compress_pages(struct list_head *ws,
unsigned long len = *total_out;
cons

[RFC 0/3]: settable compression level for zstd

2017-09-15 Thread Adam Borowski
Hi!
Here's a patch set that allows changing the compression level for zstd,
currently at mount time only.  I've played with it for a month, so despite
being a quick hack, it's reasonably well tested.  Tested on 4.13 +
btrfs-for-4.14 only, though -- I've booted 11th-day-of-merge-window only an
hour ago on one machine, no explosions yet.

As a quick hack, it doesn't conserve memory as it should: all workspace
allocations assume level 15 and waste space otherwise.

Because of an (easily changeable) quirk of compression level encoding, the
max is set at 15, but I guess higher levels are pointless for 128KB blocks. 
Nick and co can tell us more -- for me zstd is mostly a black box so it's
you who knows anything about tuning it.

There are three patches:
* [David Sterba] btrfs: allow to set compression level for zlib
  Unmodified version of the patch from Jul 24, I'm re-sending it for
  convenience.
* btrfs: allow setting zlib compression level via :9
  Some bikeshedding: it looks like Chris Mason also favours zlib:9 over
  zlib9 as the former is more readable.  If you disagree... well, it's up
  to you to decide anyway.  This patch accepts both syntaxes.
* btrfs: allow setting zstd level


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 16:00, Adam Borowski wrote:
> > Noted.  Both Marat's and my use cases, though, involve VMs that are off most
> > of the time, and at least for me, turned on only to test something.
> > Touching mtime makes rsync run again, and it's freaking _slow_: worse than
> > 40 minutes for a 40GB VM (source:SSD target:deduped HDD).
> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if
> you're going direct to a hard drive.  I get better performance than that on
> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there,
> but it's for archival storage so I don't really care).  I'm actually curious
> what the exact rsync command you are using is (you can obviously redact
> paths as you see fit), as the only way I can think of that it should be that
> slow is if you're using both --checksum (but if you're using this, you can
> tell rsync to skip the mtime check, and that issue goes away) and --inplace,
> _and_ your HDD is slow to begin with.

rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu
The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but
with nothing notable on SMART, in a Qnap 253a, kernel 4.9.

Both source and target are btrfs, but here switching to send|receive
wouldn't give much as this particular guest is Win10 Insider Edition --
a thingy that shows what the folks from Redmond have cooked up, with roughly
weekly updates to the tune of ~10GB writes 10GB deletions (if they do
incremental transfers, installation still rewrites everything system).

Lemme look a bit more, rsync performance is indeed really abysmal compared
to what it should be.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 14:43, Adam Borowski wrote:
> > On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
> > > On 2017-09-12 13:21, Adam Borowski wrote:
> > > > There's fallocate -d, but that for some reason touches mtime which makes
> > > > rsync go again.  This can be handled manually but is still not nice.
> > 
> > Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
> > calling it on regions that are already zero.  The ioctl doesn't know that,
> > so fallocate would have to restore the mtime by itself.
> > 
> > There's also another problem: such a check + ioctl are racey.  Unlike defrag
> > or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
> > suddenly become in use).  Fixing this would need kernel support, either as
> > FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.
> A new fallocate mode would be more likely.  Adding special code to the
> EXTENT_SAME ioctl and then requiring implementation on filesystems that
> don't otherwise support it is not likely to get anywhere.  A new fallocate
> mode though would be easy, especially considering that a naive
> implementation is easy

Sounds like a good idea.  If we go this way, there's a question about
interface: there's choice between:
A) check if the whole range is zero, if even a single bit is one, abort
B) dig many holes, with a given granulation (perhaps left to the
   filesystem's choice)
or even both.  The former is more consistent with FILE_EXTENT_SAME, the
latter can be smarter (like, digging a 4k hole is bad for fragmentation but
replacing a whole extent, no matter how small, is always a win).

> That said, I'm not 100% certain if it's necessary.  Intentionally calling
> fallocate on a file in use is not something most people are going to do
> normally anyway, since there is already a TOCTOU race in the fallocate -d
> implementation as things are right now.

_Current_ fallocate -d suffers from races, the whole gain from doing this
kernel-side would be eliminating those races.  Use cases about the same as
FILE_EXTENT_SAME: you don't need to stop the world.  Heck, as I mentioned
before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than
your (good) point about non-btrfs non-xfs.

> > For now, though, I wonder -- should we send fine folks at util-linux a patch
> > to make fallocate -d restore mtime, either always or on an option?
> It would need to be an option, because it also suffers from a TOCTOU race
> (other things might have changed the mtime while you were punching holes),
> and it breaks from existing behavior.  I think such an option would be
> useful, but not universally (for example, I don't care if the mtime on my VM
> images changes, as it typically matches the current date and time since the
> VM's are running constantly other than when doing maintenance like punching
> holes in the images).

Noted.  Both Marat's and my use cases, though, involve VMs that are off most
of the time, and at least for me, turned on only to test something. 
Touching mtime makes rsync run again, and it's freaking _slow_: worse than
40 minutes for a 40GB VM (source:SSD target:deduped HDD).


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote:
> On 2017-09-12 13:21, Adam Borowski wrote:
> > There's fallocate -d, but that for some reason touches mtime which makes
> > rsync go again.  This can be handled manually but is still not nice.

> It touches mtime because it updates the block allocations, which in turn
> touch ctime, which on most (possibly all, not sure though) POSIX systems
> implies an mtime update.  It's essentially the same as truncate updating the
> mtime when you extend the file, the only difference is that the
> FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size.

Yeah, the underlying ioctl does modify the file, it's merely fallocate -d
calling it on regions that are already zero.  The ioctl doesn't know that,
so fallocate would have to restore the mtime by itself.

There's also another problem: such a check + ioctl are racey.  Unlike defrag
or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could
suddenly become in use).  Fixing this would need kernel support, either as
FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate.

For now, though, I wonder -- should we send fine folks at util-linux a patch
to make fallocate -d restore mtime, either always or on an option?


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 02:26:39PM +0300, Marat Khalili wrote:
> On 12/09/17 14:12, Adam Borowski wrote:
> > Why would you need support in the hypervisor if cp --reflink=always is
> > enough?
> +1 :)
> 
> But I've already found one problem: I use rsync snapshots for backups, and
> although rsync does have --sparse argument, apparently it conflicts with
> --inplace. You cannot have all nice things :(

There's fallocate -d, but that for some reason touches mtime which makes
rsync go again.  This can be handled manually but is still not nice.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 02:01:53PM +0300, Timofey Titovets wrote:
> > On 12/09/17 13:32, Adam Borowski wrote:
> >> Just use raw -- btrfs already has every feature that qcow2 has, and
> >> does it better.  This doesn't mean btrfs is the best choice for hosting
> >> VM files, just that raw-over-btrfs is strictly better than
> >> qcow2-over-btrfs.
> >
> > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to
> > investigate/ask. Now it looks simple.
> 
> The main problem with Raw over Btrfs is that (IIRC) no one support
> btrfs features.
> 
>  - Patches for libvirt not merged and obsolete
>  - Patches for Proxmox also not merged
>  - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features.
> 
> So with raw you will have a problems like: no snapshot support

Why would you need support in the hypervisor if cp --reflink=always is
enough?  Likewise, I wouldn't expect hypervisors to implement support for
every dedup tool -- it'd be a layering violation[1].  It's not emacs or
systemd, you really can use an external tool instead of adding a lawnmower
to the kitchen sink.


Meow!

[1] Yeah, talking about layering violations in btrfs context is a bit weird,
but it's better to at least try.
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm VM died during partial raid1 problems of btrfs

2017-09-12 Thread Adam Borowski
On Tue, Sep 12, 2017 at 10:01:07AM +, Duncan wrote:
> BTW, I am most definitely /not/ a VM expert, and won't pretend to 
> understand the details or be able to explain further, but IIRC from what 
> I've read on-list, qcow2 isn't the best alternative for hosting VMs on 
> top of btrfs.  Something about it being cow-based as well, which means cow
> (qcow2)-on-cow(btrfs), which tends to lead to /extreme/ fragmentation, 
> leading to low performance.
> 
> I don't know enough about it to know what the alternatives to qcow2 are, 
> but something that not itself cow when it's on cow-based btrfs, would 
> presumably be a better alternative.

Just use raw -- btrfs already has every feature that qcow2 has, and does it
better.  This doesn't mean btrfs is the best choice for hosting VM files,
just that raw-over-btrfs is strictly better than qcow2-over-btrfs.

And like qcow2, with raw over btrfs you have the choice between a fully
pre-written nocow file and a sparse file.  For the latter, you want discard
in the guest (not supported over ide and virtio, supported over scsi and
virtio-scsi), and you get the full list of btrfs goodies like snapshots or
dedup.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts
⣾⠁⢰⠒⠀⣿⡁ productivity.  You can read it, too, you just need the
⢿⡄⠘⠷⠚⠋⠀ right music while doing so.  I recommend Skepticism
⠈⠳⣄ (funeral doom metal).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: \o/ compsize

2017-09-05 Thread Adam Borowski
On Mon, Sep 04, 2017 at 10:33:40PM +0200, A L wrote:
> On 9/4/2017 5:11 PM, Adam Borowski wrote:
> > Hi!
> > Here's an utility to measure used compression type + ratio on a set of files
> > or directories: https://github.com/kilobyte/compsize
> 
> Great tool. Just tried it on some of my backup snapshots.
> 
>    # compsize portage.20170904T2200
>    142432 files.
>    all   78%  329M/ 422M
>    none 100%  227M/ 227M
>    zlib  52%  102M/ 195M
> 
>    # du -sh  portage.20170904T2200
>    787M    portage.20170904T2200
> 
>    # btrfs fi du -s  portage.20170904T2200
>     Total   Exclusive  Set shared  Filename
>     271.61MiB 6.34MiB   245.51MiB portage.20170904T2200
> 
> Interesting results. How do I interpret them?

I've added some documentation; especially in the man page.

(Sorry for not pushing this earlier, Timofey went wild on this tool and I
wanted to avoid conflicts.)

> Compsize also doesn't seem to like some non-standard files and throws an
> error (even though they should be ignored?):
> 
> # compsize usb-backup/volumes/root/root.20170727T2321/
> open("usb-backup/volumes/root/root.20170727T2321//tmp/screen/S-root/2757.pts-1.e350"):
> No such device or address
> 
> # dir
> usb-backup/volumes/root/root.20170727T2321//tmp/screen/S-root/2757.pts-1.e350
> srwx-- 1 root root 0 Dec 31  2015 
> usb-backup/volumes/root/root.20170727T2321//tmp/screen/S-root/2757.pts-1.e350=

Fixed.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: \o/ compsize

2017-09-04 Thread Adam Borowski
On Mon, Sep 04, 2017 at 07:07:25PM +0300, Timofey Titovets wrote:
> 2017-09-04 18:11 GMT+03:00 Adam Borowski <kilob...@angband.pl>:
> > Here's an utility to measure used compression type + ratio on a set of files
> > or directories: https://github.com/kilobyte/compsize
> >
> > It should be of great help for users, and also if you:
> > * muck with compression levels
> > * add new compression types
> > * add heurestics that could err on withholding compression too much
> 
> Packaged to AUR:
> https://aur.archlinux.org/packages/compsize-git/

Cool!  I'd wait until people say the code is sane (I don't really know these
ioctls) but if you want to make poor AUR folks our beta testers, that's ok.

However, one issue: I did not set a license; your packaging says GPL3. 
It would be better to have something compatible with btrfs-progs which are
GPL2-only.  What about GPL2-or-higher?

After adding some related info (like wasted space in pinned extents, reuse
of extents), it'd be nice to have this tool inside btrfs-progs, either as a
part of "fi du" or another command.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


\o/ compsize

2017-09-04 Thread Adam Borowski
Hi!
Here's an utility to measure used compression type + ratio on a set of files
or directories: https://github.com/kilobyte/compsize

It should be of great help for users, and also if you:
* muck with compression levels
* add new compression types
* add heurestics that could err on withholding compression too much

(Thanks for Knorrie and his python-btrfs project that made figuring out the
ioctls much easier.)

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to disable/revoke 'compression'?

2017-09-04 Thread Adam Borowski
On Sun, Sep 03, 2017 at 08:30:59PM +0200, Hans van Kranenburg wrote:
> On 09/03/2017 08:06 PM, Adam Borowski wrote:
> > On Sun, Sep 03, 2017 at 07:32:01PM +0200, Cloud Admin wrote:
> >> Beside of it, is it possible to find out what the real and compressed size
> >> of a file, for example or the ratio?
> > 
> > Currently not.
> > 
> > I've once written a tool which does this, but 1. it's extremely slow, 2.
> > insane, 3. so insane a certain member of this list would kill me had I
> > distributed the tool.  Thus, I'd need to rewrite it first...
> 
> Heh, I wouldn't do that, since I need you to do my debian uploads. :D
> 
> But it would certainly help to be a bit less stubborn only wanting to
> code in the language that matches your country code. :O

It's even funnier that I prefer the C.UTF-8 locale over pl_PL.UTF-8, and
the C code works several orders of magnitude faster than the Perl+Python
version.

Ok, ok, in this case it's C++ but only because I couldn't be arsed to
implement a  the hard way.

> Or maybe I can help a bit, since it sounds like a nice one for the
> coding examples in the lib. ;]

The lib is a great help -- the ioctls are totally undocumented, your code is
a lot easier than RTFKing.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to disable/revoke 'compression'?

2017-09-03 Thread Adam Borowski
On Mon, Sep 04, 2017 at 07:55:27AM +0800, Qu Wenruo wrote:
> On 2017年09月04日 02:06, Adam Borowski wrote:
> > I've once written a tool which does this, but 1. it's extremely slow, 2.
> > insane, 3. so insane a certain member of this list would kill me had I
> > distributed the tool.  Thus, I'd need to rewrite it first...
> 
> AFAIK the only method to determine the compression ratio is to check the
> EXTENT_DATA key and its corresponding file_extent_item structure.
> (Which I assume Adam is doing this way)
> 
> In that structure is records its on-disk data size and in-memory data size.
> (All rounded up to sectorsize, which is 4K in most case)
> So in theory it's possible to determine the compression ratio.
> 
> The only method I can think of (maybe I forgot some methods?) is to use
> offline tool (btrfs-debug-tree) to check that.
> FS APIs like fiemap doesn't even support to report on-disk data size so we
> can't use it.

BTRFS_IOC_TREE_SEARCH_V2 returns all we want to know; its only downside is
being root only.

> But the problem is more complicated, especially when compressed CoW is
> involved.
> 
> For example, there is an extent (A) which represents the data for inode 258,
> range [0,128k).
> On disk size its just 4K.
> 
> And when we write the range [32K, 64K), which get CoWed and compressed,
> resulting a new file extent (B) for inode 258, range [32K, 64K), and on disk
> size is 4K as an example.
> 
> Then file extent layout for 258 will be:
> [0,32k):  range [0,32K) of uncompressed Extent A
> [32k, 64k): range [0,32k) of uncompressed Extent B
> [64k, 128k): range [64k, 128K) of uncompressed Extent A.
> 
> And on disk extent size is 4K (compressed Extent A) + 4K (compressed Extent
> B) = 8K.
> 
> Before the write, the compresstion ratio is 4K/128K = 3.125%
> While after write, the compression ratio is 8K/128K = 6.25%

There's no real meaningful way to speak about compression ratio of a partial
extent.  Thus, I decided to, for every extent, take compressed:uncompressed
sizes of the whole extent, no matter whether the file uses only a few bytes
of that extent or references it a thousand times.

> Not to mention that it's possible to have uncompressed file extent.

Yeah, the tool gives a report like:
all   74%  9.2M/  13M
lzo   68%  7.1M/  11M
none 100%  2.1M/ 2.1M
as you typically have a mix of compressible and uncompressible data.


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How to disable/revoke 'compression'?

2017-09-03 Thread Adam Borowski
On Sun, Sep 03, 2017 at 07:32:01PM +0200, Cloud Admin wrote:
> Hi,
> I used the mount option 'compression' on some mounted sub volumes. How
> can I revoke the compression? Means to delete the option and get all
> data uncompressed on this volume.
> Is it enough to remount the sub volume without this option? Or is it
> necessary to do some addional step (balancing?) to get all stored data
> uncompressed.

If you set it via mount option, removing the option is enough to disable
compression for _new_ files.  Other ways are chattr +c and btrfs-property,
but if you haven't heard about those you almost surely don't have such
attributes set.

After remounting, you may uncompress existing files.  Balancing won't do
this as it moves extents around without looking inside; defrag on the other
hand rewrites extents thus as a side effect it applies new [non]compression
settings.  Thus: 「btrfs fi defrag -r /path/to/filesystem」.

> Beside of it, is it possible to find out what the real and compressed size
> of a file, for example or the ratio?

Currently not.

I've once written a tool which does this, but 1. it's extremely slow, 2.
insane, 3. so insane a certain member of this list would kill me had I
distributed the tool.  Thus, I'd need to rewrite it first...


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: status of inline deduplication in btrfs

2017-08-28 Thread Adam Borowski
On Mon, Aug 28, 2017 at 12:49:10PM +0530, shally verma wrote:
> Am bit confused over here, is your description based on offline-dedupe
> here Or its with inline deduplication?

It doesn't matter _how_ you get to excessive reflinking, the resulting
slowdown is the same.

By the way, you can try "bees", it does nearline-dedupe which is for
practical purposes as good as fully online, and unlike the latter, has no
way to damage your data in case of bugs (mistaken userland dedupe can at
most make the kernel pointlessly read and compare data).

I haven't tried it myself, but what it does is dedupe using FILE_EXTENT_SAME
asynchronously right after a write gets put into the page cache, which in
most cases is quick enough to avoid writeout.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: status of inline deduplication in btrfs

2017-08-26 Thread Adam Borowski
On Sat, Aug 26, 2017 at 01:36:35AM +, Duncan wrote:
> The second has to do with btrfs scaling issues due to reflinking, which 
> of course is the operational mechanism for both snapshotting and dedup.  
> Snapshotting of course reflinks the entire subvolume, so it's reflinking 
> on a /massive/ scale.  While normal file operations aren't affected much, 
> btrfs maintenance operations such as balance and check scale badly enough 
> with snapshotting (due to the reflinking) that keeping the number of 
> snapshots per subvolume under 250 or so is strongly recommended, and 
> keeping them to double-digits or even single-digits is recommended if 
> possible.
> 
> Dedup works by reflinking as well, but its effect on btrfs maintenance 
> will be far more variable, depending of course on how effective the 
> deduping, and thus the reflinking, is.  But considering that snapshotting 
> is effectively 100% effective deduping of the entire subvolume (until the 
> snapshot and active copy begin to diverge, at least), that tends to be 
> the worst case, so figuring a full two-copy dedup as equivalent to one 
> snapshot is a reasonable estimate of effect.  If dedup only catches 10%, 
> only once, than it would be 10% of a snapshot's effect.  If it's 10% but 
> there's 10 duplicated instances, that's the effect of a single snapshot.  
> Assuming of course that the dedup domain is the same as the subvolume 
> that's being snapshotted.

Nope, snapshotting is not anywhere near the worst case of dedup:

[/]$ find /bin /sbin /lib /usr /var -type f -exec md5sum '{}' +|
cut -d' ' -f1|sort|uniq -c|sort -nr|head

Even on the system parts (ie, ignoring my data) of my desktop, top files
have the following dup counts: 532 384 373 164 123 122 101.  On this small
SSD, the system parts are reflinked by snapshots with 10 dailies, and by
deduping with 10 regular chroots, 11 sbuild chroots and 3 full-system lxc
containers (chroots are mostly a zoo of different architectures).

This is nothing compared to the backup server, which stores backups of 46
machines (only system/user and small data, bulky stuff is backed up
elsewhere), 24 snapshots each (a mix of dailies, 1/11/21, monthlies and
yearly).  This worked well enough until I made the mistake of deduping the
whole thing.

But, this is still not the worst horror imaginable.  I'd recommend using
whole-file dedup only as this avoids this pitfall: take two VM images, run
block dedup on them.  Identical blocks in them will be cross-reflinked.  And
there's _many_.  The vast majority of duplicate blocks are all-zero: I just
ran fallocate -d on a 40G win10 VM and it shrank to 19G.  AFAIK
file_extent_same is not yet smart enough to dedupe them to a hole instead.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Moving btrfs fs disk between computers

2017-08-18 Thread Adam Borowski
On Fri, Aug 18, 2017 at 11:09:22PM -0300, Hérikz Nawarro wrote:
> Hello everyone,
> 
> Can i create a btrfs fs in a machine and move the disk to another
> machine like a ext4 or ntfs?

Yeah, no problem whatsoever, even for multi-device filesystems.
Btrfs doesn't care about what devices it is on.

There has been no new incompat features in quite a while, so old kernels
shouldn't be an issue either.

I think the only way for valid btrfs to be unmountable on another machine
with a modern kernel is to mess with page size, which is AFAIK not even
possible on x86.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢰⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!?
⢿⡄⠘⠷⠚⠋⠀-- Genghis Ht'rok'din
⠈⠳⣄ 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 3/5] btrfs: Add zstd support

2017-08-10 Thread Adam Borowski
On Wed, Aug 09, 2017 at 07:39:02PM -0700, Nick Terrell wrote:
> Add zstd compression and decompression support to BtrFS.

Re-tested on arm64, amd64 and i386, this time everything seems fine so far.

As I'm too lazy to have a separate test setup for the zlib level patch,
I'm using a dummy implementation:

--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -423,10 +423,16 @@ static int zstd_decompress(struct list_head *ws, unsigned 
char *data_in,
return ret;
 }
 
+static void zstd_set_level(struct list_head *ws, unsigned int type)
+{
+   // TODO
+}
+
 const struct btrfs_compress_op btrfs_zstd_compress = {
.alloc_workspace = zstd_alloc_workspace,
.free_workspace = zstd_free_workspace,
.compress_pages = zstd_compress_pages,
.decompress_bio = zstd_decompress_bio,
.decompress = zstd_decompress,
+   .set_level = zstd_set_level,
 };


It might be worthwhile to do some early testing using different levels,
though.


喵!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-06 Thread Adam Borowski
On Sun, Aug 06, 2017 at 08:15:45PM -0600, Chris Murphy wrote:
> On Thu, Aug 3, 2017 at 11:51 PM, Shyam Prasad N  
> wrote:
> > We're running a couple of experiments on our servers with btrfs
> > (kernel version 4.4).
> > And we're running some abrupt power-off tests for a couple of scenarios:
> >
> > 1. We have a filesystem on top of two different btrfs filesystems
> > (distributed across N disks). i.e. Our filesystem lays out data and
> > metadata on top of these two filesystems.
> 
> This is astronomically more complicated than the already complicated
> scenario with one file system on a single normal partition of a well
> behaved (non-lying) single drive.
> 
> You have multiple devices, so any one or all of them could drop data
> during the power failure and in different amounts. In the best case
> scenario, at next mount the supers are checked on all the devices, and
> the lowest common denominator generation is found, and therefore the
> lowest common denominator root tree. No matter what it means some data
> is going to be lost.

That's exactly why we have CoW.  Unless at least one of the disks lies,
there's no way for data from a fully committed transaction to be lost.
Any writes after that are _supposed_ to be lost.

Reordering writes between disks is no different from reordering writes on a
single disk.  Even more so with NVMe where you have multiple parallel writes
on the same device, with multiple command queues.  You know the transaction
has hit the, uhm, platters, only once every device says so, and that's when
you can start writing the new superblock.
> 
> > The issue that we're facing is that a few files have been zero-sized.
> 
> I can't tell you if that's a bug or not because I'm not sure how your
> software creates these 16M backing files, if they're fallocated or
> touched or what. It's plausible they're created as zero length files,
> and the file system successful creates them, and then data is written
> to them, but before there is either committed metadata or an updated
> super pointing to the new root tree you get a power failure. And in
> that case, I expect a zero length file or maybe some partial amount of
> data is there.

It's the so-called O_PONIES issue.  No filesystem can know whether you want
files written immediately (abysmal performance) or held in cache until later
(sacrificing durability).  The only portable interface to do so is
f{,data}sync: any write that hasn't been synced cannot be relied upon.
Some traditional filesystems have implicitly synced things, but all such
details are filesystem specific.

Btrfs in particular has -o flushoncommit, which instead of a fsync after
every single write gathers writes from the last 30 seconds and flushes them
as one transaction.

More generic interfaces have been proposed but none has been implemented
yet.  Heck, I'm playing with one such idea myself, although I'm not sure if
I know enough to ensure the semantics I have in mind.

> > As a result, there is either a data-loss, or inconsistency in the
> > stacked filesystem's metadata.
> 
> Sounds expected for any file system, but chances are there's more
> missing with a CoW file system since by nature it rolls back to the
> most recent sane checkpoint for the fs metadata without any regard to
> what data is lost to make that happen. The goal is to not lose the
> file system in such a case, as some amount of data is always going to
> happen

All it takes is to _somehow_ tell the filesystem you demand the same
guarantees for data as it already provides for metadata.  And a CoW
or log-based filesystem can actually deliver such a demand.

> and why power losses need to be avoided (UPS's and such).

An UPS can't protect you from a kernel crash, a motherboard running out of
smoke, a stick of memory going bad or unseated, power supply deciding it
wants a break from delivering the juice (for redundant power supplies, the
thingy mediating power will do so), etc, etc.  There's no way around crash
tolerance.

> The
> fact that you have a file system on top of a file system makes it more
> fragile because the 2nd file system's metadata *IS* data as far as the
> 1st file system is concerned. And that data is considered expendable.

Only because by default the underlying filesystem has been taught to
consider it expendable.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH preview] btrfs: allow to set compression level for zlib

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 09:51:44PM +, Nick Terrell wrote:
> On 07/25/2017 01:29 AM, David Sterba wrote:
> > Preliminary support for setting compression level for zlib, the
> > following works:
> 
> Thanks for working on this, I think it is a great feature.
> I have a few comments relating to how it would work with zstd.

Like, currently crashing because of ->set_level being 0? :p

> > --- a/fs/btrfs/compression.c
> > +++ b/fs/btrfs/compression.c
> > @@ -866,6 +866,11 @@ static void free_workspaces(void)
> >   * Given an address space and start and length, compress the bytes into 
> > @pages
> >   * that are allocated on demand.
> >   *
> > + * @type_level is encoded algorithm and level, where level 0 means whatever
> > + * default the algorithm chooses and is opaque here;
> > + * - compression algo are 0-3
> > + * - the level are bits 4-7
> 
> zstd has 19 levels, but we can either only allow the first 15 + default, or
> provide a mapping from zstd-level to BtrFS zstd-level.

Or give it more bits.  Issues like this are exactly why this patch is marked
"preview".

But, does zstd give any gains with high compression level but input data
capped at 128KB?  I don't see levels above 15 on your benchmark, and certain
compression algorithms give worse results at highest levels for small
blocks.

> > @@ -888,9 +893,11 @@ int btrfs_compress_pages(int type, struct 
> > address_space *mapping,
> >  {
> > struct list_head *workspace;
> > int ret;
> > +   int type = type_level & 0xF;
> >  
> > workspace = find_workspace(type);
> >  
> > +   btrfs_compress_op[type - 1]->set_level(workspace, type_level);
> 
> zlib uses the same amount of memory independently of the compression level,
> but zstd uses a different amount of memory for each level. zstd will have
> to allocate memory here if it doesn't have enough (or has way to much),
> will that be okay?

We can instead store workspaces per the encoded type+level, that'd allow
having different levels on different mounts (then props, once we get there).

Depends on whether you want highest levels, though (asked above) -- the
highest ones take drastically more memory, so if they're out, blindly
reserving space for the highest supported level might be not too wasteful.

(I have only briefly looked at memory usage and set_level(), please ignore
me if I babble incoherently -- in bed on a N900 so I can't test it right
now.)


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 12:15:12PM +0530, Shyam Prasad N wrote:
> Is flushoncommit not a default option on version
> 4.4? Do I need specifically set this option?

It's not the default.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Power down tests...

2017-08-04 Thread Adam Borowski
On Fri, Aug 04, 2017 at 11:21:15AM +0530, Shyam Prasad N wrote:
> We're running a couple of experiments on our servers with btrfs
> (kernel version 4.4).
> And we're running some abrupt power-off tests for a couple of scenarios:
> 
> 1. We have a filesystem on top of two different btrfs filesystems
> (distributed across N disks). i.e. Our filesystem lays out data and
> metadata on top of these two filesystems. With the test workload, it
> is going to generate a good amount of 16MB files on top of the system.
> On abrupt power-off and following reboot, what is the recommended
> steps to be run. We're attempting btrfs mount, which seems to fail
> sometimes. If it fails, we run a fsck and then mount the btrfs. The
> issue that we're facing is that a few files have been zero-sized. As a
> result, there is either a data-loss, or inconsistency in the stacked
> filesystem's metadata.

Sounds like you want to mount with -o flushoncommit.

> We're mounting the btrfs with commit period of 5s. However, I do
> expect btrfs to journal the I/Os that are still dirty. Why then are we
> seeing the above behaviour.

By default, btrfs does only metadata consistency, like most filesystems. 
This improves performance at the cost of failing use case like yours.

-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Raid0 rescue

2017-07-27 Thread Adam Borowski
On Thu, Jul 27, 2017 at 08:25:19PM +, Duncan wrote:
> >Welcome to RAID-0...
> 
> As Hugo implies, RAID-0 mode, not just for btrfs but in general, is well 
> known among admins for being "garbage data not worth trying to recover" 
> mode.  Not only is there no redundancy, but with raid0 you're 
> deliberately increasing the chances of loss because now loss of any one 
> device pretty well makes garbage of the entire array, and loss of any 
> single device in a group of more than one is more likely than loss of any 
> single device by itself.

Disks don't quite die once a week, you see.  Using raid0 is actually quite
rational in a good part of setups.

* You need backups _anyway_.  No raid level removes this requirement.
* You can give a machine twice as much immediate storage with raid0 than
  with raid1.
* You get twice as many disks you can use for backup.

Redundant raid is good for two things:
* uptime
* reducing the chance for loss of data between last backup and the failure

For the second point, do you happen to know of a filesystem that gives you
cheap hourly backups that avoid taking half an hour just to stat?


Thus, you need to make a decision: would you prefer to take time trying to
recover, with a good chance of failure anyway -- or a-priori accept that
every failure means hitting the backups?  Obviously, depends on the use
case.

This said, I don't have a raid0 anywhere.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ What Would Jesus Do, MUD/MMORPG edition:
⣾⠁⢰⠒⠀⣿⡁ • multiplay with an admin char to benefit your mortal
⢿⡄⠘⠷⠚⠋⠀ • abuse item cloning bugs (the five fishes + two breads affair)
⠈⠳⣄ • use glitches to walk on water
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice: Add new device to RAID1 pool

2017-07-24 Thread Adam Borowski
On Mon, Jul 24, 2017 at 02:55:00PM -0600, Chris Murphy wrote:
> Egads.
> 
> Maybe Cloud Admin ought to consider using a filter to just balance the
> data chunks across the three devices, and just leave the metadata on
> the original two disks?

Balancing when adding a new disk isn't that important unless the two old
disks are almost full.

> Maybe
> sudo btrfs balance start -dusage=100 

Note that this doesn't mean "all data", merely "all data chunks < 100%
full".  It's a strictly-lesser-than comparison.

You'd want "-dusage=101" which is illegal, the right one is "-d".  I used to
believe -dusage=100 does that, myself.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: degraded raid scribbling upon wrong device

2017-07-22 Thread Adam Borowski
On Thu, Jul 13, 2017 at 08:40:12AM +0200, Adam Borowski wrote:
> Here's a set of test cases, two of them in some cases seem to scribble upon
> the wrong device:
> 
> * deg-mid-missing
> * deg-last-replaced (not on the innocent "re")
> * but never deg-last-missing
> 
> When all goes ok, there are no errors other than wrong generation on the
> re-added disk (expected).   When it goes bad, there's a lot of corruption.
> In all cases, though, the "Device missing:" field is wrong.

I did not explore this adequately yet, in a good part because of ENOSPC
triggering a lot of time for an unrelated reason that Omar just fixed
(thanks!).  So, here's what I know so far:

* copying in, say, 2.2GB /usr/share is a lot more likely to trigger than
  dd-ing 2.2GB of /dev/null
* no "real" degrading is needed: in the original scripts, the missing device
  is empty so all blocks are doubled anyway.  It's not about degraded chunks
  but because of a bogus device.
* bogus output of "btrfs f u" is a sure predictor that, with enough tries,
  you'll get corruption -- if it shows something when it should say
  "missing", shit is likely to happen


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs device ready purpose

2017-07-22 Thread Adam Borowski
On Sat, Jul 22, 2017 at 06:15:58PM +, Hugo Mills wrote:
> On Sat, Jul 22, 2017 at 12:06:17PM -0600, Chris Murphy wrote:
> > I just did an additional test that's pretty icky behavior.
> > 
> > 2x HDD device Btrfs volume. Add both devices and `btrfs devices ready`
> > exits with 0 as expected. Physically remove both USB devices.
> > Reconnect one device. `btrfs device ready` still exits 0. That's
> > definitely not good. (If I leave that one device connected and reboot,
> > `btrfs device ready` exits 1).
> 
>In a slightly less-specific way, this has been a problem pretty
> much since the inception of the FS. It's not possible to do the
> reverse of the "scan" operation on a device -- that is, invalidate/
> remove the device's record in the kernel. So, as you've discovered
> here, if you have a device which is removed (overwritten, unplugged),
> the kernel still thinks it's a part of the FS.

Alas, this needs to be fixed.  The reproducers I posted last week give data
corruption in case a device that was once a part of the FS is reconnected. 
It doesn't matter what it contains now -- be it another part of the FS or
something totally unrelated, as far as the device node (/dev/loop0,
/dev/sda1, etc) is reused, degraded mounts get confused.

It wasn't urgent before as degraded mounts were broken before Qu's chunk
check patch (that's not even merged yet) -- but once running degraded is
not an emergency, there'll be folks doing so for an extended time.

>It's something I recall being talked about a bit, some years ago. I
> don't recall now why it was going to be useful, though. I think you
> have a good use-case for such a new ioctl (or extension to the
> SCAN_DEV ioctl) now, though.

Such an ioctl would be inherently racey.  Even current udev code is --
mounting right after losetup often fails, sometimes you even need to sleep
longer than 1 second.  With the above in mind, I see no way other than
invalidating and re-checking all known devices at mount time.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 0/4] Add xxhash and zstd modules

2017-07-22 Thread Adam Borowski
On Fri, Jul 21, 2017 at 11:56:21AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-07-20 17:27, Nick Terrell wrote:
> > This patch set adds xxhash, zstd compression, and zstd decompression
> > modules. It also adds zstd support to BtrFS and SquashFS.
> > 
> > Each patch has relevant summaries, benchmarks, and tests.
> 
> For patches 2-3, I've compile tested and had runtime testing running for
> about 18 hours now with no issues, so you can add:
> 
> Tested-by: Austin S. Hemmelgarn 

I assume you haven't tried it on arm64, right?

I had no time to get 'round to it before, and just got the following build
failure:

  CC  fs/btrfs/zstd.o
In file included from fs/btrfs/zstd.c:28:0:
fs/btrfs/compression.h:39:2: error: unknown type name ‘refcount_t’
  refcount_t pending_bios;
  ^~
scripts/Makefile.build:302: recipe for target 'fs/btrfs/zstd.o' failed

It's trivially fixably by:
--- a/fs/btrfs/zstd.c
+++ b/fs/btrfs/zstd.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "compression.h"

after which it works fine, although half an hour of testing isn't exactly
exhaustive.


Alas, the armhf machine I ran stress tests (Debian archive rebuilds) on
doesn't boot with 4.13-rc1 due to some unrelated regression, bisecting that
would be quite painful so I did not try yet.  I guess re-testing your patch
set on 4.12, even with btrfs-for-4.13 (which it had for a while), wouldn't
be of much help.  So far, previous versions have been running for weeks,
with no issue since you fixed workspace flickering.


On amd64 all is fine.


I haven't tested SquashFS at all.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] Btrfs: fix early ENOSPC due to delalloc

2017-07-21 Thread Adam Borowski
On Thu, Jul 20, 2017 at 03:10:35PM -0700, Omar Sandoval wrote:
> If a lot of metadata is reserved for outstanding delayed allocations, we
> rely on shrink_delalloc() to reclaim metadata space in order to fulfill
> reservation tickets. However, shrink_delalloc() has a shortcut where if
> it determines that space can be overcommitted, it will stop early. This
> made sense before the ticketed enospc system, but now it means that
> shrink_delalloc() will often not reclaim enough space to fulfill any
> tickets, leading to an early ENOSPC.

This happens a lot (like, 1/4 to 1/3 tries) when populating a freshly made
small filesystem, that makes running tests I've been recently doing (like
those degraded raid corruptions) really unfun.  These unexplained random
ENOSPCes were driving me mad — thanks for explaining those!  Now my tests
properly corrupt data as they should :þ.

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3] Btrfs: add skeleton code for compression heuristic

2017-07-21 Thread Adam Borowski
On Fri, Jul 21, 2017 at 11:37:49PM +0500, Roman Mamedov wrote:
> On Fri, 21 Jul 2017 13:00:56 +0800
> Anand Jain  wrote:
> > On 07/18/2017 02:30 AM, David Sterba wrote:
> > > This must stay 'return 1', if force-compress is on, so the change is
> > > reverted.
> > 
> >   Initially I thought 'return 1' is correct, but looking in depth,
> >   it is not correct as below..
> > 
> >   The biggest beneficiary of the estimating the compression ratio
> >   in advance (heuristic) is when customers are using the
> >   -o compress-force. But 'return 1' here is making them not to
> >   use heuristic. So definitely something is wrong.
> 
> man mount says for btrfs:
> 
> If compress-force is specified, all files will be compressed,  whether
> or  not they compress well.
> 
> So compress-force by definition should always compress all files no matter
> what, and not use any heuristic. In fact it has no right to, as user forced
> compression to always on. Returning 1 up there does seem right to me.

Technically, for every compression algorithm other than identity (and its
bijections), some data will expand by at least one bit (thus byte, thus
page), therefore we need to be able to store with no compression even when
forced.  On the other hand, it sounds reasonable to take force to mean
"compression will always be attempted" -- ie, we forbid early return when
a small sample seems uncompressible.

> >   -o compress is about the whether each of the compression-granular bytes
> >   (BTRFS_MAX_UNCOMPRESSED) of the inode should be tried to compress OR
> >   just give up for the whole inode by looking at the compression ratio
> >   of the current compression-granular.
> >   This approach can be overridden by -o compress-force. So in
> >   -o compress-force there will be a lot more efforts in _trying_
> >   to compression than in -o compress. We must use heuristic for
> >   -o compress-force.
> 
> Semantic and the user expectation of compress-force dictates to always
> compress without giving up, even if it turns out to be slower and not 
> providing
> much benefit.

Another question is, how would "compress-force" differ from "compress"
otherwise?  Always attempting the compression is its whole purpose!


Meow.
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

2017-07-21 Thread Adam Borowski
On Fri, Jul 21, 2017 at 01:47:11PM +0200, Hans van Kranenburg wrote:
> The changes in here do the following:
> 
> 1. Throw out the current ssd_spread behaviour.
> 2. Move the current ssd behaviour to the ssd_spread option.
> 3. Make ssd mode data allocation identical to tetris mode, like nossd.
> 4. Adjust and clean up filesystem mount messages so that we can easily
> identify if a kernel has this patch applied or not, when providing
> support to end users.

> Notes for whoever wants to backport this patch on their 4.9 LTS kernel:
> * First apply commit 8a83665a "btrfs: drop the nossd flag when
>   remounting with -o ssd", or fixup the differences manually.

It's 951e7966 in mainline.  I see no 8a83665a anywhere -- most likely it's
what you got after cherry-picking to 4.9.

-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


degraded raid scribbling upon wrong device

2017-07-13 Thread Adam Borowski
Hi!
Here's a set of test cases, two of them in some cases seem to scribble upon
the wrong device:

* deg-mid-missing
* deg-last-replaced (not on the innocent "re")
* but never deg-last-missing

When all goes ok, there are no errors other than wrong generation on the
re-added disk (expected).   When it goes bad, there's a lot of corruption.
In all cases, though, the "Device missing:" field is wrong.

I'm not yet sure how to trigger this, perhaps someone would have a clue?

8:30am, hitting the sack, will try again todorrow.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ 
⣾⠁⢠⠒⠀⣿⡁ A dumb species has no way to open a tuna can.
⢿⡄⠘⠷⠚⠋⠀ A smart species invents a can opener.
⠈⠳⣄ A master species delegates.
#!/bin/sh
set -e
set -x

umount /mnt/vol1 ||:
losetup -D

dd if=/dev/zero bs=1048576 count=1 seek=4095 of=ra
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rb
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rc
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rd

mkfs.btrfs -draid1 -mraid1 ra rb rc rd

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
cp -pr /bin /mnt/vol1
btrfs fi sync /mnt/vol1
btrfs fi us /mnt/vol1
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rd
sleep 1
mount -odegraded /dev/loop0 /mnt/vol1
btrfs fi us /mnt/vol1
dd if=/dev/zero of=/mnt/vol1/foo bs=1048576 count=
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
btrfs scrub start -B /mnt/vol1
#!/bin/sh
set -e
set -x

umount /mnt/vol1 ||:
losetup -D

dd if=/dev/zero bs=1048576 count=1 seek=4095 of=ra
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rb
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rc
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rd

mkfs.btrfs -draid1 -mraid1 ra rb rc rd

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
cp -pr /bin /mnt/vol1
btrfs fi sync /mnt/vol1
btrfs fi us /mnt/vol1
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
sleep 1
mount -odegraded /dev/loop0 /mnt/vol1
btrfs fi us /mnt/vol1
dd if=/dev/zero of=/mnt/vol1/foo bs=1048576 count=
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
btrfs scrub start -B /mnt/vol1
#!/bin/sh
set -e
set -x

umount /mnt/vol1 ||:
losetup -D

dd if=/dev/zero bs=1048576 count=1 seek=4095 of=ra
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rb
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rc
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=rd
dd if=/dev/zero bs=1048576 count=1 seek=4095 of=re

mkfs.btrfs -draid1 -mraid1 ra rb rc rd

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
cp -pr /bin /mnt/vol1
btrfs fi sync /mnt/vol1
btrfs fi us /mnt/vol1
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f re
sleep 1
mount -odegraded /dev/loop0 /mnt/vol1
btrfs fi us /mnt/vol1
dd if=/dev/zero of=/mnt/vol1/foo bs=1048576 count=
umount /mnt/vol1

losetup -D
losetup -f ra
losetup -f rb
losetup -f rc
losetup -f rd
sleep 1
mount /dev/loop0 /mnt/vol1
btrfs scrub start -B /mnt/vol1


  1   2   3   >