Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Duncan
Janos Toth F. posted on Sun, 30 Jul 2017 03:39:10 +0200 as excerpted:

[OT but related topic continues...]

> I still get shivers if I need to resize a filesystems due to the
> memories of those early tragic experiences when I never won the lottery
> on the "trial and error" runs but lost filesystems with both hands and
> learned what wild-spread silent corruption is and how you can refresh
> your backups with corrupted copies...). Let's not take me back to those
> early days, please. I don't want to live in a cave anymore. Thank you
> modern filesystems (and their authors). :)
> 
> And on that note... Assuming I had interference problems, it was caused
> by my human mistake/negligence. I can always make similar or bigger
> human mistakes, independent of disk-level segregation. (For example, no
> amount of partitions will save any data if I accidentally wipe the
> entire drive with DD, or if I have it security-locked by the controller
> and loose the passwords, etc...)

I was glad to say goodbye to MSDOS/MBR style partitions as well, but just 
as happy to enthusiastically endorse GPT/EFI style partitions, with their 
effectively unlimited partition numbers (128 allowed at the default table 
size), no primary/logical partition stuff to worry about, partition (as 
opposed to filesystem in the partition) labels/names, integrity 
checksums, and second copy at the other end of the device. =:^)

And while all admins have their fat-finger or fat-head, aka brown-bag, 
experiences, I've never erased the wrong partition, tho I can certainly 
remember being /very/ careful the first couple times I did partitioning, 
back in the 90s on MSDOS.  Thankfully, these days even ssds are 
"reasonably" priced, and spinning rust is the trivial cost of perhaps a 
couple meals out, so as long as there's backups on other physical 
devices, getting even the device name wrong simply means losing perhaps 
your working copy instead of redoing the layout of one of your backups.

And of course you can see the existing layout on the device before you 
repartition it, and if it's not what you expected or there are any other 
problems, you just back out without doing that final writeout of the new 
partition table.

FWIW my last brown-bag was writing and running a script as root, with a 
variable-name typo that made varname empty with an rm -rf $varname/.  I 
caught and stopped it after it had emptied /bin, while it was in /etc, I 
believe.  Luckily I could boot to the (primary) backup.

But meanwhile, two experiences that set in concrete the practicality of 
separate filesystems on their own partitions, for me:

1) Back on MS, IE4-beta era.  I was running the public beta when the MSIE 
devs decided that for performance reasons they needed to write directly 
to the IE cache index on disk, bypassing the usual filesystem methods.  
What they didn't think about, however, was IE's new integration into the 
Explorer shell, meaning it was running all the time.

So along come people running the beta, running their scheduled defrag, 
which decides the index is fragmented and moves it out from under the of 
course still running Explorer shell, so the next time it direct-writes to 
what WAS the cache index, it's overwriting whatever file defrag moved to 
that spot after it moved the cache file out.

The eventual fix was to set the system attribute on the cache index, so 
the defragger wouldn't touch it.

I know a number of people running that beta that lost important files to 
that, when those files got moved into the old on-disk location of the 
cache index file and overwritten by IE when it direct-wrote to what it 
/thought/ was still the on-disk location of its index file.

But I was fine, never in any danger, because IE's "Temporary Internet 
Files" cache was on a dedicated tempfiles filesystem.  So the only files 
it overwrote for me were temporary in any case.

2) Some years ago, during a Phoenix summer, my AC went out.  I was in a 
trailer at the time, so without the AC it got hot pretty quickly, and I 
was away, with the computer left on, at the time it went out.

The high in the shade that day was about 47C/117F, and the trailer was in 
the sun, so it easily hit 55-60C/131-140F inside.  The computer was 
obviously going to be hotter than that, and the spinning disk in the 
computer hotter still, so it easily hit 70C/158F or higher.

The CPU shut down of course, and was fine when I turned it back on after 
a cooldown.

The disk... not so fine.  I'm sure it physically head-crashed and if I 
had taken it apart I'd have found grooves on the platter.

But... disks were a lot more expensive back then, and I didn't have 
another disk with backups.

What I *DID* have were backup partitions on the same disk, and because 
they weren't mounted at the time, the head didn't try seeking to them, 
and they weren't damaged (at least not beyond what could be repaired). 
When I went to assess things after everything cooled down, the damage was 
(almost) all 

[PATCH v2] btrfs: preserve i_mode if __btrfs_set_acl() fails

2017-07-29 Thread Ernesto A . Fernández
When changing a file's acl mask, btrfs_set_acl() will first set the
group bits of i_mode to the value of the mask, and only then set the
actual extended attribute representing the new acl.

If the second part fails (due to lack of space, for example) and the
file had no acl attribute to begin with, the system will from now on
assume that the mask permission bits are actual group permission bits,
potentially granting access to the wrong users.

Prevent this by starting the journal transaction before calling
__btrfs_set_acl and only changing the inode mode after the acl is set
successfully.

Signed-off-by: Ernesto A. Fernández 
---
Changes in v2:
  - Take the code that checks if we are setting a default acl on something
that is not a dir, remove it from the __btrfs_set_acl function, and
place it in btrfs_set_acl instead. This should fix the issue pointed out
by Josef Bacik, that I was sometimes updating the inode even when there
was no change.
  - Don't call BUG_ON when the inode failed to update. Also requested by
Josef Bacik. It should be noted that __btrfs_setxattr was already
calling BUG_ON before my patch; that has not been changed.

 fs/btrfs/acl.c | 31 +++
 1 file changed, 27 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/acl.c b/fs/btrfs/acl.c
index 8d8370d..f62e8ac 100644
--- a/fs/btrfs/acl.c
+++ b/fs/btrfs/acl.c
@@ -27,6 +27,7 @@
 #include "ctree.h"
 #include "btrfs_inode.h"
 #include "xattr.h"
+#include "transaction.h"
 
 struct posix_acl *btrfs_get_acl(struct inode *inode, int type)
 {
@@ -80,8 +81,6 @@ static int __btrfs_set_acl(struct btrfs_trans_handle *trans,
name = XATTR_NAME_POSIX_ACL_ACCESS;
break;
case ACL_TYPE_DEFAULT:
-   if (!S_ISDIR(inode->i_mode))
-   return acl ? -EINVAL : 0;
name = XATTR_NAME_POSIX_ACL_DEFAULT;
break;
default:
@@ -113,14 +112,38 @@ static int __btrfs_set_acl(struct btrfs_trans_handle 
*trans,
 
 int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type)
 {
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_trans_handle *trans;
int ret;
+   umode_t mode = inode->i_mode;
 
+   if (type == ACL_TYPE_DEFAULT && !S_ISDIR(inode->i_mode))
+   return acl ? -EINVAL : 0;
if (type == ACL_TYPE_ACCESS && acl) {
-   ret = posix_acl_update_mode(inode, >i_mode, );
+   ret = posix_acl_update_mode(inode, , );
if (ret)
return ret;
}
-   return __btrfs_set_acl(NULL, inode, acl, type);
+
+   if (btrfs_root_readonly(root))
+   return -EROFS;
+
+   trans = btrfs_start_transaction(root, 2);
+   if (IS_ERR(trans))
+   return PTR_ERR(trans);
+
+   ret = __btrfs_set_acl(trans, inode, acl, type);
+   if (ret)
+   goto out;
+
+   inode->i_mode = mode;
+   inode_inc_iversion(inode);
+   inode->i_ctime = current_time(inode);
+   set_bit(BTRFS_INODE_COPY_EVERYTHING, _I(inode)->runtime_flags);
+   ret = btrfs_update_inode(trans, root, inode);
+out:
+   btrfs_end_transaction(trans);
+   return ret;
 }
 
 /*
-- 
2.1.4


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Janos Toth F.
Reply to the TL;DR part, so TL;DR marker again...

Well, I live on the other extreme now. I want as few filesystems as
possible and viable (it's obviously impossible to have a real backup
within the same fs and/or device and with the current
size/performance/price differences between HDD and SSD, it's evident
to separate the "small and fast" from the "big and slow" storage but
other than that...). I always believed (even before I got a real grasp
on these things and could explain my view or argue about this)
"subvolumes" (in a general sense but let's use this word here) should
reside below filesystems (and be totally optional) and filesystems
should spread over a whole disk or(md- or hardware) RAID volume
(forget the MSDOS partitions) and even these ZFS/Brtfs style
subvolumes should be used sparingly (only when you really have a good
enough reason to create a subvolume, although it doesn't matter nearly
as much with subvolumes than it does with partitions).

I remember the days when I thought it's important to create separate
partitions for different kinds of data (10+ years ago when I was aware
I didn't have the experience to deviate from common general
teachings). I remember all the pain of randomly running out of space
on any and all filesystems and eventually mixing the various kinds of
data on every theoretically-segregated filesystems (wherever I found
free space), causing a nightmare of broken sorting system (like a
library after a tornado) and then all the horror of my first russian
rulett like experiences of resizing partitions and filesystem to make
the segregation decent again. And I saw much worse on other peoples's
machines. At one point, I decided to create as few partitions as
possible (and I really like the idea of zero partitions, I don't miss
MSDOS).
I still get shivers if I need to resize a filesystems due to the
memories of those early tragic experiences when I never won the
lottery on the "trial and error" runs but lost filesystems with both
hands and learned what wild-spread silent corruption is and how you
can refresh your backups with corrupted copies...). Let's not take me
back to those early days, please. I don't want to live in a cave
anymore. Thank you modern filesystems (and their authors). :)

And on that note... Assuming I had interference problems, it was
caused by my human mistake/negligence. I can always make similar or
bigger human mistakes, independent of disk-level segregation. (For
example, no amount of partitions will save any data if I accidentally
wipe the entire drive with DD, or if I have it security-locked by the
controller and loose the passwords, etc...)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-29 Thread Duncan
Imran Geriskovan posted on Sat, 29 Jul 2017 21:29:46 +0200 as excerpted:

> On 7/9/17, Duncan <1i5t5.dun...@cox.net> wrote:
>> I have however just upgraded to new ssds then wiped and setup the old
>> ones as another backup set, so everything is on brand new filesystems 
on
>> fast ssds, no possibility of old undetected corruption suddenly
>> triggering problems.
>>
>> Also, all my btrfs are raid1 or dup for checksummed redundancy
> 
> Do you have any experience/advice/comment regarding
> dup data on ssds?

Very good question. =:^)

Limited.  Most of my btrfs are raid1, with dup only used on the device-
respective /boot btrfs (of which there are four, one on each of the two 
ssds that otherwise form the btrfs raid1 pairs, for each of the working 
and backup copy pairs -- I can use BIOS to select any of the four to 
boot), and those are all sub-GiB mixed-bg mode.

So all my dup experience is sub-GiB mixed-blockgroup mode.

Within that limitation, my only btrfs problem has been that at my 
initially chosen size of 256 MiB, mkfs.btrfs at least used to create an 
initial data/metadata chunk of 64 MiB.  Remember, this is dup mode, so 
there's two of them = 128 MiB.  Because there's also a system chunk, that 
means the initial chunk cannot be balanced even with an entirely empty 
filesystem, because there's not enough space to write a second 64 MiB 
chunk duped to 128 MiB.

Between that and the 256 MiB in dup mode size meaning under 128 MiB 
usable, and the fact that I routinely run and sometimes need to bisect 
pre-release kernels, I was routinely running out of space, then cleaning 
up, but not being able to do a full cleanup without a blow-away and new 
mkfs.btrfs, because I couldn't balance.

When I recently purchased the second pair of (now larger) ssds in ordered 
to put everything, including the media and backups that were previously 
still on spinning rust, on ssd, I redid the layout and made the /boots 
512 MiB, still mixed-bg dup mode.  That seems to have solved the problem, 
and I can now rebalance the first mkfs.btrfs-created mixed-bg chunk, as 
it's now small enough that it's less than half the filesystem even when 
duped.

Because it's now 512 MiB, however, I can't say for sure whether the 
previous problem with mkfs.btrfs creating an initial mixed-bg chunk of a 
quarter the 256 MiB filesystem size, so in dup mode it can't be balanced 
because it's half the total filesystem size and with the system chunk as 
well, the other half is partially used so there's no space to write the 
balance destination chunks, is fixed, or not.  What I can say is that the 
problem doesn't affect the new 512 MiB size, at least with btrfs-progs 
4.11.x, which is what I used to mkfs.btrfs the new layout.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: write corruption due to bio cloning on raid5/6

2017-07-29 Thread Duncan
Janos Toth F. posted on Sat, 29 Jul 2017 05:02:48 +0200 as excerpted:

> The read-only scrub finished without errors/hangs (with kernel
> 4.12.3). So, I guess the hangs were caused by:
> 1: other bug in 4.13-RC1
> 2: crazy-random SATA/disk-controller issue
> 3: interference between various btrfs tools [*]
> 4: something in the background did DIO write with 4.13-RC1 (but all
> affected content was eventually overwritten/deleted between the scrub
> attempts)
> 
> [*] I expected scrub to finish in ~5 rather than ~40 hours (and didn't
> expect interference issues), so I didn't disable the scheduled
> maintenance script which deletes old files, recursively defrags the
> whole fs and runs a balance with usage=33 filters. I guess either of
> those (especially balance) could potentially cause scrub to hang.

That #3, interference between btrfs tools, could be it.  It seems btrfs
in general is getting stable enough now that we're beginning to see
bugs exposed from running two or more tools at once, because the devs
have apparently caught and fixed enough of the single-usage race bugs
that individual tools are working reasonably well, and it's now the
concurrent multi-usage case races that no one was thinking about when
they were writing the code that are being exposed.  At least, there
have been a number of such bugs either definitely or probability-traced
to concurrent usage, reported and traced/fixed, lately, more than I
remember seeing in the past.


(TL;DR folks can stop at that.)

Incidentally, that's one more advantage to my own strategy of multiple
independent small btrfs, keeping everything small enough that
maintenance jobs are at least tolerably short, making it actually
practical to run them.

Tho my case is surely an extreme, with everything on ssd and my largest
btrfs, even after recently switching my media filesystems to ssd and
btrfs, being 80 GiB (usable and per device, btrfs raid1 on paired
partitions, each on a different physical ssd).  I use neither quotas,
which don't scale well on btrfs and I don't need them, nor snapshots,
which have a bit of a scaling issue (tho apparently not as bad as
quotas) as well, because weekly or monthly backups are enough here, and
the filesystems are small enough (and on ssd) to do full-copy backups
in minutes each.  In fact, making backups easier was a major reason I
switched even the backups and media devices to all ssd, this time.

So scrubs are trivially short enough I can run them and wait for the
results while composing posts such as this (bscrub is my scrub script,
run here by my admin user with a stub setup so sudo isn't required):

$$ bscrub /mm 
scrub device /dev/sda11 (id 1) done
scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
total bytes scrubbed: 33.98GiB with 0 errors
scrub device /dev/sdb11 (id 2) done
scrub started at Sat Jul 29 14:50:54 2017 and finished after 00:01:08
total bytes scrubbed: 33.98GiB with 0 errors

Just over a minute for a scrub of both devices on my largest 80 gig per
device btrfs. =:^)  Tho projecting to full it might be 2 and a half minutes...

Tho of course parity-raid scrubs would be far slower, at a WAG an hour or two,
for similar size on spinning rust...

Balances are similar, but being on ssd and not needing one on any of the still
relatively freshly redone filesystems ATM, I don't feel inclined to needlessly
spend a write cycle just for demonstration.

With filesystem maintenance runtimes of a minute, definitely under five minutes,
per filesystem, and with full backups under 10, I don't /need/ to run more than
one tool at once, and backups can trivially be kept fresh enough that I don't
really feel the need to schedule maintenance and risk running more than one
that way, either, particularly when I know it'll be done in minutes if I run it
manually. =:^)

Like I said, I'm obviously an extreme case, but equally obviously, while I see
the runtime concurrency bug reports on the list, it's not something likely to
affect me personally. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Best Practice: Add new device to RAID1 pool (Summary)

2017-07-29 Thread Cloud Admin
Am Montag, den 24.07.2017, 18:40 +0200 schrieb Cloud Admin:
> Am Montag, den 24.07.2017, 10:25 -0400 schrieb Austin S. Hemmelgarn:
> > On 2017-07-24 10:12, Cloud Admin wrote:
> > > Am Montag, den 24.07.2017, 09:46 -0400 schrieb Austin S.
> > > Hemmelgarn:
> > > > On 2017-07-24 07:27, Cloud Admin wrote:
> > > > > Hi,
> > > > > I have a multi-device pool (three discs) as RAID1. Now I want
> > > > > to
> > > > > add a
> > > > > new disc to increase the pool. I followed the description on
> > > > > https:
> > > > > //bt
> > > > > rfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devic
> > > > > es
> > > > > and
> > > > > used 'btrfs add  '. After that I called a
> > > > > balance
> > > > > for rebalancing the RAID1 using 'btrfs balance start  > > > > path>'.
> > > > > Is that anything or should I need to call a resize (for
> > > > > example) or
> > > > > anything else? Or do I need to specify filter/profile
> > > > > parameters
> > > > > for
> > > > > balancing?
> > > > > I am a little bit confused because the balance command is
> > > > > running
> > > > > since
> > > > > 12 hours and only 3GB of data are touched. This would mean
> > > > > the
> > > > > whole
> > > > > balance process (new disc has 8TB) would run a long, long
> > > > > time...
> > > > > and
> > > > > is using one cpu by 100%.
> > > > 
> > > > Based on what you're saying, it sounds like you've either run
> > > > into a
> > > > bug, or have a huge number of snapshots on this filesystem.
> > > 
> > > It depends what you define as huge. The call of 'btrfs sub list
> > >  > > path>' returns a list of 255 subvolume.
> > 
> > OK, this isn't horrible, especially if most of them aren't
> > snapshots 
> > (it's cross-subvolume reflinks that are most of the issue when it
> > comes 
> > to snapshots, not the fact that they're subvolumes).
> > > I think this is not too huge. The most of this subvolumes was
> > > created
> > > using docker itself. I cancel the balance (this will take awhile)
> > > and will try to delete such of these subvolumes/snapshots.
> > > What can I do more?
> > 
> > As Roman mentioned in his reply, it may also be qgroup related.  If
> > you run:
> > btrfs quota disable
> 
> It seems quota was one part of it. Thanks for the tip. I disabled and
> started balance new.
> Now approx. each 5 min. one chunk will be relocated. But if I take
> the
> reported 10860 chunks and calc. the time it will take ~37 days to
> finish... So, it seems I have to investigate more time into figure
> out
> the subvolume / snapshots structure created by docker.
> A first deeper look shows, there is a subvolume with a snapshot,
> which
> has itself a snapshot, and so forth.
> > 
> > 
Now, the balance process finished after 127h the new disc is in the
pool... Not so long as expected but in my opinion long enough. Quota
seems one big driver in my case. What I could see over the time at the
beginning many extends was relocated ignoring the new disc. Properly it
could be a good idea to rebalance using filter (like -dusage=30 for
example) before add the new disc to decrease the time. 
But only theory. It will try to keep it in my mind for the next time.

Thanks all for your tips, ideas and time!
Frank

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: eliminate bogus IOC_DEV_INFO call

2017-07-29 Thread Hans van Kranenburg

On 28/07/2017 11:49, Henk Slager wrote:

On Thu, Jul 27, 2017 at 9:24 PM, Hans van Kranenburg
 wrote:

Device ID numbers always start at 1, not at 0. The first IOC_DEV_INFO
call does not make sense, since it will always return ENODEV.


When there is a btrfs-replace ongoing, there is a Device ID 0


Aha... thanks for teaching me something new today. :) Actually, I 
remember having seen it a time earlier yes.


So, this one goes to /dev/null!

Hans




ioctl(3, BTRFS_IOC_DEV_INFO, {devid=0}) = -1 ENODEV (No such device)

Signed-off-by: Hans van Kranenburg 
---
  cmds-fi-usage.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 101a0c4..52c4c62 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -535,7 +535,7 @@ static int load_device_info(int fd, struct device_info 
**device_info_ptr,
 return 1;
 }

-   for (i = 0, ndevs = 0 ; i <= fi_args.max_id ; i++) {
+   for (i = 1, ndevs = 0 ; i <= fi_args.max_id ; i++) {
 if (ndevs >= fi_args.num_devices) {
 error("unexpected number of devices: %d >= %llu", 
ndevs,
 (unsigned long long)fi_args.num_devices);
--
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.11.6 / more corruption / root 15455 has a root item with a more recent gen (33682) compared to the found root node (0)

2017-07-29 Thread Imran Geriskovan
On 7/9/17, Duncan <1i5t5.dun...@cox.net> wrote:
> I have however just upgraded to new ssds then wiped and setup the old
> ones as another backup set, so everything is on brand new filesystems on
> fast ssds, no possibility of old undetected corruption suddenly
> triggering problems.
>
> Also, all my btrfs are raid1 or dup for checksummed redundancy

Do you have any experience/advice/comment regarding
dup data on ssds?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 3/3] Btrfs: heuristic add byte core set calculation

2017-07-29 Thread Timofey Titovets
Calculate byte core set for data sample:
Sort bucket's numbers in decreasing order
Count how many numbers use 90% of sample
If core set are low (<=25%), data are easily compressible
If core set high (>=80%), data are not compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 58 ++
 fs/btrfs/compression.h |  2 ++
 2 files changed, 60 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 1429b11f2c5f..a469a7c21f5a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -33,6 +33,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "disk-io.h"
 #include "transaction.h"
@@ -1069,6 +1070,42 @@ static inline int byte_set_size(const struct 
heuristic_bucket_item *bucket)
return byte_set_size;
 }

+/* For bucket sorting */
+static inline int heuristic_bucket_compare(const void *lv, const void *rv)
+{
+   struct heuristic_bucket_item *l = (struct heuristic_bucket_item *)(lv);
+   struct heuristic_bucket_item *r = (struct heuristic_bucket_item *)(rv);
+
+   return r->count - l->count;
+}
+
+/*
+ * Byte Core set size
+ * How many bytes use 90% of sample
+ */
+static inline int byte_core_set_size(struct heuristic_bucket_item *bucket,
+u32 core_set_threshold)
+{
+   int a = 0;
+   u32 coreset_sum = 0;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_CORE_SET_LOW; a++)
+   coreset_sum += bucket[a].count;
+
+   if (coreset_sum > core_set_threshold)
+   return a;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH; a++) {
+   if (bucket[a].count == 0)
+   break;
+   coreset_sum += bucket[a].count;
+   if (coreset_sum > core_set_threshold)
+   break;
+   }
+
+   return a;
+}
+
 /*
  * Compression heuristic.
  *
@@ -1092,6 +1129,8 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
struct heuristic_bucket_item *bucket;
int a, b, ret;
u8 symbol, *input_data;
+   u32 core_set_threshold;
+   u32 input_size = end - start;

ret = 1;

@@ -1123,6 +1162,25 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
goto out;
}

+   /* Sort in reverse order */
+   sort(bucket, BTRFS_HEURISTIC_BUCKET_SIZE,
+sizeof(struct heuristic_bucket_item), _bucket_compare,
+NULL);
+
+   core_set_threshold = (input_size*90)/(BTRFS_HEURISTIC_ITER_OFFSET*100);
+   core_set_threshold *= BTRFS_HEURISTIC_READ_SIZE;
+
+   a = byte_core_set_size(bucket, core_set_threshold);
+   if (a <= BTRFS_HEURISTIC_BYTE_CORE_SET_LOW) {
+   ret = 2;
+   goto out;
+   }
+
+   if (a >= BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH) {
+   ret = 0;
+   goto out;
+   }
+
 out:
kfree(bucket);
return ret;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 03857967815a..0fcd1a485adb 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -139,6 +139,8 @@ struct heuristic_bucket_item {
 #define BTRFS_HEURISTIC_ITER_OFFSET 256
 #define BTRFS_HEURISTIC_BUCKET_SIZE 256
 #define BTRFS_HEURISTIC_BYTE_SET_THRESHOLD 64
+#define BTRFS_HEURISTIC_BYTE_CORE_SET_LOW  BTRFS_HEURISTIC_BYTE_SET_THRESHOLD
+#define BTRFS_HEURISTIC_BYTE_CORE_SET_HIGH 200 // 80%

 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/3] Btrfs: heuristic add byte set calculation

2017-07-29 Thread Timofey Titovets
Calculate byte set size for data sample:
Calculate how many unique bytes has been in sample
By count all bytes in bucket with count > 0
If byte set low (~25%), data are easily compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 27 +++
 fs/btrfs/compression.h |  1 +
 2 files changed, 28 insertions(+)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index ca7cfaad6e2f..1429b11f2c5f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1048,6 +1048,27 @@ int btrfs_decompress_buf2page(const char *buf, unsigned 
long buf_start,
return 1;
 }

+static inline int byte_set_size(const struct heuristic_bucket_item *bucket)
+{
+   int a = 0;
+   int byte_set_size = 0;
+
+   for (; a < BTRFS_HEURISTIC_BYTE_SET_THRESHOLD; a++) {
+   if (bucket[a].count > 0)
+   byte_set_size++;
+   }
+
+   for (; a < BTRFS_HEURISTIC_BUCKET_SIZE; a++) {
+   if (bucket[a].count > 0) {
+   byte_set_size++;
+   if (byte_set_size > BTRFS_HEURISTIC_BYTE_SET_THRESHOLD)
+   return byte_set_size;
+   }
+   }
+
+   return byte_set_size;
+}
+
 /*
  * Compression heuristic.
  *
@@ -1096,6 +1117,12 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
index++;
}

+   a = byte_set_size(bucket);
+   if (a > BTRFS_HEURISTIC_BYTE_SET_THRESHOLD) {
+   ret = 1;
+   goto out;
+   }
+
 out:
kfree(bucket);
return ret;
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index e30a9df1937e..03857967815a 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -138,6 +138,7 @@ struct heuristic_bucket_item {
 #define BTRFS_HEURISTIC_READ_SIZE 16
 #define BTRFS_HEURISTIC_ITER_OFFSET 256
 #define BTRFS_HEURISTIC_BUCKET_SIZE 256
+#define BTRFS_HEURISTIC_BYTE_SET_THRESHOLD 64

 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/3] Btrfs: populate heuristic with detection logic

2017-07-29 Thread Timofey Titovets
Based on kdave for-next
As heuristic skeleton already merged
Populate heuristic with basic code.

First patch: add simple sampling code
It's get 16 byte samples with 256 bytes shifts
over input data. Collect info about how many
different bytes (symbols) has been found in sample data

Second patch: add code for calculate
how many unique bytes has been
found in sample data
That can fast detect easy compressible data

Third patch: add code for calculate byte core set size
i.e. how many unique bytes use 90% of sample data
That code require that numbers in bucket must be sorted
That can detect easy compressible data with many repeated bytes
That can detect not compressible data with evenly distributed bytes

Changes v1 -> v2:
  - Change input data iterator shift 512 -> 256
  - Replace magic macro numbers with direct values
  - Drop useless symbol population in bucket
as no one care about where and what symbol stored
in bucket at now

Changes v2 -> v3 (only update #3 patch):
  - Fix u64 division problem by use u32 for input_size
  - Fix input size calculation start - end -> end - start
  - Add missing sort.h header

Timofey Titovets (3):
  Btrfs: heuristic add simple sampling logic
  Btrfs: heuristic add byte set calculation
  Btrfs: heuristic add byte core set calculation

 fs/btrfs/compression.c | 109 -
 fs/btrfs/compression.h |  13 ++
 2 files changed, 120 insertions(+), 2 deletions(-)

--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/3] Btrfs: heuristic add simple sampling logic

2017-07-29 Thread Timofey Titovets
Get small sample from input data and calculate
byte type count for that sample into bucket.
Bucket will store info about which bytes
and how many has been detected in sample

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/compression.c | 24 ++--
 fs/btrfs/compression.h | 10 ++
 2 files changed, 32 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 63f54bd2d5bb..ca7cfaad6e2f 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1068,15 +1068,35 @@ int btrfs_compress_heuristic(struct inode *inode, u64 
start, u64 end)
u64 index = start >> PAGE_SHIFT;
u64 end_index = end >> PAGE_SHIFT;
struct page *page;
-   int ret = 1;
+   struct heuristic_bucket_item *bucket;
+   int a, b, ret;
+   u8 symbol, *input_data;
+
+   ret = 1;
+
+   bucket = kcalloc(BTRFS_HEURISTIC_BUCKET_SIZE,
+   sizeof(struct heuristic_bucket_item), GFP_NOFS);
+
+   if (!bucket)
+   goto out;

while (index <= end_index) {
page = find_get_page(inode->i_mapping, index);
-   kmap(page);
+   input_data = kmap(page);
+   a = 0;
+   while (a < PAGE_SIZE) {
+   for (b = 0; b < BTRFS_HEURISTIC_READ_SIZE; b++) {
+   symbol = input_data[a+b];
+   bucket[symbol].count++;
+   }
+   a += BTRFS_HEURISTIC_ITER_OFFSET;
+   }
kunmap(page);
put_page(page);
index++;
}

+out:
+   kfree(bucket);
return ret;
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index d1f4eee2d0af..e30a9df1937e 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -129,6 +129,16 @@ struct btrfs_compress_op {
 extern const struct btrfs_compress_op btrfs_zlib_compress;
 extern const struct btrfs_compress_op btrfs_lzo_compress;

+struct heuristic_bucket_item {
+   u8  padding;
+   u8  symbol;
+   u16 count;
+};
+
+#define BTRFS_HEURISTIC_READ_SIZE 16
+#define BTRFS_HEURISTIC_ITER_OFFSET 256
+#define BTRFS_HEURISTIC_BUCKET_SIZE 256
+
 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

 #endif
--
2.13.3
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] Btrfs: heuristic add byte core set calculation

2017-07-29 Thread Timofey Titovets
2017-07-29 14:43 GMT+03:00 kbuild test robot <l...@intel.com>:
> Hi Timofey,
>
> [auto build test ERROR on next-20170724]
> [cannot apply to btrfs/next v4.13-rc2 v4.13-rc1 v4.12 v4.13-rc2]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Timofey-Titovets/Btrfs-populate-heuristic-with-detection-logic/20170729-061208
> config: arm-arm5 (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
> wget 
> https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> make.cross ARCH=arm
>
> All errors (new ones prefixed by >>):
>
>>> ERROR: "__aeabi_uldivmod" [fs/btrfs/btrfs.ko] undefined!
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation

I will fix 64bit division and resend patch set

Thanks.

-- 
Have a nice day,
Timofey.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 3/3] Btrfs: heuristic add byte core set calculation

2017-07-29 Thread kbuild test robot
Hi Timofey,

[auto build test ERROR on next-20170724]
[cannot apply to btrfs/next v4.13-rc2 v4.13-rc1 v4.12 v4.13-rc2]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Timofey-Titovets/Btrfs-populate-heuristic-with-detection-logic/20170729-061208
config: arm-arm5 (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_uldivmod" [fs/btrfs/btrfs.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH v2] btrfs: use appropriate define for the fsid

2017-07-29 Thread Anand Jain
Though BTRFS_FSID_SIZE and BTRFS_UUID_SIZE or of same size,
for the purpose of doing it correctly use BTRFS_FSID_SIZE instead.

Signed-off-by: Anand Jain 
---
v2: Fix this for all remaining files.

 fs/btrfs/check-integrity.c |  2 +-
 fs/btrfs/disk-io.c |  6 +++---
 fs/btrfs/scrub.c   |  2 +-
 fs/btrfs/volumes.c | 16 
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c
index 11d37c94ce05..0ab7f7fa1b5f 100644
--- a/fs/btrfs/check-integrity.c
+++ b/fs/btrfs/check-integrity.c
@@ -1732,7 +1732,7 @@ static int btrfsic_test_for_metadata(struct btrfsic_state 
*state,
num_pages = state->metablock_size >> PAGE_SHIFT;
h = (struct btrfs_header *)datav[0];
 
-   if (memcmp(h->fsid, fs_info->fsid, BTRFS_UUID_SIZE))
+   if (memcmp(h->fsid, fs_info->fsid, BTRFS_FSID_SIZE))
return 1;
 
for (i = 0; i < num_pages; i++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 086dcbadce09..ed840e0cabc5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -529,7 +529,7 @@ static int check_tree_block_fsid(struct btrfs_fs_info 
*fs_info,
 struct extent_buffer *eb)
 {
struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
-   u8 fsid[BTRFS_UUID_SIZE];
+   u8 fsid[BTRFS_FSID_SIZE];
int ret = 1;
 
read_extent_buffer(eb, fsid, btrfs_header_fsid(), BTRFS_FSID_SIZE);
@@ -3731,7 +3731,7 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int 
max_mirrors)
btrfs_set_stack_device_io_width(dev_item, dev->io_width);
btrfs_set_stack_device_sector_size(dev_item, dev->sector_size);
memcpy(dev_item->uuid, dev->uuid, BTRFS_UUID_SIZE);
-   memcpy(dev_item->fsid, dev->fs_devices->fsid, BTRFS_UUID_SIZE);
+   memcpy(dev_item->fsid, dev->fs_devices->fsid, BTRFS_FSID_SIZE);
 
flags = btrfs_super_flags(sb);
btrfs_set_super_flags(sb, flags | BTRFS_HEADER_FLAG_WRITTEN);
@@ -4172,7 +4172,7 @@ static int btrfs_check_super_valid(struct btrfs_fs_info 
*fs_info)
ret = -EINVAL;
}
 
-   if (memcmp(fs_info->fsid, sb->dev_item.fsid, BTRFS_UUID_SIZE) != 0) {
+   if (memcmp(fs_info->fsid, sb->dev_item.fsid, BTRFS_FSID_SIZE) != 0) {
btrfs_err(fs_info,
   "dev_item UUID does not match fsid: %pU != %pU",
   fs_info->fsid, sb->dev_item.fsid);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 6f1e4c984b94..51a5a14f4c0b 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1769,7 +1769,7 @@ static inline int scrub_check_fsid(u8 fsid[],
struct btrfs_fs_devices *fs_devices = spage->dev->fs_devices;
int ret;
 
-   ret = memcmp(fsid, fs_devices->fsid, BTRFS_UUID_SIZE);
+   ret = memcmp(fsid, fs_devices->fsid, BTRFS_FSID_SIZE);
return !ret;
 }
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 5eb7217738ed..c705ea563c60 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1726,7 +1726,7 @@ static int btrfs_add_device(struct btrfs_trans_handle 
*trans,
ptr = btrfs_device_uuid(dev_item);
write_extent_buffer(leaf, device->uuid, ptr, BTRFS_UUID_SIZE);
ptr = btrfs_device_fsid(dev_item);
-   write_extent_buffer(leaf, fs_info->fsid, ptr, BTRFS_UUID_SIZE);
+   write_extent_buffer(leaf, fs_info->fsid, ptr, BTRFS_FSID_SIZE);
btrfs_mark_buffer_dirty(leaf);
 
ret = 0;
@@ -2261,7 +2261,7 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle 
*trans,
struct btrfs_dev_item *dev_item;
struct btrfs_device *device;
struct btrfs_key key;
-   u8 fs_uuid[BTRFS_UUID_SIZE];
+   u8 fs_uuid[BTRFS_FSID_SIZE];
u8 dev_uuid[BTRFS_UUID_SIZE];
u64 devid;
int ret;
@@ -2304,7 +2304,7 @@ static int btrfs_finish_sprout(struct btrfs_trans_handle 
*trans,
read_extent_buffer(leaf, dev_uuid, btrfs_device_uuid(dev_item),
   BTRFS_UUID_SIZE);
read_extent_buffer(leaf, fs_uuid, btrfs_device_fsid(dev_item),
-  BTRFS_UUID_SIZE);
+  BTRFS_FSID_SIZE);
device = btrfs_find_device(fs_info, devid, dev_uuid, fs_uuid);
BUG_ON(!device); /* Logic error */
 
@@ -6295,7 +6295,7 @@ struct btrfs_device *btrfs_find_device(struct 
btrfs_fs_info *fs_info, u64 devid,
cur_devices = fs_info->fs_devices;
while (cur_devices) {
if (!fsid ||
-   !memcmp(cur_devices->fsid, fsid, BTRFS_UUID_SIZE)) {
+   !memcmp(cur_devices->fsid, fsid, BTRFS_FSID_SIZE)) {
device = __find_device(_devices->devices,
   devid, uuid);