Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili
If we were going to reserve something, it should be a high number, not 
a low one.  Having 0 reserved makes some sense, but reserving other 
low numbers seems kind of odd when they aren't already reserved. 


I did some experiments.

Currently assigning higher-level qgroup to lower-level qgroup is not 
possible. Consequently, assigning anything to 0-level qgroup is not 
possible.


On the other hand, assigning qgroups while skipping levels (e.g. qgroup 
2/P to 10/Q) is possible. So setting default snapshot level high is 
technically possible, but you'll not be able to assign these high-level 
qgroups anywhere low later.



although I hadn't realized that the snapshot command _does not_ have 
this argument, when it absolutely should. 

It does here in 4.4, it's just not documented :) I too found it by accident.



Perhaps have an option
Options always suit everyone except developers who need to implement and 
support them :)


Here I'd like to wrap up since I seriously doubt any real btrfs 
developers are still reading our discussion :)


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Duncan
Austin S. Hemmelgarn posted on Tue, 28 Mar 2017 07:44:56 -0400 as
excerpted:

> On 2017-03-27 21:49, Qu Wenruo wrote:

>> The problem is, how should we treat subvolume.
>>
>> Btrfs subvolume sits in the middle of directory and (logical) volume
>> used in traditional stacked solution.
>>
>> While we allow normal user to create/delete/modify dir as long as they
>> follow access control, we require privilege to create/delete/modify
>> volumes.

> No, we require privilege to do certain modifications or delete
> subvolumes.

>  Regular users can create subvolumes with no privileges whatsoever, and
> most basic directory operations (rename, chown, chmod, etc) work just
> fine within normal UNIX DAC permissions.  Unless you're running some
> specially patched kernel or some LSM (SELinux possibly) that somehow
> restricts access to the ioctl, you can always create subvolumes.

(I believe) You misread what QW was trying to say. Note the word "volume" 
as opposed to subvolume.  Like most regulars here, Qu's too deep into 
btrfs to make that sort of mistake, so he wasn't talking about subvolumes.

Rather, he's placing btrfs subvolumes in the middle between directories 
and (generic/logical) volumes as normally used, and saying that for 
(generic) directories normal users can create/modify/delete without 
special privilege (beyond write in the parent), for (generic) volumes 
special privileges are necessary to create/modify/delete, and btrfs 
subvolumes fall in between, so there's a real decision to be made in 
terms of privileges required, do they follow directories and require 
nothing special or volumes and require special privs?


Meanwhile, my own position as I've argued is that regardless of the 
theoretical debate, as long as we have the very real practical issue of 
hard scaling issues for subvolumes/snapshots, there's an equally real 
security issue in allowing unrestricted snapshot and to a lessor extent 
normal subvolume creation.

So while we might /like/ to make subvolumes more like directories and 
require no special privs, until the snapshot scaling and thus security 
issue is fixed or at least greatly reduced, as a purely practical 
security matter, we really need to restrict snapshot creation to 
privileged users.

OTOH, it's worth pointing out that snapshots aren't unique in this 
regard, reflinked files have exactly the /same/ scaling issues except 
their one-by-one while snapshots tend to be a whole bunch of reflinked 
files at once.

Which means the question must then be asked, if we choose the privileged 
route for snapshots/subvolumes due to the reflinking security issues, why 
aren't we making all reflinking ops privileged ops, or should we?  If we 
did, cp --reflink=always, among other things, would obviously fail as a 
normal user.

So I guess it's a rather harder question than I considered at first.  
Maybe we /should/ just treat subvolumes as directories in privileges 
terms as well.  In that case, however, we really need to have the same 
default for subvolume deletion as creation.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix wrong failed mirror_num of read-repair on raid56

2017-03-28 Thread Liu Bo
On Mon, Mar 27, 2017 at 07:07:15PM +0200, David Sterba wrote:
> On Fri, Mar 24, 2017 at 12:13:42PM -0700, Liu Bo wrote:
> > In raid56 senario, after trying parity recovery, we didn't set
> > mirror_num for btrfs_bio with failed mirror_num, hence
> > end_bio_extent_readpage() will report a random mirror_num in dmesg
> > log.
> > 
> > Cc: David Sterba 
> > Signed-off-by: Liu Bo 
> > ---
> >  fs/btrfs/volumes.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 73d56ee..be64e4a 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6197,6 +6197,7 @@ int btrfs_map_bio(struct btrfs_fs_info *fs_info, 
> > struct bio *bio,
> > } else {
> > ret = raid56_parity_recover(fs_info, bio, bbio,
> > map_length, mirror_num, 1);
> > +   btrfs_io_bio(bio)->mirror_num = mirror_num;
> 
> Should the mirror be set inside raid56_parity_recover? There's another
> caller, scrub_submit_raid56_bio_wait, that does not update the
> bio->mirror_num. I am not sure if this is the same case though.
>

scrub_submit_raid56_bio_wait doesn't need to, but yes, we could set the
mirror inside raid56_parity_recover iff @generic_io is 1.

Thanks,

-liubo

> > }
> >  
> > btrfs_bio_counter_dec(fs_info);

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: enable repair during read for raid56 profile

2017-03-28 Thread Liu Bo
On Mon, Mar 27, 2017 at 06:59:44PM +0200, David Sterba wrote:
> On Fri, Mar 24, 2017 at 12:13:35PM -0700, Liu Bo wrote:
> > Now that scrub can fix data errors with the help of parity for raid56
> > profile, repair during read is able to as well.
> > 
> > Although the mirror num in raid56 senario has different meanings, i.e.
> 
> (typo: scenario)
>

Thanks!

> > 0 or 1: read data directly
> > > 1:do recover with parity,
> > it could be fit into how we repair bad block during read.
> 
> Could we possibly add some symbolic names for the RAID56 case?
>

Yes, we could do that here, but I'm afraid it might not be helpful
because most of places still use mirror or mirror_num, such as
btrfs_submit_bio_hook and btrfs_map_bio.

> > 
> > The trick is to use BTRFS_MAP_READ instead of BTRFS_MAP_WRITE to get the
> > device and position on it.
> 
> Please also document the trick in the code before the following.
>

Good point.

Thanks,

-liubo
> > +   if (btrfs_is_parity_mirror(map_tree, logical, length, mirror_num)) {
> > +   /* use BTRFS_MAP_READ to get the phy dev and sector */
> > +   ret = btrfs_map_block(fs_info, BTRFS_MAP_READ, logical,
> > + _length, , 0);
> > +   if (ret) {
> > +   btrfs_bio_counter_dec(fs_info);
> > +   bio_put(bio);
> > +   return -EIO;
> > +   }
> > +   ASSERT(bbio->mirror_num == 1);
> > +   } else {
> > +   ret = btrfs_map_block(fs_info, BTRFS_MAP_WRITE, logical,
> > + _length, , mirror_num);
> > +   if (ret) {
> > +   btrfs_bio_counter_dec(fs_info);
> > +   bio_put(bio);
> > +   return -EIO;
> > +   }
> > +   BUG_ON(mirror_num != bbio->mirror_num);
> > }
> > -   BUG_ON(mirror_num != bbio->mirror_num);
> > -   sector = bbio->stripes[mirror_num-1].physical >> 9;
> > +
> > +   sector = bbio->stripes[bbio->mirror_num - 1].physical >> 9;
> > bio->bi_iter.bi_sector = sector;
> > -   dev = bbio->stripes[mirror_num-1].dev;
> > +   dev = bbio->stripes[bbio->mirror_num - 1].dev;
> > btrfs_put_bbio(bbio);
> > if (!dev || !dev->bdev || !dev->writeable) {
> > btrfs_bio_counter_dec(fs_info);
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: clear the RAID5/6 incompat flag once no longer needed

2017-03-28 Thread Adam Borowski
There's no known taint on a filesystem that was RAID5/6 once but has been
since converted to something non-experimental.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a4c3e6628ec1..c720a7bc1b15 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3089,6 +3089,8 @@ int open_ctree(struct super_block *sb,
BTRFS_BLOCK_GROUP_RAID56_MASK) {
btrfs_alert(fs_info,
"btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs");
+   } else {
+   btrfs_clear_fs_incompat(fs_info, RAID56);
}
 
/*
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: warn about RAID5/6 being experimental at mount time

2017-03-28 Thread Adam Borowski
Too many people come complaining about losing their data -- and indeed,
there's no warning outside a wiki and the mailing list tribal knowledge.
Message severity chosen for consistency with XFS -- "alert" makes dmesg
produce nice red background which should get the point across.

Signed-off-by: Adam Borowski 
---
 fs/btrfs/disk-io.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index affd7aada057..a4c3e6628ec1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3083,6 +3083,14 @@ int open_ctree(struct super_block *sb,
btrfs_set_opt(fs_info->mount_opt, SSD);
}
 
+   if ((fs_info->avail_data_alloc_bits |
+fs_info->avail_metadata_alloc_bits |
+fs_info->avail_system_alloc_bits) &
+   BTRFS_BLOCK_GROUP_RAID56_MASK) {
+   btrfs_alert(fs_info,
+   "btrfs RAID5/6 is EXPERIMENTAL and has known data-loss bugs");
+   }
+
/*
 * Mount does not set all options immediately, we can do it now and do
 * not have to wait for transaction commit
-- 
2.11.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 0/5] raid56: scrub related fixes

2017-03-28 Thread Qu Wenruo
This patchset can be fetched from my github repo:
https://github.com/adam900710/linux.git raid56_fixes

It's based on v4.11-rc2, the last two patches get modified according to
the advice from Liu Bo.

The patchset fixes the following bugs:

1) False alert or wrong csum error number when scrubbing RAID5/6
   The bug itself won't cause any damage to fs, just pure race.

   This can be triggered by running scrub for 64K corrupted data stripe,
   Normally it will report 16 csum error recovered, but sometimes it
   will report more than 16 csum error recovered, under rare case, even
   unrecoverable error an be reported.

2) Corrupted data stripe rebuild corrupts P/Q
   So scrub makes one error into another, not really fixing anything

   Since kernel scrub doesn't report parity error, so either offline
   scrub or manual check is needed to expose such error.

3) Use-after-free caused by cancelling dev-replace 
   This is quite a deadly bug, since cancelling dev-replace can
   cause kernel panic.

   Can be triggered by btrfs/069.

v2:
  Use bio_counter to protect rbio against dev-replace cancel, instead of
  original btrfs_device refcount, which is too restrict and must disable
  rbio cache, suggested by Liu Bo.

v3:
  Add fix for another possible use-after-free when rechecking recovered
  full stripe
  Squashing two patches as they are fixing the same problem, to make
  bisect easier.
  Use mutex other than spinlock to protect full stripe locks tree, this
  allow us to allocate memory inside the critical section on demand.
  Encapsulate rb_root and mutex into btrfs_full_stripe_locks_tree.
  Rename scrub_full_stripe_lock to full_stripe_lock inside scrub.c.
  Rename related function to have unified naming.
  Code style change to follow the existing scrub code style.

Qu Wenruo (5):
  btrfs: scrub: Introduce full stripe lock for RAID56
  btrfs: scrub: Fix RAID56 recovery race condition
  btrfs: scrub: Don't append on-disk pages for raid56 scrub
  btrfs: Wait flighting bio before freeing target device for raid56
  btrfs: Prevent scrub recheck from racing with dev replace

 fs/btrfs/ctree.h   |  17 
 fs/btrfs/dev-replace.c |   2 +
 fs/btrfs/extent-tree.c |   2 +
 fs/btrfs/raid56.c  |  14 +++
 fs/btrfs/scrub.c   | 248 +++--
 5 files changed, 275 insertions(+), 8 deletions(-)

-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 2/5] btrfs: scrub: Fix RAID56 recovery race condition

2017-03-28 Thread Qu Wenruo
When scrubbing a RAID5 which has recoverable data corruption (only one
data stripe is corrupted), sometimes scrub will report more csum errors
than expected. Sometimes even unrecoverable error will be reported.

The problem can be easily reproduced by the following steps:
1) Create a btrfs with RAID5 data profile with 3 devs
2) Mount it with nospace_cache or space_cache=v2
   To avoid extra data space usage.
3) Create a 128K file and sync the fs, unmount it
   Now the 128K file lies at the beginning of the data chunk
4) Locate the physical bytenr of data chunk on dev3
   Dev3 is the 1st data stripe.
5) Corrupt the first 64K of the data chunk stripe on dev3
6) Mount the fs and scrub it

The correct csum error number should be 16(assuming using x86_64).
Larger csum error number can be reported in a 1/3 chance.
And unrecoverable error can also be reported in a 1/10 chance.

The root cause of the problem is RAID5/6 recover code has race
condition, due to the fact that full scrub is initiated per device.

While for other mirror based profiles, each mirror is independent with
each other, so race won't cause any big problem.

For example:
Corrupted   |   Correct  |  Correct|
|   Scrub dev3 (D1) |Scrub dev2 (D2) |Scrub dev1(P)|

Read out D1 |Read out D2 |Read full stripe |
Check csum  |Check csum  |Check parity |
Csum mismatch   |Csum match, continue|Parity mismatch  |
handle_errored_block||handle_errored_block |
 Read out full stripe   || Read out full stripe|
 D1 csum error(err++)   || D1 csum error(err++)|
 Recover D1 || Recover D1  |

So D1's csum error is accounted twice, just because
handle_errored_block() doesn't have enough protect, and race can happen.

On even worse case, for example D1's recovery code is re-writing
D1/D2/P, and P's recovery code is just reading out full stripe, then we
can cause unrecoverable error.

This patch will use previously introduced lock_full_stripe() and
unlock_full_stripe() to protect the whole scrub_handle_errored_block()
function for RAID56 recovery.
So no extra csum error nor unrecoverable error.

Reported-by: Goffredo Baroncelli 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/scrub.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index ab33b9a8aac2..f92d2512f4f3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1129,6 +1129,17 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
have_csum = sblock_to_check->pagev[0]->have_csum;
dev = sblock_to_check->pagev[0]->dev;
 
+   /*
+* For RAID5/6 race can happen for different dev scrub thread.
+* For data corruption, Parity and Data thread will both try
+* to recovery the data.
+* Race can lead to double added csum error, or even unrecoverable
+* error.
+*/
+   ret = lock_full_stripe(fs_info, logical);
+   if (ret < 0)
+   return ret;
+
if (sctx->is_dev_replace && !is_metadata && !have_csum) {
sblocks_for_recheck = NULL;
goto nodatasum_case;
@@ -1463,6 +1474,9 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
kfree(sblocks_for_recheck);
}
 
+   ret = unlock_full_stripe(fs_info, logical);
+   if (ret < 0)
+   return ret;
return 0;
 }
 
-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 1/5] btrfs: scrub: Introduce full stripe lock for RAID56

2017-03-28 Thread Qu Wenruo
Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protect, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which stores a rb_tree of mutex for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  17 
 fs/btrfs/extent-tree.c |   2 +
 fs/btrfs/scrub.c   | 212 +
 3 files changed, 231 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..9fe56da21fed 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -538,6 +538,14 @@ struct btrfs_io_ctl {
unsigned check_crcs:1;
 };
 
+/*
+ * Tree to record all locked full stripes of a RAID5/6 block group
+ */
+struct btrfs_full_stripe_locks_tree {
+   struct rb_root root;
+   struct mutex lock;
+};
+
 struct btrfs_block_group_cache {
struct btrfs_key key;
struct btrfs_block_group_item item;
@@ -648,6 +656,9 @@ struct btrfs_block_group_cache {
 * Protected by free_space_lock.
 */
int needs_free_space;
+
+   /* Record locked full stripes for RAID5/6 block group */
+   struct btrfs_full_stripe_locks_tree full_stripe_locks_root;
 };
 
 /* delayed seq elem */
@@ -3647,6 +3658,12 @@ int btrfs_scrub_cancel_dev(struct btrfs_fs_info *info,
   struct btrfs_device *dev);
 int btrfs_scrub_progress(struct btrfs_fs_info *fs_info, u64 devid,
 struct btrfs_scrub_progress *progress);
+static inline void btrfs_init_full_stripe_locks_tree(
+   struct btrfs_full_stripe_locks_tree *locks_root)
+{
+   locks_root->root = RB_ROOT;
+   mutex_init(_root->lock);
+}
 
 /* dev-replace.c */
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..e4d48997d927 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache 
*cache)
if (atomic_dec_and_test(>count)) {
WARN_ON(cache->pinned > 0);
WARN_ON(cache->reserved > 0);
+   WARN_ON(!RB_EMPTY_ROOT(>full_stripe_locks_root.root));
kfree(cache->free_space_ctl);
kfree(cache);
}
@@ -9917,6 +9918,7 @@ btrfs_create_block_group_cache(struct btrfs_fs_info 
*fs_info,
btrfs_init_free_space_ctl(cache);
atomic_set(>trimming, 0);
mutex_init(>free_space_lock);
+   btrfs_init_full_stripe_locks_tree(>full_stripe_locks_root);
 
return cache;
 }
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb1239f..ab33b9a8aac2 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -240,6 +240,13 @@ struct scrub_warning {
struct btrfs_device *dev;
 };
 
+struct full_stripe_lock {
+   struct rb_node node;
+   u64 logical;
+   u64 refs;
+   struct mutex mutex;
+};
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
@@ -349,6 +356,211 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
 }
 
 /*
+ * Insert new full stripe lock into full stripe locks tree
+ *
+ * Return pointer to existing or newly inserted full_stripe_lock structure if
+ * everything works well.
+ * Return ERR_PTR(-ENOMEM) if we failed to allocate memory
+ *
+ * NOTE: caller must hold full_stripe_locks_root->lock before calling this
+ * function
+ */
+static struct full_stripe_lock *insert_full_stripe_lock(
+   struct btrfs_full_stripe_locks_tree *locks_root,
+   u64 fstripe_logical)
+{
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+   struct full_stripe_lock *entry;
+   struct full_stripe_lock *ret;
+
+   WARN_ON(!mutex_is_locked(_root->lock));
+
+   p = _root->root.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct full_stripe_lock, node);
+   if (fstripe_logical < entry->logical) {
+   p = &(*p)->rb_left;
+   } else if (fstripe_logical > entry->logical) {
+   p = &(*p)->rb_right;
+   } else {
+   entry->refs++;
+   return entry;
+   }
+   }
+
+   /* Insert new lock */
+   ret = kmalloc(sizeof(*ret), GFP_KERNEL);
+   if (!ret)
+   return ERR_PTR(-ENOMEM);
+   ret->logical = fstripe_logical;
+   ret->refs = 1;
+   mutex_init(>mutex);
+
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, _root->root);
+   return ret;
+}
+
+/*
+ * Search for a full stripe lock 

[PATCH v3 3/5] btrfs: scrub: Don't append on-disk pages for raid56 scrub

2017-03-28 Thread Qu Wenruo
In the following situation, scrub will calculate wrong parity to
overwrite correct one:

RAID5 full stripe:

Before
| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0x (Bad)   | 0xcdcd | 0x|
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

After scrubbing dev3 only:

| Dev 1  | Dev  2 | Dev 3 |
| Data stripe 1  | Data stripe 2  | Parity Stripe |
--- 0
| 0xcdcd (Good)  | 0xcdcd | 0xcdcd (Bad)  |
--- 4K
| 0xcdcd | 0xcdcd | 0x|
...
| 0xcdcd | 0xcdcd | 0x|
--- 64K

The reason is after raid56 read rebuild rbio->stripe_pages are all
correct recovered (0xcd for data stripes).

However when we check and repair parity, in
scrub_parity_check_and_repair(), we will appending pages in
sparity->spages list to rbio->bio_pages[], which contains old on-disk
data.

And when we submit parity data to disk, we calculate parity using
rbio->bio_pages[] first, if rbio->bio_pages[] not found, then fallback
to rbio->stripe_pages[].

The patch fix it by not appending pages from sparity->spages.
So finish_parity_scrub() will use rbio->stripe_pages[] which is correct.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/scrub.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index f92d2512f4f3..2fd259dbf4db 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2991,7 +2991,6 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
struct btrfs_fs_info *fs_info = sctx->fs_info;
struct bio *bio;
struct btrfs_raid_bio *rbio;
-   struct scrub_page *spage;
struct btrfs_bio *bbio = NULL;
u64 length;
int ret;
@@ -3021,9 +3020,6 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
if (!rbio)
goto rbio_out;
 
-   list_for_each_entry(spage, >spages, list)
-   raid56_add_scrub_pages(rbio, spage->page, spage->logical);
-
scrub_pending_bio_inc(sctx);
raid56_parity_submit_scrub_rbio(rbio);
return;
-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 5/5] btrfs: Prevent scrub recheck from racing with dev replace

2017-03-28 Thread Qu Wenruo
scrub_setup_recheck_block() calls btrfs_map_sblock() and then access
bbio without protection of bio_counter.

This can leads to use-after-free if racing with dev replace cancel.

Fix it by increasing bio_counter before calling btrfs_map_sblock() and
decrease the bio_counter when corresponding recover is finished.

Cc: Liu Bo 
Reported-by: Liu Bo 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/scrub.c | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b8c49074d1b3..84b077c993c0 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1072,9 +1072,11 @@ static inline void scrub_get_recover(struct 
scrub_recover *recover)
atomic_inc(>refs);
 }
 
-static inline void scrub_put_recover(struct scrub_recover *recover)
+static inline void scrub_put_recover(struct btrfs_fs_info *fs_info,
+struct scrub_recover *recover)
 {
if (atomic_dec_and_test(>refs)) {
+   btrfs_bio_counter_dec(fs_info);
btrfs_put_bbio(recover->bbio);
kfree(recover);
}
@@ -1464,7 +1466,7 @@ static int scrub_handle_errored_block(struct scrub_block 
*sblock_to_check)
sblock->pagev[page_index]->sblock = NULL;
recover = sblock->pagev[page_index]->recover;
if (recover) {
-   scrub_put_recover(recover);
+   scrub_put_recover(fs_info, recover);
sblock->pagev[page_index]->recover =
NULL;
}
@@ -1556,16 +1558,19 @@ static int scrub_setup_recheck_block(struct scrub_block 
*original_sblock,
 * with a length of PAGE_SIZE, each returned stripe
 * represents one mirror
 */
+   btrfs_bio_counter_inc_blocked(fs_info);
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
logical, _length, , 0, 1);
if (ret || !bbio || mapped_length < sublen) {
btrfs_put_bbio(bbio);
+   btrfs_bio_counter_dec(fs_info);
return -EIO;
}
 
recover = kzalloc(sizeof(struct scrub_recover), GFP_NOFS);
if (!recover) {
btrfs_put_bbio(bbio);
+   btrfs_bio_counter_dec(fs_info);
return -ENOMEM;
}
 
@@ -1591,7 +1596,7 @@ static int scrub_setup_recheck_block(struct scrub_block 
*original_sblock,
spin_lock(>stat_lock);
sctx->stat.malloc_errors++;
spin_unlock(>stat_lock);
-   scrub_put_recover(recover);
+   scrub_put_recover(fs_info, recover);
return -ENOMEM;
}
scrub_page_get(page);
@@ -1633,7 +1638,7 @@ static int scrub_setup_recheck_block(struct scrub_block 
*original_sblock,
scrub_get_recover(recover);
page->recover = recover;
}
-   scrub_put_recover(recover);
+   scrub_put_recover(fs_info, recover);
length -= sublen;
logical += sublen;
page_index++;
-- 
2.12.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 4/5] btrfs: Wait flighting bio before freeing target device for raid56

2017-03-28 Thread Qu Wenruo
When raid56 dev replace is cancelled by running scrub, we will free target
device without waiting flighting bios, causing the following NULL
pointer deference or general protection.

 BUG: unable to handle kernel NULL pointer dereference at 05e0
 IP: generic_make_request_checks+0x4d/0x610
 CPU: 1 PID: 11676 Comm: kworker/u4:14 Tainted: G  O4.11.0-rc2 #72
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.10.2-20170228_101828-anatol 04/01/2014
 Workqueue: btrfs-endio-raid56 btrfs_endio_raid56_helper [btrfs]
 task: 88002875b4c0 task.stack: c90001334000
 RIP: 0010:generic_make_request_checks+0x4d/0x610
 Call Trace:
  ? generic_make_request+0xc7/0x360
  generic_make_request+0x24/0x360
  ? generic_make_request+0xc7/0x360
  submit_bio+0x64/0x120
  ? page_in_rbio+0x4d/0x80 [btrfs]
  ? rbio_orig_end_io+0x80/0x80 [btrfs]
  finish_rmw+0x3f4/0x540 [btrfs]
  validate_rbio_for_rmw+0x36/0x40 [btrfs]
  raid_rmw_end_io+0x7a/0x90 [btrfs]
  bio_endio+0x56/0x60
  end_workqueue_fn+0x3c/0x40 [btrfs]
  btrfs_scrubparity_helper+0xef/0x620 [btrfs]
  btrfs_endio_raid56_helper+0xe/0x10 [btrfs]
  process_one_work+0x2af/0x720
  ? process_one_work+0x22b/0x720
  worker_thread+0x4b/0x4f0
  kthread+0x10f/0x150
  ? process_one_work+0x720/0x720
  ? kthread_create_on_node+0x40/0x40
  ret_from_fork+0x2e/0x40
 RIP: generic_make_request_checks+0x4d/0x610 RSP: c90001337bb8

In btrfs_dev_replace_finishing(), we will call
btrfs_rm_dev_replace_blocked() to wait bios before destroying the target
device when scrub is finished normally.

However when scrub is aborted, either due to error or canceled by scrub,
we didn't wait bios, this can leads to use-after-free if there are bios
holding the target device.

Furthermore, for raid56 scrub, at least 2 places are calling
btrfs_map_sblock() without protection of bio_counter, leading to the
problem.

This patch fixes the problen by
1) Wait bio_counter before freeing target device when canceling replace
2) When calling btrfs_map_sblock() for raid56, use bio_counter to
   protect the call.

Cc: Liu Bo 
Signed-off-by: Qu Wenruo 
---
 fs/btrfs/dev-replace.c |  2 ++
 fs/btrfs/raid56.c  | 14 ++
 fs/btrfs/scrub.c   |  5 +
 3 files changed, 21 insertions(+)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e653921f05d9..b9d88136b5a9 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -546,8 +546,10 @@ static int btrfs_dev_replace_finishing(struct 
btrfs_fs_info *fs_info,
mutex_unlock(_info->chunk_mutex);
mutex_unlock(_info->fs_devices->device_list_mutex);
mutex_unlock(_mutex);
+   btrfs_rm_dev_replace_blocked(fs_info);
if (tgt_device)
btrfs_destroy_dev_replace_tgtdev(fs_info, tgt_device);
+   btrfs_rm_dev_replace_unblocked(fs_info);
mutex_unlock(_replace->lock_finishing_cancel_unmount);
 
return scrub_ret;
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 1571bf26dc07..5c180fea32ab 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -2194,6 +2194,8 @@ static void read_rebuild_work(struct btrfs_work *work)
 /*
  * The following code is used to scrub/replace the parity stripe
  *
+ * Caller must have already increased bio_counter for getting @bbio.
+ *
  * Note: We need make sure all the pages that add into the scrub/replace
  * raid bio are correct and not be changed during the scrub/replace. That
  * is those pages just hold metadata or file data with checksum.
@@ -2231,6 +2233,12 @@ raid56_parity_alloc_scrub_rbio(struct btrfs_fs_info 
*fs_info, struct bio *bio,
ASSERT(rbio->stripe_npages == stripe_nsectors);
bitmap_copy(rbio->dbitmap, dbitmap, stripe_nsectors);
 
+   /*
+* We have already increased bio_counter when getting bbio, record it
+* so we can free it at rbio_orig_end_io().
+*/
+   rbio->generic_bio_cnt = 1;
+
return rbio;
 }
 
@@ -2673,6 +2681,12 @@ raid56_alloc_missing_rbio(struct btrfs_fs_info *fs_info, 
struct bio *bio,
return NULL;
}
 
+   /*
+* When we get bbio, we have already increased bio_counter, record it
+* so we can free it at rbio_orig_end_io()
+*/
+   rbio->generic_bio_cnt = 1;
+
return rbio;
 }
 
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index 2fd259dbf4db..b8c49074d1b3 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -2413,6 +2413,7 @@ static void scrub_missing_raid56_pages(struct scrub_block 
*sblock)
int ret;
int i;
 
+   btrfs_bio_counter_inc_blocked(fs_info);
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
, , 0, 1);
if (ret || !bbio || !bbio->raid_map)
@@ -2457,6 +2458,7 @@ static void scrub_missing_raid56_pages(struct scrub_block 
*sblock)
 rbio_out:
bio_put(bio);
 

Re: How to test in-bound dedupe

2017-03-28 Thread Qu Wenruo



At 03/29/2017 01:52 AM, Jitendra wrote:

Hi Qu/All,

I am looking into in-memory in-bound dedup. I have cloned your git tree
from
following urls.

Linux Tree: https://github.com/adam900710/linux.git wang_dedupe_latest
btrfs-progs: https://g...@github.com:adam900710/btrfs-progs.git
dedupe_20170316

Then run the follwoing test
https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe.


Thanks for having interest in in-band dedupe.



Is there any stuff that needs to be tested for it? if something, then
please
point out me.


Please note that, CONFIG_BTRFS_DEBUG must be set, or inband dedupe ioctl 
will still be hidden.


Basically, you could try all xfstests with inband dedupe enabled.
While it's not officially supported yet, you could modify _scratch_mount 
and _test_mount to always enable dedupe just after mount.




My goal is to understand in-memory dedup and try to participate in on-disk
de-dup.


On-disk dedupe backend is not provided yet, although planned and not 
hard to implement.


Our plan is to add on-disk dedupe backend after in-memory dedupe is 
merged and tested.



And about understanding current dedupe algorithm, it should not be that 
hard to understand, while the really hard part is from kernel itself.


Tons of facilities is involved in fs, from memory management to VFS to 
page cache to delayed allocation.


Thanks,
Qu


---
Jitendra





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send snapshot from snapshot incremental

2017-03-28 Thread Hugo Mills
On Tue, Mar 28, 2017 at 06:40:25PM -0400, J. Hart wrote:
> Don't be embarrassed.
> I'm a native speaker and still have trouble with most explanations.:-)

   You should try writing them. ;)

   Hugo ("darkling").

> 
> On 03/28/2017 06:01 PM, Jakob Schürz wrote:
> >Thanks for that explanation.
> >
> >I'm sure, i didn't understand the -c option... and my english is pretty
> >good enough for the most things I need to know in Linux-things... but
> >not for this. :-(
> >
> >
> 

-- 
Hugo Mills | What's a Nazgûl like you doing in a place like this?
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |Illiad


signature.asc
Description: Digital signature


Re: send snapshot from snapshot incremental

2017-03-28 Thread J. Hart

Don't be embarrassed.
I'm a native speaker and still have trouble with most explanations.:-)

On 03/28/2017 06:01 PM, Jakob Schürz wrote:

Thanks for that explanation.

I'm sure, i didn't understand the -c option... and my english is pretty
good enough for the most things I need to know in Linux-things... but
not for this. :-(




--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: send snapshot from snapshot incremental

2017-03-28 Thread Jakob Schürz
Thanks for that explanation.

I'm sure, i didn't understand the -c option... and my english is pretty
good enough for the most things I need to know in Linux-things... but
not for this. :-(

Am 2017-03-26 um 22:07 schrieb Peter Grandi:
> [ ... ]
>> BUT if i take a snapshot from the system, and want to transfer
>> it to the external HD, i can not set a parent subvolume,
>> because there isn't any.
> 
> Questions like this are based on incomplete understanding of
> 'send' and 'receive', and on IRC user "darkling" explained it
> fairly well:
> 
>> When you use -c, you're telling the FS that it can expect to
>> find a sent copy of that subvol on the receiving side, and
>> that anything shared with it can be sent by reference. OK, so
>> with -c on its own, you're telling the FS that "all the data
>> in this subvol already exists on the remote".
> 
>> So, when you send your subvol, *all* of the subvol's metadata
>> is sent, and where that metadata refers to an extent that's
>> shared with the -c subvol, the extent data isn't sent, because
>> it's known to be on the other end already, and can be shared
>> directly from there.
> 
>> OK. So, with -p, there's a "base" subvol. The send subvol and
>> the -p reference subvol are both snapshots of that base (at
>> different times). The -p reference subvol, as with -c, is
>> assumed to be on the remote FS. However, because it's known to
>> be an earlier version of the same data, you can be more
>> efficient in the sending by saying "start from the earlier
>> version, and modify it in this way to get the new version"
> 
>> So, with -p, not all of the metadata is sent, because you know
>> you've already got most of it on the remote in the form of the
>> earlier version.
> 
>> So -p is "take this thing and apply these differences to it"
>> and -c is "build this thing from scratch, but you can share
>> some of the data with these sources"
> 
> Also here some additional details:
> 
>   http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59
> 
> The requirement for read-only is because in that way it is
> pretty sure that the same stuff is on both origin and target
> volume.
> 
> It may help to compare with RSYNC: it has to scan both the full
> origin and target trees, because it cannot be told that there is
> a parent tree that is the same on origin and target; but with
> option '--link-dest' it can do something similar to 'send -c'.

There is Subvolume A on the send- and the receive-side.
There is also Subvolume AA on the send-side from A.
The parent-ID from send-AA is the ID from A.
The received-ID from A on received-side A is the ID from A.

To send the AA, i use the command
btrfs send -p A AA|btrfs receive /path/to/receiveFS/

The received-ID from AA on received-side A is the ID from AA.


Now i take a snapshot from AA on the send-side -> Called AAA

If i try to send AAA to the receiving FS with
btrfs send -p AA AAA|btrfs receive /path/to/receiveFS/
no parent snapshot is found.
I should better take the -c Option.

btrfs send -c AA AAA|btrfs receive /path/to/receiveFS/

Am I right?
(Sorry, cannot test it now, i do not have my external Drive here)
I might remember, that this didn't work in the past...

Best regards

Jakob

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: fix unexpected file hole after disk errors

2017-03-28 Thread Liu Bo
On Tue, Mar 28, 2017 at 02:50:06PM +0200, David Sterba wrote:
> On Mon, Mar 06, 2017 at 12:23:30PM -0800, Liu Bo wrote:
> > Btrfs creates hole extents to cover any unwritten section right before
> > doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding
> > write sequence to fix snapshot related bug.").
> > 
> > However, that takes the start position of the buffered write to compare
> > against the current EOF, hole extents would be created only if (EOF <
> > start).
> > 
> > If the EOF is at the middle of the buffered write, no hole extents will be
> > created and a file hole without a hole extent is left in this file.
> > 
> > This bug was revealed by generic/019 in fstests.  'fsstress' in this test
> > may create the above situation and the test then fails all requests
> > including writes, so the buffer write which is supposed to cover the
> > hole (without the hole extent) couldn't make it on disk.  Running fsck
> > against such btrfs ends up with detecting file extent holes.
> > 
> > Things could be more serious, some stale data would be exposed to
> > userspace if files with this kind of hole are truncated to a position of
> > the hole, because the on-disk inode size is beyond the last extent in the
> > file.
> > 
> > This fixes the bug by comparing the end position against the EOF.
> 
> Is the test reliable? As I read it, it should be possible to craft the
> file extents and trigger the bug. And verify the fix.

It's not, running generic/019 by 10 times could produces a btrfsck
error once on my box.

Actually I think we don't need this patch any more since we're going
to remove hole extents.

I made a mistake when writing the above 'more serious' part, in fact
truncating to a hole doesn't end up stale data as we don't have any
file extent that points to the hole and reading the hole part should
get all zero, thus there's no serious problem.

It was another bug about setting disk isize to zero accidentally and
data was lost, which has been fixed.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


How to test in-bound dedupe

2017-03-28 Thread Jitendra

Hi Qu/All,

I am looking into in-memory in-bound dedup. I have cloned your git tree from
following urls.

Linux Tree: https://github.com/adam900710/linux.git wang_dedupe_latest
btrfs-progs: https://g...@github.com:adam900710/btrfs-progs.git dedupe_20170316

Then run the follwoing test 
https://btrfs.wiki.kernel.org/index.php/User_notes_on_dedupe.

Is there any stuff that needs to be tested for it? if something, then please
point out me.

My goal is to understand in-memory dedup and try to participate in on-disk
de-dup.

---
Jitendra
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 4/4] btrfs: cleanup barrier_all_devices() to check dev stat flush error

2017-03-28 Thread David Sterba
On Tue, Mar 14, 2017 at 04:26:11PM +0800, Anand Jain wrote:
> The objective of this patch is to cleanup barrier_all_devices()
> so that the error checking is in a separate loop independent of
> of the loop which submits and waits on the device flush requests.

I think that getting completely rid of the ENOMEM as suggested under
patch 2, the split would not be necessary.

The send failures will be gone so we can simply count failures from the
waiting write_dev_flush(..., 1). There any returned error also implies
the device stat increase for FLUSH_ERRS. Then barrier_all_devices will
be a bit simpler.

> By doing this it helps to further develop patches which would tune
> the error-actions as needed.
> 
> Here functions such as btrfs_dev_stats_dirty() couldn't be used
> because it doesn't monitor the flush errors BTRFS_DEV_STAT_FLUSH_ERRS.
> 
> Signed-off-by: Anand Jain 
> ---
> v2: Address Qu review comments viz..
>  Add meaningful names, like cp_list (for checkpoint_list head).
>  (And actually it does not need a new struct type just to hold
>   the head pointer, list node is already named as device_checkpoint).
>  Check return value of add_device_checkpoint()
>  Check if the device is already added at add_device_checkpoint()
>  Rename fini_devices_checkpoint() to rel_devices_checkpoint()
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 5719e036048b..d0c401884643 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -3566,6 +3566,86 @@ static int write_dev_flush(struct btrfs_device 
> *device, int wait)
>   return 0;
>  }
>  
> +struct device_checkpoint {
> + struct list_head cp_node;
> + struct btrfs_device *device;
> + int stat_value_checkpoint;
> +};

I find this approach particularly bad for several reasons.

You need to store just one int per device but introduce a bloated
temporary structure. If anything, the int could be embeded in
btrfs_device, set and checked at the right points. Instead, you add
memory allocations to a critical context and more potential failure
paths.

> +static int add_device_checkpoint(struct btrfs_device *device,
> + struct list_head *cp_list)
> +{
...
> + cdev = kzalloc(sizeof(struct device_checkpoint), GFP_KERNEL);
...
> +}

The memory allocations are GFP_KERNEL, this could lead to strange
lockups if the allocator tries to free memory and asks the filesystem(s)
to flush their data.

Traversing the structures leads to quadratic complexity in
check_barrier_error(), where the checkpoint list is traversed for
each iteration of device list.

> @@ -3574,8 +3654,8 @@ static int barrier_all_devices(struct btrfs_fs_info 
> *info)
>  {
>   struct list_head *head;
>   struct btrfs_device *dev;
> - int dropouts = 0;
>   int ret;
> + static LIST_HEAD(cp_list);
^^

What is this supposed to mean? What if several filesystems end up in
barrier_all_devices modifying the list?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Austin S. Hemmelgarn

On 2017-03-28 10:43, Peter Grandi wrote:

This is going to be long because I am writing something detailed
hoping pointlessly that someone in the future will find it by
searching the list archives while doing research before setting
up a new storage system, and they will be the kind of person
that tolerates reading messages longer than Twitter. :-).


I’m currently shrinking a device and it seems that the
performance of shrink is abysmal.


When I read this kind of statement I am reminded of all the
cases where someone left me to decatastrophize a storage system
built on "optimistic" assumptions. The usual "optimism" is what
I call the "syntactic approach", that is the axiomatic belief
that any syntactically valid combination of features not only
will "work", but very fast too and reliably despite slow cheap
hardware and "unattentive" configuration. Some people call that
the expectation that system developers provide or should provide
an "O_PONIES" option. In particular I get very saddened when
people use "performance" to mean "speed", as the difference
between the two is very great.

As a general consideration, shrinking a large filetree online
in-place is an amazingly risky, difficult, slow operation and
should be a last desperate resort (as apparently in this case),
regardless of the filesystem type, and expecting otherwise is
"optimistic".

My guess is that very complex risky slow operations like that
are provided by "clever" filesystem developers for "marketing"
purposes, to win box-ticking competitions. That applies to those
system developers who do know better; I suspect that even some
filesystem developers are "optimistic" as to what they can
actually achieve.
There are cases where there really is no other sane option.  Not 
everyone has the kind of budget needed for proper HA setups, and if you 
need maximal uptime and as a result have to reprovision the system 
online, then you pretty much need a filesystem that supports online 
shrinking.  Also, it's not really all that slow on most filesystem, 
BTRFS is just hurt by it's comparatively poor performance, and the COW 
metadata updates that are needed.



I intended to shrink a ~22TiB filesystem down to 20TiB. This is
still using LVM underneath so that I can’t just remove a device
from the filesystem but have to use the resize command.


That is actually a very good idea because Btrfs multi-device is
not quite as reliable as DM/LVM2 multi-device.
This depends on how much you trust your storage hardware relative to how 
much you trust the kernel code.  For raid5/6, yes, BTRFS multi-device is 
currently crap.  For most people raid10 in BTRFS is too.  For raid1 mode 
however, it really is personal opinion.



Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
   Total devices 1 FS bytes used 18.21TiB
   devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy


Maybe 'balance' should have been used a bit more.


This has been running since last Thursday, so roughly 3.5days
now. The “used” number in devid1 has moved about 1TiB in this
time. The filesystem is seeing regular usage (read and write)
and when I’m suspending any application traffic I see about
1GiB of movement every now and then. Maybe once every 30
seconds or so. Does this sound fishy or normal to you?


With consistent "optimism" this is a request to assess whether
"performance" of some operations is adequate on a filetree
without telling us either what the filetree contents look like,
what the regular workload is, or what the storage layer looks
like.

Being one of the few system administrators crippled by lack of
psychic powers :-), I rely on guesses and inferences here, and
having read the whole thread containing some belated details.

From the ~22TB total capacity my guess is that the storage layer
involves rotating hard disks, and from later details the
filesystem contents seems to be heavily reflinked files of
several GB in size, and workload seems to be backups to those
files from several source hosts. Considering the general level
of "optimism" in the situation my wild guess is that the storage
layer is based on large slow cheap rotating disks in teh 4GB-8GB
range, with very low IOPS-per-TB.


Thanks for that info. The 1min per 1GiB is what I saw too -
the “it can take longer” wasn’t really explainable to me.


A contemporary rotating disk device can do around 0.5MB/s
transfer rate with small random accesses with barriers up to
around 80-160MB/s in purely sequential access without barriers.

1GB/m of simultaneous read-write means around 16MB/s reads plus
16MB/s writes which is fairly good *performance* (even if slow
*speed*) considering that moving extents around, even across
disks, involves quite a bit of randomish same-disk updates of
metadata; because it all depends usually on how much randomish
metadata updates need to done, on any filesystem type, as those
must be done with barriers.


As I’m not using snapshots: would large files (100+gb)


Using 

Re: [PATCH 2/4] btrfs: Communicate back ENOMEM when it occurs

2017-03-28 Thread David Sterba
On Mon, Mar 13, 2017 at 03:42:12PM +0800, Anand Jain wrote:
> The only error that write dev flush (send) will fail is due
> to the ENOMEM then, as its not a device specific error and
> rather a system wide issue, we should rather stop further
> iterations and perpetuate the -ENOMEM error to the caller.

I think we should try harder, as flushing is a critical operation and a
simple yet unlikely memory allocation failure should not stop it.

The device::flush_bio is being allocated each time we start flush and
freed afterwards. This seems unnecessary. The proper fix IMO is to
preallocate the bio at the time the device is added to the list. The bio
lifetime is same as the device'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Tomasz Kusmierz
I’ve glazed over on “Not only that …” … can you make youtube video of that :
> On 28 Mar 2017, at 16:06, Peter Grandi  wrote:
> 
>> I glazed over at “This is going to be long” … :)
>>> [ ... ]
> 
> Not only that, you also top-posted while quoting it pointlessly
> in its entirety, to the whole mailing list. Well played :-).
It’s because I’m special :* 

On a real note thank’s for giving a f to provide a detailed comment … to much 
of open source stuff is based on short comments :/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Austin S. Hemmelgarn

On 2017-03-28 09:53, Marat Khalili wrote:

There are a couple of reasons I'm advocating the specific behavior I
outlined:

Some of your points are valid, but some break current behaviour and
expectations or create technical difficulties.



1. It doesn't require any specific qgroup setup.  By definition, you
can be 100% certain that the destination qgroup exists, and that you
won't need to create new qgroups behind the user's back (given your
suggestion, what happens when qgroup 1/N doesn't exist?).

This is a general problem with current qgroups: you have to reference
them by some random numbers, not by user-assigned names like files. It
would need to be fixed sooner or later with syntax like L: in
place of L/N, or even some special syntax made specifically for path
snapshots.

BTW, what about reserving level 1 for qgroups describing subvolumes and
all their snapshots and forbidding manual management of qgroups at this
level?
If we were going to reserve something, it should be a high number, not a 
low one.  Having 0 reserved makes some sense, but reserving other low 
numbers seems kind of odd when they aren't already reserved.




2. Just because it's the default, doesn't mean that the subvolume
can't be reassigned to a different qgroup.  This also would not remove
the ability to assign a specific qgroup through the snapshot creation
command.  This is arguably a general point in favor of having any
default of course, but it's still worth pointing out.

Currently 0/N qgroups are special in that they are created automatically
and belong to the bottom of the hierarchy. It would be very nice to keep
it this way.

Changing qgroup assignments after snapshot creation is very inconvenient
because it requires quota rescan and thus blocks all other quota
operations.
The subvolume create command lets you set a specific qgroup on subvolume 
creation.  When I said it won't prevent assigning a specific qgroup, I 
meant using that, not some after-the-fact assignment, although I hadn't 
realized that the snapshot command _does not_ have this argument, when 
it absolutely should.




3. Because BTRFS has COW semantics, the new snapshot should take up
near zero space in the qgroup of it's parent.


Indeed it works this way in my experiments if you assign snapshot to 1/N
qgroup at creation where 0/N also belongs. Moreover, it does not require
quota rescan, which is very nice.
I mean if it's placed in that qgroup itself.  If you create a snapshot 
of an idle subvolume, you only effectively double the metadata (and 
possibly not even that completely), so unless it's got a very large 
number of small files, it should take up next to zero space if it gets 
put in it's parent's qgroup.




4. This correlates with the behavior most people expect based on ZFS
and LVM, which is that snapshots are tied to their parent.


I'm not against tying it to the parent. I'm against removing snapshot's
own qgroup.
Perhaps have an option (in the filesystem, not the mount options, this 
is tied to qgroup policy, which should be independent of who mounts the 
FS where) that specifies one of:

1. Current behavior.
2. Join parent's qgroup.
3. Join some other predefined qgroup.





At a minimum, it should belong to _some_ qgroup.  This could also be
covered by having a designated 'default' qgroup that all new
subvolumes created without a specified qgroup get put in, but I feel
that that is somewhat orthogonal to the issue of how snapshots are
handled.

It belongs to its own 0/N' qgroup, but this is not probably what you mean.
The 0/N groups are accounting only until some limit is set, correct?  If 
that is the case, then that really isn't what I mean.  Quotas can be 
used for accounting, but when used, they are often intended for 
restricting resource usage.  New subvolumes going into unrestricted 
qgroups by default completely eliminates any usability for resource 
restriction the moment anybody but the admin has any shell access to the 
system, and that's my biggest complaint against the current behavior. 
Putting a snapshot in it's parent's qgroup would completely close this 
hole and provide semantics that most people are likely to understand 
pretty easily.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
> I glazed over at “This is going to be long” … :)
>> [ ... ]

Not only that, you also top-posted while quoting it pointlessly
in its entirety, to the whole mailing list. Well played :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/4] btrfs: REQ_PREFLUSH does not use btrfs_end_bio() completion callback

2017-03-28 Thread David Sterba
On Mon, Mar 13, 2017 at 03:42:11PM +0800, Anand Jain wrote:
> REQ_PREFLUSH bio to flush dev cache uses btrfs_end_empty_barrier()
> completion callback only, as of now, and there it accounts for dev
> stat flush errors BTRFS_DEV_STAT_FLUSH_ERRS, so remove it from the
> btrfs_end_bio().

Can you please be more specific? I was trying to find the code path that
does the supposedly duplicate accounting for BTRFS_DEV_STAT_FLUSH_ERRS,
but still not there after half an hour.

submit_stripe_bio
  btrfsic_submit_bio
btrfs_end_bio
  -> stats accounting

write_dev_flush
  btrfsic_submit_bio
btrfs_end_empty_barrier
  complete

now here it wakes up flush_wait, that is only waited for in
write_dev_flush, there the FLUSH_ERRS accounting happens, but ... I'm
not sure if I haven't lost in the bio handling maze.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
>  [ ... ] slaps together a large storage system in the cheapest
> and quickest way knowing that while it is mostly empty it will
> seem very fast regardless and therefore to have awesome
> performance, and then the "clever" sysadm disappears surrounded
> by a halo of glory before the storage system gets full workload
> and fills up; [ ... ]

Fortunately or unfortunately Btrfs is particularly suitable for
this technique, as it has an enormous number of checkbox-ticking
awesome looking feature: transparent compression, dynamic
add/remove, online balance/scrub, different sized member devices,
online grow/shrink, online defrag, limitless scalability, online
dedup, arbitrary subvolumes and snapshots, COW and reflinking,
online conversion of RAID profiles, ... and one can use all of
them at the same time, and for the initial period where volume
workload is low and space used not much, it will looks absolutely
fantastic, cheap, flexible, always available, fast, the work of
genius of a very cool sysadm.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
> [ ... ] reminded of all the cases where someone left me to
> decatastrophize a storage system built on "optimistic"
> assumptions.

In particular when some "clever" sysadm with a "clever" (or
dumb) manager slaps together a large storage system in the
cheapest and quickest way knowing that while it is mostly empty
it will seem very fast regardless and therefore to have awesome
performance, and then the "clever" sysadm disappears surrounded
by a halo of glory before the storage system gets full workload
and fills up; when that happens usually I get to inherit it.
BTW The same technique also can be done with HPC clusters.

>> I intended to shrink a ~22TiB filesystem down to 20TiB. This
>> is still using LVM underneath so that I can’t just remove a
>> device from the filesystem but have to use the resize
>> command.

>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>> Total devices 1 FS bytes used 18.21TiB
>> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Ahh it is indeed a filled up storage system now running a full
workload. At least it wasn't me who inherited it this time. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Tomasz Kusmierz
I glazed over at “This is going to be long” … :)

> On 28 Mar 2017, at 15:43, Peter Grandi  wrote:
> 
> This is going to be long because I am writing something detailed
> hoping pointlessly that someone in the future will find it by
> searching the list archives while doing research before setting
> up a new storage system, and they will be the kind of person
> that tolerates reading messages longer than Twitter. :-).
> 
>> I’m currently shrinking a device and it seems that the
>> performance of shrink is abysmal.
> 
> When I read this kind of statement I am reminded of all the
> cases where someone left me to decatastrophize a storage system
> built on "optimistic" assumptions. The usual "optimism" is what
> I call the "syntactic approach", that is the axiomatic belief
> that any syntactically valid combination of features not only
> will "work", but very fast too and reliably despite slow cheap
> hardware and "unattentive" configuration. Some people call that
> the expectation that system developers provide or should provide
> an "O_PONIES" option. In particular I get very saddened when
> people use "performance" to mean "speed", as the difference
> between the two is very great.
> 
> As a general consideration, shrinking a large filetree online
> in-place is an amazingly risky, difficult, slow operation and
> should be a last desperate resort (as apparently in this case),
> regardless of the filesystem type, and expecting otherwise is
> "optimistic".
> 
> My guess is that very complex risky slow operations like that
> are provided by "clever" filesystem developers for "marketing"
> purposes, to win box-ticking competitions. That applies to those
> system developers who do know better; I suspect that even some
> filesystem developers are "optimistic" as to what they can
> actually achieve.
> 
>> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
>> still using LVM underneath so that I can’t just remove a device
>> from the filesystem but have to use the resize command.
> 
> That is actually a very good idea because Btrfs multi-device is
> not quite as reliable as DM/LVM2 multi-device.
> 
>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>>   Total devices 1 FS bytes used 18.21TiB
>>   devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy
> 
> Maybe 'balance' should have been used a bit more.
> 
>> This has been running since last Thursday, so roughly 3.5days
>> now. The “used” number in devid1 has moved about 1TiB in this
>> time. The filesystem is seeing regular usage (read and write)
>> and when I’m suspending any application traffic I see about
>> 1GiB of movement every now and then. Maybe once every 30
>> seconds or so. Does this sound fishy or normal to you?
> 
> With consistent "optimism" this is a request to assess whether
> "performance" of some operations is adequate on a filetree
> without telling us either what the filetree contents look like,
> what the regular workload is, or what the storage layer looks
> like.
> 
> Being one of the few system administrators crippled by lack of
> psychic powers :-), I rely on guesses and inferences here, and
> having read the whole thread containing some belated details.
> 
> From the ~22TB total capacity my guess is that the storage layer
> involves rotating hard disks, and from later details the
> filesystem contents seems to be heavily reflinked files of
> several GB in size, and workload seems to be backups to those
> files from several source hosts. Considering the general level
> of "optimism" in the situation my wild guess is that the storage
> layer is based on large slow cheap rotating disks in teh 4GB-8GB
> range, with very low IOPS-per-TB.
> 
>> Thanks for that info. The 1min per 1GiB is what I saw too -
>> the “it can take longer” wasn’t really explainable to me.
> 
> A contemporary rotating disk device can do around 0.5MB/s
> transfer rate with small random accesses with barriers up to
> around 80-160MB/s in purely sequential access without barriers.
> 
> 1GB/m of simultaneous read-write means around 16MB/s reads plus
> 16MB/s writes which is fairly good *performance* (even if slow
> *speed*) considering that moving extents around, even across
> disks, involves quite a bit of randomish same-disk updates of
> metadata; because it all depends usually on how much randomish
> metadata updates need to done, on any filesystem type, as those
> must be done with barriers.
> 
>> As I’m not using snapshots: would large files (100+gb)
> 
> Using 100GB sized VM virtual disks (never mind with COW) seems
> very unwise to me to start with, but of course a lot of other
> people know better :-). Just like a lot of other people know
> better that large single pool storage systems are awesome in
> every respect :-): cost, reliability, speed, flexibility,
> maintenance, etc.
> 
>> with long chains of CoW history (specifically reflink copies)
>> also hurt?
> 
> Oh yes... They are about one 

Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
This is going to be long because I am writing something detailed
hoping pointlessly that someone in the future will find it by
searching the list archives while doing research before setting
up a new storage system, and they will be the kind of person
that tolerates reading messages longer than Twitter. :-).

> I’m currently shrinking a device and it seems that the
> performance of shrink is abysmal.

When I read this kind of statement I am reminded of all the
cases where someone left me to decatastrophize a storage system
built on "optimistic" assumptions. The usual "optimism" is what
I call the "syntactic approach", that is the axiomatic belief
that any syntactically valid combination of features not only
will "work", but very fast too and reliably despite slow cheap
hardware and "unattentive" configuration. Some people call that
the expectation that system developers provide or should provide
an "O_PONIES" option. In particular I get very saddened when
people use "performance" to mean "speed", as the difference
between the two is very great.

As a general consideration, shrinking a large filetree online
in-place is an amazingly risky, difficult, slow operation and
should be a last desperate resort (as apparently in this case),
regardless of the filesystem type, and expecting otherwise is
"optimistic".

My guess is that very complex risky slow operations like that
are provided by "clever" filesystem developers for "marketing"
purposes, to win box-ticking competitions. That applies to those
system developers who do know better; I suspect that even some
filesystem developers are "optimistic" as to what they can
actually achieve.

> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
> still using LVM underneath so that I can’t just remove a device
> from the filesystem but have to use the resize command.

That is actually a very good idea because Btrfs multi-device is
not quite as reliable as DM/LVM2 multi-device.

> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>Total devices 1 FS bytes used 18.21TiB
>devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Maybe 'balance' should have been used a bit more.

> This has been running since last Thursday, so roughly 3.5days
> now. The “used” number in devid1 has moved about 1TiB in this
> time. The filesystem is seeing regular usage (read and write)
> and when I’m suspending any application traffic I see about
> 1GiB of movement every now and then. Maybe once every 30
> seconds or so. Does this sound fishy or normal to you?

With consistent "optimism" this is a request to assess whether
"performance" of some operations is adequate on a filetree
without telling us either what the filetree contents look like,
what the regular workload is, or what the storage layer looks
like.

Being one of the few system administrators crippled by lack of
psychic powers :-), I rely on guesses and inferences here, and
having read the whole thread containing some belated details.

>From the ~22TB total capacity my guess is that the storage layer
involves rotating hard disks, and from later details the
filesystem contents seems to be heavily reflinked files of
several GB in size, and workload seems to be backups to those
files from several source hosts. Considering the general level
of "optimism" in the situation my wild guess is that the storage
layer is based on large slow cheap rotating disks in teh 4GB-8GB
range, with very low IOPS-per-TB.

> Thanks for that info. The 1min per 1GiB is what I saw too -
> the “it can take longer” wasn’t really explainable to me.

A contemporary rotating disk device can do around 0.5MB/s
transfer rate with small random accesses with barriers up to
around 80-160MB/s in purely sequential access without barriers.

1GB/m of simultaneous read-write means around 16MB/s reads plus
16MB/s writes which is fairly good *performance* (even if slow
*speed*) considering that moving extents around, even across
disks, involves quite a bit of randomish same-disk updates of
metadata; because it all depends usually on how much randomish
metadata updates need to done, on any filesystem type, as those
must be done with barriers.

> As I’m not using snapshots: would large files (100+gb)

Using 100GB sized VM virtual disks (never mind with COW) seems
very unwise to me to start with, but of course a lot of other
people know better :-). Just like a lot of other people know
better that large single pool storage systems are awesome in
every respect :-): cost, reliability, speed, flexibility,
maintenance, etc.

> with long chains of CoW history (specifically reflink copies)
> also hurt?

Oh yes... They are about one of the worst cases for using
Btrfs. But also very "optimistic" to think that kind of stuff
can work awesomely on *any* filesystem type.

> Something I’d like to verify: does having traffic on the
> volume have the potential to delay this infinitely? [ ... ]
> it’s just slow and we’re looking forward to about 

Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili
There are a couple of reasons I'm advocating the specific behavior I 
outlined:
Some of your points are valid, but some break current behaviour and 
expectations or create technical difficulties.



1. It doesn't require any specific qgroup setup.  By definition, you 
can be 100% certain that the destination qgroup exists, and that you 
won't need to create new qgroups behind the user's back (given your 
suggestion, what happens when qgroup 1/N doesn't exist?).
This is a general problem with current qgroups: you have to reference 
them by some random numbers, not by user-assigned names like files. It 
would need to be fixed sooner or later with syntax like L: in 
place of L/N, or even some special syntax made specifically for path 
snapshots.


BTW, what about reserving level 1 for qgroups describing subvolumes and 
all their snapshots and forbidding manual management of qgroups at this 
level?



2. Just because it's the default, doesn't mean that the subvolume 
can't be reassigned to a different qgroup.  This also would not remove 
the ability to assign a specific qgroup through the snapshot creation 
command.  This is arguably a general point in favor of having any 
default of course, but it's still worth pointing out.
Currently 0/N qgroups are special in that they are created automatically 
and belong to the bottom of the hierarchy. It would be very nice to keep 
it this way.


Changing qgroup assignments after snapshot creation is very inconvenient 
because it requires quota rescan and thus blocks all other quota operations.



3. Because BTRFS has COW semantics, the new snapshot should take up 
near zero space in the qgroup of it's parent.


Indeed it works this way in my experiments if you assign snapshot to 1/N 
qgroup at creation where 0/N also belongs. Moreover, it does not require 
quota rescan, which is very nice.



4. This correlates with the behavior most people expect based on ZFS 
and LVM, which is that snapshots are tied to their parent.


I'm not against tying it to the parent. I'm against removing snapshot's 
own qgroup.



At a minimum, it should belong to _some_ qgroup.  This could also be 
covered by having a designated 'default' qgroup that all new 
subvolumes created without a specified qgroup get put in, but I feel 
that that is somewhat orthogonal to the issue of how snapshots are 
handled. 

It belongs to its own 0/N' qgroup, but this is not probably what you mean.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: sink GFP flags parameter to tree_mod_log_insert_move

2017-03-28 Thread David Sterba
All (1) callers pass the same value.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 7dc8844037e0..d034d47c5470 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -567,7 +567,7 @@ tree_mod_log_insert_key(struct btrfs_fs_info *fs_info,
 static noinline int
 tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
 struct extent_buffer *eb, int dst_slot, int src_slot,
-int nr_items, gfp_t flags)
+int nr_items)
 {
struct tree_mod_elem *tm = NULL;
struct tree_mod_elem **tm_list = NULL;
@@ -578,11 +578,11 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
if (!tree_mod_need_log(fs_info, eb))
return 0;
 
-   tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), flags);
+   tm_list = kcalloc(nr_items, sizeof(struct tree_mod_elem *), GFP_NOFS);
if (!tm_list)
return -ENOMEM;
 
-   tm = kzalloc(sizeof(*tm), flags);
+   tm = kzalloc(sizeof(*tm), GFP_NOFS);
if (!tm) {
ret = -ENOMEM;
goto free_tms;
@@ -596,7 +596,7 @@ tree_mod_log_insert_move(struct btrfs_fs_info *fs_info,
 
for (i = 0; i + dst_slot < src_slot && i < nr_items; i++) {
tm_list[i] = alloc_tree_mod_elem(eb, i + dst_slot,
-   MOD_LOG_KEY_REMOVE_WHILE_MOVING, flags);
+   MOD_LOG_KEY_REMOVE_WHILE_MOVING, GFP_NOFS);
if (!tm_list[i]) {
ret = -ENOMEM;
goto free_tms;
@@ -873,7 +873,7 @@ tree_mod_log_eb_move(struct btrfs_fs_info *fs_info, struct 
extent_buffer *dst,
 {
int ret;
ret = tree_mod_log_insert_move(fs_info, dst, dst_offset, src_offset,
-  nr_items, GFP_NOFS);
+  nr_items);
BUG_ON(ret < 0);
 }
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: sink GFP flags parameter to tree_mod_log_insert_root

2017-03-28 Thread David Sterba
All (1) callers pass the same value.

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index d034d47c5470..165e7ec12af7 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -663,7 +663,7 @@ __tree_mod_log_free_eb(struct btrfs_fs_info *fs_info,
 static noinline int
 tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
 struct extent_buffer *old_root,
-struct extent_buffer *new_root, gfp_t flags,
+struct extent_buffer *new_root,
 int log_removal)
 {
struct tree_mod_elem *tm = NULL;
@@ -678,14 +678,14 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
if (log_removal && btrfs_header_level(old_root) > 0) {
nritems = btrfs_header_nritems(old_root);
tm_list = kcalloc(nritems, sizeof(struct tree_mod_elem *),
- flags);
+ GFP_NOFS);
if (!tm_list) {
ret = -ENOMEM;
goto free_tms;
}
for (i = 0; i < nritems; i++) {
tm_list[i] = alloc_tree_mod_elem(old_root, i,
-   MOD_LOG_KEY_REMOVE_WHILE_FREEING, flags);
+   MOD_LOG_KEY_REMOVE_WHILE_FREEING, GFP_NOFS);
if (!tm_list[i]) {
ret = -ENOMEM;
goto free_tms;
@@ -693,7 +693,7 @@ tree_mod_log_insert_root(struct btrfs_fs_info *fs_info,
}
}
 
-   tm = kzalloc(sizeof(*tm), flags);
+   tm = kzalloc(sizeof(*tm), GFP_NOFS);
if (!tm) {
ret = -ENOMEM;
goto free_tms;
@@ -943,7 +943,7 @@ tree_mod_log_set_root_pointer(struct btrfs_root *root,
 {
int ret;
ret = tree_mod_log_insert_root(root->fs_info, root->node,
-  new_root_node, GFP_NOFS, log_removal);
+  new_root_node, log_removal);
BUG_ON(ret < 0);
 }
 
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: track exclusive filesystem operation in flags

2017-03-28 Thread David Sterba
There are several operations, usually started from ioctls, that cannot
run concurrently. The status is tracked in
mutually_exclusive_operation_running as an atomic_t. We can easily track
the status as one of the per-filesystem flag bits with same
synchronization guarantees.

The conversion replaces:

* atomic_xchg(..., 1)->   test_and_set_bit(FLAG, ...)
* atomic_set(..., 0) ->   clear_bit(FLAG, ...)

Signed-off-by: David Sterba 
---
 fs/btrfs/ctree.h   |  7 +--
 fs/btrfs/dev-replace.c |  5 ++---
 fs/btrfs/ioctl.c   | 33 +++--
 fs/btrfs/volumes.c |  6 +++---
 4 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..2521752c4e4a 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -702,6 +702,11 @@ struct btrfs_delayed_root;
 #define BTRFS_FS_BTREE_ERR 11
 #define BTRFS_FS_LOG1_ERR  12
 #define BTRFS_FS_LOG2_ERR  13
+/*
+ * Indicate that a whole-filesystem exclusive operation is running
+ * (device replace, resize, device add/delete, balance)
+ */
+#define BTRFS_FS_EXCL_OP   14
 
 struct btrfs_fs_info {
u8 fsid[BTRFS_FSID_SIZE];
@@ -1067,8 +1072,6 @@ struct btrfs_fs_info {
/* device replace state */
struct btrfs_dev_replace dev_replace;
 
-   atomic_t mutually_exclusive_operation_running;
-
struct percpu_counter bio_counter;
wait_queue_head_t replace_wait;
 
diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e653921f05d9..de7b2c897fe0 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -784,8 +784,7 @@ int btrfs_resume_dev_replace_async(struct btrfs_fs_info 
*fs_info)
}
btrfs_dev_replace_unlock(dev_replace, 1);
 
-   WARN_ON(atomic_xchg(
-   _info->mutually_exclusive_operation_running, 1));
+   WARN_ON(test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags));
task = kthread_run(btrfs_dev_replace_kthread, fs_info, "btrfs-devrepl");
return PTR_ERR_OR_ZERO(task);
 }
@@ -814,7 +813,7 @@ static int btrfs_dev_replace_kthread(void *data)
(unsigned int)progress);
}
btrfs_dev_replace_continue_on_mount(fs_info);
-   atomic_set(_info->mutually_exclusive_operation_running, 0);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
 
return 0;
 }
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index dabfc7ac48a6..a29dc3fd7152 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1504,7 +1504,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
if (ret)
return ret;
 
-   if (atomic_xchg(_info->mutually_exclusive_operation_running, 1)) {
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
mnt_drop_write_file(file);
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
}
@@ -1619,7 +1619,7 @@ static noinline int btrfs_ioctl_resize(struct file *file,
kfree(vol_args);
 out:
mutex_unlock(_info->volume_mutex);
-   atomic_set(_info->mutually_exclusive_operation_running, 0);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
mnt_drop_write_file(file);
return ret;
 }
@@ -2661,7 +2661,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
 
-   if (atomic_xchg(_info->mutually_exclusive_operation_running, 1))
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
return BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
 
mutex_lock(_info->volume_mutex);
@@ -2680,7 +2680,7 @@ static long btrfs_ioctl_add_dev(struct btrfs_fs_info 
*fs_info, void __user *arg)
kfree(vol_args);
 out:
mutex_unlock(_info->volume_mutex);
-   atomic_set(_info->mutually_exclusive_operation_running, 0);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
return ret;
 }
 
@@ -2708,7 +2708,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void 
__user *arg)
if (vol_args->flags & ~BTRFS_VOL_ARG_V2_FLAGS_SUPPORTED)
return -EOPNOTSUPP;
 
-   if (atomic_xchg(_info->mutually_exclusive_operation_running, 1)) {
+   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags)) {
ret = BTRFS_ERROR_DEV_EXCL_RUN_IN_PROGRESS;
goto out;
}
@@ -2721,7 +2721,7 @@ static long btrfs_ioctl_rm_dev_v2(struct file *file, void 
__user *arg)
ret = btrfs_rm_device(fs_info, vol_args->name, 0);
}
mutex_unlock(_info->volume_mutex);
-   atomic_set(_info->mutually_exclusive_operation_running, 0);
+   clear_bit(BTRFS_FS_EXCL_OP, _info->flags);
 
if (!ret) {
if (vol_args->flags & BTRFS_DEVICE_SPEC_BY_ID)
@@ -2752,7 +2752,7 @@ static long btrfs_ioctl_rm_dev(struct file *file, void 
__user *arg)
if (ret)
  

Re: [PATCH v2] Btrfs: fix unexpected file hole after disk errors

2017-03-28 Thread David Sterba
On Mon, Mar 06, 2017 at 12:23:30PM -0800, Liu Bo wrote:
> Btrfs creates hole extents to cover any unwritten section right before
> doing buffer writes after commit 3ac0d7b96a26 ("btrfs: Change the expanding
> write sequence to fix snapshot related bug.").
> 
> However, that takes the start position of the buffered write to compare
> against the current EOF, hole extents would be created only if (EOF <
> start).
> 
> If the EOF is at the middle of the buffered write, no hole extents will be
> created and a file hole without a hole extent is left in this file.
> 
> This bug was revealed by generic/019 in fstests.  'fsstress' in this test
> may create the above situation and the test then fails all requests
> including writes, so the buffer write which is supposed to cover the
> hole (without the hole extent) couldn't make it on disk.  Running fsck
> against such btrfs ends up with detecting file extent holes.
> 
> Things could be more serious, some stale data would be exposed to
> userspace if files with this kind of hole are truncated to a position of
> the hole, because the on-disk inode size is beyond the last extent in the
> file.
> 
> This fixes the bug by comparing the end position against the EOF.

Is the test reliable? As I read it, it should be possible to craft the
file extents and trigger the bug. And verify the fix.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: drop redundant parameters from btrfs_map_sblock

2017-03-28 Thread David Sterba
All callers pass 0 for mirror_num and 1 for need_raid_map.

Signed-off-by: David Sterba 
---
 fs/btrfs/scrub.c   | 6 +++---
 fs/btrfs/volumes.c | 6 ++
 fs/btrfs/volumes.h | 3 +--
 3 files changed, 6 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb1239f..c1a1fa6e9cce 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -1331,7 +1331,7 @@ static int scrub_setup_recheck_block(struct scrub_block 
*original_sblock,
 * represents one mirror
 */
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS,
-   logical, _length, , 0, 1);
+   logical, _length, );
if (ret || !bbio || mapped_length < sublen) {
btrfs_put_bbio(bbio);
return -EIO;
@@ -2188,7 +2188,7 @@ static void scrub_missing_raid56_pages(struct scrub_block 
*sblock)
int i;
 
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_GET_READ_MIRRORS, logical,
-   , , 0, 1);
+   , );
if (ret || !bbio || !bbio->raid_map)
goto bbio_out;
 
@@ -2776,7 +2776,7 @@ static void scrub_parity_check_and_repair(struct 
scrub_parity *sparity)
 
length = sparity->logic_end - sparity->logic_start;
ret = btrfs_map_sblock(fs_info, BTRFS_MAP_WRITE, sparity->logic_start,
-  , , 0, 1);
+  , );
if (ret || !bbio || !bbio->raid_map)
goto bbio_out;
 
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 73d56eef5e60..e8e7b1f9b6d9 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -5886,11 +5886,9 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum 
btrfs_map_op op,
 /* For Scrub/replace */
 int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 u64 logical, u64 *length,
-struct btrfs_bio **bbio_ret, int mirror_num,
-int need_raid_map)
+struct btrfs_bio **bbio_ret)
 {
-   return __btrfs_map_block(fs_info, op, logical, length, bbio_ret,
-mirror_num, need_raid_map);
+   return __btrfs_map_block(fs_info, op, logical, length, bbio_ret, 0, 1);
 }
 
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info,
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 59be81206dd7..a399d40686dc 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -400,8 +400,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, enum 
btrfs_map_op op,
struct btrfs_bio **bbio_ret, int mirror_num);
 int btrfs_map_sblock(struct btrfs_fs_info *fs_info, enum btrfs_map_op op,
 u64 logical, u64 *length,
-struct btrfs_bio **bbio_ret, int mirror_num,
-int need_raid_map);
+struct btrfs_bio **bbio_ret);
 int btrfs_rmap_block(struct btrfs_fs_info *fs_info,
 u64 chunk_start, u64 physical, u64 devid,
 u64 **logical, int *naddrs, int *stripe_len);
-- 
2.12.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Austin S. Hemmelgarn

On 2017-03-28 08:00, Marat Khalili wrote:

The default should be to inherit the qgroup of the parent subvolume.

This behaviour is only good for this particular use-case. In general
case, qgroups of subvolume and snapshots should exist separately, and
both can be included in some higher level qgroup (after all, that's what
qgroup hierarchy is for).

In my system I found it convenient to include subvolume and its
snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think
adopting this behaviour as default would be more sensible.
There are a couple of reasons I'm advocating the specific behavior I 
outlined:
1. It doesn't require any specific qgroup setup.  By definition, you can 
be 100% certain that the destination qgroup exists, and that you won't 
need to create new qgroups behind the user's back (given your 
suggestion, what happens when qgroup 1/N doesn't exist?).
2. Just because it's the default, doesn't mean that the subvolume can't 
be reassigned to a different qgroup.  This also would not remove the 
ability to assign a specific qgroup through the snapshot creation 
command.  This is arguably a general point in favor of having any 
default of course, but it's still worth pointing out.
3. Because BTRFS has COW semantics, the new snapshot should take up near 
zero space in the qgroup of it's parent.
4. This correlates with the behavior most people expect based on ZFS and 
LVM, which is that snapshots are tied to their parent.


At a minimum, it should belong to _some_ qgroup.  This could also be 
covered by having a designated 'default' qgroup that all new subvolumes 
created without a specified qgroup get put in, but I feel that that is 
somewhat orthogonal to the issue of how snapshots are handled.



--

With Best Regards,
Marat Khalili

On 28/03/17 14:24, Austin S. Hemmelgarn wrote:

On 2017-03-27 15:32, Chris Murphy wrote:

How about if qgroups are enabled, then non-root user is prevented from
creating new subvolumes?

Or is there a way for a new nested subvolume to be included in its
parent's quota, rather than the new subvolume having a whole new quota
limit?

Tricky problem.

The default should be to inherit the qgroup of the parent subvolume.
The organization of subvolumes is hierarchical, and sane people expect
things to behave as they look.  Taking another angle, on ZFS, 'nested'
(nested in quotes because ZFS' definition of 'nested' zvols is weird)
inherit their parent's quota and reservations (essentially reverse
quota), and they're not even inherently nested in the filesystem like
subvolumes are, so we're differing from the only other widely used
system that implements things in a similar manner.

As far as the subvolume thing, there should be an option to disable
user creation of subvolumes, and ideally it should be on by default
because:
1. Users can't delete subvolumes by default.  This means they can
create but not destroy a resource by default, which means that a user
can pretty easily accidentally cause issues for the system as a whole.
2. Correlating with 1, users being able to delete subvolumes by
default is not safe on multiple levels (easy accidental data loss,
numerous other issues), and thus user subvolume removal being off by
default is significantly safer.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5] qgroup: Retry after commit on getting EDQUOT

2017-03-28 Thread David Sterba
On Mon, Mar 27, 2017 at 01:13:46PM -0500, Goldwyn Rodrigues wrote:
> 
> 
> On 03/27/2017 12:36 PM, David Sterba wrote:
> > On Mon, Mar 27, 2017 at 12:29:57PM -0500, Goldwyn Rodrigues wrote:
> >> From: Goldwyn Rodrigues 
> >>
> >> We are facing the same problem with EDQUOT which was experienced with
> >> ENOSPC. Not sure if we require a full ticketing system such as ENOSPC, but
> >> here is a quick fix, which may be too big a hammer.
> >>
> >> Quotas are reserved during the start of an operation, incrementing
> >> qg->reserved. However, it is written to disk in a commit_transaction
> >> which could take as long as commit_interval. In the meantime there
> >> could be deletions which are not accounted for because deletions are
> >> accounted for only while committed (free_refroot). So, when we get
> >> a EDQUOT flush the data to disk and try again.
> >>
> >> This fixes fstests btrfs/139.
> >>
> >> Signed-off-by: Goldwyn Rodrigues 
> >> ---
> >> Changes since v1:
> >>  - Changed start_delalloc_roots() to start_delalloc_inode() to target
> >>the root in question only to reduce the amount of flush to be done.
> >>  - Added wait_ordered_extents().
> >>
> >> Changes since v2:
> >>   - Revised patch header
> >>   - removed comment on combining conditions
> >>   - removed test case, to be done in fstests
> >>
> >> Changes sinve v3:
> >>   - testcase reinstated
> >>   - return value checks
> >>
> >> Changes since v4:
> >>  - removed testcase since btrfs/139 got incorporated in fstests
> > 
> > The point was to keep the test inside the changelog as well.
> 
> 
> Yes, that was before we had btrfs/139 in fstest. I have put it in the
> patch header. However, if you really want the test script back, I can
> put it there. Let me know.

I think the test steps are worth keeping in the changelog, even if
there's a fstest.  I'll copy it from v4, patch added to 4.12 queue.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Marat Khalili

The default should be to inherit the qgroup of the parent subvolume.
This behaviour is only good for this particular use-case. In general 
case, qgroups of subvolume and snapshots should exist separately, and 
both can be included in some higher level qgroup (after all, that's what 
qgroup hierarchy is for).


In my system I found it convenient to include subvolume and its 
snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think 
adopting this behaviour as default would be more sensible.


--

With Best Regards,
Marat Khalili

On 28/03/17 14:24, Austin S. Hemmelgarn wrote:

On 2017-03-27 15:32, Chris Murphy wrote:

How about if qgroups are enabled, then non-root user is prevented from
creating new subvolumes?

Or is there a way for a new nested subvolume to be included in its
parent's quota, rather than the new subvolume having a whole new quota
limit?

Tricky problem.
The default should be to inherit the qgroup of the parent subvolume. 
The organization of subvolumes is hierarchical, and sane people expect 
things to behave as they look.  Taking another angle, on ZFS, 'nested' 
(nested in quotes because ZFS' definition of 'nested' zvols is weird) 
inherit their parent's quota and reservations (essentially reverse 
quota), and they're not even inherently nested in the filesystem like 
subvolumes are, so we're differing from the only other widely used 
system that implements things in a similar manner.


As far as the subvolume thing, there should be an option to disable 
user creation of subvolumes, and ideally it should be on by default 
because:
1. Users can't delete subvolumes by default.  This means they can 
create but not destroy a resource by default, which means that a user 
can pretty easily accidentally cause issues for the system as a whole.
2. Correlating with 1, users being able to delete subvolumes by 
default is not safe on multiple levels (easy accidental data loss, 
numerous other issues), and thus user subvolume removal being off by 
default is significantly safer.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Austin S. Hemmelgarn

On 2017-03-27 21:49, Qu Wenruo wrote:



At 03/27/2017 08:01 PM, Austin S. Hemmelgarn wrote:

On 2017-03-27 07:02, Moritz Sichert wrote:

Am 27.03.2017 um 05:46 schrieb Qu Wenruo:



At 03/27/2017 11:26 AM, Andrei Borzenkov wrote:

27.03.2017 03:39, Qu Wenruo пишет:



At 03/26/2017 06:03 AM, Moritz Sichert wrote:

Hi,

I tried to configure qgroups on a btrfs filesystem but was really
surprised that when you snapshot a subvolume, the snapshot will
not be
assigned to the qgroup the subvolume was in.

As an example consider the small terminal session in the
attachment: I
create a subvol A, assign it to qgroup 1/1 and set a limit of 5M on
that qgroup. Then I write a file into A and eventually get "disk
quota
exceeded". Then I create a snapshot of A and call it B. B will
not be
assigned to 1/1 and writing a file into B confirms that no limits at
all are imposed for B.

I feel like I must be missing something here. Considering that
creating a snapshot does not require root privileges this would mean
that any user can just circumvent any quota and therefore make them
useless.

Is there a way to enforce quotas even when a user creates snapshots?



Yes, there is always method to attach the subvolume/snapshot to
specified higher level qgroup.

Just use "btrfs subvolume snapshot -i 1/1".



This requires cooperation from whoever creates subvolume, while the
question was - is it possible to enforce it, without need for explicit
option/action when snapshot is created.

To reiterate - if user omits "-i 1/1" (s)he "escapes" from quota
enforcement.


What if user really want to create a subvolume assigned another group?

You're implying a *policy* that if source subvolume belongs to a
higher level qgroup, then snapshot created should also follow that
higher level qgroup.

However kernel should only provide *mechanisim*, not *policy*.
And btrfs does it, it provides method to do it, whether to do or not
is users responsibility.

If you want to implement that policy, please do it in a higher level,
something like SUSE snapper, not in kernel.


The problem is, I can't enforce the policy because *every user* can
create snapshots. Even if I would restrict the btrfs executable so
that only root can execute it, this doesn't help. As using the ioctl
for btrfs is allowed for any user, they could just get the executable
from somewhere else.

To reiterate and reinforce this:
If it is not possible to enforce new subvolumes counting for their
parent quota, and there is no option to prevent non-root (or
non-CAP_SYS_ADMIN) users from creating new subvolumes, then BTRFS
qgroups are useless on any system with shell access because a user can
trivially escape their quota restrictions (or hide from accounting) by
creating a new subvolume which is outside of their qgroup and storing
data there.

Ideally, there should be an option to disable user subvolume creation
(it arguably should be the default, because of resource exhaustion
issues, but that's a separate argument), and there should be an option
in the kernel to force specific behavior.  Both cases are policy, but
they are policy that can only be concretely enforced _by the kernel_.



The problem is, how should we treat subvolume.

Btrfs subvolume sits in the middle of directory and (logical) volume
used in traditional stacked solution.

While we allow normal user to create/delete/modify dir as long as they
follow access control, we require privilege to create/delete/modify
volumes.
No, we require privilege to do certain modifications or delete 
subvolumes.  Regular users can create subvolumes with no privileges 
whatsoever, and most basic directory operations (rename, chown, chmod, 
etc) work just fine within normal UNIX DAC permissions.  Unless you're 
running some specially patched kernel or some LSM (SELinux possibly) 
that somehow restricts access to the ioctl, you can always create 
subvolumes.


This is part of the reason that I'm personally hesitant to use BTRFS on 
systems where end users have shell access, it's a DoS waiting to happen.


Developers chose to treat btrfs subvolume as dir, makes it quite easy to
operate for normal use case, sacrificing qgroup limit which is not a
major function (or even did not exist) at that time.

IIRC at the beginning time of btrfs, we don't have a full idea of use
cases could be.
This is common, a lot of problems(even bad design) can only be found
after enough feedback from end users.

Personally speaking, I prefer to restrict subvolume creation/deletion to
privilege users only, and uses a daemon as a proxy to do such privilege
operation.
So we can do better accounting/access control without bothering the kernel.
I will agree that a daemon for this can be useful, but even if we add a 
mount option to restrict operations on subvolumes by normal users, we 
can still provide the option of a daemon.  On something like a single 
user system, there is not much advantage to having some complex access 
control in place.  I'm not saying we should 

Re: [PATCH v2 0/2] Cleanup for some hardcoded constants

2017-03-28 Thread David Sterba
On Thu, Mar 16, 2017 at 10:04:32AM -0600, ednadol...@gmail.com wrote:
> From: Edmund Nadolski 
> 
> This series replaces several hard-coded values with descriptive
> symbols.
> 
> ---
> v2:
>  + rename SEQ_NONE to SEQ_LAST and move definition to ctree.h
>  + clarify comment at __merge_refs()
> 
> Edmund Nadolski (2):
>   btrfs: provide enumeration for __merge_refs mode argument
>   btrfs: replace hardcoded value with SEQ_LAST macro

Added to 4.12 queue, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Qgroups are not applied when snapshotting a subvol?

2017-03-28 Thread Austin S. Hemmelgarn

On 2017-03-27 15:32, Chris Murphy wrote:

How about if qgroups are enabled, then non-root user is prevented from
creating new subvolumes?

Or is there a way for a new nested subvolume to be included in its
parent's quota, rather than the new subvolume having a whole new quota
limit?

Tricky problem.
The default should be to inherit the qgroup of the parent subvolume. 
The organization of subvolumes is hierarchical, and sane people expect 
things to behave as they look.  Taking another angle, on ZFS, 'nested' 
(nested in quotes because ZFS' definition of 'nested' zvols is weird) 
inherit their parent's quota and reservations (essentially reverse 
quota), and they're not even inherently nested in the filesystem like 
subvolumes are, so we're differing from the only other widely used 
system that implements things in a similar manner.


As far as the subvolume thing, there should be an option to disable user 
creation of subvolumes, and ideally it should be on by default because:
1. Users can't delete subvolumes by default.  This means they can create 
but not destroy a resource by default, which means that a user can 
pretty easily accidentally cause issues for the system as a whole.
2. Correlating with 1, users being able to delete subvolumes by default 
is not safe on multiple levels (easy accidental data loss, numerous 
other issues), and thus user subvolume removal being off by default is 
significantly safer.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: fix missing __error symbol in libbtrfs.so.0

2017-03-28 Thread David Sterba
On Mon, Mar 27, 2017 at 10:07:20PM +0100, sly...@gmail.com wrote:
> From: Sergei Trofimovich 
> 
> The easiest way to reproduce the error is to try to build
> btrfs-progs with
> $ make LDFLAGS=-Wl,--no-undefined
> 
> btrfs-list.o: In function `lookup_ino_path':
> btrfs-list.c:(.text+0x7d2): undefined reference to `__error'
> 
> Noticed by Denis Descheneaux when snapper tool
> stopped working after upgrade to btrfs-progs-4.10.
> 
> As soname didn't change in 4.9 -> 4.10 release
> I assume it's just an object file omission
> in library depends and not the API/ABI change
> of the library error printing.
> 
> Cc: linux-btrfs@vger.kernel.org
> Cc: Mike Gilbert 
> Reported-by: Denis Descheneaux
> Bug: https://bugs.gentoo.org/show_bug.cgi?id=613890
> 
> Signed-off-by: Sergei Trofimovich 

I've replaced my patch with yours, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: __link_block_group uses GFP_KERNEL

2017-03-28 Thread Denis Kirjanov
On 3/27/17, David Sterba  wrote:
> On Sat, Mar 25, 2017 at 09:48:28AM +0300, Denis Kirjanov wrote:
>> On 3/25/17, Jeff Mahoney  wrote:
>> > On 3/24/17 5:02 AM, Denis Kirjanov wrote:
>> >> Hi guys,
>> >>
>> >> Looks like that current code does GFP_KERNEL allocation inside
>> >> __link_block_group.
>> >> the function invokes kobject_add and internally creates sysfs files
>> >> with the GFP_KERNEL flag set.
>> >
>> > Yep, that's a bug.
>> >
>> >> But since do_chunk_alloc executes insides the btrfs transaction it's
>> >> not allowed to sleep.
>> >
>> > It's allowed to sleep but isn't allowed to do reclaim that involves
>> > file
>> > system writeback.  Michal Hocko's allocation context idea would fix
>> > this, but it's not there yet, so we'll need to defer the kobject_add
>> > until we can use GFP_KERNEL.
>>
>> Ok, I see. Can you point out to the initial patchset?
>
> https://lwn.net/Articles/716323/
>
> Fixing this properly is a lot of work so we might need to add a
> temporary workaround, as Jeff suggests, to move calling into sysfs to a
> later time.
>

Care to send a patch?
Or I can dig a bit.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 1/5] btrfs: scrub: Introduce full stripe lock for RAID56

2017-03-28 Thread Qu Wenruo

Thanks for the review first.

At 03/28/2017 12:38 AM, David Sterba wrote:

On Fri, Mar 24, 2017 at 10:00:23AM +0800, Qu Wenruo wrote:

Unlike mirror based profiles, RAID5/6 recovery needs to read out the
whole full stripe.

And if we don't do proper protect, it can easily cause race condition.

Introduce 2 new functions: lock_full_stripe() and unlock_full_stripe()
for RAID5/6.
Which stores a rb_tree of mutex for full stripes, so scrub callers can
use them to lock a full stripe to avoid race.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |   4 ++
 fs/btrfs/extent-tree.c |   3 +
 fs/btrfs/scrub.c   | 177 +
 3 files changed, 184 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 29b7fc28c607..4d570e1040e6 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -648,6 +648,10 @@ struct btrfs_block_group_cache {
 * Protected by free_space_lock.
 */
int needs_free_space;
+
+   /* Scrub full stripe lock tree for RAID5/6 scrub */
+   struct rb_root scrub_lock_root;


The name scrub_lock_root is a bit confusing, it should better describe
what object the tree stores.


I'll encapsulate this rb_root and spinlock into a new structure called 
full_stripe_locks_tree.


I tried to rename them to other names, but "lock" and "root/tree" makes 
it quite hard to name the spinlock.

(name like "full_stripe_locks_root_lock" is just super stupid)




+   spinlock_t scrub_lock;


And the lock could be named accordingly.


 };

 /* delayed seq elem */
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index be5477676cc8..db5d9f84535e 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -131,6 +131,7 @@ void btrfs_put_block_group(struct btrfs_block_group_cache 
*cache)
if (atomic_dec_and_test(>count)) {
WARN_ON(cache->pinned > 0);
WARN_ON(cache->reserved > 0);
+   WARN_ON(!RB_EMPTY_ROOT(>scrub_lock_root));
kfree(cache->free_space_ctl);
kfree(cache);
}
@@ -9907,6 +9908,8 @@ btrfs_create_block_group_cache(struct btrfs_fs_info 
*fs_info,

atomic_set(>count, 1);
spin_lock_init(>lock);
+   spin_lock_init(>scrub_lock);
+   cache->scrub_lock_root = RB_ROOT;
init_rwsem(>data_rwsem);
INIT_LIST_HEAD(>list);
INIT_LIST_HEAD(>cluster_list);
diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
index b0251eb1239f..38b300ac4567 100644
--- a/fs/btrfs/scrub.c
+++ b/fs/btrfs/scrub.c
@@ -240,6 +240,13 @@ struct scrub_warning {
struct btrfs_device *dev;
 };

+struct scrub_full_stripe_lock {
+   struct rb_node node;
+   u64 logical;
+   u64 refs;
+   struct mutex mutex;
+};
+
 static void scrub_pending_bio_inc(struct scrub_ctx *sctx);
 static void scrub_pending_bio_dec(struct scrub_ctx *sctx);
 static void scrub_pending_trans_workers_inc(struct scrub_ctx *sctx);
@@ -349,6 +356,176 @@ static void scrub_blocked_if_needed(struct btrfs_fs_info 
*fs_info)
 }

 /*
+ * Caller must hold cache->scrub_root_lock.
+ *
+ * Return existing full stripe and increase it refs
+ * Or return NULL, and insert @fstripe_lock into the bg cache
+ */
+static struct scrub_full_stripe_lock *
+add_scrub_lock(struct btrfs_block_group_cache *cache,
+  struct scrub_full_stripe_lock *fstripe_lock)


Please stick to the function prototype common in that file, ie. type and
name on the same line. The arguments can go below it.


Understood.



The function name is also quite confusing.


What about insert_full_stripe_lock() ?




+{
+   struct rb_node **p;
+   struct rb_node *parent = NULL;
+   struct scrub_full_stripe_lock *entry;
+
+   p = >scrub_lock_root.rb_node;
+   while (*p) {
+   parent = *p;
+   entry = rb_entry(parent, struct scrub_full_stripe_lock, node);
+   if (fstripe_lock->logical < entry->logical) {
+   p = &(*p)->rb_left;
+   } else if (fstripe_lock->logical > entry->logical) {
+   p = &(*p)->rb_right;
+   } else {
+   entry->refs++;
+   return entry;
+   }
+   }
+   /* Insert new one */
+   rb_link_node(_lock->node, parent, p);
+   rb_insert_color(_lock->node, >scrub_lock_root);
+
+   return NULL
+}
+
+static struct scrub_full_stripe_lock *
+search_scrub_lock(struct btrfs_block_group_cache *cache, u64 bytenr)


Same here.


What about search_full_stripe_lock() ?




+{
+   struct rb_node *node;
+   struct scrub_full_stripe_lock *entry;
+
+   node = cache->scrub_lock_root.rb_node;
+   while (node) {
+   entry = rb_entry(node, struct scrub_full_stripe_lock, node);
+   if (bytenr < entry->logical)
+   node = node->rb_left;
+   else if (bytenr > entry->logical)
+