date:20151008

[PATCH] btrfs-progs: restore: fix off-by-one len check

2015-10-08 Thread Vincent Stehlé

Fix a check of len versus PATH_MAX in function copy_symlink(), to
account for the terminating null byte.

Resolves-Coverity-CID: 1296749
Signed-off-by: Vincent Stehlé 
---
 cmds-restore.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/cmds-restore.c b/cmds-restore.c
index 8fc8b2a..a1445d4 100644
--- a/cmds-restore.c
+++ b/cmds-restore.c
@@ -863,7 +863,7 @@ static int copy_symlink(struct btrfs_root *root, struct 
btrfs_key *key,
 
len = btrfs_file_extent_inline_item_len(leaf,
btrfs_item_nr(path->slots[0]));
-   if (len > PATH_MAX) {
+   if (len >= PATH_MAX) {
fprintf(stderr, "Symlink '%s' target length %d is longer than 
PATH_MAX\n",
fs_name, len);
ret = -1;
-- 
2.5.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: test unmount during quota rescan

2015-10-08 Thread Filipe Manana

On Wed, Oct 7, 2015 at 4:08 PM, Justin Maggard  wrote:
> This test case tests if we are able to unmount a filesystem while
> a quota rescan is running.  Up to now (4.3-rc4) this would result
> in a kernel NULL pointer dereference.

Please mention here the title of the patch that fixes this problem
("btrfs: qgroup: exit the rescan worker during umount").

>
> Signed-off-by: Justin Maggard 
> ---
>  tests/btrfs/104 | 85 
> +
>  tests/btrfs/104.out |  1 +
>  tests/btrfs/group   |  1 +
>  3 files changed, 87 insertions(+)
>  create mode 100755 tests/btrfs/104
>  create mode 100644 tests/btrfs/104.out
>
> diff --git a/tests/btrfs/104 b/tests/btrfs/104
> new file mode 100755
> index 000..7c51298
> --- /dev/null
> +++ b/tests/btrfs/104
> @@ -0,0 +1,85 @@
> +#! /bin/bash
> +# FS QA Test No. btrfs/200

104 (should match the file name)

> +#
> +# btrfs quota scan/unmount sanity test
> +#
> +#
> +#---
> +# Copyright (c) 2015 NETGEAR, Inc.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +
> +status=1   # failure is the default!
> +
> +_cleanup()
> +{
> +   cd /
> +   rm -f $tmp.*
> +   $UMOUNT_PROG $loop_mnt >/dev/null 2>&1
> +   _destroy_loop_device $loop_dev1
> +   rm -rf $loop_mnt
> +   rm -f $fs_img1
> +}
> +
> +trap "_cleanup ; exit \$status" 0 1 2 3 15
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# real QA test starts here
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_test
> +
> +rm -f $seqres.full
> +
> +loop_mnt=$TEST_DIR/$seq.$$.mnt
> +fs_img1=$TEST_DIR/$seq.$$.img1
> +mkdir $loop_mnt
> +$XFS_IO_PROG -f -c "truncate 1G" $fs_img1 >>$seqres.full 2>&1

Don't redirect stdout/stderr here. If the truncate fails, xfs_io
prints something to stderr, resulting in a test failure due to
mismatch with the golden output.

> +
> +loop_dev1=`_create_loop_device $fs_img1`
> +
> +_mkfs_dev $loop_dev1 >>$seqres.full 2>&1
> +_mount $loop_dev1 $loop_mnt

Any reason to use a loop device on not the scratch device as most
other tests do?

> +for i in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> +do
> +  for j in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p

Coding style in xfstests is:

for ... ; do
done

Also use 8 spaces tabs for indentation.

> +  do
> +for k in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> +do
> +  for l in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> +  do
> +   touch $loop_mnt/${i}${j}${k}${l}
> +  done
> +done
> +  done
> +done >>$seqres.full 2>&1
> +echo 3 > /proc/sys/vm/drop_caches
> +$BTRFS_UTIL_PROG quota enable $loop_mnt
> +$UMOUNT_PROG $loop_mnt
> +
> +status=0
> +exit
> diff --git a/tests/btrfs/104.out b/tests/btrfs/104.out
> new file mode 100644
> index 000..1ed84bc
> --- /dev/null
> +++ b/tests/btrfs/104.out
> @@ -0,0 +1 @@
> +QA output created by 104
> diff --git a/tests/btrfs/group b/tests/btrfs/group
> index e92a65a..6218adf 100644
> --- a/tests/btrfs/group
> +++ b/tests/btrfs/group
> @@ -106,3 +106,4 @@
>  101 auto quick replace
>  102 auto quick metadata enospc
>  103 auto quick clone compress
> +104 auto qgroup
> --
> 2.6.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Pavel Pisa

Hello everybody,

On Monday 05 of October 2015 22:26:46 Pavel Pisa wrote:
> Hello everybody,
...
> BTRFS has recognized appearance of its partition (even that hanged
> from sdb5 to sde5 when disk "hotplugged" again).
> But it seems that RAID1 components are not in sync and BTRFS
> continues to report
>
> BTRFS: lost page write due to I/O error on /dev/sde5
> BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt
> 0, gen
>
> I have tried to find the best way to resync RAID1 BTRFS partitions.
> But problem is that filesystem is the root one of the system.
> So reboot to some rescue media is required to run btrfsck --repair
> which is intended for unmounted devices.
>
> What is behavior of BTRFS in this situation?
> Is BTRFS able to use data from not up to date partition in these
> cases where data in respective files have not been modified?
> The main reason for question is if such (stable) data can be backuped
> by out of sync partition in the case of some random block is wear
> out on another device. Or is this situation equivalent to running
> with only one disk?
>
> Are there some parameters/solution to run some command
> (scrub balance) which makes devices to be in the sync again
> without unmount or reboot?
>
> I believe than attaching one more drive and running "btrfs replace"
> would solve described situation. But is there some equivalent to
> run operation "inplace".

It seems that SATA controller is not able to activate link which
has not been connected at BIOS POST time. This means that I cannot add new drive
without reboot.

Before reboot, the server bleeds with messages

BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt 0, 
gen 0
BTRFS: lost page write due to I/O error on /dev/sde5
BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt 0, 
gen 0
BTRFS: lost page write due to I/O error on /dev/sde5

that changed to next mesages after reboot

Btrfs loaded
BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
BTRFS info (device sda3): disk space caching is enabled
BTRFS (device sda3): parent transid verify failed on 44623216640 wanted 263476 
found 212766
BTRFS (device sda3): parent transid verify failed on 45201899520 wanted 282383 
found 246891
BTRFS (device sda3): parent transid verify failed on 45202571264 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45201965056 wanted 282383 
found 246889
BTRFS (device sda3): parent transid verify failed on 45202505728 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45202866176 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45207126016 wanted 282383 
found 246894
BTRFS (device sda3): parent transid verify failed on 45202522112 wanted 282383 
found 246890
BTRFS: bdev /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 
11723314, rd 8526080, flush 2
BTRFS (device sda3): parent transid verify failed on 45206945792 wanted 282383 
found 67960
BTRFS (device sda3): parent transid verify failed on 45204471808 wanted 282382 
found 67960

which looks really frightening to me. Temporary disconnected drive has old 
transid
at start (OK). But what means the rest of the lines. If it means that files with
older transaction ID are used from temporary disconnected drive (now /dev/sdb5)
and newer versions from /dev/sda3 are ignored and reported as invalid then this 
means
severe data lost and may it be mitchmatch because all transactions after disk 
disconnect
are lost (i.e. FS root has been taken from misbehaving drive at old version).

BTRFS does not fall even to red-only/degraded mode after system restart.

On the other hand, from logs (all stored on the possibly damaged root FS) it 
seems
that there there are not missing messages from days when discs has been out of 
sync,
so it looks like all data are OK. So should I expect that BTRFS managed problems
well and all data are consistent?

I go to use "btrfs replace" because there has not been any reply to my inplace 
correction
question. But I expect that clarification if possible/how to resync RAID1 after 
one
drive temporal disappear is really important to many of BTRFS users.

I am now at place where all my connection to Internet goes through endangered
server/router/containers server so I hope to not lost connection.

Thanks for BTRFS work,

Pavel

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using BtrFS and backup tools for keeping two systems in sync

2015-10-08 Thread Hugo Mills

On Thu, Oct 08, 2015 at 08:05:09AM +0530, Shriramana Sharma wrote:
> Hello. I see there are some backup tools taking advantage of BtrFS's
> incremental send/receive feature:
> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup. [BTW Ames
> Cornish's ButterSink (https://github.com/AmesCornish/buttersink) seems
> to be missing from that page.]
> 
> Now I'd like to know if anyone has evolved some good practices w.r.t
> maintaining the data of two systems in sync using this feature of
> BtrFS. What I have in mind is: I work on my desktop by default, and
> for ergonomics reasons only use my laptop when I need the mobility.
> I'd like to keep the main data (documents I create, programs I write
> etc) in sync between the two. (The profile data such as in the ~/.*
> hidden folders had better stay separate though, I guess.)
> 
> I figure with the existing tools it would not be too difficult to
> maintain a synced set of snapshots between the two systems if I only
> use the desktop vs laptop alternatingly and sync at each switchover,
> but the potential problem only would come if I modify both (something
> like having to do git merge, I guess).
> 
> Has anyone come across this situation and evolved any policies to handle it?

   You can't currently do this efficiently with send/receive. It
should be possible, but it needs a change to the send stream format.

   Hugo.

-- 
Hugo Mills | UNIX: British manufacturer of modular shelving units
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Re: [PATCH v2] btrfs: qgroup: exit the rescan worker during umount

2015-10-08 Thread Filipe Manana

On Thu, Sep 3, 2015 at 2:05 AM, Justin Maggard  wrote:
> v2: Fix stupid error while making formatting changes...
>
> I was hitting a consistent NULL pointer dereference during shutdown that
> showed the trace running through end_workqueue_bio().  I traced it back to
> the endio_meta_workers workqueue being poked after it had already been
> destroyed.
>
> Eventually I found that the root cause was a qgroup rescan that was still
> in progress while we were stopping all the btrfs workers.
>
> Currently we explicitly pause balance and scrub operations in
> close_ctree(), but we do nothing to stop the qgroup rescan.  We should
> probably be doing the same for qgroup rescan, but that's a much larger
> change.  This small change is good enough to allow me to unmount without
> crashing.
>
> Signed-off-by: Justin Maggard 
> ---
>  fs/btrfs/qgroup.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
> index d904ee1..5bfcee9 100644
> --- a/fs/btrfs/qgroup.c
> +++ b/fs/btrfs/qgroup.c
> @@ -2278,7 +2278,7 @@ static void btrfs_qgroup_rescan_worker(struct 
> btrfs_work *work)
> goto out;
>
> err = 0;
> -   while (!err) {
> +   while (!err && !btrfs_fs_closing(fs_info)) {
> trans = btrfs_start_transaction(fs_info->fs_root, 0);
> if (IS_ERR(trans)) {
> err = PTR_ERR(trans);
> @@ -2301,7 +2301,8 @@ out:
> btrfs_free_path(path);
>
> mutex_lock(_info->qgroup_rescan_lock);
> -   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
> +   if (!btrfs_fs_closing(fs_info))
> +   fs_info->qgroup_flags &= ~BTRFS_QGROUP_STATUS_FLAG_RESCAN;
>
> if (err > 0 &&
> fs_info->qgroup_flags & BTRFS_QGROUP_STATUS_FLAG_INCONSISTENT) {
> @@ -2330,7 +2331,9 @@ out:
> }
> btrfs_end_transaction(trans, fs_info->quota_root);
>
> -   if (err >= 0) {
> +   if (btrfs_fs_closing(fs_info)) {
> +   btrfs_info(fs_info, "qgroup scan paused");
> +   } else if (err >= 0) {
> btrfs_info(fs_info, "qgroup scan completed%s",
> err > 0 ? " (inconsistency flag cleared)" : "");
> } else {

Justin, this is still racy (however much less racy than before).

Once we leave the loop because of the condition
btrfs_fs_closing(fs_info), we start a transaction and do some write
operation on the quota btree. While or before we do such write
operation, close_ctree() might have completed or be at a point where
such write operation will result in another null pointer dereference,
or accessing some dangling pointer, or leak a transaction that never
gets committed (because close_ctree() already stopped the transaction
kthread), etc, etc.

So in addition to what you did, you need to call
btrfs_qgroup_wait_for_completion(fs_info) at disk-io.c:close_ctree()
right after setting fs_info->closing to 1.

Otherwise it looks good.
Thanks.


> --
> 2.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: cleanup iterating over prop_handlers array

2015-10-08 Thread Byongho Lee

This patch eliminates the last item of prop_handlers array which is used
to check end of array and instead uses ARRAY_SIZE macro.
Though this is a very tiny optimization, using ARRAY_SIZE macro is a
good practice to iterate array.

Signed-off-by: Byongho Lee 
---
 fs/btrfs/props.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c
index dca137b04095..f9e60231f685 100644
--- a/fs/btrfs/props.c
+++ b/fs/btrfs/props.c
@@ -49,18 +49,16 @@ static struct prop_handler prop_handlers[] = {
.extract = prop_compression_extract,
.inheritable = 1
},
-   {
-   .xattr_name = NULL
-   }
 };
 
 void __init btrfs_props_init(void)
 {
-   struct prop_handler *p;
+   int i;
 
hash_init(prop_handlers_ht);
 
-   for (p = _handlers[0]; p->xattr_name; p++) {
+   for (i = 0; i < ARRAY_SIZE(prop_handlers); i++) {
+   struct prop_handler *p = _handlers[i];
u64 h = btrfs_name_hash(p->xattr_name, strlen(p->xattr_name));
 
hash_add(prop_handlers_ht, >node, h);
@@ -301,15 +299,16 @@ static int inherit_props(struct btrfs_trans_handle *trans,
 struct inode *inode,
 struct inode *parent)
 {
-   const struct prop_handler *h;
struct btrfs_root *root = BTRFS_I(inode)->root;
int ret;
+   int i;
 
if (!test_bit(BTRFS_INODE_HAS_PROPS,
  _I(parent)->runtime_flags))
return 0;
 
-   for (h = _handlers[0]; h->xattr_name; h++) {
+   for (i = 0; i < ARRAY_SIZE(prop_handlers); i++) {
+   const struct prop_handler *h = _handlers[i];
const char *value;
u64 num_bytes;
 
-- 
2.6.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Austin S Hemmelgarn


On 2015-10-08 04:28, Pavel Pisa wrote:

Hello everybody,

On Monday 05 of October 2015 22:26:46 Pavel Pisa wrote:

Hello everybody,

...

BTRFS has recognized appearance of its partition (even that hanged
from sdb5 to sde5 when disk "hotplugged" again).
But it seems that RAID1 components are not in sync and BTRFS
continues to report

BTRFS: lost page write due to I/O error on /dev/sde5
BTRFS: bdev /dev/sde5 errs: wr 11021805, rd 8526080, flush 29099, corrupt
0, gen

I have tried to find the best way to resync RAID1 BTRFS partitions.
But problem is that filesystem is the root one of the system.
So reboot to some rescue media is required to run btrfsck --repair
which is intended for unmounted devices.

What is behavior of BTRFS in this situation?
Is BTRFS able to use data from not up to date partition in these
cases where data in respective files have not been modified?
The main reason for question is if such (stable) data can be backuped
by out of sync partition in the case of some random block is wear
out on another device. Or is this situation equivalent to running
with only one disk?

Are there some parameters/solution to run some command
(scrub balance) which makes devices to be in the sync again
without unmount or reboot?

I believe than attaching one more drive and running "btrfs replace"
would solve described situation. But is there some equivalent to
run operation "inplace".


It seems that SATA controller is not able to activate link which
has not been connected at BIOS POST time. This means that I cannot add new drive
without reboot.
Check your BIOS options, there should be some option to set SATA ports 
as either 'Hot-Plug' or 'External', which should allow you to hot-plug 
drives without needing a reboot (unless it's a Dell system, they have 
never properly implemented the SATA standard on their desktops).


Before reboot, the server bleeds with messages

BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt 0, 
gen 0
BTRFS: lost page write due to I/O error on /dev/sde5
BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt 0, 
gen 0
BTRFS: lost page write due to I/O error on /dev/sde5
Even aside from the below mentioned issues, if your disk is showing that 
many errors, you should probably run a SMART self-test routine on it to 
determine whether this is just a transient issue or an indication of an 
impending disk failure.  The commands I'd suggest are:

smartctl -t short /dev/sde
That will tell you some time to wait for the test to complete, after 
waiting  that long, run:

smartctl -H /dev/sde
If that says the health check failed, replace the disk as soon as 
possible, and don't use it for storing any data you can't afford to lose.


that changed to next mesages after reboot

Btrfs loaded
BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
BTRFS info (device sda3): disk space caching is enabled
BTRFS (device sda3): parent transid verify failed on 44623216640 wanted 263476 
found 212766
BTRFS (device sda3): parent transid verify failed on 45201899520 wanted 282383 
found 246891
BTRFS (device sda3): parent transid verify failed on 45202571264 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45201965056 wanted 282383 
found 246889
BTRFS (device sda3): parent transid verify failed on 45202505728 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45202866176 wanted 282383 
found 246890
BTRFS (device sda3): parent transid verify failed on 45207126016 wanted 282383 
found 246894
BTRFS (device sda3): parent transid verify failed on 45202522112 wanted 282383 
found 246890
BTRFS: bdev /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 
11723314, rd 8526080, flush 2
BTRFS (device sda3): parent transid verify failed on 45206945792 wanted 282383 
found 67960
BTRFS (device sda3): parent transid verify failed on 45204471808 wanted 282382 
found 67960

which looks really frightening to me. Temporary disconnected drive has old 
transid
at start (OK). But what means the rest of the lines. If it means that files with
older transaction ID are used from temporary disconnected drive (now /dev/sdb5)
and newer versions from /dev/sda3 are ignored and reported as invalid then this 
means
severe data lost and may it be mitchmatch because all transactions after disk 
disconnect
are lost (i.e. FS root has been taken from misbehaving drive at old version).

BTRFS does not fall even to red-only/degraded mode after system restart.

This actually surprises me.


On the other hand, from logs (all stored on the possibly damaged root FS) it 
seems
that there there are not missing messages from days when discs has been out of 
sync,
so it looks like all data are OK. So should I expect that BTRFS managed problems
well and all data are consistent?
I would be very careful in that situation, you may still have issues, at 
the very least, make a

[PATCH v4 1/3] btrfs: Fix lost-data-profile caused by auto removing bg

2015-10-08 Thread Zhao Lei

Reproduce:
 (In integration-4.3 branch)

 TEST_DEV=(/dev/vdg /dev/vdh)
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 umount "$TEST_DEV"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 btrfs filesystem usage $TEST_DIR

We can see the data chunk changed from raid1 to single:
 # btrfs filesystem usage $TEST_DIR
 Data,single: Size:8.00MiB, Used:0.00B
/dev/vdg8.00MiB
 #

Reason:
 When a empty filesystem mount with -o nospace_cache, the last
 data blockgroup will be auto-removed in umount.

 Then if we mount it again, there is no data chunk in the
 filesystem, so the only available data profile is 0x0, result
 is all new chunks are created as single type.

Fix:
 Don't auto-delete last blockgroup for a raid type.

Test:
 Test by above script, and confirmed the logic by debug output.

Changelog v3->v4:
1: Avoid down_read() in spin_lock context.
   Noticed-by: Chris Mason 

Changelog v2->v3:
1: Use list_is_singular() instead of
   block_group->list.next == block_group->list.prev
   Suggested-by: Jeff Mahoney 

Changelog v1->v2:
1: Put code of checking block_group->list into
   semaphore of space_info->groups_sem.
   Noticed-by: Filipe Manana 

Reviewed-by: Filipe Manana 
Signed-off-by: Zhao Lei 
---
 fs/btrfs/extent-tree.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 79a5bd9..00c621b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10010,8 +10010,10 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
block_group = list_first_entry(_info->unused_bgs,
   struct btrfs_block_group_cache,
   bg_list);
-   space_info = block_group->space_info;
list_del_init(_group->bg_list);
+
+   space_info = block_group->space_info;
+
if (ret || btrfs_mixed_space_info(space_info)) {
btrfs_put_block_group(block_group);
continue;
@@ -10025,7 +10027,8 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
spin_lock(_group->lock);
if (block_group->reserved ||
btrfs_block_group_used(_group->item) ||
-   block_group->ro) {
+   block_group->ro ||
+   list_is_singular(_group->list)) {
/*
 * We want to bail if we made new allocations or have
 * outstanding allocations in this block group.  We do
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 2/3] btrfs: Fix lost-data-profile caused by balance bg

2015-10-08 Thread Zhao Lei

Reproduce:
 (In integration-4.3 branch)

 TEST_DEV=(/dev/vdg /dev/vdh)
 TEST_DIR=/mnt/tmp

 umount "$TEST_DEV" >/dev/null
 mkfs.btrfs -f -d raid1 "${TEST_DEV[@]}"

 mount -o nospace_cache "$TEST_DEV" "$TEST_DIR"
 btrfs balance start -dusage=0 $TEST_DIR
 btrfs filesystem usage $TEST_DIR

 dd if=/dev/zero of="$TEST_DIR"/file count=100
 btrfs filesystem usage $TEST_DIR

Result:
 We can see "no data chunk" in first "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
...
 Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg  122.88MiB
/dev/vdh  122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
/dev/vdg4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg8.00MiB
/dev/vdh8.00MiB
 Unallocated:
/dev/vdg1.06GiB
/dev/vdh1.07GiB

 And "data chunks changed from raid1 to single" in second
 "btrfs filesystem usage":
 # btrfs filesystem usage $TEST_DIR
 Overall:
...
 Data,single: Size:256.00MiB, Used:0.00B
/dev/vdh  256.00MiB
 Metadata,single: Size:8.00MiB, Used:0.00B
/dev/vdg8.00MiB
 Metadata,RAID1: Size:122.88MiB, Used:112.00KiB
/dev/vdg  122.88MiB
/dev/vdh  122.88MiB
 System,single: Size:4.00MiB, Used:0.00B
/dev/vdg4.00MiB
 System,RAID1: Size:8.00MiB, Used:16.00KiB
/dev/vdg8.00MiB
/dev/vdh8.00MiB
 Unallocated:
/dev/vdg1.06GiB
/dev/vdh  841.92MiB

Reason:
 btrfs balance delete last data chunk in case of no data in
 the filesystem, then we can see "no data chunk" by "fi usage"
 command.

 And when we do write operation to fs, the only available data
 profile is 0x0, result is all new chunks are allocated single type.

Fix:
 Allocate a data chunk explicitly to ensure we don't lose the
 raid profile for data.

Test:
 Test by above script, and confirmed the logic by debug output.

Changelog v1->v2:
1: Update patch description of "Fix" field
2: Use BTRFS_BLOCK_GROUP_DATA for btrfs_force_chunk_alloc instead
   of 1
3: Only reserve chunk when doing balance on data chunk.
All suggested-by: Filipe Manana 

Reviewed-by: Filipe Manana 
Signed-off-by: Zhao Lei 
---
 fs/btrfs/volumes.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 6fc73586..cd9e5bd 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3277,6 +3277,7 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info)
u64 limit_data = bctl->data.limit;
u64 limit_meta = bctl->meta.limit;
u64 limit_sys = bctl->sys.limit;
+   int chunk_reserved = 0;
 
/* step one make some room on all the devices */
devices = _info->fs_devices->devices;
@@ -3326,6 +3327,8 @@ again:
key.type = BTRFS_CHUNK_ITEM_KEY;
 
while (1) {
+   u64 chunk_type;
+
if ((!counting && atomic_read(_info->balance_pause_req)) ||
atomic_read(_info->balance_cancel_req)) {
ret = -ECANCELED;
@@ -3371,8 +3374,10 @@ again:
spin_unlock(_info->balance_lock);
}
 
+   chunk_type = btrfs_chunk_type(leaf, chunk);
ret = should_balance_chunk(chunk_root, leaf, chunk,
   found_key.offset);
+
btrfs_release_path(path);
if (!ret) {
mutex_unlock(_info->delete_unused_bgs_mutex);
@@ -3387,6 +3392,25 @@ again:
goto loop;
}
 
+   if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) && !chunk_reserved) {
+   trans = btrfs_start_transaction(chunk_root, 0);
+   if (IS_ERR(trans)) {
+   mutex_unlock(_info->delete_unused_bgs_mutex);
+   ret = PTR_ERR(trans);
+   goto error;
+   }
+
+   ret = btrfs_force_chunk_alloc(trans, chunk_root,
+ BTRFS_BLOCK_GROUP_DATA);
+   if (ret < 0) {
+   mutex_unlock(_info->delete_unused_bgs_mutex);
+   goto error;
+   }
+
+   btrfs_end_transaction(trans, chunk_root);
+   chunk_reserved = 1;
+   }
+
ret = btrfs_relocate_chunk(chunk_root,
   found_key.offset);
mutex_unlock(_info->delete_unused_bgs_mutex);
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v4 3/3] btrfs: Use fs_info directly in btrfs_delete_unused_bgs

2015-10-08 Thread Zhao Lei

No need to use root->fs_info in btrfs_delete_unused_bgs(),
use fs_info directly instead.

Signed-off-by: Zhao Lei 
---
 fs/btrfs/extent-tree.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 00c621b..c93a77a 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -10020,7 +10020,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
}
spin_unlock(_info->unused_bgs_lock);
 
-   mutex_lock(>fs_info->delete_unused_bgs_mutex);
+   mutex_lock(_info->delete_unused_bgs_mutex);
 
/* Don't want to race with allocators so take the groups_sem */
down_write(_info->groups_sem);
@@ -10144,7 +10144,7 @@ void btrfs_delete_unused_bgs(struct btrfs_fs_info 
*fs_info)
 end_trans:
btrfs_end_transaction(trans, root);
 next:
-   mutex_unlock(>fs_info->delete_unused_bgs_mutex);
+   mutex_unlock(_info->delete_unused_bgs_mutex);
btrfs_put_block_group(block_group);
spin_lock(_info->unused_bgs_lock);
}
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using BtrFS and backup tools for keeping two systems in sync

2015-10-08 Thread Austin S Hemmelgarn


On 2015-10-07 22:35, Shriramana Sharma wrote:

Hello. I see there are some backup tools taking advantage of BtrFS's
incremental send/receive feature:
https://btrfs.wiki.kernel.org/index.php/Incremental_Backup. [BTW Ames
Cornish's ButterSink (https://github.com/AmesCornish/buttersink) seems
to be missing from that page.]

Now I'd like to know if anyone has evolved some good practices w.r.t
maintaining the data of two systems in sync using this feature of
BtrFS. What I have in mind is: I work on my desktop by default, and
for ergonomics reasons only use my laptop when I need the mobility.
I'd like to keep the main data (documents I create, programs I write
etc) in sync between the two. (The profile data such as in the ~/.*
hidden folders had better stay separate though, I guess.)

I figure with the existing tools it would not be too difficult to
maintain a synced set of snapshots between the two systems if I only
use the desktop vs laptop alternatingly and sync at each switchover,
but the potential problem only would come if I modify both (something
like having to do git merge, I guess).

Has anyone come across this situation and evolved any policies to handle it?

Personally, while I use BTRFS on all of my systems, I usually use 
Dropbox for synchronizing data between them.  While the Linux client for 
it isn't perfect, it is significantly easier than something like a 
regularly scheduled rsync or btrfs send/receive.  It's also kind of nice 
because multiple clients bound to the same account will sync across the 
local network without needing to talk to the servers if they can avoid 
it.  Of course, it costs money if you want a decent amount of storage 
space, but it's pretty reasonable for the degree of reliability I've 
observed.




smime.p7s
Description: S/MIME Cryptographic Signature

Re: [PATCH 03/12] generic/80[0-2]: support xfs in addition to btrfs

2015-10-08 Thread Ari Sundholm

On Wed, 2015-10-07 at 05:13 +, Darrick J. Wong wrote:
> Modify the reflink tests to support xfs.
> 
> Signed-off-by: Darrick J. Wong 
> ---
>  common/rc |   37 +
>  tests/generic/800 |2 +-
>  tests/generic/801 |2 +-
>  tests/generic/802 |2 +-
>  4 files changed, 40 insertions(+), 3 deletions(-)
> 
> 
> diff --git a/common/rc b/common/rc
> index 3e97060..7e2f140 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -1429,6 +1429,43 @@ _require_scratch_xfs_crc()
>   umount $SCRATCH_MNT
>  }
>  
> +# this test requires the test fs support reflink...
> +#
> +_require_test_reflink()
> +{
> +case $FSTYP in
> +xfs)
> + xfs_info "${TEST_DIR}" | grep reflink=1 -c -q || _notrun "Reflink not 
> supported by this filesystem type: $FSTYP"
> + ;;
> +btrfs)
> +true
> +;;
> +*)
> +_notrun "Reflink not supported by this filesystem type: $FSTYP"
> +;;
> +esac
> +}
> +
> +# this test requires the scratch fs support reflink...
> +#
> +_require_scratch_reflink()
> +{
> +case $FSTYP in
> +xfs)
> + _scratch_mkfs > /dev/null 2>&1
> + _scratch_mount
> + xfs_info "${TEST_DIR}" | grep reflink=1 -c -q || _notrun "$FSTYP does 
> not support reflink"

${SCRATCH_MNT}?

> + _scratch_unmount
> + ;;
> +btrfs)
> +true
> +;;
> +*)
> +_notrun "Reflink not supported by this filesystem type: $FSTYP"
> +;;
> +esac
> +}
> +
>  # this test requires the bigalloc feature to be available in mkfs.ext4
>  #
>  _require_ext4_mkfs_bigalloc()
> diff --git a/tests/generic/800 b/tests/generic/800
> index a71f11a..954f39d 100755
> --- a/tests/generic/800
> +++ b/tests/generic/800
> @@ -45,7 +45,7 @@ _cleanup()
>  . common/filter
>  
>  # real QA test starts here
> -_supported_fs btrfs
> +_require_test_reflink
>  _supported_os Linux
>  
>  _require_xfs_io_command "fiemap"
> diff --git a/tests/generic/801 b/tests/generic/801
> index b21c44b..aedb6e9 100755
> --- a/tests/generic/801
> +++ b/tests/generic/801
> @@ -45,7 +45,7 @@ _cleanup()
>  . common/filter
>  
>  # real QA test starts here
> -_supported_fs btrfs
> +_require_test_reflink
>  _supported_os Linux
>  
>  _require_xfs_io_command "fiemap"
> diff --git a/tests/generic/802 b/tests/generic/802
> index afd8513..51d3414 100755
> --- a/tests/generic/802
> +++ b/tests/generic/802
> @@ -43,7 +43,7 @@ _cleanup()
>  . ./common/filter
>  
>  # real QA test starts here
> -_supported_fs btrfs
> +_require_test_reflink
>  _supported_os Linux
>  
>  _require_xfs_io_command "fiemap"
> 
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: cleanup iterating over prop_handlers array

2015-10-08 Thread David Sterba

On Thu, Oct 08, 2015 at 08:49:34PM +0900, Byongho Lee wrote:
> This patch eliminates the last item of prop_handlers array which is used
> to check end of array and instead uses ARRAY_SIZE macro.
> Though this is a very tiny optimization, using ARRAY_SIZE macro is a
> good practice to iterate array.
> 
> Signed-off-by: Byongho Lee 
Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: restore: fix off-by-one len check

2015-10-08 Thread David Sterba

On Thu, Oct 08, 2015 at 10:47:09AM +0200, Vincent Stehlé wrote:
> Fix a check of len versus PATH_MAX in function copy_symlink(), to
> account for the terminating null byte.
> 
> Resolves-Coverity-CID: 1296749
> Signed-off-by: Vincent Stehlé 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs-progs: Do not force mixed block group creation unless '-M' option is specified

2015-10-08 Thread Chandan Rajendra

When creating small Btrfs filesystem instances (i.e. filesystem size <= 1GiB),
mkfs.btrfs fails if both sectorsize and nodesize are specified on the command
line and sectorsize != nodesize, since mixed block groups involves both data
and metadata blocks sharing the same block group. This is an incorrect behavior
when '-M' option isn't specified on the command line.

This commit makes optional the creation of mixed block groups i.e. Mixed block
groups are created only when -M option is specified on the command line.

While at it, this commit fixes a bug which allowed creation of filesystem
instances with "mixed block group" feature enabled and having differing block
sizes for data and metadata. For e.g.

[root@localhost btrfs-progs]# ./mkfs.btrfs -f -M  -s 4096 -n 16384 /dev/loop6
SMALL VOLUME: forcing mixed metadata/data groups
btrfs-progs v3.19-rc2-404-g71f7308
See http://btrfs.wiki.kernel.org for more information.

Performing full device TRIM (4.00GiB) ...
Label:  (null)
UUID:   c82b5720-6d88-4fa1-ac05-d0d4cb797fd5
Node size:  16384
Sector size:4096
Filesystem size:4.00GiB
Block group profiles:
  Data+Metadata:single8.00MiB
  System:   single4.00MiB
SSD detected:   no
Incompat features:  mixed-bg, extref, skinny-metadata
Number of devices:  1
Devices:
  IDSIZE  PATH
   1 4.00GiB  /dev/loop6

Signed-off-by: Chandan Rajendra 
---
 cmds-device.c  |  3 +--
 cmds-replace.c |  3 +--
 mkfs.c | 45 -
 utils.c|  5 +
 utils.h|  2 +-
 5 files changed, 24 insertions(+), 34 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 5f2b952..1b601cf 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -92,7 +92,6 @@ static int cmd_device_add(int argc, char **argv)
struct btrfs_ioctl_vol_args ioctl_args;
int devfd, res;
u64 dev_block_count = 0;
-   int mixed = 0;
char *path;
 
res = test_dev_for_mkfs(argv[i], force);
@@ -109,7 +108,7 @@ static int cmd_device_add(int argc, char **argv)
}
 
res = btrfs_prepare_device(devfd, argv[i], 1, _block_count,
-  0, , discard);
+  0, discard);
close(devfd);
if (res) {
ret++;
diff --git a/cmds-replace.c b/cmds-replace.c
index 9ab8438..e944457 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -140,7 +140,6 @@ static int cmd_replace_start(int argc, char **argv)
int force_using_targetdev = 0;
u64 dstdev_block_count;
int do_not_background = 0;
-   int mixed = 0;
DIR *dirstream = NULL;
u64 srcdev_size;
u64 dstdev_size;
@@ -281,7 +280,7 @@ static int cmd_replace_start(int argc, char **argv)
strncpy((char *)start_args.start.tgtdev_name, dstdev,
BTRFS_DEVICE_PATH_NAME_MAX);
ret = btrfs_prepare_device(fddstdev, dstdev, 1, _block_count, 0,
-, 0);
+   0);
if (ret)
goto leave_with_error;
 
diff --git a/mkfs.c b/mkfs.c
index a5802f7..10016b2 100644
--- a/mkfs.c
+++ b/mkfs.c
@@ -152,7 +152,7 @@ err:
 }
 
 static int make_root_dir(struct btrfs_trans_handle *trans, struct btrfs_root 
*root,
-   int mixed, struct mkfs_allocation *allocation)
+   struct mkfs_allocation *allocation)
 {
struct btrfs_key location;
int ret;
@@ -1440,8 +1440,6 @@ int main(int ac, char **av)
break;
case 'b':
block_count = parse_size(optarg);
-   if (block_count <= BTRFS_MKFS_SMALL_VOLUME_SIZE)
-   mixed = 1;
zero_end = 0;
break;
case 'V':
@@ -1491,7 +1489,7 @@ int main(int ac, char **av)
exit(1);
}
}
-   
+
while (dev_cnt-- > 0) {
file = av[optind++];
if (is_block_device(file) == 1)
@@ -1505,10 +1503,9 @@ int main(int ac, char **av)
file = av[optind++];
ssd = is_ssd(file);
 
-   if (is_vol_small(file) || mixed) {
+   if (mixed) {
if (verbose)
-   printf("SMALL VOLUME: forcing mixed metadata/data 
groups\n");
-   mixed = 1;
+   printf("Forcing mixed metadata/data groups\n");
}
 
/*
@@ -1544,6 +1541,19 @@ int main(int ac, char **av)
if (!nodesize_forced)
nodesize = best_nodesize;
}
+
+   /*
+* FS features that can be set by other means than -O
+* just set the bit here

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Pavel Pisa

Hello Austin,

thanks for reply.

On Thursday 08 of October 2015 13:47:33 Austin S Hemmelgarn wrote:
> On 2015-10-08 04:28, Pavel Pisa wrote:
> > Hello everybody,
...
> > It seems that SATA controller is not able to activate link which
> > has not been connected at BIOS POST time. This means that I cannot add
> > new drive without reboot.
>
> Check your BIOS options, there should be some option to set SATA ports
> as either 'Hot-Plug' or 'External', which should allow you to hot-plug
> drives without needing a reboot (unless it's a Dell system, they have
> never properly implemented the SATA standard on their desktops).
>
> > Before reboot, the server bleeds with messages
> >
> > BTRFS: bdev /dev/sde5 errs: wr 11715459, rd 8526080, flush 29099, corrupt
> > 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5
> > BTRFS: bdev /dev/sde5 errs: wr 11715460, rd 8526080, flush 29099, corrupt
> > 0, gen 0 BTRFS: lost page write due to I/O error on /dev/sde5
>
> Even aside from the below mentioned issues, if your disk is showing that
> many errors, you should probably run a SMART self-test routine on it to
> determine whether this is just a transient issue or an indication of an
> impending disk failure.  The commands I'd suggest are:
> smartctl -t short /dev/sde

Yes, I have run even long as reported in the first message.
No problem has been found. The cause has been sudden stop
of DISK SATA communication after more months of uninterrupted
communication/service. When connection has been restored
by HDD power cable disconnect/connect then disk has been
OK, no SMART problems, no problem to read/write to other
filesystems.

So it seems to be BTRFS internal prevention to write to that
portion  of FS (whole block device?) on temporary disconnected
drive where transid do not match. Situation changed after reboot
(only way for new mount) when BTRFS has restored operation somehow.

> That will tell you some time to wait for the test to complete, after
> waiting  that long, run:
> smartctl -H /dev/sde
> If that says the health check failed, replace the disk as soon as
> possible, and don't use it for storing any data you can't afford to lose.
>
> > that changed to next mesages after reboot
> >
> > Btrfs loaded
> > BTRFS: device label riki-pool devid 1 transid 282383 /dev/sda3
> > BTRFS: device label riki-pool devid 2 transid 249562 /dev/sdb5
> > BTRFS info (device sda3): disk space caching is enabled
> > BTRFS (device sda3): parent transid verify failed on 44623216640 wanted
> > 263476 found 212766 BTRFS (device sda3): parent transid verify failed on
> > 45201899520 wanted 282383 found 246891 BTRFS (device sda3): parent
> > transid verify failed on 45202571264 wanted 282383 found 246890 BTRFS
> > (device sda3): parent transid verify failed on 45201965056 wanted 282383
> > found 246889 BTRFS (device sda3): parent transid verify failed on
> > 45202505728 wanted 282383 found 246890 BTRFS (device sda3): parent
> > transid verify failed on 45202866176 wanted 282383 found 246890 BTRFS
> > (device sda3): parent transid verify failed on 45207126016 wanted 282383
> > found 246894 BTRFS (device sda3): parent transid verify failed on
> > 45202522112 wanted 282383 found 246890 BTRFS: bdev
> > /dev/disk/by-uuid/1627e557-d063-40b6-9450-3694dd1fd1ba errs: wr 11723314,
> > rd 8526080, flush 2 BTRFS (device sda3): parent transid verify failed on
> > 45206945792 wanted 282383 found 67960 BTRFS (device sda3): parent transid
> > verify failed on 45204471808 wanted 282382 found 67960
> >
> > which looks really frightening to me. Temporary disconnected drive has
> > old transid at start (OK). But what means the rest of the lines. If it
> > means that files with older transaction ID are used from temporary
> > disconnected drive (now /dev/sdb5) and newer versions from /dev/sda3 are
> > ignored and reported as invalid then this means severe data lost and may
> > it be mitchmatch because all transactions after disk disconnect are lost
> > (i.e. FS root has been taken from misbehaving drive at old version).
> >
> > BTRFS does not fall even to red-only/degraded mode after system restart.
>
> This actually surprises me.

Both drives has been present for all time / except that for about one week
on drive (in fact corresponding SATA controller) reported permanent error
for each access.

> > On the other hand, from logs (all stored on the possibly damaged root FS)
> > it seems that there there are not missing messages from days when discs
> > has been out of sync, so it looks like all data are OK. So should I
> > expect that BTRFS managed problems well and all data are consistent?
>
> I would be very careful in that situation, you may still have issues, at
> the very least, make a backup of the system as soon as possible.

I have done backup to external drive before attempts to reconnect
failed drive.

I have done btrfs replace of temporary failed HDD to new bought HDD.
I have planned to replace old drive (that one which did not experience

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Pavel Pisa

Hello Hugo,

On Thursday 08 of October 2015 23:13:52 Hugo Mills wrote:
> On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> > On 2015-10-08 04:28, Pavel Pisa wrote:
> > >I go to use "btrfs replace" because there has not been any reply to my
> > > inplace correction question. But I expect that clarification if
> > > possible/how to resync RAID1 after one drive temporal disappear is
> > > really important to many of BTRFS users.
> >
> > As of right now, there is no way that I know of to safely re-sync a
> > drive that's been disconnected for a while.  The best bet is
> > probably to use replace, but for that to work reliably, you would
> > need to tell it to ignore the now stale drive when trying to read
> > each chunk.
>
>Scrub is officially what you need there. I can confirm that it
> works correctly, having used it myself after accidentally unplugging
> the wrong drive.
>

Thanks for the reply.

I have tried to run scrub after reconnect but it counted errors in
its console output and write errors has been logged by kernel as crazy.
I have to admit I have not wait to finish it because I have not good
feeling from it.
May it be it was result of not fully correct reconnect.
But other partition worked with ext4 has no problems to write.

But if mount/unmount (in my case requiring reboot) and then scrub
worked it would be much simpler than replaces series.

I hope I would not need that (at least soon/in years) but I
give try to scrub again.

May it be problem is my btrfs tools old version on the server --
Wheezy Btrfs v3.17 backport. Kernel is Linux 4.1.2 #1 SMP PREEMPT.

Thanks,

 Pavel
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Hugo Mills

On Fri, Oct 09, 2015 at 12:16:43AM +0200, Pavel Pisa wrote:
> Hello Hugo,
> 
> On Thursday 08 of October 2015 23:13:52 Hugo Mills wrote:
> > On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> > > On 2015-10-08 04:28, Pavel Pisa wrote:
> > > >I go to use "btrfs replace" because there has not been any reply to my
> > > > inplace correction question. But I expect that clarification if
> > > > possible/how to resync RAID1 after one drive temporal disappear is
> > > > really important to many of BTRFS users.
> > >
> > > As of right now, there is no way that I know of to safely re-sync a
> > > drive that's been disconnected for a while.  The best bet is
> > > probably to use replace, but for that to work reliably, you would
> > > need to tell it to ignore the now stale drive when trying to read
> > > each chunk.
> >
> >Scrub is officially what you need there. I can confirm that it
> > works correctly, having used it myself after accidentally unplugging
> > the wrong drive.
> >
> 
> Thanks for the reply.
> 
> I have tried to run scrub after reconnect but it counted errors in
> its console output and write errors has been logged by kernel as crazy.
> I have to admit I have not wait to finish it because I have not good
> feeling from it.

   If the scrub works OK, you will still get lots of scary-looking
errors in the logs, but they'll usually say it's repaired the problem.

   Getting write errors at this point indicates that you have hardware
problems of some kind, and (usually) that device needs to be replaced.
(Or the controller, or the cabling).

> May it be it was result of not fully correct reconnect.
> But other partition worked with ext4 has no problems to write.
> 
> But if mount/unmount (in my case requiring reboot) and then scrub
> worked it would be much simpler than replaces series.
> 
> I hope I would not need that (at least soon/in years) but I
> give try to scrub again.
> 
> May it be problem is my btrfs tools old version on the server --
> Wheezy Btrfs v3.17 backport. Kernel is Linux 4.1.2 #1 SMP PREEMPT.

   No, the version of the tools has no effect on any of this. It
really sounds like you have hardware issues.

   Hugo.

-- 
Hugo Mills | ©1973 Unclear Research Ltd
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature

Re: [PATCH] btrfs: test unmount during quota rescan

2015-10-08 Thread Dave Chinner

On Thu, Oct 08, 2015 at 10:39:48AM +0100, Filipe Manana wrote:
> On Wed, Oct 7, 2015 at 4:08 PM, Justin Maggard  wrote:
> > This test case tests if we are able to unmount a filesystem while
> > a quota rescan is running.  Up to now (4.3-rc4) this would result
> > in a kernel NULL pointer dereference.
> 
> Please mention here the title of the patch that fixes this problem
> ("btrfs: qgroup: exit the rescan worker during umount").
> 
> >
> > Signed-off-by: Justin Maggard 
> > ---
> >  tests/btrfs/104 | 85 
> > +
> >  tests/btrfs/104.out |  1 +
> >  tests/btrfs/group   |  1 +
> >  3 files changed, 87 insertions(+)
> >  create mode 100755 tests/btrfs/104
> >  create mode 100644 tests/btrfs/104.out
> >
> > diff --git a/tests/btrfs/104 b/tests/btrfs/104
> > new file mode 100755
> > index 000..7c51298
> > --- /dev/null
> > +++ b/tests/btrfs/104
> > @@ -0,0 +1,85 @@
> > +#! /bin/bash
> > +# FS QA Test No. btrfs/200
> 
> 104 (should match the file name)
> 
> > +#
> > +# btrfs quota scan/unmount sanity test
> > +#
> > +#
> > +#---
> > +# Copyright (c) 2015 NETGEAR, Inc.  All Rights Reserved.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#
> > +#---
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +
> > +status=1   # failure is the default!
> > +
> > +_cleanup()
> > +{
> > +   cd /
> > +   rm -f $tmp.*
> > +   $UMOUNT_PROG $loop_mnt >/dev/null 2>&1
> > +   _destroy_loop_device $loop_dev1
> > +   rm -rf $loop_mnt
> > +   rm -f $fs_img1
> > +}
> > +
> > +trap "_cleanup ; exit \$status" 0 1 2 3 15
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# real QA test starts here
> > +_supported_fs btrfs
> > +_supported_os Linux
> > +_require_test
> > +
> > +rm -f $seqres.full
> > +
> > +loop_mnt=$TEST_DIR/$seq.$$.mnt
> > +fs_img1=$TEST_DIR/$seq.$$.img1
> > +mkdir $loop_mnt
> > +$XFS_IO_PROG -f -c "truncate 1G" $fs_img1 >>$seqres.full 2>&1
> 
> Don't redirect stdout/stderr here. If the truncate fails, xfs_io
> prints something to stderr, resulting in a test failure due to
> mismatch with the golden output.
> 
> > +
> > +loop_dev1=`_create_loop_device $fs_img1`
> > +
> > +_mkfs_dev $loop_dev1 >>$seqres.full 2>&1
> > +_mount $loop_dev1 $loop_mnt
> 
> Any reason to use a loop device on not the scratch device as most
> other tests do?
> 
> > +for i in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> > +do
> > +  for j in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> 
> Coding style in xfstests is:
> 
> for ... ; do
> done
> 
> Also use 8 spaces tabs for indentation.
> 
> > +  do
> > +for k in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> > +do
> > +  for l in 0 1 2 3 4 5 6 7 8 9 a b c d e f g h i j k l m n o p
> > +  do
> > +   touch $loop_mnt/${i}${j}${k}${l}
> > +  done
> > +done
> > +  done
> > +done >>$seqres.full 2>&1

And, well, it's a strange way of creating 26**4 files.

Why not just:

for i in `seq 0 1 45`; do
echo -n > $SCRATCH_MNT/file.$i
done

note that the use of 'echo -n' rather than touch means the test
does not need to fork a new process just to create each file and
so runs much, much faster...

> > +echo 3 > /proc/sys/vm/drop_caches
> > +$BTRFS_UTIL_PROG quota enable $loop_mnt
> > +$UMOUNT_PROG $loop_mnt
> > +
> > +status=0

What's the failure criteria here?

i.e. shouldn't you at least do a quota report and check that it
reports the correct number of inodes in the quota group?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS RAID1 behavior after one drive temporal disconection

2015-10-08 Thread Hugo Mills

On Thu, Oct 08, 2015 at 07:47:33AM -0400, Austin S Hemmelgarn wrote:
> On 2015-10-08 04:28, Pavel Pisa wrote:
> >I go to use "btrfs replace" because there has not been any reply to my 
> >inplace correction
> >question. But I expect that clarification if possible/how to resync RAID1 
> >after one
> >drive temporal disappear is really important to many of BTRFS users.
> As of right now, there is no way that I know of to safely re-sync a
> drive that's been disconnected for a while.  The best bet is
> probably to use replace, but for that to work reliably, you would
> need to tell it to ignore the now stale drive when trying to read
> each chunk.

   Scrub is officially what you need there. I can confirm that it
works correctly, having used it myself after accidentally unplugging
the wrong drive.

   Hugo.

(Sorry for the delay, I wrote this earlier, but had trouble sending it)

> It is theoretically possible to wipe the FS signature on the out-of
> sync drive, run a device scan, then run 'replace missing' pointing
> at the now 'blank' device, although going that route is really
> risky.
> 



-- 
Hugo Mills | Gomez, darling, don't torture yourself.
hugo@... carfax.org.uk | That's my job.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |   Morticia Addams


signature.asc
Description: Digital signature

Re: Using BtrFS and backup tools for keeping two systems in sync

2015-10-08 Thread Duncan

Hugo Mills posted on Thu, 08 Oct 2015 06:37:45 + as excerpted:

> On Thu, Oct 08, 2015 at 08:05:09AM +0530, Shriramana Sharma wrote:
>> Hello. I see there are some backup tools taking advantage of BtrFS's
>> incremental send/receive feature:
>> https://btrfs.wiki.kernel.org/index.php/Incremental_Backup. [BTW Ames
>> Cornish's ButterSink (https://github.com/AmesCornish/buttersink) seems
>> to be missing from that page.]
>> 
>> Now I'd like to know if anyone has evolved some good practices w.r.t
>> maintaining the data of two systems in sync using this feature of
>> BtrFS. What I have in mind is: I work on my desktop by default, and for
>> ergonomics reasons only use my laptop when I need the mobility.
>> I'd like to keep the main data (documents I create, programs I write
>> etc) in sync between the two. (The profile data such as in the ~/.*
>> hidden folders had better stay separate though, I guess.)
>> 
>> I figure with the existing tools it would not be too difficult to
>> maintain a synced set of snapshots between the two systems if I only
>> use the desktop vs laptop alternatingly and sync at each switchover,
>> but the potential problem only would come if I modify both (something
>> like having to do git merge, I guess).
>> 
>> Has anyone come across this situation and evolved any policies to
>> handle it?
> 
>You can't currently do this efficiently with send/receive. It
> should be possible, but it needs a change to the send stream format.

Elucidating somewhat...

AFAIK (as a list regular but not a dev or a user, personally, of the send/
receive functionality), currently, btrfs send/receive incremental works 
only one way.  That is, the send-stream format provides sufficient 
information for incremental sends after an original send, but there's no 
way to reverse the process and sync the other way, from the original 
receiver back to the sender.  A full send can be done, but then it's no 
longer linked to the the original.

As Hugo says, the base functionality is available, but actually hooking 
it up to work will require a bump to the send stream format, as the 
required information simply isn't sent, ATM.  That send stream format 
bump is likely to eventually happen, but there's a very strong interest 
in keeping the number of formats that must be supported for backward 
compatibility to a minimum, and thus in a minimum number of format 
bumps.  So the devs want to delay the bump as long as possible, 
identifying anything else that might need to change in the mean time, and 
make, ideally, one final bump, including all changes discovered to be 
needed since the last one, and then no more.

So while this necessary change is known, it could be some time, yet, 
before it's actually done, with hopefully no further changes necessary or 
allowed after that.

Which back on the reoccurring theme of btrfs stability...

... is another point toward btrfs "definitely stabilizing now, but not 
yet fully stable and mature."  When the devs decide there are likely no 
further as-yet undiscovered necessary changes coming and finally do this 
bump, we'll know they really do consider btrfs to be settling down into 
stability.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 21/23] btrfs: fallocate: Add support to accurate qgroup reserve

2015-10-08 Thread Qu Wenruo

Now fallocate will do accurate qgroup reserve space check, unlike old
method, which will always reserve the whole length of the range.

With this patch, fallocate will:
1) Iterate the desired range and mark in data rsv map
   Only range which is going to be allocated will be recorded in data
   rsv map and reserve the space.
   For already allocated range (normal/prealloc extent) they will be
   skipped.
   Also, record the marked range into a new list for later use.

2) If 1) succeeded, do real file extent allocate.
   And at file extent allocation time, corresponding range will be
   removed from the range in data rsv map.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
  Add missing cleanup for falloc list
---
 fs/btrfs/file.c | 159 +---
 1 file changed, 116 insertions(+), 43 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c97b24f..d638d34 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2545,17 +2545,61 @@ out_only_mutex:
return err;
 }
 
+/* Helper structure to record which range is already reserved */
+struct falloc_range {
+   struct list_head list;
+   u64 start;
+   u64 len;
+};
+
+/*
+ * Helper function to add falloc range
+ *
+ * Caller should have locked the larger range of extent containing
+ * [start, len)
+ */
+static int add_falloc_range(struct list_head *head, u64 start, u64 len)
+{
+   struct falloc_range *prev = NULL;
+   struct falloc_range *range = NULL;
+
+   if (list_empty(head))
+   goto insert;
+
+   /*
+* As fallocate iterate by bytenr order, we only need to check
+* the last range.
+*/
+   prev = list_entry(head->prev, struct falloc_range, list);
+   if (prev->start + prev->len == start) {
+   prev->len += len;
+   return 0;
+   }
+insert:
+   range = kmalloc(sizeof(*range), GFP_NOFS);
+   if (!range)
+   return -ENOMEM;
+   range->start = start;
+   range->len = len;
+   list_add_tail(>list, head);
+   return 0;
+}
+
 static long btrfs_fallocate(struct file *file, int mode,
loff_t offset, loff_t len)
 {
struct inode *inode = file_inode(file);
struct extent_state *cached_state = NULL;
+   struct falloc_range *range;
+   struct falloc_range *tmp;
+   struct list_head reserve_list;
u64 cur_offset;
u64 last_byte;
u64 alloc_start;
u64 alloc_end;
u64 alloc_hint = 0;
u64 locked_end;
+   u64 actual_end = 0;
struct extent_map *em;
int blocksize = BTRFS_I(inode)->root->sectorsize;
int ret;
@@ -2571,13 +2615,11 @@ static long btrfs_fallocate(struct file *file, int mode,
return btrfs_punch_hole(inode, offset, len);
 
/*
-* Make sure we have enough space before we do the
-* allocation.
-* XXX: The behavior must be changed to do accurate check first
-* and then check data reserved space.
+* Only trigger disk allocation, don't trigger qgroup reserve
+*
+* For qgroup space, it will be checked later.
 */
-   ret = btrfs_check_data_free_space(inode, alloc_start,
- alloc_end - alloc_start);
+   ret = btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start);
if (ret)
return ret;
 
@@ -2586,6 +2628,13 @@ static long btrfs_fallocate(struct file *file, int mode,
if (ret)
goto out;
 
+   /*
+* TODO: Move these two operations after we have checked
+* accurate reserved space, or fallocate can still fail but
+* with page truncated or size expanded.
+*
+* But that's a minor problem and won't do much harm BTW.
+*/
if (alloc_start > inode->i_size) {
ret = btrfs_cont_expand(inode, i_size_read(inode),
alloc_start);
@@ -2644,10 +2693,10 @@ static long btrfs_fallocate(struct file *file, int mode,
}
}
 
+   /* First, check if we exceed the qgroup limit */
+   INIT_LIST_HEAD(_list);
cur_offset = alloc_start;
while (1) {
-   u64 actual_end;
-
em = btrfs_get_extent(inode, NULL, 0, cur_offset,
  alloc_end - cur_offset, 0);
if (IS_ERR_OR_NULL(em)) {
@@ -2660,54 +2709,78 @@ static long btrfs_fallocate(struct file *file, int mode,
last_byte = min(extent_map_end(em), alloc_end);
actual_end = min_t(u64, extent_map_end(em), offset + len);
last_byte = ALIGN(last_byte, blocksize);
-
if (em->block_start == EXTENT_MAP_HOLE ||
(cur_offset >= inode->i_size &&
 !test_bit(EXTENT_FLAG_PREALLOC, >flags))) {
-

[PATCH v2 23/23] btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook

2015-10-08 Thread Qu Wenruo

In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.

So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() may sleep to allocate memory, it will
cause lockdep warning.

This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c | 28 ++--
 fs/btrfs/inode.c   |  4 ++--
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f20b901..3970426 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3455,6 +3455,8 @@ enum btrfs_reserve_flush_enum {
 int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
+   u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 765f7e0..af221eb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4070,10 +4070,12 @@ int btrfs_check_data_free_space(struct inode *inode, 
u64 start, u64 len)
  * Called if we need to clear a data reservation for this inode
  * Normally in a error case.
  *
- * This one will handle the per-indoe data rsv map for accurate reserved
- * space framework.
+ * This one will *NOT* use accurate qgroup reserved space API, just for case
+ * which we can't sleep and is sure it won't affect qgroup reserved space.
+ * Like clear_bit_hook().
  */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
+   u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_space_info *data_sinfo;
@@ -4083,13 +4085,6 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
  round_down(start, root->sectorsize);
start = round_down(start, root->sectorsize);
 
-   /*
-* Free any reserved qgroup data space first
-* As it will alloc memory, we can't do it with data sinfo
-* spinlock hold.
-*/
-   btrfs_qgroup_free_data(inode, start, len);
-
data_sinfo = root->fs_info->data_sinfo;
spin_lock(_sinfo->lock);
if (WARN_ON(data_sinfo->bytes_may_use < len))
@@ -4101,6 +4096,19 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
spin_unlock(_sinfo->lock);
 }
 
+/*
+ * Called if we need to clear a data reservation for this inode
+ * Normally in a error case.
+ *
+ * This one will handle the per-indoe data rsv map for accurate reserved
+ * space framework.
+ */
+void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+{
+   btrfs_free_reserved_data_space_noquota(inode, start, len);
+   btrfs_qgroup_free_data(inode, start, len);
+}
+
 static void force_metadata_allocation(struct btrfs_fs_info *info)
 {
struct list_head *head = >space_info;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 85b06d1..bd3935c 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1773,8 +1773,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 
if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
&& do_list && !(state->state & EXTENT_NORESERVE))
-   btrfs_free_reserved_data_space(inode, state->start,
-  len);
+   btrfs_free_reserved_data_space_noquota(inode,
+   state->start, len);
 
__percpu_counter_add(>fs_info->delalloc_bytes, -len,
 root->fs_info->delalloc_batch);
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 22/23] btrfs: Avoid truncate tailing page if fallocate range doesn't exceed inode size

2015-10-08 Thread Qu Wenruo

Current code will always truncate tailing page if its alloc_start is
smaller than inode size.

This behavior will cause a lot of unneeded COW page size extent.

This patch will avoid such problem.

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
---
 fs/btrfs/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index d638d34..ad30b37 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2640,7 +2640,7 @@ static long btrfs_fallocate(struct file *file, int mode,
alloc_start);
if (ret)
goto out;
-   } else {
+   } else if (offset + len > inode->i_size) {
/*
 * If we are fallocating from the end of the file onward we
 * need to zero out the end of the page if i_size lands in the
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 20/23] btrfs: qgroup: Add new trace point for qgroup data reserve

2015-10-08 Thread Qu Wenruo

Now each qgroup reserve for data will has its ftrace event for better
debugging.

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
---
 fs/btrfs/qgroup.c|  15 +-
 fs/btrfs/qgroup.h|   8 +++
 include/trace/events/btrfs.h | 113 +++
 3 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 6f397ce..54ba9fc 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2818,6 +2818,7 @@ int btrfs_qgroup_reserve_data(struct inode *inode, u64 
start, u64 len)
struct btrfs_qgroup_data_rsv_map *reserve_map;
struct data_rsv_range *tmp = NULL;
struct ulist *insert_list;
+   u64 reserved = 0;
int ret;
 
if (!root->fs_info->quota_enabled || !is_fstree(root->objectid) ||
@@ -2841,7 +2842,9 @@ int btrfs_qgroup_reserve_data(struct inode *inode, u64 
start, u64 len)
 
spin_lock(_map->lock);
ret = reserve_data_range(root, reserve_map, tmp, insert_list, start,
-len, NULL);
+len, );
+   trace_btrfs_qgroup_reserve_data(inode, start, len, reserved,
+   QGROUP_RESERVE);
/*
 * For error and already exists case, free tmp memory.
 * For tmp used case, set ret to 0, as some careless
@@ -2995,6 +2998,7 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
struct data_rsv_range *tmp;
struct btrfs_qgroup_data_rsv_map *map;
u64 reserved = 0;
+   int trace_op = QGROUP_RELEASE;
int ret;
 
spin_lock(_I(inode)->qgroup_init_lock);
@@ -3011,8 +3015,11 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
/* release_data_range() won't fail only check if memory is used */
if (ret == 0)
kfree(tmp);
-   if (free_reserved)
+   if (free_reserved) {
qgroup_free(BTRFS_I(inode)->root, reserved);
+   trace_op = QGROUP_FREE;
+   }
+   trace_btrfs_qgroup_release_data(inode, start, len, reserved, trace_op);
spin_unlock(>lock);
return 0;
 }
@@ -3084,6 +3091,7 @@ int btrfs_qgroup_init_data_rsv_map(struct inode *inode)
}
binode->qgroup_rsv_map = dirty_map;
 out:
+   trace_btrfs_qgroup_init_data_rsv_map(inode, 0);
spin_unlock(>qgroup_init_lock);
return 0;
 }
@@ -3094,6 +3102,7 @@ void btrfs_qgroup_free_data_rsv_map(struct inode *inode)
struct btrfs_root *root = binode->root;
struct btrfs_qgroup_data_rsv_map *dirty_map = binode->qgroup_rsv_map;
struct rb_node *node;
+   u64 free_reserved = 0;
 
/*
 * this function is called at inode destroy routine, so no concurrency
@@ -3108,6 +3117,7 @@ void btrfs_qgroup_free_data_rsv_map(struct inode *inode)
/* Reserve map should be empty, or we are leaking */
WARN_ON(dirty_map->reserved);
 
+   free_reserved = dirty_map->reserved;
qgroup_free(root, dirty_map->reserved);
spin_lock(_map->lock);
while ((node = rb_first(_map->root)) != NULL) {
@@ -3121,6 +3131,7 @@ void btrfs_qgroup_free_data_rsv_map(struct inode *inode)
rb_erase(node, _map->root);
kfree(range);
}
+   trace_btrfs_qgroup_free_data_rsv_map(inode, free_reserved);
spin_unlock(_map->lock);
kfree(dirty_map);
binode->qgroup_rsv_map = NULL;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 3f6ad43..cd3e515 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -33,6 +33,13 @@ struct btrfs_qgroup_extent_record {
struct ulist *old_roots;
 };
 
+/*
+ * For qgroup event trace points only
+ */
+#define QGROUP_RESERVE (1<<0)
+#define QGROUP_RELEASE (1<<1)
+#define QGROUP_FREE(1<<2)
+
 /* For per-inode dirty range reserve */
 struct btrfs_qgroup_data_rsv_map;
 
@@ -84,6 +91,7 @@ static inline void btrfs_qgroup_free_delayed_ref(struct 
btrfs_fs_info *fs_info,
 u64 ref_root, u64 num_bytes)
 {
btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
+   trace_btrfs_qgroup_free_delayed_ref(ref_root, num_bytes);
 }
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9..b4473da 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1117,6 +1117,119 @@ DEFINE_EVENT(btrfs__workqueue_done, 
btrfs_workqueue_destroy,
TP_ARGS(wq)
 );
 
+DECLARE_EVENT_CLASS(btrfs__qgroup_data_map,
+
+   TP_PROTO(struct inode *inode, u64 free_reserved),
+
+   TP_ARGS(inode, free_reserved),
+
+   TP_STRUCT__entry(
+   __field(u64,rootid  )
+   __field(unsigned long,  ino )
+

[PATCH] btrfs: fix waitqueue_active without memory barrier in btrfs

2015-10-08 Thread Kosuke Tatsukawa

btrfs_bio_counter_sub() seems to be missing a memory barrier which might
cause the waker to not notice the waiter and miss sending a wake_up as
in the following figure.

btrfs_bio_counter_sub   btrfs_rm_dev_replace_blocked

if (waitqueue_active(_info->replace_wait))
/* The CPU might reorder the test for
   the waitqueue up here, before
   prior writes complete */
/* wait_event */
 /* __wait_event */
  /* ___wait_event */
  long __int = 
prepare_to_wait_event(,
&__wait, state);
  if 
(!percpu_counter_sum(_info->bio_counter))
percpu_counter_sub(_info->bio_counter,
  amount);
  schedule()


This patch removes the call to waitqueue_active() leaving just wake_up()
behind.  This fixes the problem because the call to spin_lock_irqsave()
in wake_up() will be an ACQUIRE operation.

I found this issue when I was looking through the linux source code
for places calling waitqueue_active() before wake_up*(), but without
preceding memory barriers, after sending a patch to fix a similar
issue in drivers/tty/n_tty.c  (Details about the original issue can be
found here: https://lkml.org/lkml/2015/9/28/849).

Signed-off-by: Kosuke Tatsukawa 
---
 fs/btrfs/dev-replace.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
index e54dd59..ecb3e71 100644
--- a/fs/btrfs/dev-replace.c
+++ b/fs/btrfs/dev-replace.c
@@ -918,9 +918,7 @@ void btrfs_bio_counter_inc_noblocked(struct btrfs_fs_info 
*fs_info)
 void btrfs_bio_counter_sub(struct btrfs_fs_info *fs_info, s64 amount)
 {
percpu_counter_sub(_info->bio_counter, amount);
-
-   if (waitqueue_active(_info->replace_wait))
-   wake_up(_info->replace_wait);
+   wake_up(_info->replace_wait);
 }
 
 void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info)
-- 
1.7.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 19/23] btrfs: Add handler for invalidate page

2015-10-08 Thread Qu Wenruo

For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
there will be pages don't reach disk.
In that case, their reserved space won't be release nor freed by
finish_ordered_io() nor delayed_ref handler.

So we must free their qgroup reserved space, or we will leaking reserved
space again.

So this will patch will call btrfs_qgroup_free_data() for
invalidatepage() and its variant evict_inode_truncate_page().

And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
space will only be reserved or freed once, so for pages which are
already flushed to disk, their reserved space will be released and freed
by delayed_ref handler.

Double free won't be a problem.

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
---
 fs/btrfs/inode.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ee0b239..85b06d1 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5075,6 +5075,18 @@ static void evict_inode_truncate_pages(struct inode 
*inode)
spin_unlock(_tree->lock);
 
lock_extent_bits(io_tree, start, end, 0, _state);
+
+   /*
+* If still has DELALLOC flag, the extent didn't reach disk,
+* and its reserved space won't be freed by delayed_ref.
+* So we need to free its reserved space here.
+* (Refer to comment in btrfs_invalidatepage, case 2)
+*
+* Note, end is the bytenr of last byte, so we need + 1 here.
+*/
+   if (state->state & EXTENT_DELALLOC)
+   btrfs_qgroup_free_data(inode, start, end - start + 1);
+
clear_extent_bit(io_tree, start, end,
 EXTENT_LOCKED | EXTENT_DIRTY |
 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
@@ -8592,6 +8604,18 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
}
}
 
+   /*
+* Qgroup reserved space handler
+* Page here will be either
+* 1) Already written to disk
+*In this case, its reserved space is released from data rsv map
+*and will be freed by delayed_ref handler finally.
+*So even we call qgroup_free_data(), it won't decrease reserved
+*space.
+* 2) Not written to disk
+*This means the reserved space should be freed here.
+*/
+   btrfs_qgroup_free_data(inode, page_start, PAGE_CACHE_SIZE);
if (!inode_evicting) {
clear_extent_bit(tree, page_start, page_end,
 EXTENT_LOCKED | EXTENT_DIRTY |
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 14/23] btrfs: extent-tree: Switch to new check_data_free_space and free_reserved_data_space

2015-10-08 Thread Qu Wenruo

Use new reserve/free for buffered write and inode cache.

For buffered write case, as nodatacow write won't increase quota account,
so unlike old behavior which does reserve before check nocow, now we
check nocow first and then only reserve data if we can't do nocow write.

Signed-off-by: Qu Wenruo 
---
v2:
  Add call for new free function too. Or we will leak reserved space in
  case of data reservation succeeded but metadata reservation failed.
---
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/file.c| 34 +-
 fs/btrfs/relocation.c  |  8 
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0cd6baa..f4b9db8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3356,7 +3356,7 @@ again:
num_pages *= 16;
num_pages *= PAGE_CACHE_SIZE;
 
-   ret = btrfs_check_data_free_space(inode, num_pages, num_pages);
+   ret = __btrfs_check_data_free_space(inode, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3365,7 +3365,7 @@ again:
  _hint);
if (!ret)
dcs = BTRFS_DC_SETUP;
-   btrfs_free_reserved_data_space(inode, num_pages);
+   __btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
iput(inode);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..142b217 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1510,12 +1510,17 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
}
 
reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
-   ret = btrfs_check_data_free_space(inode, reserve_bytes, 
write_bytes);
-   if (ret == -ENOSPC &&
-   (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
- BTRFS_INODE_PREALLOC))) {
+
+   if (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+BTRFS_INODE_PREALLOC)) {
ret = check_can_nocow(inode, pos, _bytes);
+   if (ret < 0)
+   break;
if (ret > 0) {
+   /*
+* For nodata cow case, no need to reserve
+* data space.
+*/
only_release_metadata = true;
/*
 * our prealloc extent may be smaller than
@@ -1524,20 +1529,19 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
num_pages = DIV_ROUND_UP(write_bytes + offset,
 PAGE_CACHE_SIZE);
reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
-   ret = 0;
-   } else {
-   ret = -ENOSPC;
+   goto reserve_metadata;
}
}
-
-   if (ret)
+   ret = __btrfs_check_data_free_space(inode, pos, write_bytes);
+   if (ret < 0)
break;
 
+reserve_metadata:
ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
if (ret) {
if (!only_release_metadata)
-   btrfs_free_reserved_data_space(inode,
-  reserve_bytes);
+   __btrfs_free_reserved_data_space(inode, pos,
+write_bytes);
else
btrfs_end_write_no_snapshoting(root);
break;
@@ -2569,8 +2573,11 @@ static long btrfs_fallocate(struct file *file, int mode,
/*
 * Make sure we have enough space before we do the
 * allocation.
+* XXX: The behavior must be changed to do accurate check first
+* and then check data reserved space.
 */
-   ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start, 
alloc_end - alloc_start);
+   ret = btrfs_check_data_free_space(inode, alloc_start,
+ alloc_end - alloc_start);
if (ret)
return ret;
 
@@ -2703,7 +2710,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 out:
mutex_unlock(>i_mutex);
/* Let go of our reservation. */
-   btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
+   __btrfs_free_reserved_data_space(inode, alloc_start,
+alloc_end - alloc_start);
return ret;
 }
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index

[PATCH v2 15/23] btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space

2015-10-08 Thread Qu Wenruo

Add new version of btrfs_delalloc_reserve_space() and
btrfs_delalloc_release_space() functions, which supports accurate qgroup
reserve.

Signed-off-by: Qu Wenruo 
---
v2:
  Add new function btrfs_delalloc_release_space() to handle error case.
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c | 59 ++
 2 files changed, 61 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19450a1..4221bfd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3473,7 +3473,9 @@ void btrfs_subvolume_release_metadata(struct btrfs_root 
*root,
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
+void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f4b9db8..32455e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5723,6 +5723,44 @@ void btrfs_delalloc_release_metadata(struct inode 
*inode, u64 num_bytes)
 }
 
 /**
+ * __btrfs_delalloc_reserve_space - reserve data and metadata space for
+ * delalloc
+ * @inode: inode we're writing to
+ * @start: start range we are writing to
+ * @len: how long the range we are writing to
+ *
+ * TODO: This function will finally replace old btrfs_delalloc_reserve_space()
+ *
+ * This will do the following things
+ *
+ * o reserve space in data space info for num bytes
+ *   and reserve precious corresponding qgroup space
+ *   (Done in check_data_free_space)
+ *
+ * o reserve space for metadata space, based on the number of outstanding
+ *   extents and how much csums will be needed
+ *   also reserve metadata space in a per root over-reserve method.
+ * o add to the inodes->delalloc_bytes
+ * o add it to the fs_info's delalloc inodes list.
+ *   (Above 3 all done in delalloc_reserve_metadata)
+ *
+ * Return 0 for success
+ * Return <0 for error(-ENOSPC or -EQUOT)
+ */
+int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
+{
+   int ret;
+
+   ret = __btrfs_check_data_free_space(inode, start, len);
+   if (ret < 0)
+   return ret;
+   ret = btrfs_delalloc_reserve_metadata(inode, len);
+   if (ret < 0)
+   __btrfs_free_reserved_data_space(inode, start, len);
+   return ret;
+}
+
+/**
  * btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc
  * @inode: inode we're writing to
  * @num_bytes: the number of bytes we want to allocate
@@ -5755,6 +5793,27 @@ int btrfs_delalloc_reserve_space(struct inode *inode, 
u64 num_bytes)
 }
 
 /**
+ * __btrfs_delalloc_release_space - release data and metadata space for 
delalloc
+ * @inode: inode we're releasing space for
+ * @start: start position of the space already reserved
+ * @len: the len of the space already reserved
+ *
+ * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
+ * called in the case that we don't need the metadata AND data reservations
+ * anymore.  So if there is an error or we insert an inline extent.
+ *
+ * This function will release the metadata space that was not used and will
+ * decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
+ * list if there are no delalloc bytes left.
+ * Also it will handle the qgroup reserved space.
+ */
+void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len)
+{
+   btrfs_delalloc_release_metadata(inode, len);
+   __btrfs_free_reserved_data_space(inode, start, len);
+}
+
+/**
  * btrfs_delalloc_release_space - release data and metadata space for delalloc
  * @inode: inode we're releasing space for
  * @num_bytes: the number of bytes we want to free up
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 10/23] btrfs: delayed_ref: release and free qgroup reserved at proper timing

2015-10-08 Thread Qu Wenruo

Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:

1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.

2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().

With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.

Signed-off-by: Qu Wenruo 
---
v2:
  Use a better wrapped function for delayed_ref reserved space release.
  As direct call to btrfs_qgroup_free_ref() will make it hard to add
  trace event.
---
 fs/btrfs/extent-tree.c |  5 +
 fs/btrfs/inode.c   | 10 ++
 fs/btrfs/qgroup.c  |  5 ++---
 fs/btrfs/qgroup.h  | 18 +-
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 601d7d4..4f6758b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2345,6 +2345,11 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
  node->num_bytes);
}
}
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(root->fs_info,
+ head->qgroup_ref_root,
+ head->qgroup_reserved);
return ret;
}
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 79ad301..8ca2993 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2112,6 +2112,16 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ret = btrfs_alloc_reserved_file_extent(trans, root,
root->root_key.objectid,
btrfs_ino(inode), file_pos, );
+   if (ret < 0)
+   goto out;
+   /*
+* Release the reserved range from inode dirty range map, and
+* move it to delayed ref codes, as now accounting only happens at
+* commit_transaction() time.
+*/
+   btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   ret = btrfs_add_delayed_qgroup_reserve(root->fs_info, trans,
+   root->objectid, disk_bytenr, ram_bytes);
 out:
btrfs_free_path(path);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index dbc0d06..1f03f9d 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2169,14 +2169,13 @@ out:
return ret;
 }
 
-void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes)
+void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
+  u64 ref_root, u64 num_bytes)
 {
struct btrfs_root *quota_root;
struct btrfs_qgroup *qgroup;
-   struct btrfs_fs_info *fs_info = root->fs_info;
struct ulist_node *unode;
struct ulist_iterator uiter;
-   u64 ref_root = root->root_key.objectid;
int ret = 0;
 
if (!is_fstree(ref_root))
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 8e69dc1..c7ee46a 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -75,7 +75,23 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 srcid, u64 objectid,
 struct btrfs_qgroup_inherit *inherit);
 int btrfs_qgroup_reserve(struct btrfs_root *root, u64 num_bytes);
-void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes);
+void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
+  u64 ref_root, u64 num_bytes);
+static inline void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes)
+{
+   return btrfs_qgroup_free_refroot(root->fs_info, root->objectid,
+num_bytes);
+}
+
+/*
+ * TODO: Add proper trace point for it, as btrfs_qgroup_free() is
+ * called by everywhere, can't provide good trace for delayed ref case.
+ */
+static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
+u64 ref_root, u64 num_bytes)
+{
+   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
+}
 
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 16/23] btrfs: extent-tree: Switch to new delalloc space reserve and release

2015-10-08 Thread Qu Wenruo

Use new __btrfs_delalloc_reserve_space() and
__btrfs_delalloc_release_space() to reserve and release space for
delalloc.

Signed-off-by: Qu Wenruo 
---
v2:
  Also use __btrfs_delalloc_release_space() function.
---
 fs/btrfs/file.c  |  5 +++--
 fs/btrfs/inode-map.c |  6 +++---
 fs/btrfs/inode.c | 38 +++---
 fs/btrfs/ioctl.c | 14 +-
 4 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 142b217..bf4d5fb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1611,7 +1611,7 @@ again:
btrfs_delalloc_release_metadata(inode,
release_bytes);
else
-   btrfs_delalloc_release_space(inode,
+   __btrfs_delalloc_release_space(inode, pos,
 release_bytes);
}
 
@@ -1664,7 +1664,8 @@ again:
btrfs_end_write_no_snapshoting(root);
btrfs_delalloc_release_metadata(inode, release_bytes);
} else {
-   btrfs_delalloc_release_space(inode, release_bytes);
+   __btrfs_delalloc_release_space(inode, pos,
+  release_bytes);
}
}
 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index d4a582a..78bc09c 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -488,17 +488,17 @@ again:
/* Just to make sure we have enough space */
prealloc += 8 * PAGE_CACHE_SIZE;
 
-   ret = btrfs_delalloc_reserve_space(inode, prealloc);
+   ret = __btrfs_delalloc_reserve_space(inode, 0, prealloc);
if (ret)
goto out_put;
 
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
  prealloc, prealloc, _hint);
if (ret) {
-   btrfs_delalloc_release_space(inode, prealloc);
+   __btrfs_delalloc_release_space(inode, 0, prealloc);
goto out_put;
}
-   btrfs_free_reserved_data_space(inode, prealloc);
+   __btrfs_free_reserved_data_space(inode, 0, prealloc);
 
ret = btrfs_write_out_ino_cache(root, trans, path, inode);
 out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8ca2993..38a0fb9 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 
if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
&& do_list && !(state->state & EXTENT_NORESERVE))
-   btrfs_free_reserved_data_space(inode, len);
+   __btrfs_free_reserved_data_space(inode, state->start,
+len);
 
__percpu_counter_add(>fs_info->delalloc_bytes, -len,
 root->fs_info->delalloc_batch);
@@ -1985,7 +1986,8 @@ again:
goto again;
}
 
-   ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+   ret = __btrfs_delalloc_reserve_space(inode, page_start,
+PAGE_CACHE_SIZE);
if (ret) {
mapping_set_error(page->mapping, ret);
end_extent_writepage(page, ret, page_start, page_end);
@@ -4581,14 +4583,17 @@ int btrfs_truncate_page(struct inode *inode, loff_t 
from, loff_t len,
if ((offset & (blocksize - 1)) == 0 &&
(!len || ((len & (blocksize - 1)) == 0)))
goto out;
-   ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+   ret = __btrfs_delalloc_reserve_space(inode,
+   round_down(from, PAGE_CACHE_SIZE), PAGE_CACHE_SIZE);
if (ret)
goto out;
 
 again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+   __btrfs_delalloc_release_space(inode,
+   round_down(from, PAGE_CACHE_SIZE),
+   PAGE_CACHE_SIZE);
ret = -ENOMEM;
goto out;
}
@@ -4656,7 +4661,8 @@ again:
 
 out_unlock:
if (ret)
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+   __btrfs_delalloc_release_space(inode, page_start,
+  PAGE_CACHE_SIZE);
unlock_page(page);
page_cache_release(page);
 out:
@@ -7587,7 +7593,7 @@ unlock:
spin_unlock(_I(inode)->lock);
}
 
-   btrfs_free_reserved_data_space(inode, len);
+   __btrfs_free_reserved_data_space(inode, start, len);

[PATCH v2 01/23] btrfs: qgroup: New function declaration for new reserve implement

2015-10-08 Thread Qu Wenruo

Add new structures and functions for new qgroup reserve implement dirty
phase.
Which will focus on avoiding over-reserve as in that case, which means
for already reserved dirty space range, we won't reserve space again.

This patch adds the needed structure declaration and comments.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix some comment spell
---
 fs/btrfs/btrfs_inode.h |  4 
 fs/btrfs/qgroup.c  | 58 ++
 fs/btrfs/qgroup.h  |  3 +++
 3 files changed, 65 insertions(+)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 0ef5cc1..6d799b8 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -24,6 +24,7 @@
 #include "extent_io.h"
 #include "ordered-data.h"
 #include "delayed-inode.h"
+#include "qgroup.h"
 
 /*
  * ordered_data_close is set by truncate when a file that used
@@ -193,6 +194,9 @@ struct btrfs_inode {
struct timespec i_otime;
 
struct inode vfs_inode;
+
+   /* qgroup dirty map for data space reserve */
+   struct btrfs_qgroup_data_rsv_map *qgroup_rsv_map;
 };
 
 extern unsigned char btrfs_filetype_table[];
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e9ace09..607ace8 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -91,6 +91,64 @@ struct btrfs_qgroup {
u64 new_refcnt;
 };
 
+/*
+ * Record one range of reserved space.
+ */
+struct data_rsv_range {
+   struct rb_node node;
+   u64 start;
+   u64 len;
+};
+
+/*
+ * Record per inode reserved range.
+ * This is mainly used to resolve reserved space leaking problem.
+ * One of the cause is the mismatch with reserve and free.
+ *
+ * New qgroup will handle reserve in two phase.
+ * 1) Dirty phase.
+ *Pages are just marked dirty, but not written to disk.
+ * 2) Flushed phase
+ *Pages are written to disk, but transaction is not committed yet.
+ *
+ * At Dirty phase, we only need to focus on avoiding over-reserve.
+ *
+ * The idea is like below.
+ * 1) Write [0,8K)
+ * 0   4K  8K  12K 16K
+ * ||
+ * Reserve +8K, total reserved: 8K
+ *
+ * 2) Write [0,4K)
+ * 0   4K  8K  12K 16K
+ * ||
+ * Reserve 0, total reserved 8K
+ *
+ * 3) Write [12K,16K)
+ * 0   4K  8K  12K 16K
+ * ||  |///|
+ * Reserve +4K, total reserved 12K
+ *
+ * 4) Flush [0,8K)
+ * Can happen without commit transaction, like fallocate will trigger the
+ * write.
+ * 0   4K  8K  12K 16K
+ * |///|
+ * Reserve 0, total reserved 12K
+ * As the extent is written to disk, not dirty any longer, the range get
+ * removed.
+ * But as its delayed_refs is not run, its reserved space will not be freed.
+ * And things continue to Flushed phase.
+ *
+ * By this method, we can avoid over-reserve, which will lead to reserved
+ * space leak.
+ */
+struct btrfs_qgroup_data_rsv_map {
+   struct rb_root root;
+   u64 reserved;
+   spinlock_t lock;
+};
+
 static void btrfs_qgroup_update_old_refcnt(struct btrfs_qgroup *qg, u64 seq,
   int mod)
 {
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 6387dcf..2f863a4 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -33,6 +33,9 @@ struct btrfs_qgroup_extent_record {
struct ulist *old_roots;
 };
 
+/* For per-inode dirty range reserve */
+struct btrfs_qgroup_data_rsv_map;
+
 int btrfs_quota_enable(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info);
 int btrfs_quota_disable(struct btrfs_trans_handle *trans,
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 18/23] btrfs: qgroup: Add handler for NOCOW and inline

2015-10-08 Thread Qu Wenruo

For NOCOW and inline case, there will be no delayed_ref created for
them, so we should free their reserved data space at proper
time(finish_ordered_io for NOCOW and cow_file_inline for inline).

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c |  7 ++-
 fs/btrfs/inode.c   | 15 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1dadbba..765f7e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4056,7 +4056,12 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
if (ret < 0)
return ret;
 
-   /* Use new btrfs_qgroup_reserve_data to reserve precious data space */
+   /*
+* Use new btrfs_qgroup_reserve_data to reserve precious data space
+*
+* TODO: Find a good method to avoid reserve data space for NOCOW
+* range, but don't impact performance on quota disable case.
+*/
ret = btrfs_qgroup_reserve_data(inode, start, len);
return ret;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 84c31dd..ee0b239 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -310,6 +310,13 @@ static noinline int cow_file_range_inline(struct 
btrfs_root *root,
btrfs_delalloc_release_metadata(inode, end + 1 - start);
btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 out:
+   /*
+* Don't forget to free the reserved space, as for inlined extent
+* it won't count as data extent, free them directly here.
+* And at reserve time, it's always aligned to page size, so
+* just free one page here.
+*/
+   btrfs_qgroup_free_data(inode, 0, PAGE_CACHE_SIZE);
btrfs_free_path(path);
btrfs_end_transaction(trans, root);
return ret;
@@ -2832,6 +2839,14 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
 
if (test_bit(BTRFS_ORDERED_NOCOW, _extent->flags)) {
BUG_ON(!list_empty(_extent->list)); /* Logic error */
+
+   /*
+* For mwrite(mmap + memset to write) case, we still reserve
+* space for NOCOW range.
+* As NOCOW won't cause a new delayed ref, just free the space
+*/
+   btrfs_qgroup_free_data(inode, ordered_extent->file_offset,
+  ordered_extent->len);
btrfs_ordered_update_i_size(inode, 0, ordered_extent);
if (nolock)
trans = btrfs_join_transaction_nolock(root);
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 08/23] btrfs: qgroup: Introduce function to release/free reserved data range

2015-10-08 Thread Qu Wenruo

Introduce functions btrfs_qgroup_release/free_data() to release/free
reserved data range.

Release means, just remove the data range from data rsv map, but doesn't
free the reserved space.
This is for normal buffered write case, when data is written into disc
and its metadata is added into tree, its reserved space should still be
kept until commit_trans().
So in that case, we only release dirty range, but keep the reserved
space recorded some other place until commit_tran().

Free means not only remove data range, but also free reserved space.
This is used for case for cleanup.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
  Update comment, to make it clear that the reserved space for any page
  cache will either be released(it goes to disk) or freed directly
  (truncated before reaching disk)
---
 fs/btrfs/qgroup.c | 55 ++-
 fs/btrfs/qgroup.h |  2 ++
 2 files changed, 56 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 9934929..dbc0d06 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -85,7 +85,7 @@ struct btrfs_qgroup {
 
/*
 * temp variables for accounting operations
-* Refer to qgroup_shared_accouting() for details.
+* Refer to qgroup_shared_accounting() for details.
 */
u64 old_refcnt;
u64 new_refcnt;
@@ -2985,6 +2985,59 @@ next:
return 0;
 }
 
+static int __btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len,
+  int free_reserved)
+{
+   struct data_rsv_range *tmp;
+   struct btrfs_qgroup_data_rsv_map *map;
+   u64 reserved = 0;
+   int ret;
+
+   spin_lock(_I(inode)->qgroup_init_lock);
+   map = BTRFS_I(inode)->qgroup_rsv_map;
+   spin_unlock(_I(inode)->qgroup_init_lock);
+   if (!map)
+   return 0;
+
+   tmp = kmalloc(sizeof(*tmp), GFP_NOFS);
+   if (!tmp)
+   return -ENOMEM;
+   spin_lock(>lock);
+   ret = release_data_range(map, tmp, start, len, );
+   /* release_data_range() won't fail only check if memory is used */
+   if (ret == 0)
+   kfree(tmp);
+   if (free_reserved)
+   btrfs_qgroup_free(BTRFS_I(inode)->root, reserved);
+   spin_unlock(>lock);
+   return 0;
+}
+
+/*
+ * Free a reserved space range from its qgroup.
+ *
+ * Should be called when a delalloc page cache is going to be invalidated
+ * For a page cache, it will will be released (as it's written to disk) or
+ * freed directly (doesn't reach disk).
+ */
+int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len)
+{
+   return __btrfs_qgroup_release_data(inode, start, len, 1);
+}
+
+/*
+ * Release a reserved space range, but doesn't free it's qgroup reserved space
+ * The reserved space will still takes space until delayed refs is run.
+ * 
+ * As qgroup accouting happens at commit time, for data written to disk
+ * its reserved space should not be freed until commit.
+ * Or we may exceed the limit.
+ */
+int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len)
+{
+   return __btrfs_qgroup_release_data(inode, start, len, 0);
+}
+
 /*
  * Init data_rsv_map for a given inode.
  *
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 366b853..8e69dc1 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -88,4 +88,6 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, 
u64 qgroupid,
 int btrfs_qgroup_init_data_rsv_map(struct inode *inode);
 void btrfs_qgroup_free_data_rsv_map(struct inode *inode);
 int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
+int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len);
+int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 09/23] btrfs: delayed_ref: Add new function to record reserved space into delayed ref

2015-10-08 Thread Qu Wenruo

Add new function btrfs_add_delayed_qgroup_reserve() function to record
how much space is reserved for that extent.

As btrfs only accounts qgroup at run_delayed_refs() time, so newly
allocated extent should keep the reserved space until then.

So add needed function with related members to do it.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/delayed-ref.c | 29 +
 fs/btrfs/delayed-ref.h | 14 ++
 2 files changed, 43 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ac3e81d..bd9b63b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -476,6 +476,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
INIT_LIST_HEAD(_ref->ref_list);
head_ref->processing = 0;
head_ref->total_ref_mod = count_mod;
+   head_ref->qgroup_reserved = 0;
+   head_ref->qgroup_ref_root = 0;
 
/* Record qgroup extent info if provided */
if (qrecord) {
@@ -746,6 +748,33 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
+int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
+struct btrfs_trans_handle *trans,
+u64 ref_root, u64 bytenr, u64 num_bytes)
+{
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *ref_head;
+   int ret = 0;
+
+   if (!fs_info->quota_enabled || !is_fstree(ref_root))
+   return 0;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   ref_head = find_ref_head(_refs->href_root, bytenr, 0);
+   if (!ref_head) {
+   ret = -ENOENT;
+   goto out;
+   }
+   WARN_ON(ref_head->qgroup_reserved || ref_head->qgroup_ref_root);
+   ref_head->qgroup_ref_root = ref_root;
+   ref_head->qgroup_reserved = num_bytes;
+out:
+   spin_unlock(_refs->lock);
+   return ret;
+}
+
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 13fb5e6..d4c41e2 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -113,6 +113,17 @@ struct btrfs_delayed_ref_head {
int total_ref_mod;
 
/*
+* For qgroup reserved space freeing.
+*
+* ref_root and reserved will be recorded after
+* BTRFS_ADD_DELAYED_EXTENT is called.
+* And will be used to free reserved qgroup space at
+* run_delayed_refs() time.
+*/
+   u64 qgroup_ref_root;
+   u64 qgroup_reserved;
+
+   /*
 * when a new extent is allocated, it is just reserved in memory
 * The actual extent isn't inserted into the extent allocation tree
 * until the delayed ref is processed.  must_insert_reserved is
@@ -242,6 +253,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 owner, u64 offset, int action,
   struct btrfs_delayed_extent_op *extent_op,
   int no_quota);
+int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
+struct btrfs_trans_handle *trans,
+u64 ref_root, u64 bytenr, u64 num_bytes);
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 06/23] btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function

2015-10-08 Thread Qu Wenruo

This new function will do all the hard work to reserve precious space
for a write.

The overall work flow will be the following.

File A already has some dirty pages:

0   4K  8K  12K 16K
|///|   |///|

And then, someone want to write some data into range [4K, 16K).
|<--desired>|

Unlike the old and wrong implement, which reserve 12K, this function
will only reserve space for newly dirty part:
|\\\|   |\\\|
Which only takes 8K reserve space, as other part has already allocated
their own reserve space.

So the final reserve map will be:
|///|

This provides the basis to resolve the long existing qgroup limit bug.

Signed-off-by: Qu Wenruo 
---
v2:
  Add needed parameter for later trace functions
---
 fs/btrfs/qgroup.c | 57 +++
 fs/btrfs/qgroup.h |  1 +
 2 files changed, 58 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 3bdf28e..e840f5c 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2796,6 +2796,63 @@ insert:
 }
 
 /*
+ * Make sure the data space for [start, start + len) is reserved.
+ * It will either reserve new space from given qgroup or reuse the already
+ * reserved space.
+ *
+ * Return 0 for successful reserve.
+ * Return <0 for error.
+ *
+ * TODO: to handle nocow case, like NODATACOW or write into prealloc space
+ * along with other mixed case.
+ * Like write 2M, first 1M can be nocowed, but next 1M is on hole and need COW.
+ */
+int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_inode *binode = BTRFS_I(inode);
+   struct btrfs_root *root = binode->root;
+   struct btrfs_qgroup_data_rsv_map *reserve_map;
+   struct data_rsv_range *tmp = NULL;
+   struct ulist *insert_list;
+   int ret;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid) ||
+   len == 0)
+   return 0;
+
+   if (!binode->qgroup_rsv_map) {
+   ret = btrfs_qgroup_init_data_rsv_map(inode);
+   if (ret < 0)
+   return ret;
+   }
+   reserve_map = binode->qgroup_rsv_map;
+   insert_list = ulist_alloc(GFP_NOFS);
+   if (!insert_list)
+   return -ENOMEM;
+   tmp = kzalloc(sizeof(*tmp), GFP_NOFS);
+   if (!tmp) {
+   ulist_free(insert_list);
+   return -ENOMEM;
+   }
+
+   spin_lock(_map->lock);
+   ret = reserve_data_range(root, reserve_map, tmp, insert_list, start,
+len, NULL);
+   /*
+* For error and already exists case, free tmp memory.
+* For tmp used case, set ret to 0, as some careless
+* caller consider >0 as error.
+*/
+   if (ret <= 0)
+   kfree(tmp);
+   else
+   ret = 0;
+   spin_unlock(_map->lock);
+   ulist_free(insert_list);
+   return ret;
+}
+
+/*
  * Init data_rsv_map for a given inode.
  *
  * This is needed at write time as quota can be disabled and then enabled
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index c87b7dc..366b853 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -87,4 +87,5 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, 
u64 qgroupid,
 /* for qgroup reserve */
 int btrfs_qgroup_init_data_rsv_map(struct inode *inode);
 void btrfs_qgroup_free_data_rsv_map(struct inode *inode);
+int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 07/23] btrfs: qgroup: Introduce function to release reserved range

2015-10-08 Thread Qu Wenruo

Introduce new function release_data_range() to release reserved ranges.
It will iterate through all existing ranges and remove/shrink them.

Note this function will not free reserved space, as the range can be
released in the following conditions:
1) The dirty range gets written to disk.
   In this case, reserved range will be released but reserved bytes
   will not be freed until the delayed_ref is run.

2) Truncate
   In this case, dirty ranges will be released and reserved bytes will
   also be freed.

So the new function won't free reserved space, but record them into
parameter if called needs.

Signed-off-by: Qu Wenruo 
---
v2:
  Add parameter for later trace functions
---
 fs/btrfs/qgroup.c | 133 ++
 1 file changed, 133 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e840f5c..9934929 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2852,6 +2852,139 @@ int btrfs_qgroup_reserve_data(struct inode *inode, u64 
start, u64 len)
return ret;
 }
 
+/* Small helper used in release_data_range() to update rsv map */
+static inline void __update_rsv(struct btrfs_qgroup_data_rsv_map *map,
+   u64 *reserved, u64 cur_rsv)
+{
+   if (WARN_ON(map->reserved < cur_rsv)) {
+   if (reserved)
+   *reserved += map->reserved;
+   map->reserved = 0;
+   } else {
+   if (reserved)
+   *reserved += cur_rsv;
+   map->reserved -= cur_rsv;
+   }
+}
+
+/*
+ * Release the range [start, start + len) from rsv map.
+ *
+ * The behavior should be much like reserve_data_range().
+ * @tmp: the allocated memory for case which need to split existing
+ *   range into two.
+ * @reserved: the number of bytes that may need to free
+ * Return > 0 if 'tmp' memory is used and release range successfully
+ * Return 0 if 'tmp' memory is not used and release range successfully
+ * Return < 0 for error
+ */
+static int release_data_range(struct btrfs_qgroup_data_rsv_map *map,
+ struct data_rsv_range *tmp,
+ u64 start, u64 len, u64 *reserved)
+{
+   struct data_rsv_range *range;
+   u64 cur_rsv = 0;
+   int ret = 0;
+
+   range = find_reserve_range(map, start);
+   /* empty tree, just return */
+   if (!range)
+   return 0;
+   /*
+* For split case
+*  ||
+* ||
+* In this case, we need to insert one new range.
+*/
+   if (range->start < start && range->start + range->len > start + len) {
+   u64 new_start = start + len;
+   u64 new_len = range->start + range->len - start - len;
+
+   cur_rsv = len;
+   if (reserved)
+   *reserved += cur_rsv;
+   map->reserved -= cur_rsv;
+
+   range->len = start - range->start;
+   ret = insert_data_range(map, tmp, new_start, new_len);
+   WARN_ON(ret <= 0);
+   return 1;
+   }
+
+   /*
+* Iterate until the end of the range and free release all
+* reserved data from map.
+* We iterate by existing range, as that will makes codes a
+* little more clean.
+*
+*  |<-desired>|
+* |//1//|  |//2//| |//3//| |//4//|
+*/
+   while (range->start < start + len) {
+   struct rb_node *next = NULL;
+   int range_freed = 0;
+
+   /*
+*  |<---desired>|
+* |///|
+*/
+   if (unlikely(range->start + range->len <= start))
+   goto next;
+
+   /*
+*  ||
+* |///|
+*/
+   if (range->start < start &&
+   range->start + range->len > start) {
+   cur_rsv = range->start + range->len - start;
+
+   range->len = start - range->start;
+   goto next;
+   }
+
+   /*
+*  |<--desired-->|
+*  |/|
+* Including same start/end case, so other case don't need
+* to check start/end equal case and don't need bother
+* deleting range.
+*/
+   if (range->start >= start &&
+   range->start + range->len <= start + len) {
+   cur_rsv = range->len;
+
+   range_freed = 1;
+   next = rb_next(>node);
+   rb_erase(>node, >root);
+   kfree(range);
+   goto next;
+
+   }

[PATCH v2 02/23] btrfs: qgroup: Implement data_rsv_map init/free functions

2015-10-08 Thread Qu Wenruo

New functions btrfs_qgroup_init/free_data_rsv_map() to init/free data
reserve map.

Data reserve map is used to mark which range already holds reserved
space, to avoid current reserved space leak.

Signed-off-by: Qu Wenruo 
---
v2:
  Add reserved space leaking check at free_data_rsv_map time
---
 fs/btrfs/btrfs_inode.h |  2 ++
 fs/btrfs/inode.c   | 10 ++
 fs/btrfs/qgroup.c  | 84 ++
 fs/btrfs/qgroup.h  |  3 ++
 4 files changed, 99 insertions(+)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 6d799b8..c2da3a9 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -197,6 +197,8 @@ struct btrfs_inode {
 
/* qgroup dirty map for data space reserve */
struct btrfs_qgroup_data_rsv_map *qgroup_rsv_map;
+   /* lock to ensure rsv_map will only be initialized once */
+   spinlock_t qgroup_init_lock;
 };
 
 extern unsigned char btrfs_filetype_table[];
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b7e439b..79ad301 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8940,6 +8940,14 @@ struct inode *btrfs_alloc_inode(struct super_block *sb)
INIT_LIST_HEAD(>delalloc_inodes);
RB_CLEAR_NODE(>rb_node);
 
+   /*
+* Init qgroup info to empty, as they will be initialized at write
+* time.
+* This behavior is needed for enable quota later case.
+*/
+   spin_lock_init(>qgroup_init_lock);
+   ei->qgroup_rsv_map = NULL;
+
return inode;
 }
 
@@ -8997,6 +9005,8 @@ void btrfs_destroy_inode(struct inode *inode)
btrfs_put_ordered_extent(ordered);
}
}
+   /* free and check data rsv map */
+   btrfs_qgroup_free_data_rsv_map(inode);
inode_tree_del(inode);
btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 607ace8..c275312 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2539,3 +2539,87 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
btrfs_queue_work(fs_info->qgroup_rescan_workers,
 _info->qgroup_rescan_work);
 }
+
+/*
+ * Init data_rsv_map for a given inode.
+ *
+ * This is needed at write time as quota can be disabled and then enabled
+ */
+int btrfs_qgroup_init_data_rsv_map(struct inode *inode)
+{
+   struct btrfs_inode *binode = BTRFS_I(inode);
+   struct btrfs_root *root = binode->root;
+   struct btrfs_qgroup_data_rsv_map *dirty_map;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid))
+   return 0;
+
+   spin_lock(>qgroup_init_lock);
+   /* Quick route for init */
+   if (likely(binode->qgroup_rsv_map))
+   goto out;
+   spin_unlock(>qgroup_init_lock);
+
+   /*
+* Slow allocation route
+*
+* TODO: Use kmem_cache to speedup allocation
+*/
+   dirty_map = kmalloc(sizeof(*dirty_map), GFP_NOFS);
+   if (!dirty_map)
+   return -ENOMEM;
+
+   dirty_map->reserved = 0;
+   dirty_map->root = RB_ROOT;
+   spin_lock_init(_map->lock);
+
+   /* Lock again to ensure no one has already init it before */
+   spin_lock(>qgroup_init_lock);
+   if (binode->qgroup_rsv_map) {
+   spin_unlock(>qgroup_init_lock);
+   kfree(dirty_map);
+   return 0;
+   }
+   binode->qgroup_rsv_map = dirty_map;
+out:
+   spin_unlock(>qgroup_init_lock);
+   return 0;
+}
+
+void btrfs_qgroup_free_data_rsv_map(struct inode *inode)
+{
+   struct btrfs_inode *binode = BTRFS_I(inode);
+   struct btrfs_root *root = binode->root;
+   struct btrfs_qgroup_data_rsv_map *dirty_map = binode->qgroup_rsv_map;
+   struct rb_node *node;
+
+   /*
+* this function is called at inode destroy routine, so no concurrency
+* will happen, no need to get the lock.
+*/
+   if (!dirty_map)
+   return;
+
+   /* insanity check */
+   WARN_ON(!is_fstree(root->objectid));
+
+   /* Reserve map should be empty, or we are leaking */
+   WARN_ON(dirty_map->reserved);
+
+   btrfs_qgroup_free(root, dirty_map->reserved);
+   spin_lock(_map->lock);
+   while ((node = rb_first(_map->root)) != NULL) {
+   struct data_rsv_range *range;
+
+   range = rb_entry(node, struct data_rsv_range, node);
+   btrfs_warn(root->fs_info,
+  "leaking reserved range, root: %llu, ino: %lu, 
start: %llu, len: %llu\n",
+  root->objectid, inode->i_ino, range->start,
+  range->len);
+   rb_erase(node, _map->root);
+   kfree(range);
+   }
+   spin_unlock(_map->lock);
+   kfree(dirty_map);
+   binode->qgroup_rsv_map = NULL;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index

[PATCH v2 00/23] Rework btrfs qgroup reserved space framework

2015-10-08 Thread Qu Wenruo

In previous rework of qgroup, we succeeded in fixing qgroup accounting
part, making the rfer/excl numbers accurate.

But that's just part of qgroup work, another part of qgroup still has
quite a lot problem, that's qgroup reserve space part which will lead to
EQUOT even we are far from the limit.

[[BUG]]
The easiest way to trigger the bug is,
1) Enable quota
2) Limit excl of qgroup 5 to 16M
3) Write [0,2M) of a file inside subvol 5 10 times without sync

EQUOT will be triggered at about the 8th write.
But after remount, we can still write until about 15M.

[[CAUSE]]
The problem is caused by the fact that qgroup will reserve space even
the data space is already reserved.

In above reproducer, each time we buffered write [0,2M) qgroup will
reserve 2M space, but in fact, at the 1st time, we have already reserved
2M and from then on, we don't need to reserved any data space as we are
only writing [0,2M).

Also, the reserved space will only be freed *ONCE* when its backref is
run at commit_transaction() time.

That's causing the reserved space leaking.

[[FIX]]
The fix is not a simple one, as currently btrfs_qgroup_reserve() will
allocate whatever caller asked for.

So for accurate qgroup reserve, we introduce a completely new framework
for data and metadata.
1) Per-inode data reserve map
   Now, each inode will have a data reserve map, recording which range
   of data is already reserved.
   If we are writing a range which is already reserved, we won't need to
   reserve space again.

   Also, for the fact that qgroup is only accounted at commit_trans(),
   for data commit into disc and its metadata is also inserted into
   current tree, we should free the data reserved range, but still keep
   the reserved space until commit_trans().

   So delayed_ref_head will have new members to record how much space is
   reserved and free them at commit_trans() time.

2) Per-root metadata reserve counter
   For metadata(tree block), it's impossible to know  how much space it
   will use exactly in advance.
   And due to the new qgroup accounting framework, the old
   free-at-end-trans may lead to exceeding limit.

   So we record how much metadata space is reserved for each root, and
   free them at commit_trans() time.
   This method is not perfect, but thanks to the compared small size of
   metadata, it should be quite good.

The new API itself is quite safe, any stupid caller reserve or free a
range twice or more won't cause any problem, due to the nature of the
design.

[[PATCH STRUCTURE]]
As the patchset is a little huge, it can be spilt into different parts:
1) Accurate reserve space framework API(Patch 1 ~ 13)
   Implement the mergeable reserved space map and per transaction
   metadata reserve.
   Main part of the patchset, we need to merge/split and calculate how
   many bytes we really need to reserve/free.

2) Apply needed hooks to related callers(Pathc 14 ~ 22)
   The following functions need to be converted to using new qgroup
   reserve API:
   btrfs_check_free_data_space()
   btrfs_free_reserved_data_space()
   btrfs_delalloc_reserve_space()
   btrfs_delalloc_release_space()

   And the following function need to change its behavior for accurate
   qgroup reserve space:
   btrfs_fallocate()

3) Minor fix (Patch 23)
   Fix a lockdep warning where clear_bit_hook() calls
   btrfs_qgroup_free_data() but it won't really decrease qgroup reserve
   space, as it's already handle before it.

   So add a new function btrfs_free_reserved_data_space_noquota() for
   it.

Changelog:
v2:
  Add new handlers to avoid reserved space leaking for buffered write
  followed by a truncate:
btrfs_invalidatepage()
evict_inode_truncate_page()
  Add new handlers to avoid reserved space leaking for error handle
  routine:
btrfs_free_reserved_data_space()
btrfs_delalloc_release_space()

Qu Wenruo (23):
  btrfs: qgroup: New function declaration for new reserve implement
  btrfs: qgroup: Implement data_rsv_map init/free functions
  btrfs: qgroup: Introduce new function to search most left reserve
range
  btrfs: qgroup: Introduce function to insert non-overlap reserve range
  btrfs: qgroup: Introduce function to reserve data range per inode
  btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function
  btrfs: qgroup: Introduce function to release reserved range
  btrfs: qgroup: Introduce function to release/free reserved data range
  btrfs: delayed_ref: Add new function to record reserved space into
delayed ref
  btrfs: delayed_ref: release and free qgroup reserved at proper timing
  btrfs: qgroup: Introduce new functions to reserve/free metadata
  btrfs: qgroup: Use new metadata reservation.
  btrfs: extent-tree: Add new version of btrfs_check_data_free_space and
btrfs_free_reserved_data_space.
  btrfs: extent-tree: Switch to new check_data_free_space and
free_reserved_data_space
  btrfs: extent-tree: Add new version of
btrfs_delalloc_reserve/release_space
  btrfs: extent-tree: Switch

[PATCH v2 17/23] btrfs: qgroup: Cleanup old inaccurate facilities

2015-10-08 Thread Qu Wenruo

Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo 
---
v2:
  Apply newly introduced functions too.
---
 fs/btrfs/ctree.h   |  12 ++
 fs/btrfs/extent-tree.c | 109 +
 fs/btrfs/file.c|  15 ---
 fs/btrfs/inode-map.c   |   6 +--
 fs/btrfs/inode.c   |  34 +++
 fs/btrfs/ioctl.c   |   6 +--
 fs/btrfs/qgroup.c  |  19 +
 fs/btrfs/qgroup.h  |   8 
 fs/btrfs/relocation.c  |   8 ++--
 9 files changed, 61 insertions(+), 156 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4221bfd..f20b901 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3452,11 +3452,9 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes);
-int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
-void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
@@ -3472,10 +3470,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root 
*root,
  u64 qgroup_reserved);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
-int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
-void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32455e0..1dadbba 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3356,7 +3356,7 @@ again:
num_pages *= 16;
num_pages *= PAGE_CACHE_SIZE;
 
-   ret = __btrfs_check_data_free_space(inode, 0, num_pages);
+   ret = btrfs_check_data_free_space(inode, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3365,7 +3365,7 @@ again:
  _hint);
if (!ret)
dcs = BTRFS_DC_SETUP;
-   __btrfs_free_reserved_data_space(inode, 0, num_pages);
+   btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
iput(inode);
@@ -4038,27 +4038,11 @@ commit_trans:
 }
 
 /*
- * This will check the space that the inode allocates from to make sure we have
- * enough space for bytes.
- */
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
-{
-   struct btrfs_root *root = BTRFS_I(inode)->root;
-   int ret;
-
-   ret = btrfs_alloc_data_chunk_ondemand(inode, bytes);
-   if (ret < 0)
-   return ret;
-   ret = btrfs_qgroup_reserve(root, write_bytes);
-   return ret;
-}
-
-/*
  * New check_data_free_space() with ability for precious data reservation
  * Will replace old btrfs_check_data_free_space(), but for patch split,
  * add a new function first and then replace it.
  */
-int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
int ret;
@@ -4078,33 +4062,13 @@ int __btrfs_check_data_free_space(struct inode *inode, 
u64 start, u64 len)
 }
 
 /*
- * Called if we need to clear a data reservation for this inode.
- */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
-{
-   struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_space_info *data_sinfo;
-
-   /* make sure bytes are sectorsize aligned */
-   bytes = ALIGN(bytes, root->sectorsize);
-
-   data_sinfo = root->fs_info->data_sinfo;
-   spin_lock(_sinfo->lock);
-

[PATCH v2 05/23] btrfs: qgroup: Introduce function to reserve data range per inode

2015-10-08 Thread Qu Wenruo

Introduce new function reserve_data_range().
This function will find non-overlap range and to insert it into reserve
map using previously introduced functions.

This provides the basis for later per inode reserve map implement.

Signed-off-by: Qu Wenruo 
---
v2:
  Add needed parameter for later trace functions
---
 fs/btrfs/qgroup.c | 95 +++
 1 file changed, 95 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b690b02..3bdf28e 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2701,6 +2701,101 @@ static int insert_data_ranges(struct 
btrfs_qgroup_data_rsv_map *map,
 }
 
 /*
+ * Check qgroup limit and insert dirty range into reserve_map.
+ *
+ * Must be called with map->lock hold
+ */
+static int reserve_data_range(struct btrfs_root *root,
+ struct btrfs_qgroup_data_rsv_map *map,
+ struct data_rsv_range *tmp,
+ struct ulist *insert_list, u64 start, u64 len,
+ u64 *reserved)
+{
+   struct data_rsv_range *range;
+   u64 cur_start = 0;
+   u64 cur_len = 0;
+   u64 reserve = 0;
+   int ret = 0;
+
+   range = find_reserve_range(map, start);
+   /* empty tree, insert the whole range */
+   if (!range) {
+   reserve = len;
+   ret = ulist_add(insert_list, start, len, GFP_ATOMIC);
+   if (ret < 0)
+   return ret;
+   goto insert;
+   }
+
+   /* For case range is covering the leading part */
+   if (range->start <= start && range->start + range->len > start)
+   cur_start = range->start + range->len;
+   else
+   cur_start = start;
+
+   /*
+* iterate until the end of the range.
+* Like the following:
+*
+*  ||
+*|//1//|   |2//|   |///3///|   <- exists
+* Then we will need to insert the following
+*  |\\\4\\\|   |\\\5\\\|   |\\\6\\\|
+* And only add qgroup->reserved for rang 4,5,6.
+*/
+   while (cur_start < start + len) {
+   struct rb_node *next_node;
+   u64 next_start;
+
+   if (range->start + range->len <= cur_start) {
+   /*
+* Move to next range if current range is before
+* cur_start
+* e.g range is 1, cur_start is the end of range 1.
+*/
+   next_node = rb_next(>node);
+   if (!next_node) {
+   /*
+* no next range, fill the rest
+* e.g range is 3, cur_start is end of range 3.
+*/
+   cur_len = start + len - cur_start;
+   next_start = start + len;
+   } else {
+   range = rb_entry(next_node,
+struct data_rsv_range, node);
+   cur_len = min(range->start, start + len) -
+ cur_start;
+   next_start = range->start + range->len;
+   }
+   } else {
+   /*
+* current range is already after cur_start
+* e.g range is 2, cur_start is end of range 1.
+*/
+   cur_len = min(range->start, start + len) - cur_start;
+   next_start = range->start + range->len;
+   }
+   reserve += cur_len;
+   ret = ulist_add(insert_list, cur_start, cur_len, GFP_ATOMIC);
+   if (ret < 0)
+   return ret;
+
+   cur_start = next_start;
+   }
+insert:
+   ret = btrfs_qgroup_reserve(root, reserve);
+   if (ret < 0)
+   return ret;
+   /* ranges must be inserted after we are sure it has enough space */
+   ret = insert_data_ranges(map, tmp, insert_list);
+   map->reserved += reserve;
+   if (reserved)
+   *reserved = reserve;
+   return ret;
+}
+
+/*
  * Init data_rsv_map for a given inode.
  *
  * This is needed at write time as quota can be disabled and then enabled
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 03/23] btrfs: qgroup: Introduce new function to search most left reserve range

2015-10-08 Thread Qu Wenruo

Introduce the new function to search the most left reserve range in a
reserve map.

It provides the basis for later reserve map implement.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/qgroup.c | 36 
 1 file changed, 36 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index c275312..c771029 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2541,6 +2541,42 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
 }
 
 /*
+ * Return the nearest left range of given start
+ * No ensure about the range will cover start.
+ */
+static struct data_rsv_range *
+find_reserve_range(struct btrfs_qgroup_data_rsv_map *map, u64 start)
+{
+   struct rb_node **p = >root.rb_node;
+   struct rb_node *parent = NULL;
+   struct rb_node *prev = NULL;
+   struct data_rsv_range *range = NULL;
+
+   while (*p) {
+   parent = *p;
+   range = rb_entry(parent, struct data_rsv_range, node);
+   if (range->start < start)
+   p = &(*p)->rb_right;
+   else if (range->start > start)
+   p = &(*p)->rb_left;
+   else
+   return range;
+   }
+
+   /* empty tree */
+   if (!parent)
+   return NULL;
+   if (range->start <= start)
+   return range;
+
+   prev = rb_prev(parent);
+   /* Already most left one */
+   if (!prev)
+   return range;
+   return rb_entry(prev, struct data_rsv_range, node);
+}
+
+/*
  * Init data_rsv_map for a given inode.
  *
  * This is needed at write time as quota can be disabled and then enabled
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 04/23] btrfs: qgroup: Introduce function to insert non-overlap reserve range

2015-10-08 Thread Qu Wenruo

New function insert_data_ranges() will insert non-overlap reserve ranges
into reserve map.

It provides the basis for later qgroup reserve map implement.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
---
 fs/btrfs/qgroup.c | 124 ++
 1 file changed, 124 insertions(+)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index c771029..b690b02 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2577,6 +2577,130 @@ find_reserve_range(struct btrfs_qgroup_data_rsv_map 
*map, u64 start)
 }
 
 /*
+ * Insert one data range
+ * [start,len) here won't overlap with each other.
+ *
+ * Return 0 if range is inserted and tmp is not used.
+ * Return > 0 if range is inserted and tmp is used.
+ * No catchable error case. Only possible error will cause BUG_ON() as
+ * that's logical error.
+ */
+static int insert_data_range(struct btrfs_qgroup_data_rsv_map *map,
+struct data_rsv_range *tmp,
+u64 start, u64 len)
+{
+   struct rb_node **p = >root.rb_node;
+   struct rb_node *parent = NULL;
+   struct rb_node *tmp_node = NULL;
+   struct data_rsv_range *range = NULL;
+   struct data_rsv_range *prev_range = NULL;
+   struct data_rsv_range *next_range = NULL;
+   int prev_merged = 0;
+   int next_merged = 0;
+   int ret = 0;
+
+   while (*p) {
+   parent = *p;
+   range = rb_entry(parent, struct data_rsv_range, node);
+   if (range->start < start)
+   p = &(*p)->rb_right;
+   else if (range->start > start)
+   p = &(*p)->rb_left;
+   else
+   BUG_ON(1);
+   }
+
+   /* Empty tree, goto isolated case */
+   if (!range)
+   goto insert_isolated;
+
+   /* get adjusted ranges */
+   if (range->start < start) {
+   prev_range = range;
+   tmp_node = rb_next(parent);
+   if (tmp)
+   next_range = rb_entry(tmp_node, struct data_rsv_range,
+ node);
+   } else {
+   next_range = range;
+   tmp_node = rb_prev(parent);
+   if (tmp)
+   prev_range = rb_entry(tmp_node, struct data_rsv_range,
+ node);
+   }
+
+   /* try to merge with previous and next ranges */
+   if (prev_range && prev_range->start + prev_range->len == start) {
+   prev_merged = 1;
+   prev_range->len += len;
+   }
+   if (next_range && start + len == next_range->start) {
+   next_merged = 1;
+
+   /*
+* the range can be merged with adjusted two ranges into one,
+* remove the tailing range.
+*/
+   if (prev_merged) {
+   prev_range->len += next_range->len;
+   rb_erase(_range->node, >root);
+   kfree(next_range);
+   } else {
+   next_range->start = start;
+   next_range->len += len;
+   }
+   }
+
+insert_isolated:
+   /* isolated case, need to insert range now */
+   if (!next_merged && !prev_merged) {
+   BUG_ON(!tmp);
+
+   tmp->start = start;
+   tmp->len = len;
+   rb_link_node(>node, parent, p);
+   rb_insert_color(>node, >root);
+   ret = 1;
+   }
+   return ret;
+}
+
+/*
+ * insert reserve range and merge them if possible
+ *
+ * Return 0 if all inserted and tmp not used
+ * Return > 0 if all inserted and tmp used
+ * No catchable error return value.
+ */
+static int insert_data_ranges(struct btrfs_qgroup_data_rsv_map *map,
+ struct data_rsv_range *tmp,
+ struct ulist *insert_list)
+{
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+   int tmp_used = 0;
+   int ret = 0;
+
+   ULIST_ITER_INIT();
+   while ((unode = ulist_next(insert_list, ))) {
+   ret = insert_data_range(map, tmp, unode->val, unode->aux);
+
+   /*
+* insert_data_range() won't return error return value,
+* no need to hanle <0 case.
+*
+* Also tmp should be used at most one time, so clear it to
+* NULL to cooperate with sanity check in insert_data_range().
+*/
+   if (ret > 0) {
+   tmp_used = 1;
+   tmp = NULL;
+   }
+   }
+   return tmp_used;
+}
+
+/*
  * Init data_rsv_map for a given inode.
  *
  * This is needed at write time as quota can be disabled and then enabled
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe

[PATCH v2 11/23] btrfs: qgroup: Introduce new functions to reserve/free metadata

2015-10-08 Thread Qu Wenruo

Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free
metadata reserved space.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/qgroup.c  | 40 
 fs/btrfs/qgroup.h  |  4 
 4 files changed, 48 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..ae86025 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1943,6 +1943,9 @@ struct btrfs_root {
int send_in_progress;
struct btrfs_subvolume_writers *subv_writers;
atomic_t will_be_snapshoted;
+
+   /* For qgroup metadata space reserve */
+   atomic_t qgroup_meta_rsv;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 807f685..2b51705 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1259,6 +1259,7 @@ static void __setup_root(u32 nodesize, u32 sectorsize, 
u32 stripesize,
atomic_set(>orphan_inodes, 0);
atomic_set(>refs, 1);
atomic_set(>will_be_snapshoted, 0);
+   atomic_set(>qgroup_meta_rsv, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 1f03f9d..b7f6ce1 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -3120,3 +3120,43 @@ void btrfs_qgroup_free_data_rsv_map(struct inode *inode)
kfree(dirty_map);
binode->qgroup_rsv_map = NULL;
 }
+
+int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes)
+{
+   int ret;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid) ||
+   num_bytes == 0)
+   return 0;
+
+   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
+   ret = btrfs_qgroup_reserve(root, num_bytes);
+   if (ret < 0)
+   return ret;
+   atomic_add(num_bytes, >qgroup_meta_rsv);
+   return ret;
+}
+
+void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
+{
+   int reserved;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid))
+   return;
+
+   reserved = atomic_xchg(>qgroup_meta_rsv, 0);
+   if (reserved == 0)
+   return;
+   btrfs_qgroup_free(root, reserved);
+}
+
+void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes)
+{
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid))
+   return;
+
+   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
+   WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes);
+   atomic_sub(num_bytes, >qgroup_meta_rsv);
+   btrfs_qgroup_free(root, num_bytes);
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index c7ee46a..47d75cb 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -106,4 +106,8 @@ void btrfs_qgroup_free_data_rsv_map(struct inode *inode);
 int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
 int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len);
 int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len);
+
+int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes);
+void btrfs_qgroup_free_meta_all(struct btrfs_root *root);
+void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v2 13/23] btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space.

2015-10-08 Thread Qu Wenruo

Add new functions __btrfs_check_data_free_space() and
__btrfs_free_reserved_data_space() to work with new accurate qgroup
reserved space framework.

The new function will replace old btrfs_check_data_free_space() and
btrfs_free_reserved_data_space() respectively, but until all the change
is done, let's just use the new name.

Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as
now qgroup reserve requires precious bytes, some operation can't get the
accurate number in advance(like fallocate).
But data space info check and data chunk allocate doesn't need to be
that accurate, and can be called at the beginning.

So export it for later operations.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
  Add __btrfs_free_reserved_data_space() function, or we will leak
  reserved space at EQUOT error handle routine.
---
 fs/btrfs/ctree.h   |  3 ++
 fs/btrfs/extent-tree.c | 85 --
 2 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ae86025..19450a1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3453,7 +3453,10 @@ enum btrfs_reserve_flush_enum {
 };
 
 int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes);
+int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 22702bd..0cd6baa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3908,11 +3908,7 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int 
data)
return ret;
 }
 
-/*
- * This will check the space that the inode allocates from to make sure we have
- * enough space for bytes.
- */
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
+int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 {
struct btrfs_space_info *data_sinfo;
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -4033,19 +4029,55 @@ commit_trans:
  data_sinfo->flags, bytes, 1);
return -ENOSPC;
}
-   ret = btrfs_qgroup_reserve(root, write_bytes);
-   if (ret)
-   goto out;
data_sinfo->bytes_may_use += bytes;
trace_btrfs_space_reservation(root->fs_info, "space_info",
  data_sinfo->flags, bytes, 1);
-out:
spin_unlock(_sinfo->lock);
 
return ret;
 }
 
 /*
+ * This will check the space that the inode allocates from to make sure we have
+ * enough space for bytes.
+ */
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   int ret;
+
+   ret = btrfs_alloc_data_chunk_ondemand(inode, bytes);
+   if (ret < 0)
+   return ret;
+   ret = btrfs_qgroup_reserve(root, write_bytes);
+   return ret;
+}
+
+/*
+ * New check_data_free_space() with ability for precious data reservation
+ * Will replace old btrfs_check_data_free_space(), but for patch split,
+ * add a new function first and then replace it.
+ */
+int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   int ret;
+
+   /* align the range */
+   len = round_up(start + len, root->sectorsize) -
+ round_down(start, root->sectorsize);
+   start = round_down(start, root->sectorsize);
+
+   ret = btrfs_alloc_data_chunk_ondemand(inode, len);
+   if (ret < 0)
+   return ret;
+
+   /* Use new btrfs_qgroup_reserve_data to reserve precious data space */
+   ret = btrfs_qgroup_reserve_data(inode, start, len);
+   return ret;
+}
+
+/*
  * Called if we need to clear a data reservation for this inode.
  */
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
@@ -4065,6 +4097,41 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 bytes)
spin_unlock(_sinfo->lock);
 }
 
+/*
+ * Called if we need to clear a data reservation for this inode
+ * Normally in a error case.
+ *
+ * This one will handle the per-indoe data rsv map for accurate reserved
+ * space framework.
+ */
+void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_space_info *data_sinfo;
+
+   /* Make sure the range is aligned to sectorsize */
+   len = round_up(start + len,

[PATCH v2 12/23] btrfs: qgroup: Use new metadata reservation.

2015-10-08 Thread Qu Wenruo

As we have the new metadata reservation functions, use them to replace
the old btrfs_qgroup_reserve() call for metadata.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/extent-tree.c | 14 ++
 fs/btrfs/transaction.c | 34 ++
 fs/btrfs/transaction.h |  1 -
 3 files changed, 12 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4f6758b..22702bd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5345,7 +5345,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (root->fs_info->quota_enabled) {
/* One for parent inode, two for dir entries */
num_bytes = 3 * root->nodesize;
-   ret = btrfs_qgroup_reserve(root, num_bytes);
+   ret = btrfs_qgroup_reserve_meta(root, num_bytes);
if (ret)
return ret;
} else {
@@ -5363,10 +5363,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (ret == -ENOSPC && use_global_rsv)
ret = btrfs_block_rsv_migrate(global_rsv, rsv, num_bytes);
 
-   if (ret) {
-   if (*qgroup_reserved)
-   btrfs_qgroup_free(root, *qgroup_reserved);
-   }
+   if (ret && *qgroup_reserved)
+   btrfs_qgroup_free_meta(root, *qgroup_reserved);
 
return ret;
 }
@@ -5527,15 +5525,15 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
spin_unlock(_I(inode)->lock);
 
if (root->fs_info->quota_enabled) {
-   ret = btrfs_qgroup_reserve(root, nr_extents * root->nodesize);
+   ret = btrfs_qgroup_reserve_meta(root,
+   nr_extents * root->nodesize);
if (ret)
goto out_fail;
}
 
ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
if (unlikely(ret)) {
-   if (root->fs_info->quota_enabled)
-   btrfs_qgroup_free(root, nr_extents * root->nodesize);
+   btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
goto out_fail;
}
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 376191c..5ed06b8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -478,13 +478,10 @@ start_transaction(struct btrfs_root *root, u64 num_items, 
unsigned int type,
 * the appropriate flushing if need be.
 */
if (num_items > 0 && root != root->fs_info->chunk_root) {
-   if (root->fs_info->quota_enabled &&
-   is_fstree(root->root_key.objectid)) {
-   qgroup_reserved = num_items * root->nodesize;
-   ret = btrfs_qgroup_reserve(root, qgroup_reserved);
-   if (ret)
-   return ERR_PTR(ret);
-   }
+   qgroup_reserved = num_items * root->nodesize;
+   ret = btrfs_qgroup_reserve_meta(root, qgroup_reserved);
+   if (ret)
+   return ERR_PTR(ret);
 
num_bytes = btrfs_calc_trans_metadata_size(root, num_items);
/*
@@ -553,7 +550,6 @@ again:
h->block_rsv = NULL;
h->orig_rsv = NULL;
h->aborted = 0;
-   h->qgroup_reserved = 0;
h->delayed_ref_elem.seq = 0;
h->type = type;
h->allocating_chunk = false;
@@ -579,7 +575,6 @@ again:
h->bytes_reserved = num_bytes;
h->reloc_reserved = reloc_reserved;
}
-   h->qgroup_reserved = qgroup_reserved;
 
 got_it:
btrfs_record_root_in_trans(h, root);
@@ -597,8 +592,7 @@ alloc_fail:
btrfs_block_rsv_release(root, >fs_info->trans_block_rsv,
num_bytes);
 reserve_fail:
-   if (qgroup_reserved)
-   btrfs_qgroup_free(root, qgroup_reserved);
+   btrfs_qgroup_free_meta(root, qgroup_reserved);
return ERR_PTR(ret);
 }
 
@@ -815,15 +809,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
must_run_delayed_refs = 2;
}
 
-   if (trans->qgroup_reserved) {
-   /*
-* the same root has to be passed here between start_transaction
-* and end_transaction. Subvolume quota depends on this.
-*/
-   btrfs_qgroup_free(trans->root, trans->qgroup_reserved);
-   trans->qgroup_reserved = 0;
-   }
-
btrfs_trans_release_metadata(trans, root);
trans->block_rsv = NULL;
 
@@ -1238,6 +1223,7 @@ static noinline int commit_fs_roots(struct 
btrfs_trans_handle *trans,
spin_lock(_info->fs_roots_radix_lock);
if (err)
break;
+   btrfs_qgroup_free_meta_all(root);
}

Re: [PATCH v2 00/23] Rework btrfs qgroup reserved space framework

2015-10-08 Thread Qu Wenruo




Josef Bacik wrote on 2015/10/08 21:36 -0700:

On 10/08/2015 07:11 PM, Qu Wenruo wrote:

In previous rework of qgroup, we succeeded in fixing qgroup accounting
part, making the rfer/excl numbers accurate.

But that's just part of qgroup work, another part of qgroup still has
quite a lot problem, that's qgroup reserve space part which will lead to
EQUOT even we are far from the limit.

[[BUG]]
The easiest way to trigger the bug is,
1) Enable quota
2) Limit excl of qgroup 5 to 16M
3) Write [0,2M) of a file inside subvol 5 10 times without sync

EQUOT will be triggered at about the 8th write.
But after remount, we can still write until about 15M.

[[CAUSE]]
The problem is caused by the fact that qgroup will reserve space even
the data space is already reserved.

In above reproducer, each time we buffered write [0,2M) qgroup will
reserve 2M space, but in fact, at the 1st time, we have already reserved
2M and from then on, we don't need to reserved any data space as we are
only writing [0,2M).

Also, the reserved space will only be freed *ONCE* when its backref is
run at commit_transaction() time.

That's causing the reserved space leaking.

[[FIX]]
The fix is not a simple one, as currently btrfs_qgroup_reserve() will
allocate whatever caller asked for.

So for accurate qgroup reserve, we introduce a completely new framework
for data and metadata.
1) Per-inode data reserve map
Now, each inode will have a data reserve map, recording which range
of data is already reserved.
If we are writing a range which is already reserved, we won't need to
reserve space again.

Also, for the fact that qgroup is only accounted at commit_trans(),
for data commit into disc and its metadata is also inserted into
current tree, we should free the data reserved range, but still keep
the reserved space until commit_trans().

So delayed_ref_head will have new members to record how much space is
reserved and free them at commit_trans() time.


This is already handled by setting DELALLOC in the io_tree, we do
similar sort of stuff for the normal enospc accounting, why not use
that?  Thanks,

Josef


Thanks for pointing this out.

I was also searching for a existing facility, but didn't find one as I'm 
not familiar with io_tree.


After a quick glance, it seems quite fit the need, but not completely sure.

I'll keep investigating on it and try to use it.

BTW, from what I understand, __btrfs_buffered_write() should cause the 
range to be DEALLOC, but I didn't find any call to set_extent_delalloc(),

it that done in other place?

Thanks,
Qu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 00/23] Rework btrfs qgroup reserved space framework

2015-10-08 Thread Josef Bacik


On 10/08/2015 07:11 PM, Qu Wenruo wrote:

In previous rework of qgroup, we succeeded in fixing qgroup accounting
part, making the rfer/excl numbers accurate.

But that's just part of qgroup work, another part of qgroup still has
quite a lot problem, that's qgroup reserve space part which will lead to
EQUOT even we are far from the limit.

[[BUG]]
The easiest way to trigger the bug is,
1) Enable quota
2) Limit excl of qgroup 5 to 16M
3) Write [0,2M) of a file inside subvol 5 10 times without sync

EQUOT will be triggered at about the 8th write.
But after remount, we can still write until about 15M.

[[CAUSE]]
The problem is caused by the fact that qgroup will reserve space even
the data space is already reserved.

In above reproducer, each time we buffered write [0,2M) qgroup will
reserve 2M space, but in fact, at the 1st time, we have already reserved
2M and from then on, we don't need to reserved any data space as we are
only writing [0,2M).

Also, the reserved space will only be freed *ONCE* when its backref is
run at commit_transaction() time.

That's causing the reserved space leaking.

[[FIX]]
The fix is not a simple one, as currently btrfs_qgroup_reserve() will
allocate whatever caller asked for.

So for accurate qgroup reserve, we introduce a completely new framework
for data and metadata.
1) Per-inode data reserve map
Now, each inode will have a data reserve map, recording which range
of data is already reserved.
If we are writing a range which is already reserved, we won't need to
reserve space again.

Also, for the fact that qgroup is only accounted at commit_trans(),
for data commit into disc and its metadata is also inserted into
current tree, we should free the data reserved range, but still keep
the reserved space until commit_trans().

So delayed_ref_head will have new members to record how much space is
reserved and free them at commit_trans() time.


This is already handled by setting DELALLOC in the io_tree, we do 
similar sort of stuff for the normal enospc accounting, why not use 
that?  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: fix waitqueue_active without memory barrier in btrfs

2015-10-08 Thread Kosuke Tatsukawa

Josef Bacik wrote:
> On 10/08/2015 05:35 PM, Kosuke Tatsukawa wrote:
>> btrfs_bio_counter_sub() seems to be missing a memory barrier which might
>> cause the waker to not notice the waiter and miss sending a wake_up as
>> in the following figure.
>> 
>>  btrfs_bio_counter_sub   btrfs_rm_dev_replace_blocked
>> 
>> if (waitqueue_active(_info->replace_wait))
>> /* The CPU might reorder the test for
>> the waitqueue up here, before
>> prior writes complete */
>>  /* wait_event */
>>   /* __wait_event */
>>/* ___wait_event */
>>long __int = 
>> prepare_to_wait_event(,
>>  &__wait, state);
>>if 
>> (!percpu_counter_sum(_info->bio_counter))
>> percpu_counter_sub(_info->bio_counter,
>>amount);
>>schedule()
>
> percpu_counter_sub can't be reordered, in its most basic form it does
> preempt_disable/enable which in its most basic form does barrier().  Thanks,

It's not the compiler, but the CPU that is doing the reordering.

The CPU can delay the write of the counter, so that the following read
of _info->replace_wait is completed first.  Hence a memory barrier is
required, and not just a barrier.
---
Kosuke TATSUKAWA  | 3rd IT Platform Department
  | IT Platform Division, NEC Corporation
  | ta...@ab.jp.nec.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

49 matches

Mail list logo