btrfs goes readonly + No space left on 4.3

2015-10-12 Thread Stefan Priebe

Hi,

while trying to write to my volume btrfs gets readonly:

btrfs fi show /vmbackup/
Label: none  uuid: f4afaac2-c587-4ff7-87b1-19e6a483215f
Total devices 1 FS bytes used 35.56TiB
devid1 size 50.93TiB used 35.72TiB path 
/dev/mapper/stripe0-vmbackup


btrfs-progs v4.1.2

btrfs fi df /vmbackup/
Data, single: total=35.40TiB, used=35.40TiB
System, DUP: total=8.00MiB, used=3.75MiB
Metadata, DUP: total=162.00GiB, used=160.72GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

This is 4.1.10 with all btrfs patches up to 4.3-rc3

[ 6230.406369] [ cut here ]
[ 6230.411594] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.463718] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.466681] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.475887] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.505852] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.525233] BTRFS warning (device dm-0): 
btrfs_finish_ordered_io:2840: Aborting unused transaction(No space left).
[ 6230.851050] WARNING: CPU: 8 PID: 8230 at fs/btrfs/extent-tree.c:6356 
__btrfs_free_extent.isra.83+0x2cc/0xce0 [btrfs]()

[ 6230.911727] BTRFS: Transaction aborted (error -28)
[ 6230.911729] Modules linked in: netconsole ipt_REJECT nf_reject_ipv4 
xt_multiport iptable_filter ip_tables x_tables bonding ext2 coretemp 
loop usbhid ehci_pci ehci_hcd sb_edac i2c_i801 ipmi_si usbcore edac_core 
i2c_core shpchp usb_common ipmi_msghandler button btrfs lzo_compress 
dm_mod raid456 async_raid6_recov async_memcpy async_pq async_xor 
async_tx xor raid6_pq raid1 md_mod i40e(O) ixgbe vxlan ip6_udp_tunnel 
mdio udp_tunnel sg sd_mod ahci ptp aacraid libahci pps_core
[ 6231.175922] CPU: 8 PID: 8230 Comm: btrfs-transacti Tainted: G 
   O4.1.10 #1
[ 6231.243585] Hardware name: Supermicro X10DRH/X10DRH-IT, BIOS 1.0c 
02/18/2015
[ 6231.311584]  c0433895 88085a8db9b8 8a639186 
0001
[ 6231.380600]  88085a8dba08 88085a8db9f8 8a07fd57 
88080850
[ 6231.449888]  880311fef2d0 1fcbf245  


[ 6231.519402] Call Trace:
[ 6231.586837]  [] dump_stack+0x45/0x57
[ 6231.654260]  [] warn_slowpath_common+0x97/0xe0
[ 6231.721958]  [] warn_slowpath_fmt+0x46/0x50
[ 6231.789246]  [] 
__btrfs_free_extent.isra.83+0x2cc/0xce0 [btrfs]
[ 6231.857732]  [] ? 
block_group_cache_tree_search+0x98/0xf0 [btrfs]

[ 6231.926972]  [] ? find_ref_head+0x6c/0x90 [btrfs]
[ 6231.995254]  [] 
__btrfs_run_delayed_refs+0x730/0x11a0 [btrfs]
[ 6232.062995]  [] btrfs_run_delayed_refs+0x7f/0x290 
[btrfs]
[ 6232.130929]  [] 
btrfs_write_dirty_block_groups+0x103/0x2a0 [btrfs]
[ 6232.199577]  [] commit_cowonly_roots+0x225/0x2cf 
[btrfs]
[ 6232.268690]  [] 
btrfs_commit_transaction+0x538/0xa90 [btrfs]

[ 6232.338554]  [] transaction_kthread+0x1c5/0x240 [btrfs]
[ 6232.407982]  [] ? open_ctree+0x2390/0x2390 [btrfs]
[ 6232.476521]  [] kthread+0xc9/0xe0
[ 6232.544430]  [] ? kthread_create_on_node+0x1a0/0x1a0
[ 6232.612249]  [] ret_from_fork+0x42/0x70
[ 6232.680322]  [] ? kthread_create_on_node+0x1a0/0x1a0
[ 6232.748832] ---[ end trace 4977add5c48cdc47 ]---
[ 6232.816367] BTRFS: error (device dm-0) in __btrfs_free_extent:6356: 
errno=-28 No space left

[ 6232.885109] BTRFS info (device dm-0): forced readonly
[ 6232.953994] BTRFS: error (device dm-0) in 
btrfs_run_delayed_refs:2854: errno=-28 No space left
[ 6233.027127] BTRFS warning (device dm-0): Skipping commit of aborted 
transaction.
[ 6233.098974] BTRFS: error (device dm-0) in cleanup_transaction:1726: 
errno=-28 No space left


Stefan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about FIEMAP

2015-10-12 Thread Hugo Mills
On Mon, Oct 12, 2015 at 04:37:55AM +, Wang, Zhiye wrote:
> Hello everyone,
> 
> After googled a bit, I got information that btrfs supports FIEMAP (as "cp" 
> needs it), but it's not valid for "write" operation.
> 
> I guess we cannot write to block device directly after get block list using 
> FIEMAP. This is because:
> 
> 1. COW feature of btrfs (but this can be disabled using NOCOW)
> 2. File system rebalance
> 3. Defragmentation
> 
> Aren't item #2 and #3 also a problem for "read" operation? For example, after 
> "cp" get block list using FIEMAP, file system rebalance occurs, So, previous 
> result of FIEMAP is not valid anymore.

   That's correct. If you use FIEMAP to get the blocks of a file, and
then balance the FS or defrag the file, then even without an explicit
write to the file, the file's location will have changed. This is the
same reason that btrfs doesn't support swap files (although I don't
know if swapon uses FIEMAP directly, or if there's just some
equivalent mechanism to get the blocks).

   Hugo.

-- 
Hugo Mills | Have found Lost City of Atlantis. High Priest is
hugo@... carfax.org.uk | winning at quoits
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

2015-10-12 Thread Pádraig Brady
On 11/10/15 15:29, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:53PM -0400, Anna Schumaker wrote:
>> Reject copies that don't have the COPY_FR_REFLINK flag set.
> 
> I think a reflink actually is a perfectly valid copy, and I don't buy
> the duplicate arguments in earlier threads.  We really need to think
> more in terms of how this impacts a user and now how it's implemented
> internally.  How does a user notice it's a reflink?  They don't as
> implemented in btrfs and co. 

You're right that if the user doesn't notice, then there is no
point exposing this. However I think the user does notice as
there is a difference in the end state of the copy.  I.E. generally
if there is a different end state it would require an option,
while if only a different copying mechanism it would not.
I think the different end state of a reflink warrants an option for 3 reasons:

 - The user might want separate bits for resiliency. Now this is
   a weak argument due to possible deduplication in lower layers,
   but still valid is some setups.

 - The user might want to avoid CoW at a later time critical stage.

 - The user might want to avoid ENOSPC at a later critical stage.

> Now on filesystem that don't always do
> copy on write but might support reflinks (ocfs2, XFS in the future)
> this becomes a bit more interesting - the difference he is that we
> get an implicit fallocate when doing a real copy.  But if that's
> something we have actual requests for that's how we should specify
> it rather than in terms of arcane implementation details.

thanks,
Pádraig.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about FIEMAP

2015-10-12 Thread Duncan
Wang, Zhiye posted on Mon, 12 Oct 2015 04:37:55 + as excerpted:

> I guess we cannot write to block device directly after get block list
> using FIEMAP. This is because:
> 
> 1. COW feature of btrfs (but this can be disabled using NOCOW)

I'm a user not a dev and many of the specifics of this discussion will 
with little doubt be above my head, but a warning on this assumption, 
just in case you overlooked it...

Btrfs' snapshot feature conflicts with nocow, because a snapshot locks in 
place existing extents, relying on cow for any rewrite, to write the new 
blocks elsewhere.

So what happens when a nocow file is snapshotted and then written into?

Simple enough, it's effectively cow1.  That is, the first write to a 
particular block of a nocow file after a snapshot will still cow it, but 
the file retains its nocow attribute, and further writes to the same 
block will rewrite the block in its now existing new location... until 
the next snapshot locks that too in place, of course.

Bottom line, in the presence of snapshotting, particularly scheduled 
snapshotting that the admin may have forgotten about and/or doesn't know 
the consequences of, you can't rely on nocow actually being absolute 
rewrite-in-place nocow.

So just in case you weren't aware, don't assume what can't be assumed. 
=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue: With an OverlayFS that has Btrfs as the upper layer, removal of directories silently fails

2015-10-12 Thread Miklos Szeredi
On Fri, Aug 21, 2015 at 9:23 PM, Neal Gompa  wrote:
> Hello,
>
> For those who have received this message twice already, I sincerely apologize.
>
> Yesterday I filed an issue in the CentOS bug tracker (#9297[1]) and
> the Red Hat Bugzilla (#1255512[2]) about OverlayFS with Btrfs as the
> upper layer.
>
> The issue was discovered by a colleague of mine, and we verified the
> issue exists in the EL7 kernel and mainline kernels (tested on Arch
> Linux). I additionally verified the problem exists in Fedora kernels.
>
> Rather than quoting the issue, I'll just note that the issue is
> described quite well in the filed bugs noted earlier. In the linked
> bugs, there's a simple Bash script that will reliably reproduce the
> problem. While the script uses tmpfs at the lower layer, we originally
> discovered the problem with ext4 as the lower layer and verified that
> it doesn't matter which lower layer filesystem it is, the problem
> exists. Additionally, the problem doesn't exist when using ext4 as the
> upper layer.
>
> I was advised by Josh Boyer to email you guys (and the mailing lists,
> to be sure to get everyone a chance to look over it) about the
> problem, since it affects mainline.
>
> [1]: https://bugs.centos.org/view.php?id=9297
> [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1255512

AFAICS btrfs doesn't support RENAME_EXCHANGE and RENAME_WHITEOUT.
These flags are needed for any filesystem that wants to be a fully
functional  overlayfs upper layer.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread Henk Slager
Hi Warren,

from your dmesg I see:
Oct 10 07:42:36 cloud.warrenhughes.net kernel: scsi 0:0:1:0:
Direct-Access ATA  ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5
Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)

Oct 11 23:57:56 cloud.warrenhughes.net kernel: scsi 0:0:1:0:
Direct-Access ATA  ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5
Oct 11 23:57:56 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdo]
15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)

and looking at this spec:
http://www.seagate.com/files/www-content/product-content/hdd-fam/seagate-archive-hdd/en-us/docs/archive-hdd-dS1834-3-1411us.pdf

it seems that it is a drive-managed SMR disk. I am not sure why David
assumes it is host-managed, maybe drive firmware/functionality can be
bypassed.

As far as I can see, the drive should not have a problem with btrfs as
such, but I read quite worrying stories w.r.t. raid. I think the write
characteristics of the balance operation, in combination with the
connection via the LSI controller, are not really compatible with
'archive' use case of the drive. 'Simple', 'relaxed' write operation
should be OK, but beyond that, it might fail. See also:
http://www.storagereview.com/seagate_archive_hdd_review_8tb

How much data is already on the drive? Is it an option to mount with
skip_balance and try to remove the device and then do some tests on it
in single independent mode?

/Henk


On Mon, Oct 12, 2015 at 3:21 PM, David Sterba  wrote:
> On Mon, Oct 12, 2015 at 07:43:50AM +1300, Warren Hughes wrote:
>> Hi guys, just added a new Seagate Archive 8TB drive to my BTRFS volume
>> and I'm getting a tonne of errors when balancing or scrubbing.
>>
>> A short smartctl test reports fine, running a long one now. Will also
>> run seatools from a bootable DOS USB while at work today.
>>
>> Running latest firmware on my 9240-8i which explicitly supports this drive.
>>
>> I'm finding it very hard to tell if SMR drives are OK with BTRFS
>> currently - anyone chime in?
>
> I assume you have the host-managed SMR drives. This type needs tweaks to
> the operating system so the write patterns play well with the SMR
> constraints. Btrfs does not support that out of the box, but my
> colleague Hannes Reinecke managed to get it working with some minor
> changes to the allocator and disabled writing of superblock copies.
>
> For full support of SMR we'd have to change more than that, currently
> nothing prevents to write "backwards" in a given chunk that is allowed
> to be written only in the append way. So you can get mixed results when
> trying to use the SMR devices but I'd say it will mostly not work.
>
> But, btrfs has all the fundamental features in place, we'd have to make
> adjustments to follow the SMR constraints:
>
> * we can map the blockgroups to the SMR chunks (in some multiples)
> * remember the write pointers and do only append writes (easy with COW)
> * if the chunk is getting full, mark it read-only, rebalance the live
>   data somewhere else and reset the chunk and the pointer
>
> I have some notes at
> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs-progs: Add all missing close_ctree and btrfs_close_all_devices

2015-10-12 Thread David Sterba
On Mon, Oct 12, 2015 at 11:27:39AM +0800, Zhao Lei wrote:
> This patch add all missing close_ctree and btrfs_close_all_devices
> to several tools in btrfs progs, to avoid memory leak.

With that many missing callsites, I think it's better to put it right
after the command callback:

btrfs.c:
245 exit(cmd->fn(argc, argv));

so we don't need to add it everywhere manually. The standalone tools
need that though.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread David Sterba
On Mon, Oct 12, 2015 at 07:43:50AM +1300, Warren Hughes wrote:
> Hi guys, just added a new Seagate Archive 8TB drive to my BTRFS volume
> and I'm getting a tonne of errors when balancing or scrubbing.
> 
> A short smartctl test reports fine, running a long one now. Will also
> run seatools from a bootable DOS USB while at work today.
> 
> Running latest firmware on my 9240-8i which explicitly supports this drive.
> 
> I'm finding it very hard to tell if SMR drives are OK with BTRFS
> currently - anyone chime in?

I assume you have the host-managed SMR drives. This type needs tweaks to
the operating system so the write patterns play well with the SMR
constraints. Btrfs does not support that out of the box, but my
colleague Hannes Reinecke managed to get it working with some minor
changes to the allocator and disabled writing of superblock copies.

For full support of SMR we'd have to change more than that, currently
nothing prevents to write "backwards" in a given chunk that is allowed
to be written only in the append way. So you can get mixed results when
trying to use the SMR devices but I'd say it will mostly not work.

But, btrfs has all the fundamental features in place, we'd have to make
adjustments to follow the SMR constraints:

* we can map the blockgroups to the SMR chunks (in some multiples)
* remember the write pointers and do only append writes (easy with COW)
* if the chunk is getting full, mark it read-only, rebalance the live
  data somewhere else and reset the chunk and the pointer

I have some notes at
https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/11] btrfs-progs: subvolume: use btrfs_open_dir for btrfs subvolume command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # (/mnt/tmp is not btrfs mountpoint)
  #
  # btrfs subvolume create /mnt/tmp/123
  Create subvolume '/mnt/tmp/123'
  ERROR: cannot create subvolume - Inappropriate ioctl for device
  #

After patch:
  # btrfs subvolume create /mnt/tmp/123
  ERROR: not btrfs filesystem: /mnt/tmp
  #

Signed-off-by: Zhao Lei 
---
 cmds-subvolume.c | 56 +++-
 1 file changed, 19 insertions(+), 37 deletions(-)

diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index c40330a..be1a54a 100644
--- a/cmds-subvolume.c
+++ b/cmds-subvolume.c
@@ -181,11 +181,9 @@ static int cmd_subvol_create(int argc, char **argv)
goto out;
}
 
-   fddst = open_file_or_dir(dstdir, );
-   if (fddst < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", dstdir);
+   fddst = btrfs_open_dir(dstdir, , 1);
+   if (fddst < 0)
goto out;
-   }
 
printf("Create subvolume '%s/%s'\n", dstdir, newname);
if (inherit) {
@@ -348,9 +346,8 @@ again:
vname = basename(dupvname);
free(cpath);
 
-   fd = open_file_or_dir(dname, );
+   fd = btrfs_open_dir(dname, , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", dname);
ret = 1;
goto out;
}
@@ -564,7 +561,7 @@ static int cmd_subvol_list(int argc, char **argv)
}
 
subvol = argv[optind];
-   fd = open_file_or_dir(subvol, );
+   fd = btrfs_open_dir(subvol, , 1);
if (fd < 0) {
ret = -1;
fprintf(stderr, "ERROR: can't access '%s'\n", subvol);
@@ -723,17 +720,13 @@ static int cmd_subvol_snapshot(int argc, char **argv)
goto out;
}
 
-   fddst = open_file_or_dir(dstdir, );
-   if (fddst < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", dstdir);
+   fddst = btrfs_open_dir(dstdir, , 1);
+   if (fddst < 0)
goto out;
-   }
 
-   fd = open_file_or_dir(subvol, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", dstdir);
+   fd = btrfs_open_dir(subvol, , 1);
+   if (fd < 0)
goto out;
-   }
 
if (readonly) {
args.flags |= BTRFS_SUBVOL_RDONLY;
@@ -791,11 +784,9 @@ static int cmd_subvol_get_default(int argc, char **argv)
usage(cmd_subvol_get_default_usage);
 
subvol = argv[1];
-   fd = open_file_or_dir(subvol, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", subvol);
+   fd = btrfs_open_dir(subvol, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = btrfs_list_get_default_subvolume(fd, _id);
if (ret) {
@@ -859,11 +850,9 @@ static int cmd_subvol_set_default(int argc, char **argv)
 
objectid = arg_strtou64(subvolid);
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_DEFAULT_SUBVOL, );
e = errno;
@@ -906,11 +895,9 @@ static int cmd_subvol_find_new(int argc, char **argv)
return 1;
}
 
-   fd = open_file_or_dir(subvol, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", subvol);
+   fd = btrfs_open_dir(subvol, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_SYNC);
if (ret < 0) {
@@ -980,11 +967,9 @@ static int cmd_subvol_show(int argc, char **argv)
ret = 1;
svpath = get_subvol_name(mnt, fullpath);
 
-   fd = open_file_or_dir(fullpath, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", fullpath);
+   fd = btrfs_open_dir(fullpath, , 1);
+   if (fd < 0)
goto out;
-   }
 
ret = btrfs_list_get_path_rootid(fd, _id);
if (ret) {
@@ -993,11 +978,9 @@ static int cmd_subvol_show(int argc, char **argv)
goto out;
}
 
-   mntfd = open_file_or_dir(mnt, );
-   if (mntfd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", mnt);
+   mntfd = btrfs_open_dir(mnt, , 1);
+   if (mntfd < 0)
goto out;
-   }
 
if (sv_id == BTRFS_FS_TREE_OBJECTID) {
printf("%s is btrfs root\n", fullpath);
@@ -1271,9 +1254,8 @@ static int cmd_subvol_sync(int argc, char **argv)
if (check_argc_min(argc - optind, 1))
usage(cmd_subvol_sync_usage);
 
-   fd = open_file_or_dir(argv[optind], );
+   fd = btrfs_open_dir(argv[optind], , 1);
 

[PATCH 07/11] btrfs-progs: qgroup: use btrfs_open_dir for btrfs qgroup command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # ./btrfs qgroup create 1/5  /mnt/tmp1
  ERROR: unable to create quota group: Inappropriate ioctl for device
  #
  # ./btrfs qgroup assign 1/5 2/5 /mnt/tmp1
  ERROR: unable to assign quota group: Inappropriate ioctl for device
  #
  # ./btrfs qgroup show /mnt/tmp1
  ERROR: can't perform the search - Inappropriate ioctl for device
  ERROR: can't list qgroups: Inappropriate ioctl for device
  #
  # ./btrfs qgroup limit 1G 1/5  /mnt/tmp1
  ERROR: unable to limit requested quota group: Inappropriate ioctl for device

After patch:
  # ./btrfs qgroup create 1/5 /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs qgroup assign 1/5 2/5 /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs qgroup show /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs qgroup limit 1G 1/5 /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1

Signed-off-by: Zhao Lei 
---
 cmds-qgroup.c | 24 
 1 file changed, 8 insertions(+), 16 deletions(-)

diff --git a/cmds-qgroup.c b/cmds-qgroup.c
index 48c1733..0ad99f4 100644
--- a/cmds-qgroup.c
+++ b/cmds-qgroup.c
@@ -79,11 +79,9 @@ static int qgroup_assign(int assign, int argc, char **argv)
fprintf(stderr, "ERROR: bad relation requested '%s'\n", path);
return 1;
}
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_QGROUP_ASSIGN, );
e = errno;
@@ -137,11 +135,9 @@ static int qgroup_create(int create, int argc, char **argv)
args.create = create;
args.qgroupid = parse_qgroupid(argv[1]);
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_QGROUP_CREATE, );
e = errno;
@@ -351,11 +347,9 @@ static int cmd_qgroup_show(int argc, char **argv)
usage(cmd_qgroup_show_usage);
 
path = argv[optind];
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
if (filter_flag) {
qgroupid = btrfs_get_path_rootid(fd);
@@ -460,11 +454,9 @@ static int cmd_qgroup_limit(int argc, char **argv)
} else
usage(cmd_qgroup_limit_usage);
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_QGROUP_LIMIT, );
e = errno;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/11] btrfs-progs: use btrfs_open_dir in open_path_or_dev_mnt

2015-10-12 Thread Zhao Lei
Use btrfs_open_dir() in open_path_or_dev_mnt() to make the function
return error when target is neither block device nor btrfs mount point.

Also add "verbose" argument to let function output common error
message instead of putting duplicated lines in caller.

Before patch:
  # ./btrfs device stats /mnt/tmp1
  ERROR: getting dev info for devstats failed: Inappropriate ioctl for device
  # ./btrfs replace start /dev/vdd /dev/vde /mnt/tmp1
  ERROR: ioctl(DEV_REPLACE_STATUS) failed on "/mnt/tmp1": Inappropriate ioctl 
for device

After patch:
  # ./btrfs device stats /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs replace start /dev/vdd /dev/vde /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1

Signed-off-by: Zhao Lei 
---
 cmds-device.c  | 13 ++---
 cmds-replace.c | 13 ++---
 cmds-scrub.c   | 28 +---
 utils.c| 21 +++--
 utils.h|  2 +-
 5 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/cmds-device.c b/cmds-device.c
index 5f2b952..a9354f5 100644
--- a/cmds-device.c
+++ b/cmds-device.c
@@ -385,18 +385,9 @@ static int cmd_device_stats(int argc, char **argv)
 
dev_path = argv[optind];
 
-   fdmnt = open_path_or_dev_mnt(dev_path, );
-
-   if (fdmnt < 0) {
-   if (errno == EINVAL)
-   fprintf(stderr,
-   "ERROR: '%s' is not a mounted btrfs device\n",
-   dev_path);
-   else
-   fprintf(stderr, "ERROR: can't access '%s': %s\n",
-   dev_path, strerror(errno));
+   fdmnt = open_path_or_dev_mnt(dev_path, , 1);
+   if (fdmnt < 0)
return 1;
-   }
 
ret = get_fs_info(dev_path, _args, _args);
if (ret) {
diff --git a/cmds-replace.c b/cmds-replace.c
index 9ab8438..385b764 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -170,18 +170,9 @@ static int cmd_replace_start(int argc, char **argv)
usage(cmd_replace_start_usage);
path = argv[optind + 2];
 
-   fdmnt = open_path_or_dev_mnt(path, );
-
-   if (fdmnt < 0) {
-   if (errno == EINVAL)
-   fprintf(stderr,
-   "ERROR: '%s' is not a mounted btrfs device\n",
-   path);
-   else
-   fprintf(stderr, "ERROR: can't access '%s': %s\n",
-   path, strerror(errno));
+   fdmnt = open_path_or_dev_mnt(path, , 1);
+   if (fdmnt < 0)
goto leave_with_error;
-   }
 
/* check for possible errors before backgrounding */
status_args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_STATUS;
diff --git a/cmds-scrub.c b/cmds-scrub.c
index ea6ffc9..da614f2 100644
--- a/cmds-scrub.c
+++ b/cmds-scrub.c
@@ -1198,17 +1198,9 @@ static int scrub_start(int argc, char **argv, int resume)
 
path = argv[optind];
 
-   fdmnt = open_path_or_dev_mnt(path, );
-
-   if (fdmnt < 0) {
-   if (errno == EINVAL)
-   error_on(!do_quiet, "'%s' is not a mounted btrfs 
device",
-   path);
-   else
-   error_on(!do_quiet, "can't access '%s': %s",
-   path, strerror(errno));
+   fdmnt = open_path_or_dev_mnt(path, , !do_quiet);
+   if (fdmnt < 0)
return 1;
-   }
 
ret = get_fs_info(path, _args, _args);
if (ret) {
@@ -1604,12 +1596,8 @@ static int cmd_scrub_cancel(int argc, char **argv)
 
path = argv[1];
 
-   fdmnt = open_path_or_dev_mnt(path, );
+   fdmnt = open_path_or_dev_mnt(path, , 1);
if (fdmnt < 0) {
-   if (errno == EINVAL)
-   error("'%s' is not a mounted btrfs device", path);
-   else
-   error("can't access '%s': %s", path, strerror(errno));
ret = 1;
goto out;
}
@@ -1705,15 +1693,9 @@ static int cmd_scrub_status(int argc, char **argv)
 
path = argv[optind];
 
-   fdmnt = open_path_or_dev_mnt(path, );
-
-   if (fdmnt < 0) {
-   if (errno == EINVAL)
-   error("'%s' is not a mounted btrfs device", path);
-   else
-   error("can't access '%s': %s", path, strerror(errno));
+   fdmnt = open_path_or_dev_mnt(path, , 1);
+   if (fdmnt < 0)
return 1;
-   }
 
ret = get_fs_info(path, _args, _args);
if (ret) {
diff --git a/utils.c b/utils.c
index f1e3248..6f5df23 100644
--- a/utils.c
+++ b/utils.c
@@ -1081,27 +1081,28 @@ out:
  *
  * On error, return -1, errno should be set.
  */
-int open_path_or_dev_mnt(const char *path, DIR **dirstream)
+int open_path_or_dev_mnt(const char *path, DIR **dirstream, int verbose)
 {
char mp[PATH_MAX];
-   int fdmnt;
-
- 

[PATCH 04/11] btrfs-progs: inspect: Bypass unnecessary clean function in open_error

2015-10-12 Thread Zhao Lei
No need to cleanup fd in open_fail case, because it is not opened.

Signed-off-by: Zhao Lei 
---
 cmds-inspect.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index fc3db99..879fd43 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -626,9 +626,8 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
}
 
ret = print_min_dev_size(fd, devid);
-out:
close_file_or_dir(fd, dirstream);
-
+out:
return !!ret;
 }
 
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/11] btrfs-progs: filesystem: use btrfs_open_dir for btrfs filesystem command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # (/mnt/tmp is not btrfs mountpoint)
  #
  # btrfs filesystem df /mnt/tmp
  ERROR: couldn't get space info - Inappropriate ioctl for device
  ERROR: get_df failed Inappropriate ioctl for device
  #

After patch:
  # ./btrfs filesystem df /mnt/tmp
  ERROR: not btrfs filesystem: /mnt/tmp
  #

Signed-off-by: Zhao Lei 
---
 cmds-fi-usage.c   |  4 +---
 cmds-filesystem.c | 19 +++
 2 files changed, 8 insertions(+), 15 deletions(-)

diff --git a/cmds-fi-usage.c b/cmds-fi-usage.c
index 50d6333..5fefed4 100644
--- a/cmds-fi-usage.c
+++ b/cmds-fi-usage.c
@@ -901,10 +901,8 @@ int cmd_filesystem_usage(int argc, char **argv)
int chunkcount = 0;
int devcount = 0;
 
-   fd = open_file_or_dir(argv[i], );
+   fd = btrfs_open_dir(argv[i], , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n",
-   argv[i]);
ret = 1;
goto out;
}
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 3663734..91bf1fa 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -205,11 +205,10 @@ static int cmd_filesystem_df(int argc, char **argv)
 
path = argv[1];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
+
ret = get_df(fd, );
 
if (ret == 0) {
@@ -939,11 +938,9 @@ static int cmd_filesystem_sync(int argc, char **argv)
 
path = argv[1];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
printf("FSSync '%s'\n", path);
res = ioctl(fd, BTRFS_IOC_SYNC);
@@ -1229,11 +1226,9 @@ static int cmd_filesystem_resize(int argc, char **argv)
return 1;
}
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
printf("Resize '%s' of '%s'\n", path, amount);
memset(, 0, sizeof(args));
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/11] btrfs-progs: quota: use btrfs_open_dir for btrfs quota command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # ./btrfs quota enable /mnt/tmp1
  ERROR: quota command failed: Inappropriate ioctl for device
  # ./btrfs quota disable /mnt/tmp1
  ERROR: quota command failed: Inappropriate ioctl for device
  # ./btrfs quota rescan /mnt/tmp1
  ERROR: quota rescan failed: Inappropriate ioctl for device
  #

After patch:
  # ./btrfs quota enable /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs quota disable /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs quota rescan /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  #

Signed-off-by: Zhao Lei 
---
 cmds-quota.c | 12 
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/cmds-quota.c b/cmds-quota.c
index 8adc1bf..efbc3ef 100644
--- a/cmds-quota.c
+++ b/cmds-quota.c
@@ -45,11 +45,9 @@ static int quota_ctl(int cmd, int argc, char **argv)
memset(, 0, sizeof(args));
args.cmd = cmd;
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_QUOTA_CTL, );
e = errno;
@@ -141,11 +139,9 @@ static int cmd_quota_rescan(int argc, char **argv)
memset(, 0, sizeof(args));
 
path = argv[optind];
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, ioctlnum, );
e = errno;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/11] btrfs-progs: inspect: use btrfs_open_dir for btrfs inspect command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # ./btrfs inspect-internal rootid /mnt/tmp1
  ERROR: Failed to lookup root id - Inappropriate ioctl for device
  btrfs inspect-internal rootid: rootid failed with ret=-1
  # ./btrfs inspect-internal inode-resolve 256 /mnt/tmp1
  ioctl ret=-1, error: Inappropriate ioctl for device
  # ./btrfs inspect-internal min-dev-size /mnt/tmp1
  Error invoking tree search ioctl: Inappropriate ioctl for device

After patch:
  # ./btrfs inspect-internal rootid /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs inspect-internal inode-resolve 256 /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs inspect-internal min-dev-size /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1

Signed-off-by: Zhao Lei 
---
 cmds-inspect.c | 22 +++---
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/cmds-inspect.c b/cmds-inspect.c
index a13a170..40ab49b 100644
--- a/cmds-inspect.c
+++ b/cmds-inspect.c
@@ -116,11 +116,9 @@ static int cmd_inspect_inode_resolve(int argc, char **argv)
if (check_argc_exact(argc - optind, 2))
usage(cmd_inspect_inode_resolve_usage);
 
-   fd = open_file_or_dir(argv[optind+1], );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", argv[optind+1]);
+   fd = btrfs_open_dir(argv[optind + 1], , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = __ino_to_path_fd(arg_strtou64(argv[optind]), fd, verbose,
   argv[optind+1]);
@@ -189,9 +187,8 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
loi.size = size;
loi.inodes = ptr_to_u64(inodes);
 
-   fd = open_file_or_dir(argv[optind+1], );
+   fd = btrfs_open_dir(argv[optind + 1], , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", argv[optind+1]);
ret = 12;
goto out;
}
@@ -239,10 +236,8 @@ static int cmd_inspect_logical_resolve(int argc, char 
**argv)
name);
BUG_ON(ret >= bytes_left);
free(name);
-   path_fd = open_file_or_dir(full_path, );
+   path_fd = btrfs_open_dir(full_path, , 1);
if (path_fd < 0) {
-   fprintf(stderr, "ERROR: can't access "
-   "'%s'\n", full_path);
ret = -ENOENT;
goto out;
}
@@ -279,9 +274,8 @@ static int cmd_inspect_subvolid_resolve(int argc, char 
**argv)
if (check_argc_exact(argc, 3))
usage(cmd_inspect_subvolid_resolve_usage);
 
-   fd = open_file_or_dir(argv[2], );
+   fd = btrfs_open_dir(argv[2], , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", argv[2]);
ret = -ENOENT;
goto out;
}
@@ -320,9 +314,8 @@ static int cmd_inspect_rootid(int argc, char **argv)
if (check_argc_exact(argc, 2))
usage(cmd_inspect_rootid_usage);
 
-   fd = open_file_or_dir(argv[1], );
+   fd = btrfs_open_dir(argv[1], , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", argv[1]);
ret = -ENOENT;
goto out;
}
@@ -619,9 +612,8 @@ static int cmd_inspect_min_dev_size(int argc, char **argv)
if (check_argc_exact(argc - optind, 1))
usage(cmd_inspect_min_dev_size_usage);
 
-   fd = open_file_or_dir(argv[optind], );
+   fd = btrfs_open_dir(argv[optind], , 1);
if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", argv[optind]);
ret = -ENOENT;
goto out;
}
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/11] btrfs-progs: replace: use btrfs_open_dir for btrfs replace command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # ./btrfs replace cancel /mnt/tmp1
  ERROR: ioctl(DEV_REPLACE_CANCEL) failed on "/mnt/tmp1": Inappropriate ioctl 
for device
  # ./btrfs replace status /mnt/tmp1
  ERROR: ioctl(DEV_REPLACE_STATUS) failed on "/mnt/tmp1": Inappropriate ioctl 
for device

After patch:
  # ./btrfs replace cancel /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1
  # ./btrfs replace status /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1

Signed-off-by: Zhao Lei 
---
 cmds-replace.c | 16 
 1 file changed, 4 insertions(+), 12 deletions(-)

diff --git a/cmds-replace.c b/cmds-replace.c
index 385b764..9596f2a 100644
--- a/cmds-replace.c
+++ b/cmds-replace.c
@@ -348,7 +348,6 @@ static const char *const cmd_replace_status_usage[] = {
 static int cmd_replace_status(int argc, char **argv)
 {
int fd;
-   int e;
int c;
char *path;
int once = 0;
@@ -370,13 +369,9 @@ static int cmd_replace_status(int argc, char **argv)
usage(cmd_replace_status_usage);
 
path = argv[optind];
-   fd = open_file_or_dir(path, );
-   e = errno;
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access \"%s\": %s\n",
-   path, strerror(e));
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = print_replace_status(fd, path, once);
close_file_or_dir(fd, dirstream);
@@ -541,12 +536,9 @@ static int cmd_replace_cancel(int argc, char **argv)
usage(cmd_replace_cancel_usage);
 
path = argv[optind];
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access \"%s\": %s\n",
-   path, strerror(errno));
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
args.cmd = BTRFS_IOCTL_DEV_REPLACE_CMD_CANCEL;
args.result = BTRFS_IOCTL_DEV_REPLACE_RESULT_NO_RESULT;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/11] btrfs-progs: balance: use btrfs_open_dir for btrfs balance command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
kernel space of ioctl, and return fuzzy error message.

Before patch:
  # btrfs balance start /mnt/tmp
  ERROR: error during balancing '/mnt/tmp' - Inappropriate ioctl for device
  There may be more info in syslog - try dmesg | tail
  #

After patch:
  # btrfs balance start /mnt/tmp
  ERROR: not btrfs filesystem: /mnt/tmp
  #

Signed-off-by: Zhao Lei 
---
 cmds-balance.c | 30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/cmds-balance.c b/cmds-balance.c
index 9af218b..b02e40d 100644
--- a/cmds-balance.c
+++ b/cmds-balance.c
@@ -306,11 +306,9 @@ static int do_balance(const char *path, struct 
btrfs_ioctl_balance_args *args,
int e;
DIR *dirstream = NULL;
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_BALANCE_V2, args);
e = errno;
@@ -503,11 +501,9 @@ static int cmd_balance_pause(int argc, char **argv)
 
path = argv[1];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_BALANCE_CTL, BTRFS_BALANCE_CTL_PAUSE);
e = errno;
@@ -544,11 +540,9 @@ static int cmd_balance_cancel(int argc, char **argv)
 
path = argv[1];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
ret = ioctl(fd, BTRFS_IOC_BALANCE_CTL, BTRFS_BALANCE_CTL_CANCEL);
e = errno;
@@ -586,11 +580,9 @@ static int cmd_balance_resume(int argc, char **argv)
 
path = argv[1];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 1;
-   }
 
memset(, 0, sizeof(args));
args.flags |= BTRFS_BALANCE_RESUME;
@@ -679,11 +671,9 @@ static int cmd_balance_status(int argc, char **argv)
 
path = argv[optind];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
return 2;
-   }
 
ret = ioctl(fd, BTRFS_IOC_BALANCE_PROGRESS, );
e = errno;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/11] btrfs-progs: fragments: use btrfs_open_dir for btrfs-fragments command

2015-10-12 Thread Zhao Lei
We can use btrfs_open_dir() to check whether target dir is
in btrfs's mount point before open, instead of checking it in
deeper code, and return fuzzy error message.

Before patch:
  ./btrfs-fragments -o 123 /mnt/tmp1
  ERROR: can't perform the search

After patch:
  # ./btrfs-fragments -o 123 /mnt/tmp1
  ERROR: not a btrfs filesystem: /mnt/tmp1

Signed-off-by: Zhao Lei 
---
 btrfs-fragments.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/btrfs-fragments.c b/btrfs-fragments.c
index d742f60..17768c3 100644
--- a/btrfs-fragments.c
+++ b/btrfs-fragments.c
@@ -436,11 +436,9 @@ int main(int argc, char **argv)
 
path = argv[optind++];
 
-   fd = open_file_or_dir(path, );
-   if (fd < 0) {
-   fprintf(stderr, "ERROR: can't access '%s'\n", path);
+   fd = btrfs_open_dir(path, , 1);
+   if (fd < 0)
exit(1);
-   }
 
if (flags == 0)
flags = BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA;
-- 
1.8.5.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about FIEMAP

2015-10-12 Thread David Sterba
On Mon, Oct 12, 2015 at 04:37:55AM +, Wang, Zhiye wrote:
> After googled a bit, I got information that btrfs supports FIEMAP (as
> "cp" needs it), but it's not valid for "write" operation.

The FIEMAP output is informative, there's no guarantee that the extent
information does not change before it reaches the caller.

> I guess we cannot write to block device directly after get block list
> using FIEMAP.

Beware that there's another layer of translation that maps the logical
offsets to physical offsets, basically the RAID layer. So even if you
get 'physical offset' from FIEMAP, it's still the 'logical' offset from
the POV of the filesytem and has no correspondence to the block device
offset.

> This is because:
> 
> 1. COW feature of btrfs (but this can be disabled using NOCOW)
> 2. File system rebalance
> 3. Defragmentation

This always reflects the 'logical' offset from the filesystem POV.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Questions about FIEMAP

2015-10-12 Thread Eric Sandeen
On 10/11/15 11:37 PM, Wang, Zhiye wrote:
> Hello everyone,
> 
> After googled a bit, I got information that btrfs supports FIEMAP (as "cp" 
> needs it), but it's not valid for "write" operation.

cp should not be using fiemap any more.  It was for a while, until they 
realized that copying based on fiemap output could lead to corruption because 
things changed between the fiemap call and the actual copy...

> I guess we cannot write to block device directly after get block list using 
> FIEMAP. This is because:
> 
> 1. COW feature of btrfs (but this can be disabled using NOCOW)
> 2. File system rebalance
> 3. Defragmentation
> 
> Aren't item #2 and #3 also a problem for "read" operation? For example, after 
> "cp" get block list using FIEMAP, file system rebalance occurs, So, previous 
> result of FIEMAP is not valid anymore.
> 
> Or maybe I misunderstood something. Please correct me.

That all may be true for btrfs, but more fundamentally as dsterba said, nothing 
guarantees that the layout won't change *immediately* after your fiemap call.  
This is the case on any filesystem, not just btrfs.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/11] btrfs-progs: Use btrfs_open_dir to avoid show error of ioctl or tree search

2015-10-12 Thread Zhao Lei
Use btrfs_open_dir() instead of open_file_or_dir(), to show error before
real action(in ioctl or tree search), to make the error message exact
and unified.

It can also make code simple:
  85 insertions(+), 185 deletions(-)

Also include some small bug fix.

Before patch:
  # grep open_file_or_dir *.c
  btrfs-fragments.c:  fd = open_file_or_dir(path, );
  cmds-balance.c: fd = open_file_or_dir(path, );
  cmds-balance.c: fd = open_file_or_dir(path, );
  cmds-balance.c: fd = open_file_or_dir(path, );
  cmds-balance.c: fd = open_file_or_dir(path, );
  cmds-balance.c: fd = open_file_or_dir(path, );
  cmds-filesystem.c:  fd = open_file_or_dir(path, );
  cmds-filesystem.c:  fd = open_file_or_dir(path, );
  cmds-filesystem.c:  fd = open_file_or_dir(argv[i], );
  cmds-filesystem.c:  fd = open_file_or_dir(path, );
  cmds-fi-usage.c:fd = open_file_or_dir(argv[i], );
  cmds-inspect.c: fd = open_file_or_dir(argv[optind+1], );
  cmds-inspect.c: fd = open_file_or_dir(argv[optind+1], );
  cmds-inspect.c: path_fd = open_file_or_dir(full_path, 
);
  cmds-inspect.c: fd = open_file_or_dir(argv[2], );
  cmds-inspect.c: fd = open_file_or_dir(argv[1], );
  cmds-inspect.c: fd = open_file_or_dir(argv[optind], );
  cmds-qgroup.c:  fd = open_file_or_dir(path, );
  cmds-qgroup.c:  fd = open_file_or_dir(path, );
  cmds-qgroup.c:  fd = open_file_or_dir(path, );
  cmds-qgroup.c:  fd = open_file_or_dir(path, );
  cmds-quota.c:   fd = open_file_or_dir(path, );
  cmds-quota.c:   fd = open_file_or_dir(path, );
  cmds-replace.c: fd = open_file_or_dir(path, );
  cmds-replace.c: fd = open_file_or_dir(path, );
  cmds-subvolume.c:   fddst = open_file_or_dir(dstdir, );
  cmds-subvolume.c:   fd = open_file_or_dir(dname, );
  cmds-subvolume.c:   fd = open_file_or_dir(subvol, );
  cmds-subvolume.c:   fddst = open_file_or_dir(dstdir, );
  cmds-subvolume.c:   fd = open_file_or_dir(subvol, );
  cmds-subvolume.c:   fd = open_file_or_dir(subvol, );
  cmds-subvolume.c:   fd = open_file_or_dir(path, );
  cmds-subvolume.c:   fd = open_file_or_dir(subvol, );
  cmds-subvolume.c:   fd = open_file_or_dir(fullpath, );
  cmds-subvolume.c:   mntfd = open_file_or_dir(mnt, );
  cmds-subvolume.c:   fd = open_file_or_dir(argv[optind], );
  props.c:fd = open_file_or_dir3(object, , open_flags);
  utils.c:fdmnt = open_file_or_dir(mp, dirstream);
  utils.c:fdmnt = open_file_or_dir(path, dirstream);
  utils.c: * Do the following checks before calling open_file_or_dir():
  utils.c:ret = open_file_or_dir(path, dirstream);
  utils.c:int open_file_or_dir3(const char *fname, DIR **dirstream, int 
open_flags)
  utils.c:int open_file_or_dir(const char *fname, DIR **dirstream)
  utils.c:return open_file_or_dir3(fname, dirstream, O_RDWR);
  utils.c:fd = open_file_or_dir(path, );
  #

After patch:
  # grep open_file_or_dir *.c
  cmds-filesystem.c:  fd = open_file_or_dir(argv[i], ); *1
  props.c:fd = open_file_or_dir3(object, , open_flags);
  utils.c:ret = open_file_or_dir(mp, dirstream);
  utils.c: * Do the following checks before calling open_file_or_dir():
  utils.c:ret = open_file_or_dir(path, dirstream);
  utils.c:int open_file_or_dir3(const char *fname, DIR **dirstream, int 
open_flags)
  utils.c:int open_file_or_dir(const char *fname, DIR **dirstream)
  utils.c:return open_file_or_dir3(fname, dirstream, O_RDWR);
  utils.c:fd = open_file_or_dir(path, );
  #
  *1: It is used to open dir or file, can not use btrfs_open_dir()
  instead.

Zhao Lei (11):
  btrfs-progs: subvolume: use btrfs_open_dir for btrfs subvolume command
  btrfs-progs: filesystem: use btrfs_open_dir for btrfs filesystem
command
  btrfs-progs: balance: use btrfs_open_dir for btrfs balance command
  btrfs-progs: inspect: Bypass unnecessary clean function in open_error
  btrfs-progs: inspect: set return value of error case
  btrfs-progs: inspect: use btrfs_open_dir for btrfs inspect command
  btrfs-progs: qgroup: use btrfs_open_dir for btrfs qgroup command
  btrfs-progs: quota: use btrfs_open_dir for btrfs quota command
  btrfs-progs: use btrfs_open_dir in open_path_or_dev_mnt
  btrfs-progs: replace: use btrfs_open_dir for btrfs replace command
  btrfs-progs: fragments: use btrfs_open_dir for btrfs-fragments command

 btrfs-fragments.c |  6 ++
 cmds-balance.c| 30 ++---
 cmds-device.c | 13 ++---
 cmds-fi-usage.c   |  4 +---
 cmds-filesystem.c | 19 +++
 cmds-inspect.c| 26 +-
 cmds-qgroup.c | 24 
 cmds-quota.c  | 12 
 cmds-replace.c| 29 ++--
 cmds-scrub.c  | 28 +---
 cmds-subvolume.c  | 56 +++
 utils.c   | 21 +++--
 

Re: filesystem goes ro trying to balance. "cpu stuck"

2015-10-12 Thread Donald Pearson
On Mon, Oct 12, 2015 at 12:33 AM, Duncan <1i5t5.dun...@cox.net> wrote:
> Donald Pearson posted on Sun, 11 Oct 2015 11:46:14 -0500 as excerpted:
>
>> Kernel 4.2.2-1.el7.elrepo btrfs-progs v4.2.1
>>
>> I'm attempting to convert a filesystem from raid6 to raid10.  I didn't
>> have any functional problems with it, but performance is abysmal
>> compared to basically the same arrangement in raid10 so I thought I'd
>> just get away from raid56 for a while (I also saw something about parity
>> raid code developed beyond 2-disk parity that was ignored/thrown away so
>> I'm thinking the devs don't care much about about parity raid at least
>> for now).
>
> Note on the parity-raid story:  AFAIK at least the btrfs folks aren't
> ignoring it (I don't know about the mdraid/dmraid folks).  There's simply
> more opportunities for new features than there are coders to code them
> up, and while progress is indeed occurring, some of these features may
> well take years.
>
> Consider, even standard raid56 support was originally planned for IIRC
> 3.5, but it wasn't actually added until (IIRC) 3.9, and that was only
> partial/runtime support (the parities were being calculated and written,
> but the tools to rebuild from parity were incomplete/broken/non-existent,
> so it was effectively a slow raid0 in terms of reliability, that would be
> upgraded to raid56 "for free" once the tools were done).  Complete raid56
> support wasn't even nominally there until 3.19, with the initial bugs
> still being worked out thru 4.0 and into 4.1.  So it took about /three/
> /years/ longer than initially planned.
>
> This sort of longer-to-implement-than-planned pattern has repeated
> multiple times over the life of btrfs, which is why it's taking so long
> to mature and stabilize.
>
> So it's not that multi-parity-raid is being rejected or ignored, it's
> simply that there's way more to do than people to do it, and btrfs as a
> cow-based filesystem isn't exactly the simplest thing to implement
> correctly, so initial plans turned out to be /wildly/ optimistic, and
> honestly, some of these features, while not rejected, could well be a
> decade out.  Obviously others will be implemented before then, but
> there's just so many, and so few devs working on what really is a complex
> project, so something ends up being shoved back to that decade out, and
> that's the way it's going to be unless btrfs suddenly gets way more
> developer resources working on it than it has now.
>

"Don't care" was a poor choose of words on my part and I apologize to
the group.  I understand that it's a matter of priority and resources,
and not about lack of caring.

>> Partway through the balance something goes wrong and filesystem is
>> forced read-only stopping the balance.
>>
>> I did a fschk and it didn't complain about/find any errors.  The drives
>> aren't throwing any errors or incrementing any smart attributes.  This
>> is a backup array, so it's not the end of the world if I have to just
>> blow it away and rebuild as raid10 from scratch.
>>
>> The console prints this error.
>> NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s!
>> [btrfs-balance:8015]
>
> I'm a user not a dev, tho I am a regular on this list, and backtraces
> don't mean a lot to me, so take this FWIW...
>
> 1) How old is the filesystem?  It isn't quite new, created with
> mkfs.btrfs from btrfs-progs v4.2.0 or v4.2.1, is it?  There's a known
> mkfs.btrfs bug along in there, that I don't remember whether it's fixed
> by 4.2.1 or only the latest 4.2.2, but it creates invalid filesystems.
> Btrfs check from 4.2.2 can detect the problem, but can't fix it, and as
> the filesystems as they are are unstable, it's best to get what you need
> off of them and recreate them with a non-buggy mkfs.btrfs ASAP.
>
> 2) Since you're on progs v4.2.1 ATM, that may apply to its mkfs.btrfs as
> well.  Please upgrade to 4.2.2 before creating any further btrfs, or
> failing that, downgrade to 4.1.3 or whatever the last in the progs 4.1
> series was.
>
> 3) Are you running btrfs quotas on the filesystem?  Unfortunately, btrfs
> quota handling code remains an unstable sore spot, tho they're continuing
> to work hard on fixing it.  I'm continuing to recommend, as I have for
> some time now, that people don't use it unless they're willing to deal
> with the problems and are actively working with the devs to fix them.
> Otherwise, either they need quota support and should really choose a
> filesystem where the feature is mature and stable, or they don't, in
> which case just leaving it off (or turning it off if on) avoids the
> problem.
>
> There's at least two confirmed reasonably recent cases where turning off
> btrfs quota support eliminated the issues people were reporting, so this
> isn't an idle recommendation, it really does help in at least some
> cases.  If you don't really need quotas, leave (or turn) them off.  If
> you do, you really should be using a filesystem where the quota feature
> is mature and 

Re: Add stripes filter

2015-10-12 Thread David Sterba
On Mon, Sep 28, 2015 at 05:57:05PM +, Gabríel Arthúr Pétursson wrote:
> The attached patches to linux and btrfs-progs add support for filtering
> based on the number of strips in a block when balancing.

FYI, I'm going to make the fixups myself as they're mostly cosmetic and
prepare this patch for 4.4 merge window, together with the extension to
the 'limit' filter I mentioned.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

2015-10-12 Thread Christoph Hellwig
On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> You're right that if the user doesn't notice, then there is no
> point exposing this. However I think the user does notice as
> there is a difference in the end state of the copy.  I.E. generally
> if there is a different end state it would require an option,
> while if only a different copying mechanism it would not.
> I think the different end state of a reflink warrants an option for 3 reasons:
> 
>  - The user might want separate bits for resiliency. Now this is
>a weak argument due to possible deduplication in lower layers,
>but still valid is some setups.

This one is completely bogus.  For one because literally every lower
layer can and increasinly will dedup or share in some form.  If we
prentend we could do this we actively mislead the user.

>  - The user might want to avoid CoW at a later time critical stage.
> 
>  - The user might want to avoid ENOSPC at a later critical stage.

These two are the same and would be the argument for the "falloc" flag
I mention before.  But we'd need to sit down and specify the exact
semantics for it.  For example one important question that comes to mind
is if it also applies for extents that are holes in the source range.

I'd much rather get the basic system call in ASAP and then let people
explain their use cases for this and only add it once we've made sure
we have consistent semantics that actually fit the users needs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for 4.3] btrfs: check unsupported filters in balance arguments

2015-10-12 Thread David Sterba
We don't verify that all the balance filter arguments supplemented by
the flags are actually known to the kernel. Thus we let it silently pass
and do nothing.

At the moment this means only the 'limit' filter, but we're going to add
a few more soon so it's better to have that fixed. Also in older stable
kernels so that it works with newer userspace tools.

Cc: sta...@vger.kernel.org # 3.16+
Signed-off-by: David Sterba 
---

Please try to get it into 4.3, before the new balance filters (stripes,
enhanced 'limit') get merged. This would cause us headaches when we would not
have the kernel checks and try to run updated btrfs-progs against that.

 fs/btrfs/ioctl.c   | 5 +
 fs/btrfs/volumes.h | 8 
 2 files changed, 13 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf5422fce9..3e3e6130637f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -4639,6 +4639,11 @@ static long btrfs_ioctl_balance(struct file *file, void 
__user *arg)
bctl->flags |= BTRFS_BALANCE_TYPE_MASK;
}
 
+   if (bctl->flags & ~(BTRFS_BALANCE_ARGS_MASK | BTRFS_BALANCE_TYPE_MASK)) 
{
+   ret = -EINVAL;
+   goto out_bargs;
+   }
+
 do_balance:
/*
 * Ownership of bctl and mutually_exclusive_operation_running
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 2ca784a14e84..595279a8b99f 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -376,6 +376,14 @@ struct map_lookup {
 #define BTRFS_BALANCE_ARGS_VRANGE  (1ULL << 4)
 #define BTRFS_BALANCE_ARGS_LIMIT   (1ULL << 5)
 
+#define BTRFS_BALANCE_ARGS_MASK\
+   (BTRFS_BALANCE_ARGS_PROFILES |  \
+BTRFS_BALANCE_ARGS_USAGE | \
+BTRFS_BALANCE_ARGS_DEVID | \
+BTRFS_BALANCE_ARGS_DRANGE |\
+BTRFS_BALANCE_ARGS_VRANGE |\
+BTRFS_BALANCE_ARGS_LIMIT)
+
 /*
  * Profile changing flags.  When SOFT is set we won't relocate chunk if
  * it already has the target profile (even though it may be
-- 
2.1.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread Warren Hughes
Yes, correct its drive managed SMR.

I have been following this bug:
https://bugzilla.kernel.org/show_bug.cgi?id=93581 for a while

As a test I compiled/installed 4.3.0-rc4 as it looks like they
reverted some kernel patches that (negatively) affect SMR.

I ran a complete balance overnight and not a single error on the 8TB
SMR drive. I have a number of corrected and medium errors on one of my
3TB WD Red drives which appear to be genuine errors. Thankfully my
BTRFS is RAID1.

I'll remove and replace that 3TB drive and run a complete scrub - but
for now it looks like I was a victim of the above bug entry.

On 13 October 2015 at 05:25, Henk Slager  wrote:
> Hi Warren,
>
> from your dmesg I see:
> Oct 10 07:42:36 cloud.warrenhughes.net kernel: scsi 0:0:1:0:
> Direct-Access ATA  ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5
> Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
>
> Oct 11 23:57:56 cloud.warrenhughes.net kernel: scsi 0:0:1:0:
> Direct-Access ATA  ST8000AS0002-1NA AR13 PQ: 0 ANSI: 5
> Oct 11 23:57:56 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdo]
> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
>
> and looking at this spec:
> http://www.seagate.com/files/www-content/product-content/hdd-fam/seagate-archive-hdd/en-us/docs/archive-hdd-dS1834-3-1411us.pdf
>
> it seems that it is a drive-managed SMR disk. I am not sure why David
> assumes it is host-managed, maybe drive firmware/functionality can be
> bypassed.
>
> As far as I can see, the drive should not have a problem with btrfs as
> such, but I read quite worrying stories w.r.t. raid. I think the write
> characteristics of the balance operation, in combination with the
> connection via the LSI controller, are not really compatible with
> 'archive' use case of the drive. 'Simple', 'relaxed' write operation
> should be OK, but beyond that, it might fail. See also:
> http://www.storagereview.com/seagate_archive_hdd_review_8tb
>
> How much data is already on the drive? Is it an option to mount with
> skip_balance and try to remove the device and then do some tests on it
> in single independent mode?
>
> /Henk
>
>
> On Mon, Oct 12, 2015 at 3:21 PM, David Sterba  wrote:
>> On Mon, Oct 12, 2015 at 07:43:50AM +1300, Warren Hughes wrote:
>>> Hi guys, just added a new Seagate Archive 8TB drive to my BTRFS volume
>>> and I'm getting a tonne of errors when balancing or scrubbing.
>>>
>>> A short smartctl test reports fine, running a long one now. Will also
>>> run seatools from a bootable DOS USB while at work today.
>>>
>>> Running latest firmware on my 9240-8i which explicitly supports this drive.
>>>
>>> I'm finding it very hard to tell if SMR drives are OK with BTRFS
>>> currently - anyone chime in?
>>
>> I assume you have the host-managed SMR drives. This type needs tweaks to
>> the operating system so the write patterns play well with the SMR
>> constraints. Btrfs does not support that out of the box, but my
>> colleague Hannes Reinecke managed to get it working with some minor
>> changes to the allocator and disabled writing of superblock copies.
>>
>> For full support of SMR we'd have to change more than that, currently
>> nothing prevents to write "backwards" in a given chunk that is allowed
>> to be written only in the append way. So you can get mixed results when
>> trying to use the SMR devices but I'd say it will mostly not work.
>>
>> But, btrfs has all the fundamental features in place, we'd have to make
>> adjustments to follow the SMR constraints:
>>
>> * we can map the blockgroups to the SMR chunks (in some multiples)
>> * remember the write pointers and do only append writes (easy with COW)
>> * if the chunk is getting full, mark it read-only, rebalance the live
>>   data somewhere else and reset the chunk and the pointer
>>
>> I have some notes at
>> https://github.com/kdave/drafts/blob/master/btrfs/smr-mode.txt
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Warren Hughes
+64 21 633324
IM: gtalk + msn: this email address, skype: akawsh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread Chris Murphy
I get a lot of these from both sdb and sdc

Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] Sense
Key : 0x3 [current]
Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
ASC=0x11 ASCQ=0x0
Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] CDB:
opcode=0x88 88 00 00 00 00 00 11 b3 e1 98 00 00 00 08 00 00
Oct 11 23:00:03 cloud.warrenhughes.net kernel: blk_update_request:
critical medium error, dev sdb, sector 297001368



Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] Sense
Key : 0x3 [current] [descriptor]
Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
ASC=0x11 ASCQ=0x0
Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] CDB:
opcode=0x88 88 00 00 00 00 01 3e 0a 7d 80 00 00 01 00 00 00
Oct 11 23:47:32 cloud.warrenhughes.net kernel: blk_update_request:
critical medium error, dev sdc, sector 5335842176

There are a lot of these kinds of errors and they aren't all for the
same LBA +/- 8 so they're are different physical sectors affected on
both drives, but I don't know what the error is.


Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
4096-byte physical blocks

sd 0:0:1:0 starts out as sdb, but then goes a bit crazy somehow and
eventually gets offlined
Oct 11 23:55:24 cloud.warrenhughes.net kernel: sd 0:0:1:0: rejecting
I/O to offline device

And then reappears as sdo

Oct 11 23:57:56 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdo]
15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)

But no further scsi messages for this drive while Btrfs now complains
about sdo instead of sdb. Seems to me that this device is confused
even about its own error reporting. Anyway both sdb and sdc were
having problems at the same time.


Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies

2015-10-12 Thread Darrick J. Wong
On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
> > This allows us to have an in-kernel copy mechanism that avoids frequent
> > switches between kernel and user space.  This is especially useful so
> > NFSD can support server-side copies.
> > 
> > I make pagecache copies configurable by adding three new (exclusive)
> > flags:
> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
> >   ranges are identical.
> 
> All but FR_COPY really should be a separate system call.  Clones (an
> dedup as a special case of clones) are really a separate beast from file
> copies.
> 
> If I want to clone a file I either want it clone fully or fail, not copy
> a certain amount.  That means that a) we need to return an error not
> short "write", and b) locking impementations are important - we need to
> prevent other applications from racing with our clone even if it is
> large, while to get these semantics for the possible short returning
> file copy will require a proper userland locking protocol. Last but not
> least file copies need to be interruptible while clones should be not.
> All this is already important for local file systems and even more
> important for NFS exporting.
> 
> So I'd suggest to drop this patch and just let your syscall handle
> actualy copies with all their horrors.  We can go with Peng's patches
> to generalize the btrfs ioctls for clones for now which is what everyone
> already uses anyway, and then add a separate sys_file_clone later.

Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.

What does everyone think about generalizing EXTENT_SAME?  The interface enables
one to ask the kernel to dedupe multiple file ranges in a single call.  That's
more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
that the extra complexity buys us the ability to ... multi-dedupe at the same
time, with locks held on the source file?

I'm happy to generalize the existing EXTENT_SAME, but please yell if you really
hate the interface.

--D

> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: fix resending received snapshot with parent

2015-10-12 Thread Ed Tomlinson

On Friday, October 9, 2015 4:24:10 PM EDT, Filipe Manana wrote:

On Wed, Sep 30, 2015 at 8:23 PM, Robin Ruede  wrote:

This fixes a regression introduced by 37b8d27d between v4.1 and v4.2.

When a snapshot is received, its received_uuid is set to the original
uuid of the subvolume. When that snapshot is then resent to a third
filesystem, it's received_uuid is set to the second uuid
instead of the original one. The same was true for the parent_uuid. ...

Reviewed-by: Filipe Manana 

Thanks for fixing this.
I've added this to my integration branch [1] and will send soon a pull
request to Chris for 4.4, including this fix plus a few others for
send/receive, after some more testing.

I've also made an xfstest for it [1, 2]


Another thanks for this fix.  It fixes things here.  I am runing 4.2.3 with 
the 4.3 btrfs tree pulled on top of it along with this fix.  Incremental 
sends
are now working again.  


Tested-by: Ed Tomlinson 

This fixes a regression, can we please get into 4.3?

Thanks
Ed Tomlinson

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

2015-10-12 Thread Darrick J. Wong
On Mon, Oct 12, 2015 at 07:34:44AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 12, 2015 at 11:23:05AM +0100, P??draig Brady wrote:
> > You're right that if the user doesn't notice, then there is no
> > point exposing this. However I think the user does notice as
> > there is a difference in the end state of the copy.  I.E. generally
> > if there is a different end state it would require an option,
> > while if only a different copying mechanism it would not.
> > I think the different end state of a reflink warrants an option for 3 
> > reasons:
> > 
> >  - The user might want separate bits for resiliency. Now this is
> >a weak argument due to possible deduplication in lower layers,
> >but still valid is some setups.
> 
> This one is completely bogus.  For one because literally every lower
> layer can and increasinly will dedup or share in some form.  If we
> prentend we could do this we actively mislead the user.
> 
> >  - The user might want to avoid CoW at a later time critical stage.
> > 
> >  - The user might want to avoid ENOSPC at a later critical stage.
> 
> These two are the same and would be the argument for the "falloc" flag
> I mention before.  But we'd need to sit down and specify the exact
> semantics for it.  For example one important question that comes to mind
> is if it also applies for extents that are holes in the source range.

One of the patches in last week's XFS reflink patchbomb adds FALLOC_FL_UNSHARE
flag; at the moment it _only_ forces copy-on-write of shared blocks, and it
leaves holes alone.

Obviously we haven't yet figured out what are peoples' preferences in terms of
"fill the holes and unshare the shared" vs. "only unshare the shared" vs. "only
fill the holes".  It isn't that hard to add a FALLOC_FL_UNSHARE_FILL_HOLES flag
that fills the holes while unsharing is going on.

Personally I suspect that the most interest is in filling holes and unsharing,
because they don't want to pay for allocation at a critical stage for anywhere
in the file.  But I could be wrong, so allowing both goals to be expressed via
mode allows flexibility.

--D

> 
> I'd much rather get the basic system call in ASAP and then let people
> explain their use cases for this and only add it once we've made sure
> we have consistent semantics that actually fit the users needs.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread Justin Maggard
Sounds to me like this: https://bugzilla.kernel.org/show_bug.cgi?id=93581

On Mon, Oct 12, 2015 at 11:37 AM, Chris Murphy  wrote:
> I get a lot of these from both sdb and sdc
>
> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
> UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] Sense
> Key : 0x3 [current]
> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
> ASC=0x11 ASCQ=0x0
> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] CDB:
> opcode=0x88 88 00 00 00 00 00 11 b3 e1 98 00 00 00 08 00 00
> Oct 11 23:00:03 cloud.warrenhughes.net kernel: blk_update_request:
> critical medium error, dev sdb, sector 297001368
>
>
>
> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
> UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] Sense
> Key : 0x3 [current] [descriptor]
> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
> ASC=0x11 ASCQ=0x0
> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] CDB:
> opcode=0x88 88 00 00 00 00 01 3e 0a 7d 80 00 00 01 00 00 00
> Oct 11 23:47:32 cloud.warrenhughes.net kernel: blk_update_request:
> critical medium error, dev sdc, sector 5335842176
>
> There are a lot of these kinds of errors and they aren't all for the
> same LBA +/- 8 so they're are different physical sectors affected on
> both drives, but I don't know what the error is.
>
>
> Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
> Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
> 4096-byte physical blocks
>
> sd 0:0:1:0 starts out as sdb, but then goes a bit crazy somehow and
> eventually gets offlined
> Oct 11 23:55:24 cloud.warrenhughes.net kernel: sd 0:0:1:0: rejecting
> I/O to offline device
>
> And then reappears as sdo
>
> Oct 11 23:57:56 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdo]
> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
>
> But no further scsi messages for this drive while Btrfs now complains
> about sdo instead of sdb. Seems to me that this device is confused
> even about its own error reporting. Anyway both sdb and sdc were
> having problems at the same time.
>
>
> Chris Murphy
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS with 8TB SMR drives

2015-10-12 Thread Warren Hughes
yes indeed - referenced it in my update here
https://mail-archive.com/linux-btrfs@vger.kernel.org/msg47380.html

On 13 October 2015 at 13:04, Justin Maggard  wrote:
> Sounds to me like this: https://bugzilla.kernel.org/show_bug.cgi?id=93581
>
> On Mon, Oct 12, 2015 at 11:37 AM, Chris Murphy  
> wrote:
>> I get a lot of these from both sdb and sdc
>>
>> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
>> UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
>> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] Sense
>> Key : 0x3 [current]
>> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
>> ASC=0x11 ASCQ=0x0
>> Oct 11 23:00:03 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb] CDB:
>> opcode=0x88 88 00 00 00 00 00 11 b3 e1 98 00 00 00 08 00 00
>> Oct 11 23:00:03 cloud.warrenhughes.net kernel: blk_update_request:
>> critical medium error, dev sdb, sector 297001368
>>
>>
>>
>> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
>> UNKNOWN(0x2003) Result: hostbyte=0x00 driverbyte=0x08
>> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] Sense
>> Key : 0x3 [current] [descriptor]
>> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc]
>> ASC=0x11 ASCQ=0x0
>> Oct 11 23:47:32 cloud.warrenhughes.net kernel: sd 0:0:2:0: [sdc] CDB:
>> opcode=0x88 88 00 00 00 00 01 3e 0a 7d 80 00 00 01 00 00 00
>> Oct 11 23:47:32 cloud.warrenhughes.net kernel: blk_update_request:
>> critical medium error, dev sdc, sector 5335842176
>>
>> There are a lot of these kinds of errors and they aren't all for the
>> same LBA +/- 8 so they're are different physical sectors affected on
>> both drives, but I don't know what the error is.
>>
>>
>> Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
>> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
>> Oct 10 07:42:36 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdb]
>> 4096-byte physical blocks
>>
>> sd 0:0:1:0 starts out as sdb, but then goes a bit crazy somehow and
>> eventually gets offlined
>> Oct 11 23:55:24 cloud.warrenhughes.net kernel: sd 0:0:1:0: rejecting
>> I/O to offline device
>>
>> And then reappears as sdo
>>
>> Oct 11 23:57:56 cloud.warrenhughes.net kernel: sd 0:0:1:0: [sdo]
>> 15628053168 512-byte logical blocks: (8.00 TB/7.27 TiB)
>>
>> But no further scsi messages for this drive while Btrfs now complains
>> about sdo instead of sdb. Seems to me that this device is confused
>> even about its own error reporting. Anyway both sdb and sdc were
>> having problems at the same time.
>>
>>
>> Chris Murphy
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Warren Hughes
+64 21 633324
IM: gtalk + msn: this email address, skype: akawsh
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 00/21] Rework btrfs qgroup reserved space framework

2015-10-12 Thread Qu Wenruo
In previous rework of qgroup, we succeeded in fixing qgroup accounting
part, making the rfer/excl numbers accurate.

But that's just part of qgroup work, another part of qgroup still has
quite a lot problem, that's qgroup reserve space part which will lead to
EQUOT even we are far from the limit.

[[BUG]]
The easiest way to trigger the bug is,
1) Enable quota
2) Limit excl of qgroup 5 to 16M
3) Write [0,2M) of a file inside subvol 5 10 times without sync

EQUOT will be triggered at about the 8th write.
But after remount, we can still write until about 15M.

[[CAUSE]]
The problem is caused by the fact that qgroup will reserve space even
the data space is already reserved.

In above reproducer, each time we buffered write [0,2M) qgroup will
reserve 2M space, but in fact, at the 1st time, we have already reserved
2M and from then on, we don't need to reserved any data space as we are
only writing [0,2M).

Also, the reserved space will only be freed *ONCE* when its backref is
run at commit_transaction() time.

That's causing the reserved space leaking.

[[FIX]]
Reuse the existing io_tree facilities to record which range is already
reserved for qgroup.

Although qgroup reserved space behavior is quite similar with already
existing DELALLOC flag, but since fallocate don't go through DELALLOC
flag, we introduce a new extent flag, EXTENT_QGROUP_RESERVED for our own
purpose, without interfering any existing flag.

The new API itself is quite safe, any stupid caller reserve or free a
range twice or more won't cause any problem, due to the nature of the
design.

[[PATCH STRUCTURE]]
As the patchset is a little huge, it can be spilt into different parts:
1) Accurate reserve space framework API(Patch 1 ~ 8)
   Use io_tree to implement the needed data reserve API.
   And slightly change the metadata reserve API

2) Apply needed hooks to related callers(Pathc 9 ~ 18)
   The following functions need to be converted to using new qgroup
   reserve API:
   btrfs_check_free_data_space()
   btrfs_free_reserved_data_space()
   btrfs_delalloc_reserve_space()
   btrfs_delalloc_release_space()

   And the following function need to change its behavior for accurate
   qgroup reserve space:
   btrfs_fallocate()

   Also add ftrace support for new APIs in patch 17.

3) Minor enhancement and fix(Patch 19~21)
   Avoid unneeded page truncating (Patch 19)
   Fix a deadlock due to lock io_tree with io_tree lock hold in
   set_bit_hook() (Patch 20)
   And finally, makes qgroup reserved space much more obvious for
   further debugging (Patch 21)

[[Changelog]]
v2:
  Add new handlers to avoid reserved space leaking for buffered write
  followed by a truncate:
btrfs_invalidatepage()
evict_inode_truncate_page()
  Add new handlers to avoid reserved space leaking for error handle
  routine:
btrfs_free_reserved_data_space()
btrfs_delalloc_release_space()

v3:
  Use io_tree to implement data reserve map, which hugely reduced the
  patchset size, from 1300+ lines net insert to 600+ lines net insert.
  Suggested-by: Josef Bacik

Qu Wenruo (21):
  btrfs: extent_io: Introduce needed structure for recoding set/clear
bits
  btrfs: extent_io: Introduce new function set_record_extent_bits
  btrfs: extent_io: Introduce new function clear_record_extent_bits()
  btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function
  btrfs: qgroup: Introduce functions to release/free qgroup reserve data
space
  btrfs: delayed_ref: Add new function to record reserved space into
delayed ref
  btrfs: delayed_ref: release and free qgroup reserved at proper timing
  btrfs: qgroup: Introduce new functions to reserve/free metadata
  btrfs: qgroup: Use new metadata reservation.
  btrfs: extent-tree: Add new version of btrfs_check_data_free_space and
btrfs_free_reserved_data_space.
  btrfs: extent-tree: Switch to new check_data_free_space and
free_reserved_data_space
  btrfs: extent-tree: Add new version of
btrfs_delalloc_reserve/release_space
  btrfs: extent-tree: Switch to new delalloc space reserve and release
  btrfs: qgroup: Cleanup old inaccurate facilities
  btrfs: qgroup: Add handler for NOCOW and inline
  btrfs: Add handler for invalidate page
  btrfs: qgroup: Add new trace point for qgroup data reserve
  btrfs: fallocate: Add support to accurate qgroup reserve
  btrfs: Avoid truncate tailing page if fallocate range doesn't exceed
inode size
  btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in
clear_bit_hook
  btrfs: qgroup: Check if qgroup reserved space leaked

 fs/btrfs/ctree.h |  14 ++-
 fs/btrfs/delayed-ref.c   |  29 +++
 fs/btrfs/delayed-ref.h   |  14 +++
 fs/btrfs/disk-io.c   |   1 +
 fs/btrfs/extent-tree.c   | 149 ++--
 fs/btrfs/extent_io.c | 121 +++---
 fs/btrfs/extent_io.h |  19 +
 fs/btrfs/file.c  | 193 +
 fs/btrfs/inode-map.c  

[PATCH v3 12/21] btrfs: extent-tree: Add new version of btrfs_delalloc_reserve/release_space

2015-10-12 Thread Qu Wenruo
Add new version of btrfs_delalloc_reserve_space() and
btrfs_delalloc_release_space() functions, which supports accurate qgroup
reserve.

Signed-off-by: Qu Wenruo 
---
v2:
  Add new function btrfs_delalloc_release_space() to handle error case.
v3:
  None
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c | 59 ++
 2 files changed, 61 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 19450a1..4221bfd 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3473,7 +3473,9 @@ void btrfs_subvolume_release_metadata(struct btrfs_root 
*root,
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
 int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
+int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
 void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
+void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f4b9db8..32455e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5723,6 +5723,44 @@ void btrfs_delalloc_release_metadata(struct inode 
*inode, u64 num_bytes)
 }
 
 /**
+ * __btrfs_delalloc_reserve_space - reserve data and metadata space for
+ * delalloc
+ * @inode: inode we're writing to
+ * @start: start range we are writing to
+ * @len: how long the range we are writing to
+ *
+ * TODO: This function will finally replace old btrfs_delalloc_reserve_space()
+ *
+ * This will do the following things
+ *
+ * o reserve space in data space info for num bytes
+ *   and reserve precious corresponding qgroup space
+ *   (Done in check_data_free_space)
+ *
+ * o reserve space for metadata space, based on the number of outstanding
+ *   extents and how much csums will be needed
+ *   also reserve metadata space in a per root over-reserve method.
+ * o add to the inodes->delalloc_bytes
+ * o add it to the fs_info's delalloc inodes list.
+ *   (Above 3 all done in delalloc_reserve_metadata)
+ *
+ * Return 0 for success
+ * Return <0 for error(-ENOSPC or -EQUOT)
+ */
+int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len)
+{
+   int ret;
+
+   ret = __btrfs_check_data_free_space(inode, start, len);
+   if (ret < 0)
+   return ret;
+   ret = btrfs_delalloc_reserve_metadata(inode, len);
+   if (ret < 0)
+   __btrfs_free_reserved_data_space(inode, start, len);
+   return ret;
+}
+
+/**
  * btrfs_delalloc_reserve_space - reserve data and metadata space for delalloc
  * @inode: inode we're writing to
  * @num_bytes: the number of bytes we want to allocate
@@ -5755,6 +5793,27 @@ int btrfs_delalloc_reserve_space(struct inode *inode, 
u64 num_bytes)
 }
 
 /**
+ * __btrfs_delalloc_release_space - release data and metadata space for 
delalloc
+ * @inode: inode we're releasing space for
+ * @start: start position of the space already reserved
+ * @len: the len of the space already reserved
+ *
+ * This must be matched with a call to btrfs_delalloc_reserve_space.  This is
+ * called in the case that we don't need the metadata AND data reservations
+ * anymore.  So if there is an error or we insert an inline extent.
+ *
+ * This function will release the metadata space that was not used and will
+ * decrement ->delalloc_bytes and remove it from the fs_info delalloc_inodes
+ * list if there are no delalloc bytes left.
+ * Also it will handle the qgroup reserved space.
+ */
+void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len)
+{
+   btrfs_delalloc_release_metadata(inode, len);
+   __btrfs_free_reserved_data_space(inode, start, len);
+}
+
+/**
  * btrfs_delalloc_release_space - release data and metadata space for delalloc
  * @inode: inode we're releasing space for
  * @num_bytes: the number of bytes we want to free up
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 20/21] btrfs: qgroup: Avoid calling btrfs_free_reserved_data_space in clear_bit_hook

2015-10-12 Thread Qu Wenruo
In clear_bit_hook, qgroup reserved data is already handled quite well,
either released by finish_ordered_io or invalidatepage.

So calling btrfs_qgroup_free_data() here is completely meaningless, and
since btrfs_qgroup_free_data() will lock io_tree, so it can't be called
with io_tree lock hold.

This patch will add a new function
btrfs_free_reserved_data_space_noquota() for clear_bit_hook() to cease
the lockdep warning.

Signed-off-by: Qu Wenruo 
---
v2:
  None
v3:
  Update commit message as now it will cause a deadlock instead of
  lockdep warning
---
 fs/btrfs/ctree.h   |  2 ++
 fs/btrfs/extent-tree.c | 28 ++--
 fs/btrfs/inode.c   |  4 ++--
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f20b901..3970426 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3455,6 +3455,8 @@ enum btrfs_reserve_flush_enum {
 int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
+   u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 765f7e0..af221eb 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4070,10 +4070,12 @@ int btrfs_check_data_free_space(struct inode *inode, 
u64 start, u64 len)
  * Called if we need to clear a data reservation for this inode
  * Normally in a error case.
  *
- * This one will handle the per-indoe data rsv map for accurate reserved
- * space framework.
+ * This one will *NOT* use accurate qgroup reserved space API, just for case
+ * which we can't sleep and is sure it won't affect qgroup reserved space.
+ * Like clear_bit_hook().
  */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+void btrfs_free_reserved_data_space_noquota(struct inode *inode, u64 start,
+   u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
struct btrfs_space_info *data_sinfo;
@@ -4083,13 +4085,6 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
  round_down(start, root->sectorsize);
start = round_down(start, root->sectorsize);
 
-   /*
-* Free any reserved qgroup data space first
-* As it will alloc memory, we can't do it with data sinfo
-* spinlock hold.
-*/
-   btrfs_qgroup_free_data(inode, start, len);
-
data_sinfo = root->fs_info->data_sinfo;
spin_lock(_sinfo->lock);
if (WARN_ON(data_sinfo->bytes_may_use < len))
@@ -4101,6 +4096,19 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 start, u64 len)
spin_unlock(_sinfo->lock);
 }
 
+/*
+ * Called if we need to clear a data reservation for this inode
+ * Normally in a error case.
+ *
+ * This one will handle the per-indoe data rsv map for accurate reserved
+ * space framework.
+ */
+void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+{
+   btrfs_free_reserved_data_space_noquota(inode, start, len);
+   btrfs_qgroup_free_data(inode, start, len);
+}
+
 static void force_metadata_allocation(struct btrfs_fs_info *info)
 {
struct list_head *head = >space_info;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index fee54b6..39c9191 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1773,8 +1773,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 
if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
&& do_list && !(state->state & EXTENT_NORESERVE))
-   btrfs_free_reserved_data_space(inode, state->start,
-  len);
+   btrfs_free_reserved_data_space_noquota(inode,
+   state->start, len);
 
__percpu_counter_add(>fs_info->delalloc_bytes, -len,
 root->fs_info->delalloc_batch);
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 13/21] btrfs: extent-tree: Switch to new delalloc space reserve and release

2015-10-12 Thread Qu Wenruo
Use new __btrfs_delalloc_reserve_space() and
__btrfs_delalloc_release_space() to reserve and release space for
delalloc.

Signed-off-by: Qu Wenruo 
---
v2:
  Also use __btrfs_delalloc_release_space() function.
v3:
  None
---
 fs/btrfs/file.c  |  5 +++--
 fs/btrfs/inode-map.c |  6 +++---
 fs/btrfs/inode.c | 38 +++---
 fs/btrfs/ioctl.c | 14 +-
 4 files changed, 38 insertions(+), 25 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 142b217..bf4d5fb 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1611,7 +1611,7 @@ again:
btrfs_delalloc_release_metadata(inode,
release_bytes);
else
-   btrfs_delalloc_release_space(inode,
+   __btrfs_delalloc_release_space(inode, pos,
 release_bytes);
}
 
@@ -1664,7 +1664,8 @@ again:
btrfs_end_write_no_snapshoting(root);
btrfs_delalloc_release_metadata(inode, release_bytes);
} else {
-   btrfs_delalloc_release_space(inode, release_bytes);
+   __btrfs_delalloc_release_space(inode, pos,
+  release_bytes);
}
}
 
diff --git a/fs/btrfs/inode-map.c b/fs/btrfs/inode-map.c
index d4a582a..78bc09c 100644
--- a/fs/btrfs/inode-map.c
+++ b/fs/btrfs/inode-map.c
@@ -488,17 +488,17 @@ again:
/* Just to make sure we have enough space */
prealloc += 8 * PAGE_CACHE_SIZE;
 
-   ret = btrfs_delalloc_reserve_space(inode, prealloc);
+   ret = __btrfs_delalloc_reserve_space(inode, 0, prealloc);
if (ret)
goto out_put;
 
ret = btrfs_prealloc_file_range_trans(inode, trans, 0, 0, prealloc,
  prealloc, prealloc, _hint);
if (ret) {
-   btrfs_delalloc_release_space(inode, prealloc);
+   __btrfs_delalloc_release_space(inode, 0, prealloc);
goto out_put;
}
-   btrfs_free_reserved_data_space(inode, prealloc);
+   __btrfs_free_reserved_data_space(inode, 0, prealloc);
 
ret = btrfs_write_out_ino_cache(root, trans, path, inode);
 out_put:
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index f5c2ffe..df3cff2 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -1766,7 +1766,8 @@ static void btrfs_clear_bit_hook(struct inode *inode,
 
if (root->root_key.objectid != BTRFS_DATA_RELOC_TREE_OBJECTID
&& do_list && !(state->state & EXTENT_NORESERVE))
-   btrfs_free_reserved_data_space(inode, len);
+   __btrfs_free_reserved_data_space(inode, state->start,
+len);
 
__percpu_counter_add(>fs_info->delalloc_bytes, -len,
 root->fs_info->delalloc_batch);
@@ -1985,7 +1986,8 @@ again:
goto again;
}
 
-   ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+   ret = __btrfs_delalloc_reserve_space(inode, page_start,
+PAGE_CACHE_SIZE);
if (ret) {
mapping_set_error(page->mapping, ret);
end_extent_writepage(page, ret, page_start, page_end);
@@ -4581,14 +4583,17 @@ int btrfs_truncate_page(struct inode *inode, loff_t 
from, loff_t len,
if ((offset & (blocksize - 1)) == 0 &&
(!len || ((len & (blocksize - 1)) == 0)))
goto out;
-   ret = btrfs_delalloc_reserve_space(inode, PAGE_CACHE_SIZE);
+   ret = __btrfs_delalloc_reserve_space(inode,
+   round_down(from, PAGE_CACHE_SIZE), PAGE_CACHE_SIZE);
if (ret)
goto out;
 
 again:
page = find_or_create_page(mapping, index, mask);
if (!page) {
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+   __btrfs_delalloc_release_space(inode,
+   round_down(from, PAGE_CACHE_SIZE),
+   PAGE_CACHE_SIZE);
ret = -ENOMEM;
goto out;
}
@@ -4656,7 +4661,8 @@ again:
 
 out_unlock:
if (ret)
-   btrfs_delalloc_release_space(inode, PAGE_CACHE_SIZE);
+   __btrfs_delalloc_release_space(inode, page_start,
+  PAGE_CACHE_SIZE);
unlock_page(page);
page_cache_release(page);
 out:
@@ -7587,7 +7593,7 @@ unlock:
spin_unlock(_I(inode)->lock);
}
 
-   btrfs_free_reserved_data_space(inode, len);
+   __btrfs_free_reserved_data_space(inode, start, len);
  

[PATCH v3 02/21] btrfs: extent_io: Introduce new function set_record_extent_bits

2015-10-12 Thread Qu Wenruo
Introduce new function set_record_extent_bits(), which will not only set
given bits, but also record how many bytes are changed, and detailed
range info.

This is quite important for later qgroup reserve framework.
The number of bytes will be used to do qgroup reserve, and detailed
range info will be used to cleanup for EQUOT case.

Signed-off-by: Qu Wenruo 
---
v3:
  Newly introduced
---
 fs/btrfs/extent_io.c | 71 +++-
 fs/btrfs/extent_io.h |  3 +++
 2 files changed, 56 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 363726b..f5efaa6 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -131,6 +131,23 @@ struct extent_page_data {
unsigned int sync_io:1;
 };
 
+static void add_extent_changeset(struct extent_state *state, unsigned bits,
+struct extent_changeset *changeset,
+int set)
+{
+   int ret;
+
+   if (!changeset)
+   return;
+   if (set && (state->state & bits) == bits)
+   return;
+   changeset->bytes_changed += state->end - state->start + 1;
+   ret = ulist_add(changeset->range_changed, state->start, state->end,
+   GFP_ATOMIC);
+   /* ENOMEM */
+   BUG_ON(ret < 0);
+}
+
 static noinline void flush_write_bio(void *data);
 static inline struct btrfs_fs_info *
 tree_fs_info(struct extent_io_tree *tree)
@@ -410,7 +427,8 @@ static void clear_state_cb(struct extent_io_tree *tree,
 }
 
 static void set_state_bits(struct extent_io_tree *tree,
-  struct extent_state *state, unsigned *bits);
+  struct extent_state *state, unsigned *bits,
+  struct extent_changeset *changeset);
 
 /*
  * insert an extent_state struct into the tree.  'bits' are set on the
@@ -426,7 +444,7 @@ static int insert_state(struct extent_io_tree *tree,
struct extent_state *state, u64 start, u64 end,
struct rb_node ***p,
struct rb_node **parent,
-   unsigned *bits)
+   unsigned *bits, struct extent_changeset *changeset)
 {
struct rb_node *node;
 
@@ -436,7 +454,7 @@ static int insert_state(struct extent_io_tree *tree,
state->start = start;
state->end = end;
 
-   set_state_bits(tree, state, bits);
+   set_state_bits(tree, state, bits, changeset);
 
node = tree_insert(>state, NULL, end, >rb_node, p, parent);
if (node) {
@@ -789,7 +807,7 @@ out:
 
 static void set_state_bits(struct extent_io_tree *tree,
   struct extent_state *state,
-  unsigned *bits)
+  unsigned *bits, struct extent_changeset *changeset)
 {
unsigned bits_to_set = *bits & ~EXTENT_CTLBITS;
 
@@ -798,6 +816,7 @@ static void set_state_bits(struct extent_io_tree *tree,
u64 range = state->end - state->start + 1;
tree->dirty_bytes += range;
}
+   add_extent_changeset(state, bits_to_set, changeset, 1);
state->state |= bits_to_set;
 }
 
@@ -835,7 +854,7 @@ static int __must_check
 __set_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
 unsigned bits, unsigned exclusive_bits,
 u64 *failed_start, struct extent_state **cached_state,
-gfp_t mask)
+gfp_t mask, struct extent_changeset *changeset)
 {
struct extent_state *state;
struct extent_state *prealloc = NULL;
@@ -873,7 +892,7 @@ again:
prealloc = alloc_extent_state_atomic(prealloc);
BUG_ON(!prealloc);
err = insert_state(tree, prealloc, start, end,
-  , , );
+  , , , changeset);
if (err)
extent_io_tree_panic(tree, err);
 
@@ -899,7 +918,7 @@ hit_next:
goto out;
}
 
-   set_state_bits(tree, state, );
+   set_state_bits(tree, state, , changeset);
cache_state(state, cached_state);
merge_state(tree, state);
if (last_end == (u64)-1)
@@ -945,7 +964,7 @@ hit_next:
if (err)
goto out;
if (state->end <= end) {
-   set_state_bits(tree, state, );
+   set_state_bits(tree, state, , changeset);
cache_state(state, cached_state);
merge_state(tree, state);
if (last_end == (u64)-1)
@@ -980,7 +999,7 @@ hit_next:
 * the later extent.
 */
err = insert_state(tree, prealloc, start, this_end,
-  NULL, NULL, );
+  

[PATCH v3 16/21] btrfs: Add handler for invalidate page

2015-10-12 Thread Qu Wenruo
For btrfs_invalidatepage() and its variant evict_inode_truncate_page(),
there will be pages don't reach disk.
In that case, their reserved space won't be release nor freed by
finish_ordered_io() nor delayed_ref handler.

So we must free their qgroup reserved space, or we will leaking reserved
space again.

So this will patch will call btrfs_qgroup_free_data() for
invalidatepage() and its variant evict_inode_truncate_page().

And due to the nature of new btrfs_qgroup_reserve/free_data() reserved
space will only be reserved or freed once, so for pages which are
already flushed to disk, their reserved space will be released and freed
by delayed_ref handler.

Double free won't be a problem.

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
v3:
  None
---
 fs/btrfs/inode.c | 24 
 1 file changed, 24 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 2fe95f0..fee54b6 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5075,6 +5075,18 @@ static void evict_inode_truncate_pages(struct inode 
*inode)
spin_unlock(_tree->lock);
 
lock_extent_bits(io_tree, start, end, 0, _state);
+
+   /*
+* If still has DELALLOC flag, the extent didn't reach disk,
+* and its reserved space won't be freed by delayed_ref.
+* So we need to free its reserved space here.
+* (Refer to comment in btrfs_invalidatepage, case 2)
+*
+* Note, end is the bytenr of last byte, so we need + 1 here.
+*/
+   if (state->state & EXTENT_DELALLOC)
+   btrfs_qgroup_free_data(inode, start, end - start + 1);
+
clear_extent_bit(io_tree, start, end,
 EXTENT_LOCKED | EXTENT_DIRTY |
 EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING |
@@ -8592,6 +8604,18 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
}
}
 
+   /*
+* Qgroup reserved space handler
+* Page here will be either
+* 1) Already written to disk
+*In this case, its reserved space is released from data rsv map
+*and will be freed by delayed_ref handler finally.
+*So even we call qgroup_free_data(), it won't decrease reserved
+*space.
+* 2) Not written to disk
+*This means the reserved space should be freed here.
+*/
+   btrfs_qgroup_free_data(inode, page_start, PAGE_CACHE_SIZE);
if (!inode_evicting) {
clear_extent_bit(tree, page_start, page_end,
 EXTENT_LOCKED | EXTENT_DIRTY |
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 15/21] btrfs: qgroup: Add handler for NOCOW and inline

2015-10-12 Thread Qu Wenruo
For NOCOW and inline case, there will be no delayed_ref created for
them, so we should free their reserved data space at proper
time(finish_ordered_io for NOCOW and cow_file_inline for inline).

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
v3:
  None
---
 fs/btrfs/extent-tree.c |  7 ++-
 fs/btrfs/inode.c   | 15 +++
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 1dadbba..765f7e0 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4056,7 +4056,12 @@ int btrfs_check_data_free_space(struct inode *inode, u64 
start, u64 len)
if (ret < 0)
return ret;
 
-   /* Use new btrfs_qgroup_reserve_data to reserve precious data space */
+   /*
+* Use new btrfs_qgroup_reserve_data to reserve precious data space
+*
+* TODO: Find a good method to avoid reserve data space for NOCOW
+* range, but don't impact performance on quota disable case.
+*/
ret = btrfs_qgroup_reserve_data(inode, start, len);
return ret;
 }
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index ef0f8cd..2fe95f0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -310,6 +310,13 @@ static noinline int cow_file_range_inline(struct 
btrfs_root *root,
btrfs_delalloc_release_metadata(inode, end + 1 - start);
btrfs_drop_extent_cache(inode, start, aligned_end - 1, 0);
 out:
+   /*
+* Don't forget to free the reserved space, as for inlined extent
+* it won't count as data extent, free them directly here.
+* And at reserve time, it's always aligned to page size, so
+* just free one page here.
+*/
+   btrfs_qgroup_free_data(inode, 0, PAGE_CACHE_SIZE);
btrfs_free_path(path);
btrfs_end_transaction(trans, root);
return ret;
@@ -2832,6 +2839,14 @@ static int btrfs_finish_ordered_io(struct 
btrfs_ordered_extent *ordered_extent)
 
if (test_bit(BTRFS_ORDERED_NOCOW, _extent->flags)) {
BUG_ON(!list_empty(_extent->list)); /* Logic error */
+
+   /*
+* For mwrite(mmap + memset to write) case, we still reserve
+* space for NOCOW range.
+* As NOCOW won't cause a new delayed ref, just free the space
+*/
+   btrfs_qgroup_free_data(inode, ordered_extent->file_offset,
+  ordered_extent->len);
btrfs_ordered_update_i_size(inode, 0, ordered_extent);
if (nolock)
trans = btrfs_join_transaction_nolock(root);
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 08/21] btrfs: qgroup: Introduce new functions to reserve/free metadata

2015-10-12 Thread Qu Wenruo
Introduce new functions btrfs_qgroup_reserve/free_meta() to reserve/free
metadata reserved space.

Signed-off-by: Qu Wenruo 
---
v2:
  None
v3:
  None
---
 fs/btrfs/ctree.h   |  3 +++
 fs/btrfs/disk-io.c |  1 +
 fs/btrfs/qgroup.c  | 40 
 fs/btrfs/qgroup.h  |  4 
 4 files changed, 48 insertions(+)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..ae86025 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1943,6 +1943,9 @@ struct btrfs_root {
int send_in_progress;
struct btrfs_subvolume_writers *subv_writers;
atomic_t will_be_snapshoted;
+
+   /* For qgroup metadata space reserve */
+   atomic_t qgroup_meta_rsv;
 };
 
 struct btrfs_ioctl_defrag_range_args {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 807f685..2b51705 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1259,6 +1259,7 @@ static void __setup_root(u32 nodesize, u32 sectorsize, 
u32 stripesize,
atomic_set(>orphan_inodes, 0);
atomic_set(>refs, 1);
atomic_set(>will_be_snapshoted, 0);
+   atomic_set(>qgroup_meta_rsv, 0);
root->log_transid = 0;
root->log_transid_committed = -1;
root->last_log_commit = 0;
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a2678f6..b5d1850 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2589,3 +2589,43 @@ int btrfs_qgroup_release_data(struct inode *inode, u64 
start, u64 len)
 {
return __btrfs_qgroup_release_data(inode, start, len, 0);
 }
+
+int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes)
+{
+   int ret;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid) ||
+   num_bytes == 0)
+   return 0;
+
+   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
+   ret = btrfs_qgroup_reserve(root, num_bytes);
+   if (ret < 0)
+   return ret;
+   atomic_add(num_bytes, >qgroup_meta_rsv);
+   return ret;
+}
+
+void btrfs_qgroup_free_meta_all(struct btrfs_root *root)
+{
+   int reserved;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid))
+   return;
+
+   reserved = atomic_xchg(>qgroup_meta_rsv, 0);
+   if (reserved == 0)
+   return;
+   btrfs_qgroup_free(root, reserved);
+}
+
+void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes)
+{
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid))
+   return;
+
+   BUG_ON(num_bytes != round_down(num_bytes, root->nodesize));
+   WARN_ON(atomic_read(>qgroup_meta_rsv) < num_bytes);
+   atomic_sub(num_bytes, >qgroup_meta_rsv);
+   btrfs_qgroup_free(root, num_bytes);
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 80924ae..7d1c87c 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -101,4 +101,8 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info 
*fs_info, u64 qgroupid,
 int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
 int btrfs_qgroup_release_data(struct inode *inode, u64 start, u64 len);
 int btrfs_qgroup_free_data(struct inode *inode, u64 start, u64 len);
+
+int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes);
+void btrfs_qgroup_free_meta_all(struct btrfs_root *root);
+void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 18/21] btrfs: fallocate: Add support to accurate qgroup reserve

2015-10-12 Thread Qu Wenruo
Now fallocate will do accurate qgroup reserve space check, unlike old
method, which will always reserve the whole length of the range.

With this patch, fallocate will:
1) Iterate the desired range and mark in data rsv map
   Only range which is going to be allocated will be recorded in data
   rsv map and reserve the space.
   For already allocated range (normal/prealloc extent) they will be
   skipped.
   Also, record the marked range into a new list for later use.

2) If 1) succeeded, do real file extent allocate.
   And at file extent allocation time, corresponding range will be
   removed from the range in data rsv map.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
  Add missing cleanup for falloc list
v3:
  Fix a false error return, due to chunk_allocation may return >0, but
  incorrectly considered as an error.
---
 fs/btrfs/file.c | 161 
 1 file changed, 117 insertions(+), 44 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index c97b24f..35bfabf 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2545,17 +2545,61 @@ out_only_mutex:
return err;
 }
 
+/* Helper structure to record which range is already reserved */
+struct falloc_range {
+   struct list_head list;
+   u64 start;
+   u64 len;
+};
+
+/*
+ * Helper function to add falloc range
+ *
+ * Caller should have locked the larger range of extent containing
+ * [start, len)
+ */
+static int add_falloc_range(struct list_head *head, u64 start, u64 len)
+{
+   struct falloc_range *prev = NULL;
+   struct falloc_range *range = NULL;
+
+   if (list_empty(head))
+   goto insert;
+
+   /*
+* As fallocate iterate by bytenr order, we only need to check
+* the last range.
+*/
+   prev = list_entry(head->prev, struct falloc_range, list);
+   if (prev->start + prev->len == start) {
+   prev->len += len;
+   return 0;
+   }
+insert:
+   range = kmalloc(sizeof(*range), GFP_NOFS);
+   if (!range)
+   return -ENOMEM;
+   range->start = start;
+   range->len = len;
+   list_add_tail(>list, head);
+   return 0;
+}
+
 static long btrfs_fallocate(struct file *file, int mode,
loff_t offset, loff_t len)
 {
struct inode *inode = file_inode(file);
struct extent_state *cached_state = NULL;
+   struct falloc_range *range;
+   struct falloc_range *tmp;
+   struct list_head reserve_list;
u64 cur_offset;
u64 last_byte;
u64 alloc_start;
u64 alloc_end;
u64 alloc_hint = 0;
u64 locked_end;
+   u64 actual_end = 0;
struct extent_map *em;
int blocksize = BTRFS_I(inode)->root->sectorsize;
int ret;
@@ -2571,14 +2615,12 @@ static long btrfs_fallocate(struct file *file, int mode,
return btrfs_punch_hole(inode, offset, len);
 
/*
-* Make sure we have enough space before we do the
-* allocation.
-* XXX: The behavior must be changed to do accurate check first
-* and then check data reserved space.
+* Only trigger disk allocation, don't trigger qgroup reserve
+*
+* For qgroup space, it will be checked later.
 */
-   ret = btrfs_check_data_free_space(inode, alloc_start,
- alloc_end - alloc_start);
-   if (ret)
+   ret = btrfs_alloc_data_chunk_ondemand(inode, alloc_end - alloc_start);
+   if (ret < 0)
return ret;
 
mutex_lock(>i_mutex);
@@ -2586,6 +2628,13 @@ static long btrfs_fallocate(struct file *file, int mode,
if (ret)
goto out;
 
+   /*
+* TODO: Move these two operations after we have checked
+* accurate reserved space, or fallocate can still fail but
+* with page truncated or size expanded.
+*
+* But that's a minor problem and won't do much harm BTW.
+*/
if (alloc_start > inode->i_size) {
ret = btrfs_cont_expand(inode, i_size_read(inode),
alloc_start);
@@ -2644,10 +2693,10 @@ static long btrfs_fallocate(struct file *file, int mode,
}
}
 
+   /* First, check if we exceed the qgroup limit */
+   INIT_LIST_HEAD(_list);
cur_offset = alloc_start;
while (1) {
-   u64 actual_end;
-
em = btrfs_get_extent(inode, NULL, 0, cur_offset,
  alloc_end - cur_offset, 0);
if (IS_ERR_OR_NULL(em)) {
@@ -2660,54 +2709,78 @@ static long btrfs_fallocate(struct file *file, int mode,
last_byte = min(extent_map_end(em), alloc_end);
actual_end = min_t(u64, extent_map_end(em), offset + len);
last_byte = ALIGN(last_byte, blocksize);
-
if 

[PATCH v3 21/21] btrfs: qgroup: Check if qgroup reserved space leaked

2015-10-12 Thread Qu Wenruo
Add check at btrfs_destroy_inode() time to detect qgroup reserved space
leak.

Signed-off-by: Qu Wenruo 
---
v3:
  Separate from old btrfs_qgroup_free_data_rsv_map().
---
 fs/btrfs/inode.c  |  1 +
 fs/btrfs/qgroup.c | 32 
 fs/btrfs/qgroup.h |  1 +
 3 files changed, 34 insertions(+)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 39c9191..15d6ee0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9054,6 +9054,7 @@ void btrfs_destroy_inode(struct inode *inode)
btrfs_put_ordered_extent(ordered);
}
}
+   btrfs_qgroup_check_reserved_leak(inode);
inode_tree_del(inode);
btrfs_drop_extent_cache(inode, 0, (u64)-1, 0);
 free:
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index b3b485a..f452c85 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2640,3 +2640,35 @@ void btrfs_qgroup_free_meta(struct btrfs_root *root, int 
num_bytes)
atomic_sub(num_bytes, >qgroup_meta_rsv);
qgroup_free(root, num_bytes);
 }
+
+/*
+ * Check qgroup reserved space leaking, normally at destory inode
+ * time
+ */
+void btrfs_qgroup_check_reserved_leak(struct inode *inode)
+{
+   struct extent_changeset changeset;
+   struct ulist_node *unode;
+   struct ulist_iterator iter;
+   int ret;
+
+   changeset.bytes_changed = 0;
+   changeset.range_changed = ulist_alloc(GFP_NOFS);
+   if (WARN_ON(!changeset.range_changed))
+   return;
+
+   ret = clear_record_extent_bits(_I(inode)->io_tree, 0, (u64)-1,
+   EXTENT_QGROUP_RESERVED, GFP_NOFS, );
+
+   WARN_ON(ret < 0);
+   if (WARN_ON(changeset.bytes_changed)) {
+   ULIST_ITER_INIT();
+   while ((unode = ulist_next(changeset.range_changed, ))) {
+   btrfs_warn(BTRFS_I(inode)->root->fs_info,
+   "leaking qgroup reserved space, ino: %lu, 
start: %llu, end: %llu",
+   inode->i_ino, unode->val, unode->aux);
+   }
+   qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed);
+   }
+   ulist_free(changeset.range_changed);
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 686b60f..ecb2c14 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -105,4 +105,5 @@ int btrfs_qgroup_free_data(struct inode *inode, u64 start, 
u64 len);
 int btrfs_qgroup_reserve_meta(struct btrfs_root *root, int num_bytes);
 void btrfs_qgroup_free_meta_all(struct btrfs_root *root);
 void btrfs_qgroup_free_meta(struct btrfs_root *root, int num_bytes);
+void btrfs_qgroup_check_reserved_leak(struct inode *inode);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 17/21] btrfs: qgroup: Add new trace point for qgroup data reserve

2015-10-12 Thread Qu Wenruo
Now each qgroup reserve for data will has its ftrace event for better
debugging.

Signed-off-by: Qu Wenruo 
---
v2:
  Newly introduced
v3:
  None
---
 fs/btrfs/qgroup.c|  11 -
 fs/btrfs/qgroup.h|   8 +++
 include/trace/events/btrfs.h | 113 +++
 3 files changed, 130 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index 646a867..b3b485a 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2511,10 +2511,12 @@ int btrfs_qgroup_reserve_data(struct inode *inode, u64 
start, u64 len)
 
changeset.bytes_changed = 0;
changeset.range_changed = ulist_alloc(GFP_NOFS);
-
ret = set_record_extent_bits(_I(inode)->io_tree, start,
start + len -1, EXTENT_QGROUP_RESERVED, GFP_NOFS,
);
+   trace_btrfs_qgroup_reserve_data(inode, start, len,
+   changeset.bytes_changed,
+   QGROUP_RESERVE);
if (ret < 0)
goto cleanup;
ret = qgroup_reserve(root, changeset.bytes_changed);
@@ -2539,6 +2541,7 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
   int free)
 {
struct extent_changeset changeset;
+   int trace_op = QGROUP_RELEASE;
int ret;
 
changeset.bytes_changed = 0;
@@ -2552,8 +2555,12 @@ static int __btrfs_qgroup_release_data(struct inode 
*inode, u64 start, u64 len,
if (ret < 0)
goto out;
 
-   if (free)
+   if (free) {
qgroup_free(BTRFS_I(inode)->root, changeset.bytes_changed);
+   trace_op = QGROUP_FREE;
+   }
+   trace_btrfs_qgroup_release_data(inode, start, len,
+   changeset.bytes_changed, trace_op);
 out:
ulist_free(changeset.range_changed);
return ret;
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index adb03da..686b60f 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -33,6 +33,13 @@ struct btrfs_qgroup_extent_record {
struct ulist *old_roots;
 };
 
+/*
+ * For qgroup event trace points only
+ */
+#define QGROUP_RESERVE (1<<0)
+#define QGROUP_RELEASE (1<<1)
+#define QGROUP_FREE(1<<2)
+
 int btrfs_quota_enable(struct btrfs_trans_handle *trans,
   struct btrfs_fs_info *fs_info);
 int btrfs_quota_disable(struct btrfs_trans_handle *trans,
@@ -81,6 +88,7 @@ static inline void btrfs_qgroup_free_delayed_ref(struct 
btrfs_fs_info *fs_info,
 u64 ref_root, u64 num_bytes)
 {
btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
+   trace_btrfs_qgroup_free_delayed_ref(ref_root, num_bytes);
 }
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h
index 0b73af9..b4473da 100644
--- a/include/trace/events/btrfs.h
+++ b/include/trace/events/btrfs.h
@@ -1117,6 +1117,119 @@ DEFINE_EVENT(btrfs__workqueue_done, 
btrfs_workqueue_destroy,
TP_ARGS(wq)
 );
 
+DECLARE_EVENT_CLASS(btrfs__qgroup_data_map,
+
+   TP_PROTO(struct inode *inode, u64 free_reserved),
+
+   TP_ARGS(inode, free_reserved),
+
+   TP_STRUCT__entry(
+   __field(u64,rootid  )
+   __field(unsigned long,  ino )
+   __field(u64,free_reserved   )
+   ),
+
+   TP_fast_assign(
+   __entry->rootid =   BTRFS_I(inode)->root->objectid;
+   __entry->ino=   inode->i_ino;
+   __entry->free_reserved  =   free_reserved;
+   ),
+
+   TP_printk("rootid=%llu, ino=%lu, free_reserved=%llu",
+ __entry->rootid, __entry->ino, __entry->free_reserved)
+);
+
+DEFINE_EVENT(btrfs__qgroup_data_map, btrfs_qgroup_init_data_rsv_map,
+
+   TP_PROTO(struct inode *inode, u64 free_reserved),
+
+   TP_ARGS(inode, free_reserved)
+);
+
+DEFINE_EVENT(btrfs__qgroup_data_map, btrfs_qgroup_free_data_rsv_map,
+
+   TP_PROTO(struct inode *inode, u64 free_reserved),
+
+   TP_ARGS(inode, free_reserved)
+);
+
+#define BTRFS_QGROUP_OPERATIONS\
+   { QGROUP_RESERVE,   "reserve"   },  \
+   { QGROUP_RELEASE,   "release"   },  \
+   { QGROUP_FREE,  "free"  }
+
+DECLARE_EVENT_CLASS(btrfs__qgroup_rsv_data,
+
+   TP_PROTO(struct inode *inode, u64 start, u64 len, u64 reserved, int op),
+
+   TP_ARGS(inode, start, len, reserved, op),
+
+   TP_STRUCT__entry(
+   __field(u64,rootid  )
+   __field(unsigned long,  ino )
+   __field(u64,start   )
+   

[PATCH v3 09/21] btrfs: qgroup: Use new metadata reservation.

2015-10-12 Thread Qu Wenruo
As we have the new metadata reservation functions, use them to replace
the old btrfs_qgroup_reserve() call for metadata.

Signed-off-by: Qu Wenruo 
---
v2:
  None
v3:
  None
---
 fs/btrfs/extent-tree.c | 14 ++
 fs/btrfs/transaction.c | 34 ++
 fs/btrfs/transaction.h |  1 -
 3 files changed, 12 insertions(+), 37 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4f6758b..22702bd 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -5345,7 +5345,7 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (root->fs_info->quota_enabled) {
/* One for parent inode, two for dir entries */
num_bytes = 3 * root->nodesize;
-   ret = btrfs_qgroup_reserve(root, num_bytes);
+   ret = btrfs_qgroup_reserve_meta(root, num_bytes);
if (ret)
return ret;
} else {
@@ -5363,10 +5363,8 @@ int btrfs_subvolume_reserve_metadata(struct btrfs_root 
*root,
if (ret == -ENOSPC && use_global_rsv)
ret = btrfs_block_rsv_migrate(global_rsv, rsv, num_bytes);
 
-   if (ret) {
-   if (*qgroup_reserved)
-   btrfs_qgroup_free(root, *qgroup_reserved);
-   }
+   if (ret && *qgroup_reserved)
+   btrfs_qgroup_free_meta(root, *qgroup_reserved);
 
return ret;
 }
@@ -5527,15 +5525,15 @@ int btrfs_delalloc_reserve_metadata(struct inode 
*inode, u64 num_bytes)
spin_unlock(_I(inode)->lock);
 
if (root->fs_info->quota_enabled) {
-   ret = btrfs_qgroup_reserve(root, nr_extents * root->nodesize);
+   ret = btrfs_qgroup_reserve_meta(root,
+   nr_extents * root->nodesize);
if (ret)
goto out_fail;
}
 
ret = reserve_metadata_bytes(root, block_rsv, to_reserve, flush);
if (unlikely(ret)) {
-   if (root->fs_info->quota_enabled)
-   btrfs_qgroup_free(root, nr_extents * root->nodesize);
+   btrfs_qgroup_free_meta(root, nr_extents * root->nodesize);
goto out_fail;
}
 
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 376191c..5ed06b8 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -478,13 +478,10 @@ start_transaction(struct btrfs_root *root, u64 num_items, 
unsigned int type,
 * the appropriate flushing if need be.
 */
if (num_items > 0 && root != root->fs_info->chunk_root) {
-   if (root->fs_info->quota_enabled &&
-   is_fstree(root->root_key.objectid)) {
-   qgroup_reserved = num_items * root->nodesize;
-   ret = btrfs_qgroup_reserve(root, qgroup_reserved);
-   if (ret)
-   return ERR_PTR(ret);
-   }
+   qgroup_reserved = num_items * root->nodesize;
+   ret = btrfs_qgroup_reserve_meta(root, qgroup_reserved);
+   if (ret)
+   return ERR_PTR(ret);
 
num_bytes = btrfs_calc_trans_metadata_size(root, num_items);
/*
@@ -553,7 +550,6 @@ again:
h->block_rsv = NULL;
h->orig_rsv = NULL;
h->aborted = 0;
-   h->qgroup_reserved = 0;
h->delayed_ref_elem.seq = 0;
h->type = type;
h->allocating_chunk = false;
@@ -579,7 +575,6 @@ again:
h->bytes_reserved = num_bytes;
h->reloc_reserved = reloc_reserved;
}
-   h->qgroup_reserved = qgroup_reserved;
 
 got_it:
btrfs_record_root_in_trans(h, root);
@@ -597,8 +592,7 @@ alloc_fail:
btrfs_block_rsv_release(root, >fs_info->trans_block_rsv,
num_bytes);
 reserve_fail:
-   if (qgroup_reserved)
-   btrfs_qgroup_free(root, qgroup_reserved);
+   btrfs_qgroup_free_meta(root, qgroup_reserved);
return ERR_PTR(ret);
 }
 
@@ -815,15 +809,6 @@ static int __btrfs_end_transaction(struct 
btrfs_trans_handle *trans,
must_run_delayed_refs = 2;
}
 
-   if (trans->qgroup_reserved) {
-   /*
-* the same root has to be passed here between start_transaction
-* and end_transaction. Subvolume quota depends on this.
-*/
-   btrfs_qgroup_free(trans->root, trans->qgroup_reserved);
-   trans->qgroup_reserved = 0;
-   }
-
btrfs_trans_release_metadata(trans, root);
trans->block_rsv = NULL;
 
@@ -1238,6 +1223,7 @@ static noinline int commit_fs_roots(struct 
btrfs_trans_handle *trans,
spin_lock(_info->fs_roots_radix_lock);
if (err)
break;
+   

[PATCH v3 11/21] btrfs: extent-tree: Switch to new check_data_free_space and free_reserved_data_space

2015-10-12 Thread Qu Wenruo
Use new reserve/free for buffered write and inode cache.

For buffered write case, as nodatacow write won't increase quota account,
so unlike old behavior which does reserve before check nocow, now we
check nocow first and then only reserve data if we can't do nocow write.

Signed-off-by: Qu Wenruo 
---
v2:
  Add call for new free function too. Or we will leak reserved space in
  case of data reservation succeeded but metadata reservation failed.
v3:
  None
---
 fs/btrfs/extent-tree.c |  4 ++--
 fs/btrfs/file.c| 34 +-
 fs/btrfs/relocation.c  |  8 
 3 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 0cd6baa..f4b9db8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3356,7 +3356,7 @@ again:
num_pages *= 16;
num_pages *= PAGE_CACHE_SIZE;
 
-   ret = btrfs_check_data_free_space(inode, num_pages, num_pages);
+   ret = __btrfs_check_data_free_space(inode, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3365,7 +3365,7 @@ again:
  _hint);
if (!ret)
dcs = BTRFS_DC_SETUP;
-   btrfs_free_reserved_data_space(inode, num_pages);
+   __btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
iput(inode);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..142b217 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1510,12 +1510,17 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
}
 
reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
-   ret = btrfs_check_data_free_space(inode, reserve_bytes, 
write_bytes);
-   if (ret == -ENOSPC &&
-   (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
- BTRFS_INODE_PREALLOC))) {
+
+   if (BTRFS_I(inode)->flags & (BTRFS_INODE_NODATACOW |
+BTRFS_INODE_PREALLOC)) {
ret = check_can_nocow(inode, pos, _bytes);
+   if (ret < 0)
+   break;
if (ret > 0) {
+   /*
+* For nodata cow case, no need to reserve
+* data space.
+*/
only_release_metadata = true;
/*
 * our prealloc extent may be smaller than
@@ -1524,20 +1529,19 @@ static noinline ssize_t __btrfs_buffered_write(struct 
file *file,
num_pages = DIV_ROUND_UP(write_bytes + offset,
 PAGE_CACHE_SIZE);
reserve_bytes = num_pages << PAGE_CACHE_SHIFT;
-   ret = 0;
-   } else {
-   ret = -ENOSPC;
+   goto reserve_metadata;
}
}
-
-   if (ret)
+   ret = __btrfs_check_data_free_space(inode, pos, write_bytes);
+   if (ret < 0)
break;
 
+reserve_metadata:
ret = btrfs_delalloc_reserve_metadata(inode, reserve_bytes);
if (ret) {
if (!only_release_metadata)
-   btrfs_free_reserved_data_space(inode,
-  reserve_bytes);
+   __btrfs_free_reserved_data_space(inode, pos,
+write_bytes);
else
btrfs_end_write_no_snapshoting(root);
break;
@@ -2569,8 +2573,11 @@ static long btrfs_fallocate(struct file *file, int mode,
/*
 * Make sure we have enough space before we do the
 * allocation.
+* XXX: The behavior must be changed to do accurate check first
+* and then check data reserved space.
 */
-   ret = btrfs_check_data_free_space(inode, alloc_end - alloc_start, 
alloc_end - alloc_start);
+   ret = btrfs_check_data_free_space(inode, alloc_start,
+ alloc_end - alloc_start);
if (ret)
return ret;
 
@@ -2703,7 +2710,8 @@ static long btrfs_fallocate(struct file *file, int mode,
 out:
mutex_unlock(>i_mutex);
/* Let go of our reservation. */
-   btrfs_free_reserved_data_space(inode, alloc_end - alloc_start);
+   __btrfs_free_reserved_data_space(inode, alloc_start,
+alloc_end - alloc_start);
return ret;
 }
 
diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c

[PATCH v3 14/21] btrfs: qgroup: Cleanup old inaccurate facilities

2015-10-12 Thread Qu Wenruo
Cleanup the old facilities which use old btrfs_qgroup_reserve() function
call, replace them with the newer version, and remove the "__" prefix in
them.

Also, make btrfs_qgroup_reserve/free() functions private, as they are
now only used inside qgroup codes.

Now, the whole btrfs qgroup is swithed to use the new reserve facilities.

Signed-off-by: Qu Wenruo 
---
v2:
  Apply newly introduced functions too.
v3:
  None
---
 fs/btrfs/ctree.h   |  12 ++
 fs/btrfs/extent-tree.c | 109 +
 fs/btrfs/file.c|  15 ---
 fs/btrfs/inode-map.c   |   6 +--
 fs/btrfs/inode.c   |  34 +++
 fs/btrfs/ioctl.c   |   6 +--
 fs/btrfs/qgroup.c  |  18 
 fs/btrfs/qgroup.h  |   8 
 fs/btrfs/relocation.c  |   8 ++--
 9 files changed, 60 insertions(+), 156 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 4221bfd..f20b901 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3452,11 +3452,9 @@ enum btrfs_reserve_flush_enum {
BTRFS_RESERVE_FLUSH_ALL,
 };
 
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes);
-int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
 int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
-void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
+void btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
@@ -3472,10 +3470,8 @@ void btrfs_subvolume_release_metadata(struct btrfs_root 
*root,
  u64 qgroup_reserved);
 int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes);
 void btrfs_delalloc_release_metadata(struct inode *inode, u64 num_bytes);
-int btrfs_delalloc_reserve_space(struct inode *inode, u64 num_bytes);
-int __btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
-void btrfs_delalloc_release_space(struct inode *inode, u64 num_bytes);
-void __btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
+int btrfs_delalloc_reserve_space(struct inode *inode, u64 start, u64 len);
+void btrfs_delalloc_release_space(struct inode *inode, u64 start, u64 len);
 void btrfs_init_block_rsv(struct btrfs_block_rsv *rsv, unsigned short type);
 struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct btrfs_root *root,
  unsigned short type);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 32455e0..1dadbba 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3356,7 +3356,7 @@ again:
num_pages *= 16;
num_pages *= PAGE_CACHE_SIZE;
 
-   ret = __btrfs_check_data_free_space(inode, 0, num_pages);
+   ret = btrfs_check_data_free_space(inode, 0, num_pages);
if (ret)
goto out_put;
 
@@ -3365,7 +3365,7 @@ again:
  _hint);
if (!ret)
dcs = BTRFS_DC_SETUP;
-   __btrfs_free_reserved_data_space(inode, 0, num_pages);
+   btrfs_free_reserved_data_space(inode, 0, num_pages);
 
 out_put:
iput(inode);
@@ -4038,27 +4038,11 @@ commit_trans:
 }
 
 /*
- * This will check the space that the inode allocates from to make sure we have
- * enough space for bytes.
- */
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
-{
-   struct btrfs_root *root = BTRFS_I(inode)->root;
-   int ret;
-
-   ret = btrfs_alloc_data_chunk_ondemand(inode, bytes);
-   if (ret < 0)
-   return ret;
-   ret = btrfs_qgroup_reserve(root, write_bytes);
-   return ret;
-}
-
-/*
  * New check_data_free_space() with ability for precious data reservation
  * Will replace old btrfs_check_data_free_space(), but for patch split,
  * add a new function first and then replace it.
  */
-int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+int btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
int ret;
@@ -4078,33 +4062,13 @@ int __btrfs_check_data_free_space(struct inode *inode, 
u64 start, u64 len)
 }
 
 /*
- * Called if we need to clear a data reservation for this inode.
- */
-void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
-{
-   struct btrfs_root *root = BTRFS_I(inode)->root;
-   struct btrfs_space_info *data_sinfo;
-
-   /* make sure bytes are sectorsize aligned */
-   bytes = ALIGN(bytes, root->sectorsize);
-
-   data_sinfo = root->fs_info->data_sinfo;
-   

[PATCH v3 10/21] btrfs: extent-tree: Add new version of btrfs_check_data_free_space and btrfs_free_reserved_data_space.

2015-10-12 Thread Qu Wenruo
Add new functions __btrfs_check_data_free_space() and
__btrfs_free_reserved_data_space() to work with new accurate qgroup
reserved space framework.

The new function will replace old btrfs_check_data_free_space() and
btrfs_free_reserved_data_space() respectively, but until all the change
is done, let's just use the new name.

Also, export internal use function btrfs_alloc_data_chunk_ondemand(), as
now qgroup reserve requires precious bytes, some operation can't get the
accurate number in advance(like fallocate).
But data space info check and data chunk allocate doesn't need to be
that accurate, and can be called at the beginning.

So export it for later operations.

Signed-off-by: Qu Wenruo 
---
v2:
  Fix comment typo
  Add __btrfs_free_reserved_data_space() function, or we will leak
  reserved space at EQUOT error handle routine.
v3:
  None
---
 fs/btrfs/ctree.h   |  3 ++
 fs/btrfs/extent-tree.c | 85 --
 2 files changed, 79 insertions(+), 9 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ae86025..19450a1 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3453,7 +3453,10 @@ enum btrfs_reserve_flush_enum {
 };
 
 int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes);
+int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len);
+int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes);
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes);
+void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len);
 void btrfs_trans_release_metadata(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
 void btrfs_trans_release_chunk_metadata(struct btrfs_trans_handle *trans);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 22702bd..0cd6baa 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3908,11 +3908,7 @@ u64 btrfs_get_alloc_profile(struct btrfs_root *root, int 
data)
return ret;
 }
 
-/*
- * This will check the space that the inode allocates from to make sure we have
- * enough space for bytes.
- */
-int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
+int btrfs_alloc_data_chunk_ondemand(struct inode *inode, u64 bytes)
 {
struct btrfs_space_info *data_sinfo;
struct btrfs_root *root = BTRFS_I(inode)->root;
@@ -4033,19 +4029,55 @@ commit_trans:
  data_sinfo->flags, bytes, 1);
return -ENOSPC;
}
-   ret = btrfs_qgroup_reserve(root, write_bytes);
-   if (ret)
-   goto out;
data_sinfo->bytes_may_use += bytes;
trace_btrfs_space_reservation(root->fs_info, "space_info",
  data_sinfo->flags, bytes, 1);
-out:
spin_unlock(_sinfo->lock);
 
return ret;
 }
 
 /*
+ * This will check the space that the inode allocates from to make sure we have
+ * enough space for bytes.
+ */
+int btrfs_check_data_free_space(struct inode *inode, u64 bytes, u64 
write_bytes)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   int ret;
+
+   ret = btrfs_alloc_data_chunk_ondemand(inode, bytes);
+   if (ret < 0)
+   return ret;
+   ret = btrfs_qgroup_reserve(root, write_bytes);
+   return ret;
+}
+
+/*
+ * New check_data_free_space() with ability for precious data reservation
+ * Will replace old btrfs_check_data_free_space(), but for patch split,
+ * add a new function first and then replace it.
+ */
+int __btrfs_check_data_free_space(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   int ret;
+
+   /* align the range */
+   len = round_up(start + len, root->sectorsize) -
+ round_down(start, root->sectorsize);
+   start = round_down(start, root->sectorsize);
+
+   ret = btrfs_alloc_data_chunk_ondemand(inode, len);
+   if (ret < 0)
+   return ret;
+
+   /* Use new btrfs_qgroup_reserve_data to reserve precious data space */
+   ret = btrfs_qgroup_reserve_data(inode, start, len);
+   return ret;
+}
+
+/*
  * Called if we need to clear a data reservation for this inode.
  */
 void btrfs_free_reserved_data_space(struct inode *inode, u64 bytes)
@@ -4065,6 +4097,41 @@ void btrfs_free_reserved_data_space(struct inode *inode, 
u64 bytes)
spin_unlock(_sinfo->lock);
 }
 
+/*
+ * Called if we need to clear a data reservation for this inode
+ * Normally in a error case.
+ *
+ * This one will handle the per-indoe data rsv map for accurate reserved
+ * space framework.
+ */
+void __btrfs_free_reserved_data_space(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct btrfs_space_info *data_sinfo;
+
+   /* Make sure the range is aligned to sectorsize */
+   len = 

[PATCH v3 03/21] btrfs: extent_io: Introduce new function clear_record_extent_bits()

2015-10-12 Thread Qu Wenruo
Introduce new function clear_record_extent_bits(), which will clear bits
for given range and record the details about which ranges are cleared
and how many bytes in total it changes.

This provides the basis for later qgroup reserve codes.

Signed-off-by: Qu Wenruo 
---
v3:
  Newly introduced
---
 fs/btrfs/extent_io.c | 50 +++---
 fs/btrfs/extent_io.h |  3 +++
 2 files changed, 42 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index f5efaa6..1c20f8be 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -141,6 +141,8 @@ static void add_extent_changeset(struct extent_state 
*state, unsigned bits,
return;
if (set && (state->state & bits) == bits)
return;
+   if (!set && (state->state & bits) == 0)
+   return;
changeset->bytes_changed += state->end - state->start + 1;
ret = ulist_add(changeset->range_changed, state->start, state->end,
GFP_ATOMIC);
@@ -529,7 +531,8 @@ static struct extent_state *next_state(struct extent_state 
*state)
  */
 static struct extent_state *clear_state_bit(struct extent_io_tree *tree,
struct extent_state *state,
-   unsigned *bits, int wake)
+   unsigned *bits, int wake,
+   struct extent_changeset *changeset)
 {
struct extent_state *next;
unsigned bits_to_clear = *bits & ~EXTENT_CTLBITS;
@@ -540,6 +543,7 @@ static struct extent_state *clear_state_bit(struct 
extent_io_tree *tree,
tree->dirty_bytes -= range;
}
clear_state_cb(tree, state, bits);
+   add_extent_changeset(state, bits_to_clear, changeset, 0);
state->state &= ~bits_to_clear;
if (wake)
wake_up(>wq);
@@ -587,10 +591,10 @@ static void extent_io_tree_panic(struct extent_io_tree 
*tree, int err)
  *
  * This takes the tree lock, and returns 0 on success and < 0 on error.
  */
-int clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
-unsigned bits, int wake, int delete,
-struct extent_state **cached_state,
-gfp_t mask)
+static int __clear_extent_bit(struct extent_io_tree *tree, u64 start, u64 end,
+ unsigned bits, int wake, int delete,
+ struct extent_state **cached_state,
+ gfp_t mask, struct extent_changeset *changeset)
 {
struct extent_state *state;
struct extent_state *cached;
@@ -689,7 +693,8 @@ hit_next:
if (err)
goto out;
if (state->end <= end) {
-   state = clear_state_bit(tree, state, , wake);
+   state = clear_state_bit(tree, state, , wake,
+   changeset);
goto next;
}
goto search_again;
@@ -710,13 +715,13 @@ hit_next:
if (wake)
wake_up(>wq);
 
-   clear_state_bit(tree, prealloc, , wake);
+   clear_state_bit(tree, prealloc, , wake, changeset);
 
prealloc = NULL;
goto out;
}
 
-   state = clear_state_bit(tree, state, , wake);
+   state = clear_state_bit(tree, state, , wake, changeset);
 next:
if (last_end == (u64)-1)
goto out;
@@ -1151,7 +1156,7 @@ hit_next:
if (state->start == start && state->end <= end) {
set_state_bits(tree, state, , NULL);
cache_state(state, cached_state);
-   state = clear_state_bit(tree, state, _bits, 0);
+   state = clear_state_bit(tree, state, _bits, 0, NULL);
if (last_end == (u64)-1)
goto out;
start = last_end + 1;
@@ -1192,7 +1197,8 @@ hit_next:
if (state->end <= end) {
set_state_bits(tree, state, , NULL);
cache_state(state, cached_state);
-   state = clear_state_bit(tree, state, _bits, 0);
+   state = clear_state_bit(tree, state, _bits, 0,
+   NULL);
if (last_end == (u64)-1)
goto out;
start = last_end + 1;
@@ -1254,7 +1260,7 @@ hit_next:
 
set_state_bits(tree, prealloc, , NULL);
cache_state(prealloc, cached_state);
-   clear_state_bit(tree, prealloc, _bits, 0);
+   clear_state_bit(tree, prealloc, _bits, 0, NULL);
prealloc = NULL;
goto out;
}
@@ -1309,6 +1315,14 @@ int set_record_extent_bits(struct 

[PATCH v3 06/21] btrfs: delayed_ref: Add new function to record reserved space into delayed ref

2015-10-12 Thread Qu Wenruo
Add new function btrfs_add_delayed_qgroup_reserve() function to record
how much space is reserved for that extent.

As btrfs only accounts qgroup at run_delayed_refs() time, so newly
allocated extent should keep the reserved space until then.

So add needed function with related members to do it.

Signed-off-by: Qu Wenruo 
---
v2:
  None
v3:
  None
---
 fs/btrfs/delayed-ref.c | 29 +
 fs/btrfs/delayed-ref.h | 14 ++
 2 files changed, 43 insertions(+)

diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index ac3e81d..bd9b63b 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -476,6 +476,8 @@ add_delayed_ref_head(struct btrfs_fs_info *fs_info,
INIT_LIST_HEAD(_ref->ref_list);
head_ref->processing = 0;
head_ref->total_ref_mod = count_mod;
+   head_ref->qgroup_reserved = 0;
+   head_ref->qgroup_ref_root = 0;
 
/* Record qgroup extent info if provided */
if (qrecord) {
@@ -746,6 +748,33 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
return 0;
 }
 
+int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
+struct btrfs_trans_handle *trans,
+u64 ref_root, u64 bytenr, u64 num_bytes)
+{
+   struct btrfs_delayed_ref_root *delayed_refs;
+   struct btrfs_delayed_ref_head *ref_head;
+   int ret = 0;
+
+   if (!fs_info->quota_enabled || !is_fstree(ref_root))
+   return 0;
+
+   delayed_refs = >transaction->delayed_refs;
+
+   spin_lock(_refs->lock);
+   ref_head = find_ref_head(_refs->href_root, bytenr, 0);
+   if (!ref_head) {
+   ret = -ENOENT;
+   goto out;
+   }
+   WARN_ON(ref_head->qgroup_reserved || ref_head->qgroup_ref_root);
+   ref_head->qgroup_ref_root = ref_root;
+   ref_head->qgroup_reserved = num_bytes;
+out:
+   spin_unlock(_refs->lock);
+   return ret;
+}
+
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index 13fb5e6..d4c41e2 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -113,6 +113,17 @@ struct btrfs_delayed_ref_head {
int total_ref_mod;
 
/*
+* For qgroup reserved space freeing.
+*
+* ref_root and reserved will be recorded after
+* BTRFS_ADD_DELAYED_EXTENT is called.
+* And will be used to free reserved qgroup space at
+* run_delayed_refs() time.
+*/
+   u64 qgroup_ref_root;
+   u64 qgroup_reserved;
+
+   /*
 * when a new extent is allocated, it is just reserved in memory
 * The actual extent isn't inserted into the extent allocation tree
 * until the delayed ref is processed.  must_insert_reserved is
@@ -242,6 +253,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
   u64 owner, u64 offset, int action,
   struct btrfs_delayed_extent_op *extent_op,
   int no_quota);
+int btrfs_add_delayed_qgroup_reserve(struct btrfs_fs_info *fs_info,
+struct btrfs_trans_handle *trans,
+u64 ref_root, u64 bytenr, u64 num_bytes);
 int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info,
struct btrfs_trans_handle *trans,
u64 bytenr, u64 num_bytes,
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 01/21] btrfs: extent_io: Introduce needed structure for recoding set/clear bits

2015-10-12 Thread Qu Wenruo
Add a new structure, extent_change_set, to record how many bytes are
changed in one set/clear_extent_bits() operation, with detailed changed
ranges info.

This provides the needed facilities for later qgroup reserve framework.

Signed-off-by: Qu Wenruo 
---
v3:
  Newly introduced, to reuse existing extent_io facilities.
---
 fs/btrfs/extent_io.h | 12 
 1 file changed, 12 insertions(+)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index c668f36..3107a6e 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -2,6 +2,7 @@
 #define __EXTENTIO__
 
 #include 
+#include "ulist.h"
 
 /* bits for the extent state */
 #define EXTENT_DIRTY   (1U << 0)
@@ -161,6 +162,17 @@ struct extent_buffer {
 #endif
 };
 
+/*
+ * Structure to record how many bytes and which ranges are set/cleared
+ */
+struct extent_changeset {
+   /* How many bytes are set/cleared in this operation */
+   u64 bytes_changed;
+
+   /* Changed ranges */
+   struct ulist *range_changed;
+};
+
 static inline void extent_set_compress_type(unsigned long *bio_flags,
int compress_type)
 {
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3 07/21] btrfs: delayed_ref: release and free qgroup reserved at proper timing

2015-10-12 Thread Qu Wenruo
Qgroup reserved space needs to be released from inode dirty map and get
freed at different timing:

1) Release when the metadata is written into tree
After corresponding metadata is written into tree, any newer write will
be COWed(don't include NOCOW case yet).
So we must release its range from inode dirty range map, or we will
forget to reserve needed range, causing accounting exceeding the limit.

2) Free reserved bytes when delayed ref is run
When delayed refs are run, qgroup accounting will follow soon and turn
the reserved bytes into rfer/excl numbers.
As run_delayed_refs and qgroup accounting are all done at
commit_transaction() time, we are safe to free reserved space in
run_delayed_ref time().

With these timing to release/free reserved space, we should be able to
resolve the long existing qgroup reserve space leak problem.

Signed-off-by: Qu Wenruo 
---
v2:
  Use a better wrapped function for delayed_ref reserved space release.
  As direct call to btrfs_qgroup_free_ref() will make it hard to add
  trace event.
v3:
  None
---
 fs/btrfs/extent-tree.c |  5 +
 fs/btrfs/inode.c   | 10 ++
 fs/btrfs/qgroup.c  |  5 ++---
 fs/btrfs/qgroup.h  | 18 +-
 4 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 601d7d4..4f6758b 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -2345,6 +2345,11 @@ static int run_one_delayed_ref(struct btrfs_trans_handle 
*trans,
  node->num_bytes);
}
}
+
+   /* Also free its reserved qgroup space */
+   btrfs_qgroup_free_delayed_ref(root->fs_info,
+ head->qgroup_ref_root,
+ head->qgroup_reserved);
return ret;
}
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index b7e439b..f5c2ffe 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2112,6 +2112,16 @@ static int insert_reserved_file_extent(struct 
btrfs_trans_handle *trans,
ret = btrfs_alloc_reserved_file_extent(trans, root,
root->root_key.objectid,
btrfs_ino(inode), file_pos, );
+   if (ret < 0)
+   goto out;
+   /*
+* Release the reserved range from inode dirty range map, and
+* move it to delayed ref codes, as now accounting only happens at
+* commit_transaction() time.
+*/
+   btrfs_qgroup_release_data(inode, file_pos, ram_bytes);
+   ret = btrfs_add_delayed_qgroup_reserve(root->fs_info, trans,
+   root->objectid, disk_bytenr, ram_bytes);
 out:
btrfs_free_path(path);
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index a86d9c6..a2678f6 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2111,14 +2111,13 @@ out:
return ret;
 }
 
-void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes)
+void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
+  u64 ref_root, u64 num_bytes)
 {
struct btrfs_root *quota_root;
struct btrfs_qgroup *qgroup;
-   struct btrfs_fs_info *fs_info = root->fs_info;
struct ulist_node *unode;
struct ulist_iterator uiter;
-   u64 ref_root = root->root_key.objectid;
int ret = 0;
 
if (!is_fstree(ref_root))
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 564eb21..80924ae 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -72,7 +72,23 @@ int btrfs_qgroup_inherit(struct btrfs_trans_handle *trans,
 struct btrfs_fs_info *fs_info, u64 srcid, u64 objectid,
 struct btrfs_qgroup_inherit *inherit);
 int btrfs_qgroup_reserve(struct btrfs_root *root, u64 num_bytes);
-void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes);
+void btrfs_qgroup_free_refroot(struct btrfs_fs_info *fs_info,
+  u64 ref_root, u64 num_bytes);
+static inline void btrfs_qgroup_free(struct btrfs_root *root, u64 num_bytes)
+{
+   return btrfs_qgroup_free_refroot(root->fs_info, root->objectid,
+num_bytes);
+}
+
+/*
+ * TODO: Add proper trace point for it, as btrfs_qgroup_free() is
+ * called by everywhere, can't provide good trace for delayed ref case.
+ */
+static inline void btrfs_qgroup_free_delayed_ref(struct btrfs_fs_info *fs_info,
+u64 ref_root, u64 num_bytes)
+{
+   btrfs_qgroup_free_refroot(fs_info, ref_root, num_bytes);
+}
 
 void assert_qgroups_uptodate(struct btrfs_trans_handle *trans);
 
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v3 04/21] btrfs: qgroup: Introduce btrfs_qgroup_reserve_data function

2015-10-12 Thread Qu Wenruo
Introduce a new function, btrfs_qgroup_reserve_data(), which will use
io_tree to accurate qgroup reserve, to avoid reserved space leaking.

Signed-off-by: Qu Wenruo 
---
v2:
  Add needed parameter for later trace functions
v3:
  Use io_tree facilities instead of data_rsv_map facilities
---
 fs/btrfs/extent_io.h |  1 +
 fs/btrfs/qgroup.c| 49 +
 fs/btrfs/qgroup.h|  2 ++
 3 files changed, 52 insertions(+)

diff --git a/fs/btrfs/extent_io.h b/fs/btrfs/extent_io.h
index 51e1b71..f4c1ae1 100644
--- a/fs/btrfs/extent_io.h
+++ b/fs/btrfs/extent_io.h
@@ -19,6 +19,7 @@
 #define EXTENT_NEED_WAIT   (1U << 13)
 #define EXTENT_DAMAGED (1U << 14)
 #define EXTENT_NORESERVE   (1U << 15)
+#define EXTENT_QGROUP_RESERVED (1U << 16)
 #define EXTENT_IOBITS  (EXTENT_LOCKED | EXTENT_WRITEBACK)
 #define EXTENT_CTLBITS (EXTENT_DO_ACCOUNTING | EXTENT_FIRST_DELALLOC)
 
diff --git a/fs/btrfs/qgroup.c b/fs/btrfs/qgroup.c
index e9ace09..9ef8d73 100644
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@ -2481,3 +2481,52 @@ btrfs_qgroup_rescan_resume(struct btrfs_fs_info *fs_info)
btrfs_queue_work(fs_info->qgroup_rescan_workers,
 _info->qgroup_rescan_work);
 }
+
+/*
+ * Reserve qgroup space for range [start, start + len).
+ *
+ * This function will either reserve space from related qgroups or doing
+ * nothing if the range is already reserved.
+ *
+ * Return 0 for successful reserve
+ * Return <0 for error (including -EQUOT)
+ *
+ * NOTE: this function may sleep for memory allocation.
+ */
+int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len)
+{
+   struct btrfs_root *root = BTRFS_I(inode)->root;
+   struct extent_changeset changeset;
+   struct ulist_node *unode;
+   struct ulist_iterator uiter;
+   int ret;
+
+   if (!root->fs_info->quota_enabled || !is_fstree(root->objectid) ||
+   len == 0)
+   return 0;
+
+   changeset.bytes_changed = 0;
+   changeset.range_changed = ulist_alloc(GFP_NOFS);
+
+   ret = set_record_extent_bits(_I(inode)->io_tree, start,
+   start + len -1, EXTENT_QGROUP_RESERVED, GFP_NOFS,
+   );
+   if (ret < 0)
+   goto cleanup;
+   ret = btrfs_qgroup_reserve(root, changeset.bytes_changed);
+   if (ret < 0)
+   goto cleanup;
+
+   ulist_free(changeset.range_changed);
+   return ret;
+
+cleanup:
+   /* cleanup already reserved ranges */
+   ULIST_ITER_INIT();
+   while ((unode = ulist_next(changeset.range_changed, )))
+   clear_extent_bit(_I(inode)->io_tree, unode->val,
+unode->aux, EXTENT_QGROUP_RESERVED, 0, 0, NULL,
+GFP_NOFS);
+   ulist_free(changeset.range_changed);
+   return ret;
+}
diff --git a/fs/btrfs/qgroup.h b/fs/btrfs/qgroup.h
index 6387dcf..bd17cc2 100644
--- a/fs/btrfs/qgroup.h
+++ b/fs/btrfs/qgroup.h
@@ -81,4 +81,6 @@ int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, 
u64 qgroupid,
   u64 rfer, u64 excl);
 #endif
 
+/* New io_tree based accurate qgroup reserve API */
+int btrfs_qgroup_reserve_data(struct inode *inode, u64 start, u64 len);
 #endif /* __BTRFS_QGROUP__ */
-- 
2.6.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 8/9] vfs: Add vfs_copy_file_range() support for pagecache copies

2015-10-12 Thread Trond Myklebust
On Mon, Oct 12, 2015 at 7:17 PM, Darrick J. Wong
 wrote:
> On Sun, Oct 11, 2015 at 07:22:03AM -0700, Christoph Hellwig wrote:
>> On Wed, Sep 30, 2015 at 01:26:52PM -0400, Anna Schumaker wrote:
>> > This allows us to have an in-kernel copy mechanism that avoids frequent
>> > switches between kernel and user space.  This is especially useful so
>> > NFSD can support server-side copies.
>> >
>> > I make pagecache copies configurable by adding three new (exclusive)
>> > flags:
>> > - COPY_FR_REFLINK tells vfs_copy_file_range() to only create a reflink.
>> > - COPY_FR_COPY does a full data copy, but may be filesystem accelerated.
>> > - COPY_FR_DEDUP creates a reflink, but only if the contents of both
>> >   ranges are identical.
>>
>> All but FR_COPY really should be a separate system call.  Clones (an
>> dedup as a special case of clones) are really a separate beast from file
>> copies.
>>
>> If I want to clone a file I either want it clone fully or fail, not copy
>> a certain amount.  That means that a) we need to return an error not
>> short "write", and b) locking impementations are important - we need to
>> prevent other applications from racing with our clone even if it is
>> large, while to get these semantics for the possible short returning
>> file copy will require a proper userland locking protocol. Last but not
>> least file copies need to be interruptible while clones should be not.
>> All this is already important for local file systems and even more
>> important for NFS exporting.
>>
>> So I'd suggest to drop this patch and just let your syscall handle
>> actualy copies with all their horrors.  We can go with Peng's patches
>> to generalize the btrfs ioctls for clones for now which is what everyone
>> already uses anyway, and then add a separate sys_file_clone later.
>
> Hm.  Peng's patches only generalize the CLONE and CLONE_RANGE ioctls from
> btrfs, however they don't port over the (vastly different) EXTENT_SAME ioctl.
>
> What does everyone think about generalizing EXTENT_SAME?  The interface 
> enables
> one to ask the kernel to dedupe multiple file ranges in a single call.  That's
> more complex than what I was proposing with COPY_FR_DEDUP(E), but I'm assuming
> that the extra complexity buys us the ability to ... multi-dedupe at the same
> time, with locks held on the source file?

How is this supposed to be implemented on something like NFS without
protocol changes?

Trond
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html