[PATCH RESEND] btrfs: unlock i_mutex after attempting to delete subvolume during send

2015-04-10 Thread Omar Sandoval
Whenever the check for a send in progress introduced in commit
521e0546c970 (btrfs: protect snapshots from deleting during send) is
hit, we return without unlocking inode-i_mutex. This is easy to see
with lockdep enabled:

[  +0.59] 
[  +0.28] [ BUG: lock held when returning to user space! ]
[  +0.29] 4.0.0-rc5-00096-g3c435c1 #93 Not tainted
[  +0.26] 
[  +0.29] btrfs/211 is leaving the kernel with locks still held!
[  +0.29] 1 lock held by btrfs/211:
[  +0.23]  #0:  (type-i_mutex_dir_key){+.+.+.}, at: [8135b8df] 
btrfs_ioctl_snap_destroy+0x2df/0x7a0

Make sure we unlock it in the error path.

Reviewed-by: Filipe Manana fdman...@suse.com
Reviewed-by: David Sterba dste...@suse.cz
Cc: sta...@vger.kernel.org
Signed-off-by: Omar Sandoval osan...@osandov.com
---
Just resending this with Filipe's and David's Reviewed-bys and Cc-ing
stable.

 fs/btrfs/ioctl.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 74609b9..9fde01f 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2403,7 +2403,7 @@ static noinline int btrfs_ioctl_snap_destroy(struct file 
*file,
Attempt to delete subvolume %llu during send,
dest-root_key.objectid);
err = -EPERM;
-   goto out_dput;
+   goto out_unlock_inode;
}
 
d_invalidate(dentry);
@@ -2498,6 +2498,7 @@ out_up_write:
root_flags  ~BTRFS_ROOT_SUBVOL_DEAD);
spin_unlock(dest-root_item_lock);
}
+out_unlock_inode:
mutex_unlock(inode-i_mutex);
if (!err) {
shrink_dcache_sb(root-fs_info-sb);
-- 
2.3.5

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 2/3] x86: add sys_copy_file_range to syscall tables

2015-04-10 Thread Zach Brown
Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown z...@redhat.com
---
 arch/x86/syscalls/syscall_32.tbl | 1 +
 arch/x86/syscalls/syscall_64.tbl | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index b3560ec..88d0025 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
 356i386memfd_createsys_memfd_create
 357i386bpf sys_bpf
 358i386execveatsys_execveat
stub32_execveat
+359i386copy_file_range sys_copy_file_range
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 8d656fb..81802c5 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
 320common  kexec_file_load sys_kexec_file_load
 321common  bpf sys_bpf
 32264  execveatstub_execveat
+323common  copy_file_range sys_copy_file_range
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/3] simple copy offloading system call

2015-04-10 Thread Zach Brown
Hello everyone!

Here's my current attempt at the most basic system call interface for
offloading copying between files.  The system call and vfs function
are relatively light wrappers around the file_operation method that
does the heavy lifting.

There was interest at LSF in getting the basic infrastructure merged
before worrying about adding behavioural flags and more complicated
implementations.  This series only offers a refactoring of the btrfs
clone ioctl as an example of an implementation of the file
copy_file_range method.

I've added support for copy_file_range() to xfs_io in xfsprogs and
have the start of an xfstest that tests the system call.  I'll send
those to fstests@.

So how does this look?

Do we want to merge this and let the NFS and block XCOPY patches add
their changes when they're ready?

- z

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Zach Brown
Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data.  There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file).  This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown z...@redhat.com
---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 4 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 8e1b687..c65ce1d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -17,6 +17,7 @@
 #include linux/pagemap.h
 #include linux/splice.h
 #include linux/compat.h
+#include linux/mount.h
 #include internal.h
 
 #include asm/uaccess.h
@@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success.  When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+   struct file *file_out, loff_t pos_out,
+   size_t len, int flags)
+{
+   struct inode *inode_in;
+   struct inode *inode_out;
+   ssize_t ret;
+
+   if (flags)
+   return -EINVAL;
+
+   if (len == 0)
+   return 0;
+
+   /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */
+   ret = rw_verify_area(READ, file_in, pos_in, len);
+   if (ret = 0)
+   ret = rw_verify_area(WRITE, file_out, pos_out, len);
+   if (ret  0)
+   return ret;
+
+   if (!(file_in-f_mode  FMODE_READ) ||
+   !(file_out-f_mode  FMODE_WRITE) ||
+   (file_out-f_flags  O_APPEND) ||
+   !file_in-f_op || !file_in-f_op-copy_file_range)
+   return -EINVAL;
+
+   inode_in = file_inode(file_in);
+   inode_out = file_inode(file_out);
+
+   /* make sure offsets don't wrap and the input is inside i_size */
+   if (pos_in + len  pos_in || pos_out + len  pos_out ||
+   pos_in + len  i_size_read(inode_in))
+   return -EINVAL;
+
+   /* this could be relaxed once a method supports cross-fs copies */
+   if (inode_in-i_sb != inode_out-i_sb ||
+   file_in-f_path.mnt != file_out-f_path.mnt)
+   return -EXDEV;
+
+   /* forbid ranges in the same file */
+   if (inode_in == inode_out)
+   return -EINVAL;
+
+   ret = mnt_want_write_file(file_out);
+   if (ret)
+   return ret;
+
+   ret = file_in-f_op-copy_file_range(file_in, pos_in, file_out, pos_out,
+len, flags);
+   if (ret  0) {
+   fsnotify_access(file_in);
+   add_rchar(current, ret);
+   fsnotify_modify(file_out);
+   add_wchar(current, ret);
+   }
+   inc_syscr(current);
+   inc_syscw(current);
+
+   mnt_drop_write_file(file_out);
+
+   return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+   int, fd_out, loff_t __user *, off_out,
+   size_t, len, unsigned int, flags)
+{
+   loff_t pos_in;
+   loff_t pos_out;
+   struct fd f_in;
+   struct fd f_out;
+   ssize_t ret;
+
+   f_in = fdget(fd_in);
+   f_out = fdget(fd_out);
+   if (!f_in.file || !f_out.file) {
+   ret = -EBADF;
+   goto out;
+   }
+
+   ret = -EFAULT;
+   if (off_in) {
+   if (copy_from_user(pos_in, off_in, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_in = f_in.file-f_pos;
+   }
+
+   if (off_out) {
+   if (copy_from_user(pos_out, off_out, sizeof(loff_t)))
+   goto out;
+   } else {
+   pos_out = f_out.file-f_pos;
+   }
+
+   ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+   if (ret  0) {
+   pos_in += ret;
+   pos_out += ret;
+
+  

[PATCH RFC 3/3] btrfs: add .copy_file_range file operation

2015-04-10 Thread Zach Brown
This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function.  It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown z...@redhat.com
---
 fs/btrfs/ctree.h |  3 ++
 fs/btrfs/file.c  |  1 +
 fs/btrfs/ioctl.c | 91 
 3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index f9c89ca..f7cfa26 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3958,6 +3958,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
  loff_t pos, size_t write_bytes,
  struct extent_state **cached);
 int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);
 
 /* tree-defrag.c */
 int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 30982bb..49989899 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2820,6 +2820,7 @@ const struct file_operations btrfs_file_operations = {
 #ifdef CONFIG_COMPAT
.compat_ioctl   = btrfs_ioctl,
 #endif
+   .copy_file_range = btrfs_copy_file_range,
 };
 
 void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 74609b9..0eb008e 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3537,17 +3537,16 @@ out:
return ret;
 }
 
-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
-  u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+   u64 off, u64 olen, u64 destoff)
 {
struct inode *inode = file_inode(file);
+   struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)-root;
-   struct fd src_file;
-   struct inode *src;
int ret;
u64 len = olen;
u64 bs = root-fs_info-sb-s_blocksize;
-   int same_inode = 0;
+   int same_inode = src == inode;
 
/*
 * TODO:
@@ -3560,49 +3559,20 @@ static noinline long btrfs_ioctl_clone(struct file 
*file, unsigned long srcfd,
 *   be either compressed or non-compressed.
 */
 
-   /* the destination must be opened for writing */
-   if (!(file-f_mode  FMODE_WRITE) || (file-f_flags  O_APPEND))
-   return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;
 
-   ret = mnt_want_write_file(file);
-   if (ret)
-   return ret;
-
-   src_file = fdget(srcfd);
-   if (!src_file.file) {
-   ret = -EBADF;
-   goto out_drop_write;
-   }
-
-   ret = -EXDEV;
-   if (src_file.file-f_path.mnt != file-f_path.mnt)
-   goto out_fput;
-
-   src = file_inode(src_file.file);
-
-   ret = -EINVAL;
-   if (src == inode)
-   same_inode = 1;
-
-   /* the src must be open for reading */
-   if (!(src_file.file-f_mode  FMODE_READ))
-   goto out_fput;
+   if (file_src-f_path.mnt != file-f_path.mnt ||
+   src-i_sb != inode-i_sb)
+   return -EXDEV;
 
/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)-flags  BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)-flags  BTRFS_INODE_NODATASUM))
-   goto out_fput;
+   return -EINVAL;
 
-   ret = -EISDIR;
if (S_ISDIR(src-i_mode) || S_ISDIR(inode-i_mode))
-   goto out_fput;
-
-   ret = -EXDEV;
-   if (src-i_sb != inode-i_sb)
-   goto out_fput;
+   return -EISDIR;
 
if (!same_inode) {
if (inode  src) {
@@ -3690,6 +3660,49 @@ out_unlock:
} else {
mutex_unlock(src-i_mutex);
}
+   return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+   ssize_t ret;
+
+   ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+   if (ret == 0)
+   ret = len;
+   return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+  u64 off, u64 olen, u64 destoff)
+{
+   struct fd src_file;
+   int ret;
+
+   /* the destination must be opened for writing */
+   if (!(file-f_mode  FMODE_WRITE) || (file-f_flags  O_APPEND))
+   return -EINVAL;
+
+   ret = mnt_want_write_file(file);
+   if 

btrfs balance {,meta}data to raid5 not working?

2015-04-10 Thread Piotr Szymaniak
Hi,

I tried today to balance two drive btrfs raid1 to two drive btrfs raid5
without luck:

~ # btrfs balance start -dconvert=raid5 -mconvert=raid5 /mnt/cdrom/
ERROR: error during balancing '/mnt/cdrom/' - Invalid argument
There may be more info in syslog - try dmesg | tail
~ # btrfs balance start -mconvert=raid5 /mnt/cdrom/
ERROR: error during balancing '/mnt/cdrom/' - Invalid argument
There may be more info in syslog - try dmesg | tail
~ # btrfs balance start -dconvert=raid5 /mnt/cdrom/
ERROR: error during balancing '/mnt/cdrom/' - Invalid argument
There may be more info in syslog - try dmesg | tail
~ # dmesg | tail -3
[57073.050249] BTRFS error (device sdd): unable to start balance with
target data profile 128
[57079.674386] BTRFS error (device sdd): unable to start balance with
target metadata profile 128
[57082.754136] BTRFS error (device sdd): unable to start balance with
target data profile 128

Linux 3.19.3
btrfs-progs v3.19.1


Piotr Szymaniak.
-- 
Nie wierze,  zeby wyslali cie tam, nie w tym kraju, gdzie zabojcom daje
sie  po lapie i po dwoch latach ogladania kolorowej telewizji w wiezie-
niu znow wypuszcza na ulice, zeby mogli zabijac.
  -- Stephen King, Apt Pupil


signature.asc
Description: Digital signature


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Trond Myklebust
Hi Zach,

On Fri, Apr 10, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote:
 Add a copy_file_range() system call for offloading copies between
 regular files.

 This gives an interface to underlying layers of the storage stack which
 can copy without reading and writing all the data.  There are a few
 candidates that should support copy offloading in the nearer term:

 - btrfs shares extent references with its clone ioctl
 - NFS has patches to add a COPY command which copies on the server
 - SCSI has a family of XCOPY commands which copy in the device

 This system call avoids the complexity of also accelerating the creation
 of the destination file by operating on an existing destination file
 descriptor, not a path.

 Currently the high level vfs entry point limits copy offloading to files
 on the same mount and super (and not in the same file).  This can be
 relaxed if we get implementations which can copy between file systems
 safely.

 Signed-off-by: Zach Brown z...@redhat.com
 ---
  fs/read_write.c   | 129 
 ++
  include/linux/fs.h|   3 +
  include/uapi/asm-generic/unistd.h |   4 +-
  kernel/sys_ni.c   |   1 +
  4 files changed, 136 insertions(+), 1 deletion(-)

 diff --git a/fs/read_write.c b/fs/read_write.c
 index 8e1b687..c65ce1d 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -17,6 +17,7 @@
  #include linux/pagemap.h
  #include linux/splice.h
  #include linux/compat.h
 +#include linux/mount.h
  #include internal.h

  #include asm/uaccess.h
 @@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
 in_fd,
 return do_sendfile(out_fd, in_fd, NULL, count, 0);
  }
  #endif
 +
 +/*
 + * copy_file_range() differs from regular file read and write in that it
 + * specifically allows return partial success.  When it does so is up to
 + * the copy_file_range method.
 + */
 +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 +   struct file *file_out, loff_t pos_out,
 +   size_t len, int flags)

I'm going to repeat a gripe with this interface. I really don't think
we should treat copy_file_range() as taking a size_t length, since
that is not sufficient to do a full file copy on 32-bit systems w/ LFS
support.

Could we perhaps instead of a length, define a 'pos_in_start' and a
'pos_in_end' offset (with the latter being -1 for a full-file copy)
and then return an 'loff_t' value stating where the copy ended?

Note that both btrfs and NFSv4.2 allow for 64-bit lengths, so this
interface would be closer to what is already in use anyway.

Cheers
  Trond
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Andreas Dilger
On Apr 10, 2015, at 4:00 PM, Zach Brown z...@redhat.com wrote:
 
 Add a copy_file_range() system call for offloading copies between
 regular files.
 
 This gives an interface to underlying layers of the storage stack which
 can copy without reading and writing all the data.  There are a few
 candidates that should support copy offloading in the nearer term:
 
 - btrfs shares extent references with its clone ioctl
 - NFS has patches to add a COPY command which copies on the server
 - SCSI has a family of XCOPY commands which copy in the device
 
 This system call avoids the complexity of also accelerating the creation
 of the destination file by operating on an existing destination file
 descriptor, not a path.
 
 Currently the high level vfs entry point limits copy offloading to files
 on the same mount and super (and not in the same file).  This can be
 relaxed if we get implementations which can copy between file systems
 safely.
 
 Signed-off-by: Zach Brown z...@redhat.com
 ---
 fs/read_write.c   | 129 ++
 include/linux/fs.h|   3 +
 include/uapi/asm-generic/unistd.h |   4 +-
 kernel/sys_ni.c   |   1 +
 4 files changed, 136 insertions(+), 1 deletion(-)
 
 diff --git a/fs/read_write.c b/fs/read_write.c
 index 8e1b687..c65ce1d 100644
 --- a/fs/read_write.c
 +++ b/fs/read_write.c
 @@ -17,6 +17,7 @@
 #include linux/pagemap.h
 #include linux/splice.h
 #include linux/compat.h
 +#include linux/mount.h
 #include internal.h
 
 #include asm/uaccess.h
 @@ -1424,3 +1425,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, 
 in_fd,
   return do_sendfile(out_fd, in_fd, NULL, count, 0);
 }
 #endif
 +
 +/*
 + * copy_file_range() differs from regular file read and write in that it
 + * specifically allows return partial success.  When it does so is up to
 + * the copy_file_range method.
 + */
 +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
 + struct file *file_out, loff_t pos_out,
 + size_t len, int flags)

Minor nit - flags should be unsigned int to match the syscall.

 +{
 + struct inode *inode_in;
 + struct inode *inode_out;
 + ssize_t ret;
 +
 + if (flags)
 + return -EINVAL;
 +
 + if (len == 0)
 + return 0;
 +
 + /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT  */

This says ssize_t, but the len parameter is size_t...

 + ret = rw_verify_area(READ, file_in, pos_in, len);
 + if (ret = 0)
 + ret = rw_verify_area(WRITE, file_out, pos_out, len);
 + if (ret  0)
 + return ret;
 +
 + if (!(file_in-f_mode  FMODE_READ) ||
 + !(file_out-f_mode  FMODE_WRITE) ||
 + (file_out-f_flags  O_APPEND) ||
 + !file_in-f_op || !file_in-f_op-copy_file_range)
 + return -EINVAL;
 +
 + inode_in = file_inode(file_in);
 + inode_out = file_inode(file_out);
 +
 + /* make sure offsets don't wrap and the input is inside i_size */
 + if (pos_in + len  pos_in || pos_out + len  pos_out ||
 + pos_in + len  i_size_read(inode_in))
 + return -EINVAL;
 +
 + /* this could be relaxed once a method supports cross-fs copies */
 + if (inode_in-i_sb != inode_out-i_sb ||
 + file_in-f_path.mnt != file_out-f_path.mnt)
 + return -EXDEV;
 +
 + /* forbid ranges in the same file */
 + if (inode_in == inode_out)
 + return -EINVAL;
 +
 + ret = mnt_want_write_file(file_out);
 + if (ret)
 + return ret;
 +
 + ret = file_in-f_op-copy_file_range(file_in, pos_in, file_out, pos_out,
 +  len, flags);
 + if (ret  0) {
 + fsnotify_access(file_in);
 + add_rchar(current, ret);
 + fsnotify_modify(file_out);
 + add_wchar(current, ret);
 + }
 + inc_syscr(current);
 + inc_syscw(current);
 +
 + mnt_drop_write_file(file_out);
 +
 + return ret;
 +}
 +EXPORT_SYMBOL(vfs_copy_file_range);
 +
 +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
 + int, fd_out, loff_t __user *, off_out,
 + size_t, len, unsigned int, flags)
 +{
 + loff_t pos_in;
 + loff_t pos_out;
 + struct fd f_in;
 + struct fd f_out;
 + ssize_t ret;
 +
 + f_in = fdget(fd_in);
 + f_out = fdget(fd_out);
 + if (!f_in.file || !f_out.file) {
 + ret = -EBADF;
 + goto out;
 + }
 +
 + ret = -EFAULT;
 + if (off_in) {
 + if (copy_from_user(pos_in, off_in, sizeof(loff_t)))
 + goto out;
 + } else {
 + pos_in = f_in.file-f_pos;
 + }
 +
 + if (off_out) {
 + if (copy_from_user(pos_out, off_out, sizeof(loff_t)))
 + goto out;
 + } else {
 + pos_out = f_out.file-f_pos;
 + }
 +
 + ret = 

Re: [PATCH RFC 1/3] vfs: add copy_file_range syscall and vfs helper

2015-04-10 Thread Zach Brown
On Fri, Apr 10, 2015 at 06:36:41PM -0400, Trond Myklebust wrote:
 On Fri, Apr 10, 2015 at 6:00 PM, Zach Brown z...@redhat.com wrote:

  +
  +/*
  + * copy_file_range() differs from regular file read and write in that it
  + * specifically allows return partial success.  When it does so is up to
  + * the copy_file_range method.
  + */
  +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
  +   struct file *file_out, loff_t pos_out,
  +   size_t len, int flags)
 
 I'm going to repeat a gripe with this interface. I really don't think
 we should treat copy_file_range() as taking a size_t length, since
 that is not sufficient to do a full file copy on 32-bit systems w/ LFS
 support.

*nod*.  The length type is limited by the syscall return type and the
arbitrary desire to mimic read/write.

I sympathize with wanting to copy giant files with operations that don't
scale with file size because files can be enormous but sparse.

 Could we perhaps instead of a length, define a 'pos_in_start' and a
 'pos_in_end' offset (with the latter being -1 for a full-file copy)
 and then return an 'loff_t' value stating where the copy ended?

Well, the resulting offset will be set if the caller provided it.  So
they could already be getting the copied length from that.  But they
might not specify the offsets.  Maybe they're just using the results to
total up a completion indicator.

Maybe we could make the length a pointer like the offsets that's set to
the copied length on return.

This all seems pretty gross.  Does anyone else have a vote?

(And I'll argue strongly against creating magical offset values that
change behaviour.  If we want to ignore arguments and get the length
from the source file we'd add a flag to do so.)

 Note that both btrfs and NFSv4.2 allow for 64-bit lengths, so this
 interface would be closer to what is already in use anyway.

Yeah, btrfs doesn't allow partial progress.  It returns 0 on success.
We could also do that but people have expressed an interest in returning
partial progress.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs balance {,meta}data to raid5 not working?

2015-04-10 Thread Duncan
Piotr Szymaniak posted on Sat, 11 Apr 2015 00:10:49 +0200 as excerpted:

 Hi,
 
 I tried today to balance two drive btrfs raid1 to two drive btrfs raid5
 without luck: [snipped]

 Linux 3.19.3
 btrfs-progs v3.19.1

Two points:

1) There is (was?) a known bug with balance-conversion in (near-)current 
btrfs.  It was broken for a time, and I'm not sure it is fixed yet.  I'm 
also not sure whether it was a user-side or kernel-side issue, tho I 
believe the culprit commit has been traced and posted, so the answer 
should be on the back-list if nobody else replies here with more specific 
info.

Which presents a problem, since fully working raid5 support is so new.  
But for conversion, I /think/ you can use somewhat older versions and do 
the conversion, then use current versions that better handle problems for 
actual operation.  If I only knew which part, userspace or kernelspace, 
you have to use an old version of...

But you could try the latest 4.0-rc7+ kernel and see if it works with 
that, yet.

2) You specify two drives[1] and an intended conversion to raid5.  
Normally/traditionally, raid5 needs three devices to function undegraded, 
altho technically, two-device raid5 is possible; it's just effectively a 
slow raid1.  There has been some discussion around whether btrfs should 
enable two-device raid5 or not, but regardless of whether it's actually
/possible/, why would you /want/ it?

2a) If your intention is to keep it two devices, just continue using 
raid1, particularly with btrfs where raid1 mode is MUCH more mature and 
tested than raid5 mode.

2b) If instead your intention was to convert it to raid5 before upgrading 
it to three devices, just add the third device first, then do the balance-
conversion.  It'll save quite some time over effectively doing the 
balance-conversion twice.

---
[1] Disks/drives/devices.  In a modern world of SSDs and virtual devices, 
a block device may well be neither a disk nor an actual drive. (Does SSD 
refer to a solid state /device/, or a solid state /drive/; it's certainly 
not a /disk/?  Either way, a virtual device may not in fact be a drive of 
any sort at all, while still being a device.)  I guess I'm not alone 
among experienced users and sysadmins of an earlier era, who find 
themselves now trying to retrain themselves to use the more accurate 
generic term in most contexts...

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html