Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Dave Chinner
On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote:
> On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
> > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> > > I'm going to keep hacking away at this.  My next step is to get ext4
> > > supporting .copy_range, probably with a quick hack to copy the
> > > contents of bios.  Hopefully that'll give enough time to also integrate
> > > review feedback.
> > 
> > Wouldn't the easiest "support all filesystems" hack just be to add
> > a destination offset parameter to do_splice_direct() and call that
> > when the filesystem doesn't supply a ->copy_range method? i.e. use
> > the mechanisms we already have for copying from one file to another
> > via the page cache as efficiently as possible?
> 
> Probably; and this in-kernel buffered fallback is particularly desirable
> for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
> service the COPY op is still a significant win over having the client
> move the data backand forth over the wire.

Sure. That's kind of what I was thinking to make it easy to test and
have widespread support up front.

> But in that quote above I was talking about implementing .copy_range in
> ext4 as though it could use XCOPY today.  I'd like to get a feel for how
> bad it's going to be to juggle the bio XCOPY IO with unwritten extent
> conversion, RMW with overlapping existing blocks, i_size advancing, etc.
> (It's so much like O_DIRECT that I'm already crying a little.)

Toss anything that is hard back to the page cache path. Overlapping
blocks, partial blocks and so can be handled by the slow path
without making the offload path complex.

Make the offload do the simple stuff fast - the mapping and
completion callbacks should be no different to the direct IO bits we
have now, and if you only handle filesystem block aligned ranges in
the offload (rather than sector alignment) most of the grot that DIO
code has to handle goes away

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Zach Brown
On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
> On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> > I'm going to keep hacking away at this.  My next step is to get ext4
> > supporting .copy_range, probably with a quick hack to copy the
> > contents of bios.  Hopefully that'll give enough time to also integrate
> > review feedback.
> 
> Wouldn't the easiest "support all filesystems" hack just be to add
> a destination offset parameter to do_splice_direct() and call that
> when the filesystem doesn't supply a ->copy_range method? i.e. use
> the mechanisms we already have for copying from one file to another
> via the page cache as efficiently as possible?

Probably; and this in-kernel buffered fallback is particularly desirable
for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
service the COPY op is still a significant win over having the client
move the data backand forth over the wire.

But in that quote above I was talking about implementing .copy_range in
ext4 as though it could use XCOPY today.  I'd like to get a feel for how
bad it's going to be to juggle the bio XCOPY IO with unwritten extent
conversion, RMW with overlapping existing blocks, i_size advancing, etc.
(It's so much like O_DIRECT that I'm already crying a little.)

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Dave Chinner
On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
> We've been talking about implementing some form of bulk data copy
> offloading for a while now.  BTRFS and OCFS2 implement forms of copy
> offloading with ioctls, NFS 4.2 will include a byte-granular COPY
> operation, and the SCSI XCOPY command is being implemented now that
> Windows can issue it.
> 
> In the past we've discussed promoting the ocfs2 reflink ioctl into a
> system call that would create a new file and implicitly copy the
> source data into the new file:
> https://lkml.org/lkml/2009/9/14/481
> 
> These draft patches take the simpler approach of only copying data
> between existing files.  The patches 1) make a system call out of the
> btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
> the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
> op, and 4) serve the COPY op in nfsd by calling the .copy_range method
> again.
> 
> The nfs patch is an untested hack.  I'm happy to beat it in to shape
> but I'll need some guidance.
> 
> I'd like strong review feedback on the interfaces, here are some
> possible topics:
> 
> a) Hopefully being able to specify a portion of the data to copy will
> avoid *huge* syscall latencies and the motivation for new async
> semantics.
> 
> b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
> from the start offset to the end of the file.  Does anyone have a
> strong feeling about this?  I'm leaning towards not bothering with it
> in the syscall interface.
> 
> c) I chose to return partial progess in the ssize_t return code.  This
> limits the length of the range and the size_t count argument can be too
> large and return errors, much like other io syscalls.  This seemed
> less awful than some extra argument with a pointer to a status value.
> 
> d) I'm dreading mentioning a vector of ranges to copy in one syscall
> because I don't want to think about overlaping ranges and file systems
> that use range locks -- xfs for now, but more if Jan gets his way.

XFS doesn't use range locks (yet).

> I'd rather that we get some experience with this simpler syscall before
> taking on that headache.
> 
> I'm sure I'm forgetting some other details.
> 
> I'm going to keep hacking away at this.  My next step is to get ext4
> supporting .copy_range, probably with a quick hack to copy the
> contents of bios.  Hopefully that'll give enough time to also integrate
> review feedback.

Wouldn't the easiest "support all filesystems" hack just be to add
a destination offset parameter to do_splice_direct() and call that
when the filesystem doesn't supply a ->copy_range method? i.e. use
the mechanisms we already have for copying from one file to another
via the page cache as efficiently as possible?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Zach Brown
We've been talking about implementing some form of bulk data copy
offloading for a while now.  BTRFS and OCFS2 implement forms of copy
offloading with ioctls, NFS 4.2 will include a byte-granular COPY
operation, and the SCSI XCOPY command is being implemented now that
Windows can issue it.

In the past we've discussed promoting the ocfs2 reflink ioctl into a
system call that would create a new file and implicitly copy the
source data into the new file:
https://lkml.org/lkml/2009/9/14/481

These draft patches take the simpler approach of only copying data
between existing files.  The patches 1) make a system call out of the
btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
op, and 4) serve the COPY op in nfsd by calling the .copy_range method
again.

The nfs patch is an untested hack.  I'm happy to beat it in to shape
but I'll need some guidance.

I'd like strong review feedback on the interfaces, here are some
possible topics:

a) Hopefully being able to specify a portion of the data to copy will
avoid *huge* syscall latencies and the motivation for new async
semantics.

b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
from the start offset to the end of the file.  Does anyone have a
strong feeling about this?  I'm leaning towards not bothering with it
in the syscall interface.

c) I chose to return partial progess in the ssize_t return code.  This
limits the length of the range and the size_t count argument can be too
large and return errors, much like other io syscalls.  This seemed
less awful than some extra argument with a pointer to a status value.

d) I'm dreading mentioning a vector of ranges to copy in one syscall
because I don't want to think about overlaping ranges and file systems
that use range locks -- xfs for now, but more if Jan gets his way.
I'd rather that we get some experience with this simpler syscall before
taking on that headache.

I'm sure I'm forgetting some other details.

I'm going to keep hacking away at this.  My next step is to get ext4
supporting .copy_range, probably with a quick hack to copy the
contents of bios.  Hopefully that'll give enough time to also integrate
review feedback.

Thoughts?

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Zach Brown
We've been talking about implementing some form of bulk data copy
offloading for a while now.  BTRFS and OCFS2 implement forms of copy
offloading with ioctls, NFS 4.2 will include a byte-granular COPY
operation, and the SCSI XCOPY command is being implemented now that
Windows can issue it.

In the past we've discussed promoting the ocfs2 reflink ioctl into a
system call that would create a new file and implicitly copy the
source data into the new file:
https://lkml.org/lkml/2009/9/14/481

These draft patches take the simpler approach of only copying data
between existing files.  The patches 1) make a system call out of the
btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
op, and 4) serve the COPY op in nfsd by calling the .copy_range method
again.

The nfs patch is an untested hack.  I'm happy to beat it in to shape
but I'll need some guidance.

I'd like strong review feedback on the interfaces, here are some
possible topics:

a) Hopefully being able to specify a portion of the data to copy will
avoid *huge* syscall latencies and the motivation for new async
semantics.

b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
from the start offset to the end of the file.  Does anyone have a
strong feeling about this?  I'm leaning towards not bothering with it
in the syscall interface.

c) I chose to return partial progess in the ssize_t return code.  This
limits the length of the range and the size_t count argument can be too
large and return errors, much like other io syscalls.  This seemed
less awful than some extra argument with a pointer to a status value.

d) I'm dreading mentioning a vector of ranges to copy in one syscall
because I don't want to think about overlaping ranges and file systems
that use range locks -- xfs for now, but more if Jan gets his way.
I'd rather that we get some experience with this simpler syscall before
taking on that headache.

I'm sure I'm forgetting some other details.

I'm going to keep hacking away at this.  My next step is to get ext4
supporting .copy_range, probably with a quick hack to copy the
contents of bios.  Hopefully that'll give enough time to also integrate
review feedback.

Thoughts?

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Dave Chinner
On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
 We've been talking about implementing some form of bulk data copy
 offloading for a while now.  BTRFS and OCFS2 implement forms of copy
 offloading with ioctls, NFS 4.2 will include a byte-granular COPY
 operation, and the SCSI XCOPY command is being implemented now that
 Windows can issue it.
 
 In the past we've discussed promoting the ocfs2 reflink ioctl into a
 system call that would create a new file and implicitly copy the
 source data into the new file:
 https://lkml.org/lkml/2009/9/14/481
 
 These draft patches take the simpler approach of only copying data
 between existing files.  The patches 1) make a system call out of the
 btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with
 the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY
 op, and 4) serve the COPY op in nfsd by calling the .copy_range method
 again.
 
 The nfs patch is an untested hack.  I'm happy to beat it in to shape
 but I'll need some guidance.
 
 I'd like strong review feedback on the interfaces, here are some
 possible topics:
 
 a) Hopefully being able to specify a portion of the data to copy will
 avoid *huge* syscall latencies and the motivation for new async
 semantics.
 
 b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy
 from the start offset to the end of the file.  Does anyone have a
 strong feeling about this?  I'm leaning towards not bothering with it
 in the syscall interface.
 
 c) I chose to return partial progess in the ssize_t return code.  This
 limits the length of the range and the size_t count argument can be too
 large and return errors, much like other io syscalls.  This seemed
 less awful than some extra argument with a pointer to a status value.
 
 d) I'm dreading mentioning a vector of ranges to copy in one syscall
 because I don't want to think about overlaping ranges and file systems
 that use range locks -- xfs for now, but more if Jan gets his way.

XFS doesn't use range locks (yet).

 I'd rather that we get some experience with this simpler syscall before
 taking on that headache.
 
 I'm sure I'm forgetting some other details.
 
 I'm going to keep hacking away at this.  My next step is to get ext4
 supporting .copy_range, probably with a quick hack to copy the
 contents of bios.  Hopefully that'll give enough time to also integrate
 review feedback.

Wouldn't the easiest support all filesystems hack just be to add
a destination offset parameter to do_splice_direct() and call that
when the filesystem doesn't supply a -copy_range method? i.e. use
the mechanisms we already have for copying from one file to another
via the page cache as efficiently as possible?

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Zach Brown
On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
 On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
  I'm going to keep hacking away at this.  My next step is to get ext4
  supporting .copy_range, probably with a quick hack to copy the
  contents of bios.  Hopefully that'll give enough time to also integrate
  review feedback.
 
 Wouldn't the easiest support all filesystems hack just be to add
 a destination offset parameter to do_splice_direct() and call that
 when the filesystem doesn't supply a -copy_range method? i.e. use
 the mechanisms we already have for copying from one file to another
 via the page cache as efficiently as possible?

Probably; and this in-kernel buffered fallback is particularly desirable
for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
service the COPY op is still a significant win over having the client
move the data backand forth over the wire.

But in that quote above I was talking about implementing .copy_range in
ext4 as though it could use XCOPY today.  I'd like to get a feel for how
bad it's going to be to juggle the bio XCOPY IO with unwritten extent
conversion, RMW with overlapping existing blocks, i_size advancing, etc.
(It's so much like O_DIRECT that I'm already crying a little.)

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFC v0 0/4] sys_copy_range() rough draft

2013-05-14 Thread Dave Chinner
On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote:
 On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote:
  On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote:
   I'm going to keep hacking away at this.  My next step is to get ext4
   supporting .copy_range, probably with a quick hack to copy the
   contents of bios.  Hopefully that'll give enough time to also integrate
   review feedback.
  
  Wouldn't the easiest support all filesystems hack just be to add
  a destination offset parameter to do_splice_direct() and call that
  when the filesystem doesn't supply a -copy_range method? i.e. use
  the mechanisms we already have for copying from one file to another
  via the page cache as efficiently as possible?
 
 Probably; and this in-kernel buffered fallback is particularly desirable
 for nfsd when the exported fs doesn't provide .copy_range.  Having nfsd
 service the COPY op is still a significant win over having the client
 move the data backand forth over the wire.

Sure. That's kind of what I was thinking to make it easy to test and
have widespread support up front.

 But in that quote above I was talking about implementing .copy_range in
 ext4 as though it could use XCOPY today.  I'd like to get a feel for how
 bad it's going to be to juggle the bio XCOPY IO with unwritten extent
 conversion, RMW with overlapping existing blocks, i_size advancing, etc.
 (It's so much like O_DIRECT that I'm already crying a little.)

Toss anything that is hard back to the page cache path. Overlapping
blocks, partial blocks and so can be handled by the slow path
without making the offload path complex.

Make the offload do the simple stuff fast - the mapping and
completion callbacks should be no different to the direct IO bits we
have now, and if you only handle filesystem block aligned ranges in
the offload (rather than sector alignment) most of the grot that DIO
code has to handle goes away

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/