Re: [RFC v0 0/4] sys_copy_range() rough draft
On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote: > On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: > > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > > > I'm going to keep hacking away at this. My next step is to get ext4 > > > supporting .copy_range, probably with a quick hack to copy the > > > contents of bios. Hopefully that'll give enough time to also integrate > > > review feedback. > > > > Wouldn't the easiest "support all filesystems" hack just be to add > > a destination offset parameter to do_splice_direct() and call that > > when the filesystem doesn't supply a ->copy_range method? i.e. use > > the mechanisms we already have for copying from one file to another > > via the page cache as efficiently as possible? > > Probably; and this in-kernel buffered fallback is particularly desirable > for nfsd when the exported fs doesn't provide .copy_range. Having nfsd > service the COPY op is still a significant win over having the client > move the data backand forth over the wire. Sure. That's kind of what I was thinking to make it easy to test and have widespread support up front. > But in that quote above I was talking about implementing .copy_range in > ext4 as though it could use XCOPY today. I'd like to get a feel for how > bad it's going to be to juggle the bio XCOPY IO with unwritten extent > conversion, RMW with overlapping existing blocks, i_size advancing, etc. > (It's so much like O_DIRECT that I'm already crying a little.) Toss anything that is hard back to the page cache path. Overlapping blocks, partial blocks and so can be handled by the slow path without making the offload path complex. Make the offload do the simple stuff fast - the mapping and completion callbacks should be no different to the direct IO bits we have now, and if you only handle filesystem block aligned ranges in the offload (rather than sector alignment) most of the grot that DIO code has to handle goes away Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 0/4] sys_copy_range() rough draft
On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: > On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > > I'm going to keep hacking away at this. My next step is to get ext4 > > supporting .copy_range, probably with a quick hack to copy the > > contents of bios. Hopefully that'll give enough time to also integrate > > review feedback. > > Wouldn't the easiest "support all filesystems" hack just be to add > a destination offset parameter to do_splice_direct() and call that > when the filesystem doesn't supply a ->copy_range method? i.e. use > the mechanisms we already have for copying from one file to another > via the page cache as efficiently as possible? Probably; and this in-kernel buffered fallback is particularly desirable for nfsd when the exported fs doesn't provide .copy_range. Having nfsd service the COPY op is still a significant win over having the client move the data backand forth over the wire. But in that quote above I was talking about implementing .copy_range in ext4 as though it could use XCOPY today. I'd like to get a feel for how bad it's going to be to juggle the bio XCOPY IO with unwritten extent conversion, RMW with overlapping existing blocks, i_size advancing, etc. (It's so much like O_DIRECT that I'm already crying a little.) - z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 0/4] sys_copy_range() rough draft
On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: > We've been talking about implementing some form of bulk data copy > offloading for a while now. BTRFS and OCFS2 implement forms of copy > offloading with ioctls, NFS 4.2 will include a byte-granular COPY > operation, and the SCSI XCOPY command is being implemented now that > Windows can issue it. > > In the past we've discussed promoting the ocfs2 reflink ioctl into a > system call that would create a new file and implicitly copy the > source data into the new file: > https://lkml.org/lkml/2009/9/14/481 > > These draft patches take the simpler approach of only copying data > between existing files. The patches 1) make a system call out of the > btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with > the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY > op, and 4) serve the COPY op in nfsd by calling the .copy_range method > again. > > The nfs patch is an untested hack. I'm happy to beat it in to shape > but I'll need some guidance. > > I'd like strong review feedback on the interfaces, here are some > possible topics: > > a) Hopefully being able to specify a portion of the data to copy will > avoid *huge* syscall latencies and the motivation for new async > semantics. > > b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy > from the start offset to the end of the file. Does anyone have a > strong feeling about this? I'm leaning towards not bothering with it > in the syscall interface. > > c) I chose to return partial progess in the ssize_t return code. This > limits the length of the range and the size_t count argument can be too > large and return errors, much like other io syscalls. This seemed > less awful than some extra argument with a pointer to a status value. > > d) I'm dreading mentioning a vector of ranges to copy in one syscall > because I don't want to think about overlaping ranges and file systems > that use range locks -- xfs for now, but more if Jan gets his way. XFS doesn't use range locks (yet). > I'd rather that we get some experience with this simpler syscall before > taking on that headache. > > I'm sure I'm forgetting some other details. > > I'm going to keep hacking away at this. My next step is to get ext4 > supporting .copy_range, probably with a quick hack to copy the > contents of bios. Hopefully that'll give enough time to also integrate > review feedback. Wouldn't the easiest "support all filesystems" hack just be to add a destination offset parameter to do_splice_direct() and call that when the filesystem doesn't supply a ->copy_range method? i.e. use the mechanisms we already have for copying from one file to another via the page cache as efficiently as possible? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v0 0/4] sys_copy_range() rough draft
We've been talking about implementing some form of bulk data copy offloading for a while now. BTRFS and OCFS2 implement forms of copy offloading with ioctls, NFS 4.2 will include a byte-granular COPY operation, and the SCSI XCOPY command is being implemented now that Windows can issue it. In the past we've discussed promoting the ocfs2 reflink ioctl into a system call that would create a new file and implicitly copy the source data into the new file: https://lkml.org/lkml/2009/9/14/481 These draft patches take the simpler approach of only copying data between existing files. The patches 1) make a system call out of the btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY op, and 4) serve the COPY op in nfsd by calling the .copy_range method again. The nfs patch is an untested hack. I'm happy to beat it in to shape but I'll need some guidance. I'd like strong review feedback on the interfaces, here are some possible topics: a) Hopefully being able to specify a portion of the data to copy will avoid *huge* syscall latencies and the motivation for new async semantics. b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy from the start offset to the end of the file. Does anyone have a strong feeling about this? I'm leaning towards not bothering with it in the syscall interface. c) I chose to return partial progess in the ssize_t return code. This limits the length of the range and the size_t count argument can be too large and return errors, much like other io syscalls. This seemed less awful than some extra argument with a pointer to a status value. d) I'm dreading mentioning a vector of ranges to copy in one syscall because I don't want to think about overlaping ranges and file systems that use range locks -- xfs for now, but more if Jan gets his way. I'd rather that we get some experience with this simpler syscall before taking on that headache. I'm sure I'm forgetting some other details. I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Thoughts? - z -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC v0 0/4] sys_copy_range() rough draft
We've been talking about implementing some form of bulk data copy offloading for a while now. BTRFS and OCFS2 implement forms of copy offloading with ioctls, NFS 4.2 will include a byte-granular COPY operation, and the SCSI XCOPY command is being implemented now that Windows can issue it. In the past we've discussed promoting the ocfs2 reflink ioctl into a system call that would create a new file and implicitly copy the source data into the new file: https://lkml.org/lkml/2009/9/14/481 These draft patches take the simpler approach of only copying data between existing files. The patches 1) make a system call out of the btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY op, and 4) serve the COPY op in nfsd by calling the .copy_range method again. The nfs patch is an untested hack. I'm happy to beat it in to shape but I'll need some guidance. I'd like strong review feedback on the interfaces, here are some possible topics: a) Hopefully being able to specify a portion of the data to copy will avoid *huge* syscall latencies and the motivation for new async semantics. b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy from the start offset to the end of the file. Does anyone have a strong feeling about this? I'm leaning towards not bothering with it in the syscall interface. c) I chose to return partial progess in the ssize_t return code. This limits the length of the range and the size_t count argument can be too large and return errors, much like other io syscalls. This seemed less awful than some extra argument with a pointer to a status value. d) I'm dreading mentioning a vector of ranges to copy in one syscall because I don't want to think about overlaping ranges and file systems that use range locks -- xfs for now, but more if Jan gets his way. I'd rather that we get some experience with this simpler syscall before taking on that headache. I'm sure I'm forgetting some other details. I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Thoughts? - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 0/4] sys_copy_range() rough draft
On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: We've been talking about implementing some form of bulk data copy offloading for a while now. BTRFS and OCFS2 implement forms of copy offloading with ioctls, NFS 4.2 will include a byte-granular COPY operation, and the SCSI XCOPY command is being implemented now that Windows can issue it. In the past we've discussed promoting the ocfs2 reflink ioctl into a system call that would create a new file and implicitly copy the source data into the new file: https://lkml.org/lkml/2009/9/14/481 These draft patches take the simpler approach of only copying data between existing files. The patches 1) make a system call out of the btrfs CLONE_RANGE ioctl, 2) implement the btrfs .copy_range method with the ioctl's guts, 3) implement the nfs .copy_range by sending a COPY op, and 4) serve the COPY op in nfsd by calling the .copy_range method again. The nfs patch is an untested hack. I'm happy to beat it in to shape but I'll need some guidance. I'd like strong review feedback on the interfaces, here are some possible topics: a) Hopefully being able to specify a portion of the data to copy will avoid *huge* syscall latencies and the motivation for new async semantics. b) The BTRFS ioctl and nfs COPY let you specify a count of 0 to copy from the start offset to the end of the file. Does anyone have a strong feeling about this? I'm leaning towards not bothering with it in the syscall interface. c) I chose to return partial progess in the ssize_t return code. This limits the length of the range and the size_t count argument can be too large and return errors, much like other io syscalls. This seemed less awful than some extra argument with a pointer to a status value. d) I'm dreading mentioning a vector of ranges to copy in one syscall because I don't want to think about overlaping ranges and file systems that use range locks -- xfs for now, but more if Jan gets his way. XFS doesn't use range locks (yet). I'd rather that we get some experience with this simpler syscall before taking on that headache. I'm sure I'm forgetting some other details. I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Wouldn't the easiest support all filesystems hack just be to add a destination offset parameter to do_splice_direct() and call that when the filesystem doesn't supply a -copy_range method? i.e. use the mechanisms we already have for copying from one file to another via the page cache as efficiently as possible? Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 0/4] sys_copy_range() rough draft
On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Wouldn't the easiest support all filesystems hack just be to add a destination offset parameter to do_splice_direct() and call that when the filesystem doesn't supply a -copy_range method? i.e. use the mechanisms we already have for copying from one file to another via the page cache as efficiently as possible? Probably; and this in-kernel buffered fallback is particularly desirable for nfsd when the exported fs doesn't provide .copy_range. Having nfsd service the COPY op is still a significant win over having the client move the data backand forth over the wire. But in that quote above I was talking about implementing .copy_range in ext4 as though it could use XCOPY today. I'd like to get a feel for how bad it's going to be to juggle the bio XCOPY IO with unwritten extent conversion, RMW with overlapping existing blocks, i_size advancing, etc. (It's so much like O_DIRECT that I'm already crying a little.) - z -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC v0 0/4] sys_copy_range() rough draft
On Tue, May 14, 2013 at 03:04:40PM -0700, Zach Brown wrote: On Wed, May 15, 2013 at 07:42:51AM +1000, Dave Chinner wrote: On Tue, May 14, 2013 at 02:15:22PM -0700, Zach Brown wrote: I'm going to keep hacking away at this. My next step is to get ext4 supporting .copy_range, probably with a quick hack to copy the contents of bios. Hopefully that'll give enough time to also integrate review feedback. Wouldn't the easiest support all filesystems hack just be to add a destination offset parameter to do_splice_direct() and call that when the filesystem doesn't supply a -copy_range method? i.e. use the mechanisms we already have for copying from one file to another via the page cache as efficiently as possible? Probably; and this in-kernel buffered fallback is particularly desirable for nfsd when the exported fs doesn't provide .copy_range. Having nfsd service the COPY op is still a significant win over having the client move the data backand forth over the wire. Sure. That's kind of what I was thinking to make it easy to test and have widespread support up front. But in that quote above I was talking about implementing .copy_range in ext4 as though it could use XCOPY today. I'd like to get a feel for how bad it's going to be to juggle the bio XCOPY IO with unwritten extent conversion, RMW with overlapping existing blocks, i_size advancing, etc. (It's so much like O_DIRECT that I'm already crying a little.) Toss anything that is hard back to the page cache path. Overlapping blocks, partial blocks and so can be handled by the slow path without making the offload path complex. Make the offload do the simple stuff fast - the mapping and completion callbacks should be no different to the direct IO bits we have now, and if you only handle filesystem block aligned ranges in the offload (rather than sector alignment) most of the grot that DIO code has to handle goes away Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/