subject:"RE\: \[RFC\] extending splice for copy offloading"

On Mon, 2013-09-30 at 16:08 -0400, Ric Wheeler wrote:
> On 09/30/2013 04:00 PM, Bernd Schubert wrote:
> > pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
> > interface? And userspace needs to address all of them differently? 
> 
> The NFS and SCSI groups have each defined a standard which Zach's proposal 
> abstracts into a common user API.
> 
> Distributed file systems tend to be rather unique and do not have similar 
> standard bodies, but a lot of them could hide server specific implementations 
> under the current proposed interfaces.
> 
> What is not a good idea is to drag out the core, simple copy offload 
> discussion 
> for another 5 years to pull in every odd use case :)

Agreed. The whole idea of a common system call interface should be to
allow us to abstract away the underlying storage and filesystem
architectures. If filesystem developers also want a way to expose that
underlying architecture to applications in order to enable further
optimisations, then that belongs in a separate discussion.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 22:00 +0200, Bernd Schubert wrote:
> On 09/30/2013 09:34 PM, Myklebust, Trond wrote:
> > On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:
> >> On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
> >>> On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
>  On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
> > On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
> >> It would be nice if there would be way if the file system would get a
> >> hint that the target file is supposed to be copy of another file. That
> >> way distributed file systems could also create the target-file with the
> >> correct meta-information (same storage targets as in-file has).
> >> Well, if we cannot agree on that, file system with a custom protocol at
> >> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
> >> sure if this would work for pNFS, though.
> >
> > splice() does not create new files. What you appear to be asking for
> > lies way outside the scope of that system call interface.
> >
> 
>  Sorry I know, definitely outside the scope of splice, but in the context
>  of offloaded file copies. So the question is, what is the best way to
>  address/discuss that?
> >>>
> >>> Why does it need to be addressed in the first place?
> >>
> >> An offloaded copy is still not efficient if different storage
> >> servers/targets used by from-file and to-file.
> >
> > So?
> 
> mds1: orig-file
> oss1/target1: orig-chunk1
> 
> mds1: target-file
> ossN/targetN: target-chunk1
> 
> clientN: Performs the copy
> 
> Ideally, orig-chunk1 and target-chunk1 are on the same server and same 
> target. Copy offload then even could done from the underlying fs, 
> similiar as local splice.
> If different ossN servers are used copies still have to be done over 
> network by these storage servers, although the client only would need to 
> initiate the copy. Still faster, but also not ideal.
> 
> >
> >>>
> >>> What is preventing an application from retrieving and setting this
> >>> information using standard libc functions such as fstat()+open(), and
> >>> supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
> >>> where appropriate?
> >>>
> >>
> >> At a minimum this requires network and metadata overhead. And while I'm
> >> working on FhGFS now, I still wonder what other file system need to do -
> >> for example Lustre pre-allocates storage-target files on creating a
> >> file, so file layout changes mean even more overhead there.
> >
> > The problem you are describing is limited to a narrow set of storage
> > architectures. If copy offload using splice() doesn't make sense for
> > those architectures, then don't implement it for them.
> 
> But it _does_ make sense. The file system just needs a hint that a 
> splice copy is going to come up.

Just wait for the splice() system call. How is this any different from
write()?

> > You might be able to provide ioctls() to do these special hinted file
> > creations for those filesystems that need it, but the vast majority
> > don't, and you shouldn't enforce it on them.
> 
> And exactly for that we need a standard - it does not make sense if each 
> and every distributed file system implements its own 
> ioctl/libattr/libacl interface for that.
> 
> >
> >> Anyway, if we could agree on to use libattr or libacl to teach the file
> >> system about the upcoming splice call I would be fine.
> >
> > libattr and libacl are generic libraries that exist to manipulate xattrs
> > and acls. They do not need to contain Lustre-specific code.
> >
> 
> pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
> interface? And userspace needs to address all of them differently?
>
> I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, 
> didn't find a better name yet), which would take in-file-path and 
> out-file-path and allow the file system to create out-file-path with the 
> same meta-layout as in-file-path. And it would need some flags, such as 
> AUTO (file system decides if it makes sense to do a local copy) and 
> FORCE (always try a local copy).

splice() is not a whole-file copy operation; it's a byte range copy. How
does the above help other than in the whole-file case?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading


On 09/30/2013 04:00 PM, Bernd Schubert wrote:
pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently? 


The NFS and SCSI groups have each defined a standard which Zach's proposal 
abstracts into a common user API.


Distributed file systems tend to be rather unique and do not have similar 
standard bodies, but a lot of them could hide server specific implementations 
under the current proposed interfaces.


What is not a good idea is to drag out the core, simple copy offload discussion 
for another 5 years to pull in every odd use case :)


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 09:34 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:

On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage
servers/targets used by from-file and to-file.


So?


mds1: orig-file
oss1/target1: orig-chunk1

mds1: target-file
ossN/targetN: target-chunk1

clientN: Performs the copy

Ideally, orig-chunk1 and target-chunk1 are on the same server and same 
target. Copy offload then even could done from the underlying fs, 
similiar as local splice.
If different ossN servers are used copies still have to be done over 
network by these storage servers, although the client only would need to 
initiate the copy. Still faster, but also not ideal.






What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm
working on FhGFS now, I still wonder what other file system need to do -
for example Lustre pre-allocates storage-target files on creating a
file, so file layout changes mean even more overhead there.


The problem you are describing is limited to a narrow set of storage
architectures. If copy offload using splice() doesn't make sense for
those architectures, then don't implement it for them.


But it _does_ make sense. The file system just needs a hint that a 
splice copy is going to come up.



You might be able to provide ioctls() to do these special hinted file
creations for those filesystems that need it, but the vast majority
don't, and you shouldn't enforce it on them.


And exactly for that we need a standard - it does not make sense if each 
and every distributed file system implements its own 
ioctl/libattr/libacl interface for that.





Anyway, if we could agree on to use libattr or libacl to teach the file
system about the upcoming splice call I would be fine.


libattr and libacl are generic libraries that exist to manipulate xattrs
and acls. They do not need to contain Lustre-specific code.



pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently?


I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, 
didn't find a better name yet), which would take in-file-path and 
out-file-path and allow the file system to create out-file-path with the 
same meta-layout as in-file-path. And it would need some flags, such as 
AUTO (file system decides if it makes sense to do a local copy) and 
FORCE (always try a local copy).



Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:
> On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
> > On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
> >> On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
> >>> On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
>  It would be nice if there would be way if the file system would get a
>  hint that the target file is supposed to be copy of another file. That
>  way distributed file systems could also create the target-file with the
>  correct meta-information (same storage targets as in-file has).
>  Well, if we cannot agree on that, file system with a custom protocol at
>  least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
>  sure if this would work for pNFS, though.
> >>>
> >>> splice() does not create new files. What you appear to be asking for
> >>> lies way outside the scope of that system call interface.
> >>>
> >>
> >> Sorry I know, definitely outside the scope of splice, but in the context
> >> of offloaded file copies. So the question is, what is the best way to
> >> address/discuss that?
> >
> > Why does it need to be addressed in the first place?
> 
> An offloaded copy is still not efficient if different storage 
> servers/targets used by from-file and to-file.

So? 

> >
> > What is preventing an application from retrieving and setting this
> > information using standard libc functions such as fstat()+open(), and
> > supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
> > where appropriate?
> >
> 
> At a minimum this requires network and metadata overhead. And while I'm 
> working on FhGFS now, I still wonder what other file system need to do - 
> for example Lustre pre-allocates storage-target files on creating a 
> file, so file layout changes mean even more overhead there.

The problem you are describing is limited to a narrow set of storage
architectures. If copy offload using splice() doesn't make sense for
those architectures, then don't implement it for them.
You might be able to provide ioctls() to do these special hinted file
creations for those filesystems that need it, but the vast majority
don't, and you shouldn't enforce it on them.

> Anyway, if we could agree on to use libattr or libacl to teach the file 
> system about the upcoming splice call I would be fine.

libattr and libacl are generic libraries that exist to manipulate xattrs
and acls. They do not need to contain Lustre-specific code.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RFC] extending splice for copy offloading


On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage 
servers/targets used by from-file and to-file.




What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm 
working on FhGFS now, I still wonder what other file system need to do - 
for example Lustre pre-allocates storage-target files on creating a 
file, so file layout changes mean even more overhead there.
Anyway, if we could agree on to use libattr or libacl to teach the file 
system about the upcoming splice call I would be fine. Metadata overhead 
is probably negligible for large files.





Thanks,
Bernd

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
> On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
> > On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
> >> It would be nice if there would be way if the file system would get a
> >> hint that the target file is supposed to be copy of another file. That
> >> way distributed file systems could also create the target-file with the
> >> correct meta-information (same storage targets as in-file has).
> >> Well, if we cannot agree on that, file system with a custom protocol at
> >> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
> >> sure if this would work for pNFS, though.
> >
> > splice() does not create new files. What you appear to be asking for
> > lies way outside the scope of that system call interface.
> >
> 
> Sorry I know, definitely outside the scope of splice, but in the context 
> of offloaded file copies. So the question is, what is the best way to 
> address/discuss that?

Why does it need to be addressed in the first place?

What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading


On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context 
of offloaded file copies. So the question is, what is the best way to 
address/discuss that?


Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
> It would be nice if there would be way if the file system would get a 
> hint that the target file is supposed to be copy of another file. That 
> way distributed file systems could also create the target-file with the 
> correct meta-information (same storage targets as in-file has).
> Well, if we cannot agree on that, file system with a custom protocol at 
> least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not 
> sure if this would work for pNFS, though.

splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RFC] extending splice for copy offloading


On 09/30/2013 06:31 PM, Miklos Szeredi wrote:

Here's an example "cp" app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos


#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT(0x10)  /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;


off_t off = 0;



if (argc != 3)
errx(1, "usage: %s from to", argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, "opening %s", argv[1]);

res = fstat(in_fd, );
if (res == -1)
err(1, "fstat");

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, "opening %s", argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, _off, out_fd, _off, SSIZE_MAX,
 SPLICE_F_DIRECT);
if (rres == -1)
err(1, "splice");
if (rres == 0)
break;

off += rres;
} while (off < stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, "close");

res = fsync(out_fd);
if (res == -1)
err(1, "fsync");

res = close(out_fd);
if (res == -1)
err(1, "close");

return 0;
}



It would be nice if there would be way if the file system would get a 
hint that the target file is supposed to be copy of another file. That 
way distributed file systems could also create the target-file with the 
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at 
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not 
sure if this would work for pNFS, though.



Bernd



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

Here's an example "cp" app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos


#define _GNU_SOURCE

#include 
#include 
#include 
#include 
#include 
#include 

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT(0x10)  /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;

if (argc != 3)
errx(1, "usage: %s from to", argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, "opening %s", argv[1]);

res = fstat(in_fd, );
if (res == -1)
err(1, "fstat");

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, "opening %s", argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, _off, out_fd, _off, SSIZE_MAX,
 SPLICE_F_DIRECT);
if (rres == -1)
err(1, "splice");
if (rres == 0)
break;

off += rres;
} while (off < stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, "close");

res = fsync(out_fd);
if (res == -1)
err(1, "fsync");

res = close(out_fd);
if (res == -1)
err(1, "close");

return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:49 PM, Ric Wheeler  wrote:
> On 09/30/2013 10:46 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler  wrote:
>>>
>>> The way the array based offload (and some software side reflink works) is
>>> not a byte by byte copy. We cannot assume that a valid count can be
>>> returned
>>> or that such a count would be an indication of a sequential segment of
>>> good
>>> data.  The whole thing would normally have to be reissued.
>>>
>>> To make that a true assumption, you would have to mandate that in each of
>>> the specifications (and sw targets)...
>>
>> You're missing my point.
>>
>>   - user issues SIZE_MAX splice request
>>   - fs issues *64M* (or whatever) request to offload
>>   - when that completes *fully* then we return 64M to userspace
>>   - if it completes partially, then we return an error to userspace
>>
>> Again, wouldn't that work?
>>
>> Thanks,
>> Miklos
>
>
> Yes, if you send a copy offload command and it works, you can assume that it
> worked fully. It would be pretty interesting if that were not true :)
>
> If it fails, we cannot assume anything about partial completion.

Sure, that was my understanding from the start.  Maybe I wasn't
precise enough in my explanation.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:46 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler  wrote:

The way the array based offload (and some software side reflink works) is
not a byte by byte copy. We cannot assume that a valid count can be returned
or that such a count would be an indication of a sequential segment of good
data.  The whole thing would normally have to be reissued.

To make that a true assumption, you would have to mandate that in each of
the specifications (and sw targets)...

You're missing my point.

  - user issues SIZE_MAX splice request
  - fs issues *64M* (or whatever) request to offload
  - when that completes *fully* then we return 64M to userspace
  - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos


Yes, if you send a copy offload command and it works, you can assume that it 
worked fully. It would be pretty interesting if that were not true :)


If it fails, we cannot assume anything about partial completion.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler  wrote:
> The way the array based offload (and some software side reflink works) is
> not a byte by byte copy. We cannot assume that a valid count can be returned
> or that such a count would be an indication of a sequential segment of good
> data.  The whole thing would normally have to be reissued.
>
> To make that a true assumption, you would have to mandate that in each of
> the specifications (and sw targets)...

You're missing my point.

 - user issues SIZE_MAX splice request
 - fs issues *64M* (or whatever) request to offload
 - when that completes *fully* then we return 64M to userspace
 - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:38 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler  wrote:

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be
restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy
starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
   1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
  1.a) fs reflinks the whole file in a jiffy and returns the size of
the file
  1 b) fs does copy offload of, say, 64MB and returns 64M
   2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?


No.

Keep in mind that the offload operation in (1) might fail partially. The
target file (the copy) is allocated, the question is what ranges have valid
data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos


The way the array based offload (and some software side reflink works) is not a 
byte by byte copy. We cannot assume that a valid count can be returned or that 
such a count would be an indication of a sequential segment of good data.  The 
whole thing would normally have to be reissued.


To make that a true assumption, you would have to mandate that in each of the 
specifications (and sw targets)...


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler  wrote:
> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:
>>>
>>> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

 On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
 wrote:
>>
>> My other worry is about interruptibility/restartability.  Ideas?
>>
>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>> Can the page cache copy be made restartable?   Or should splice() be
>> allowed to return a short count?  What happens on (non-reflink) remote
>> copies and huge request sizes?
>
> If I were writing an application that required copies to be
> restartable,
> I'd probably use the largest possible range in the reflink case but
> break the copy into smaller chunks in the splice case.
>
 The app really doesn't want to care about that.  And it doesn't want
 to care about restartability, etc..  It's something the *kernel* has
 to care about.   You just can't have uninterruptible syscalls that
 sleep for a "long" time, otherwise first you'll just have annoyed
 users pressing ^C in vain; then, if the sleep is even longer, warnings
 about task sleeping too long.

 One idea is letting splice() return a short count, and so the app can
 safely issue SIZE_MAX requests and the kernel can decide if it can
 copy the whole file in one go or if it wants to do it in smaller
 chunks.

>>> You cannot rely on a short count. That implies that an offloaded copy
>>> starts
>>> at byte 0 and the short count first bytes are all valid.
>>
>> Huh?
>>
>> - app calls splice(from, 0, to, 0, SIZE_MAX)
>>   1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
>>  1.a) fs reflinks the whole file in a jiffy and returns the size of
>> the file
>>  1 b) fs does copy offload of, say, 64MB and returns 64M
>>   2) VFS does page copy of, say, 1MB and returns 1MB
>> - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
>> ...
>>
>> The point is: the app is always doing the same (incrementing offset
>> with the return value from splice) and the kernel can decide what is
>> the best size it can service within a single uninterruptible syscall.
>>
>> Wouldn't that work?
>>

>
> No.
>
> Keep in mind that the offload operation in (1) might fail partially. The
> target file (the copy) is allocated, the question is what ranges have valid
> data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] extending splice for copy offloading

> -Original Message-
> From: Ric Wheeler [mailto:rwhee...@redhat.com]
> Sent: Monday, September 30, 2013 10:29 AM
> To: Miklos Szeredi
> Cc: J. Bruce Fields; Myklebust, Trond; Zach Brown; Anna Schumaker; Kernel
> Mailing List; Linux-Fsdevel; linux-...@vger.kernel.org; Schumaker, Bryan;
> Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
> Subject: Re: [RFC] extending splice for copy offloading
> 
> On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
> > On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler 
> wrote:
> >> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
> >>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields
> >>> 
> >>> wrote:
> >>>>> My other worry is about interruptibility/restartability.  Ideas?
> >>>>>
> >>>>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
> >>>>> Can the page cache copy be made restartable?   Or should splice() be
> >>>>> allowed to return a short count?  What happens on (non-reflink)
> >>>>> remote copies and huge request sizes?
> >>>> If I were writing an application that required copies to be
> >>>> restartable, I'd probably use the largest possible range in the
> >>>> reflink case but break the copy into smaller chunks in the splice case.
> >>>>
> >>> The app really doesn't want to care about that.  And it doesn't want
> >>> to care about restartability, etc..  It's something the *kernel* has
> >>> to care about.   You just can't have uninterruptible syscalls that
> >>> sleep for a "long" time, otherwise first you'll just have annoyed
> >>> users pressing ^C in vain; then, if the sleep is even longer,
> >>> warnings about task sleeping too long.
> >>>
> >>> One idea is letting splice() return a short count, and so the app
> >>> can safely issue SIZE_MAX requests and the kernel can decide if it
> >>> can copy the whole file in one go or if it wants to do it in smaller
> >>> chunks.
> >>>
> >> You cannot rely on a short count. That implies that an offloaded copy
> >> starts at byte 0 and the short count first bytes are all valid.
> > Huh?
> >
> > - app calls splice(from, 0, to, 0, SIZE_MAX)
> >   1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
> >  1.a) fs reflinks the whole file in a jiffy and returns the size of the 
> > file
> >  1 b) fs does copy offload of, say, 64MB and returns 64M
> >   2) VFS does page copy of, say, 1MB and returns 1MB
> > - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
> > ...
> >
> > The point is: the app is always doing the same (incrementing offset
> > with the return value from splice) and the kernel can decide what is
> > the best size it can service within a single uninterruptible syscall.
> >
> > Wouldn't that work?
> >
> > Thanks,
> > Miklos
> 
> No.
> 
> Keep in mind that the offload operation in (1) might fail partially. The 
> target
> file (the copy) is allocated, the question is what ranges have valid data.
> 
> I don't see that (2) is interesting or really needed to be done in the kernel.
> If nothing else, it tends to confuse the discussion
> 

Anna's figures, that were presented at Plumber's, show that (2) is still worth 
doing on the _server_ for the case of NFS.

Cheers
  Trond

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
  1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
 1 b) fs does copy offload of, say, 64MB and returns 64M
  2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos


No.

Keep in mind that the offload operation in (1) might fail partially. The target 
file (the copy) is allocated, the question is what ranges have valid data.


I don't see that (2) is interesting or really needed to be done in the kernel. 
If nothing else, it tends to confuse the discussion


ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler  wrote:
> On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
>>
>> On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields 
>> wrote:

 My other worry is about interruptibility/restartability.  Ideas?

 What happens on splice(from, to, 4G) and it's a non-reflink copy?
 Can the page cache copy be made restartable?   Or should splice() be
 allowed to return a short count?  What happens on (non-reflink) remote
 copies and huge request sizes?
>>>
>>> If I were writing an application that required copies to be restartable,
>>> I'd probably use the largest possible range in the reflink case but
>>> break the copy into smaller chunks in the splice case.
>>>
>> The app really doesn't want to care about that.  And it doesn't want
>> to care about restartability, etc..  It's something the *kernel* has
>> to care about.   You just can't have uninterruptible syscalls that
>> sleep for a "long" time, otherwise first you'll just have annoyed
>> users pressing ^C in vain; then, if the sleep is even longer, warnings
>> about task sleeping too long.
>>
>> One idea is letting splice() return a short count, and so the app can
>> safely issue SIZE_MAX requests and the kernel can decide if it can
>> copy the whole file in one go or if it wants to do it in smaller
>> chunks.
>>

>
> You cannot rely on a short count. That implies that an offloaded copy starts
> at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
 1) VFS calls ->direct_splice(from, 0,  to, 0, SIZE_MAX)
1.a) fs reflinks the whole file in a jiffy and returns the size of the file
1 b) fs does copy offload of, say, 64MB and returns 64M
 2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields  wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos


You cannot rely on a short count. That implies that an offloaded copy starts at 
byte 0 and the short count first bytes are all valid.


I don't believe that is in fact required by all (any?) versions of the spec :)

Best just to fail and restart the whole operation.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields  wrote:
>> My other worry is about interruptibility/restartability.  Ideas?
>>
>> What happens on splice(from, to, 4G) and it's a non-reflink copy?
>> Can the page cache copy be made restartable?   Or should splice() be
>> allowed to return a short count?  What happens on (non-reflink) remote
>> copies and huge request sizes?
>
> If I were writing an application that required copies to be restartable,
> I'd probably use the largest possible range in the reflink case but
> break the copy into smaller chunks in the splice case.
>

The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a "long" time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:34 AM, J. Bruce Fields wrote:

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler  wrote:


I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.
Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since
the invention of snapshots. The NFSv4.2 spec does allow you to set a
per-file attribute that causes the storage server to always preallocate
enough buffers to guarantee that you can rewrite the entire file, however
the fact that we've lived without it for said 20 years leads me to believe
that demand for it is going to be limited. I haven't put it top of the list
of features we care to implement...

Cheers,
 Trond


I agree - this has been common behaviour for a very long time in the array
space. Even without an array,  this is the same as overwriting a block in
btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.


I think that this is still confusing - reflink and array copy offload should not 
be differentiated.  In effect, they should often be the same order of magnitude 
in performance and possibly even use the same or very similar techniques (just 
on different sides of the initiator/target transaction!).


It is much simpler to let the application fail if the offload (or reflink) is 
not supported and let it do the traditional copy offload.  Then you always send 
the largest possible offload operation and do whatever you do now if that fails.


thanks!

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread J. Bruce Fields

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:
> On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler  wrote:
> 
> >>> I don't see the safety argument very compelling either.  There are real
> >>> semantic differences, however: ENOSPC on a write to a
> >>> (apparentlíy) already allocated block.  That could be a bit unexpected.
> >>> Do we
> >>> need a fallocate extension to deal with shared blocks?
> >>
> >> The above has been the case for all enterprise storage arrays ever since
> >> the invention of snapshots. The NFSv4.2 spec does allow you to set a
> >> per-file attribute that causes the storage server to always preallocate
> >> enough buffers to guarantee that you can rewrite the entire file, however
> >> the fact that we've lived without it for said 20 years leads me to believe
> >> that demand for it is going to be limited. I haven't put it top of the list
> >> of features we care to implement...
> >>
> >> Cheers,
> >> Trond
> >
> >
> > I agree - this has been common behaviour for a very long time in the array
> > space. Even without an array,  this is the same as overwriting a block in
> > btrfs or any file system with a read-write LVM snapshot.
> 
> Okay, I'm convinced.
> 
> So I suggest
> 
>  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
> set, fall back to page cache copy.
>  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
> can force reflink.
> 
> Both are trivial to implement and make sure that no backward
> incompatibility surprises happen.
> 
> My other worry is about interruptibility/restartability.  Ideas?
> 
> What happens on splice(from, to, 4G) and it's a non-reflink copy?
> Can the page cache copy be made restartable?   Or should splice() be
> allowed to return a short count?  What happens on (non-reflink) remote
> copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler  wrote:

>>> I don't see the safety argument very compelling either.  There are real
>>> semantic differences, however: ENOSPC on a write to a
>>> (apparentlíy) already allocated block.  That could be a bit unexpected.
>>> Do we
>>> need a fallocate extension to deal with shared blocks?
>>
>> The above has been the case for all enterprise storage arrays ever since
>> the invention of snapshots. The NFSv4.2 spec does allow you to set a
>> per-file attribute that causes the storage server to always preallocate
>> enough buffers to guarantee that you can rewrite the entire file, however
>> the fact that we've lived without it for said 20 years leads me to believe
>> that demand for it is going to be limited. I haven't put it top of the list
>> of features we care to implement...
>>
>> Cheers,
>> Trond
>
>
> I agree - this has been common behaviour for a very long time in the array
> space. Even without an array,  this is the same as overwriting a block in
> btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

 - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
 - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler rwhee...@redhat.com wrote:

 I don't see the safety argument very compelling either.  There are real
 semantic differences, however: ENOSPC on a write to a
 (apparentlíy) already allocated block.  That could be a bit unexpected.
 Do we
 need a fallocate extension to deal with shared blocks?

 The above has been the case for all enterprise storage arrays ever since
 the invention of snapshots. The NFSv4.2 spec does allow you to set a
 per-file attribute that causes the storage server to always preallocate
 enough buffers to guarantee that you can rewrite the entire file, however
 the fact that we've lived without it for said 20 years leads me to believe
 that demand for it is going to be limited. I haven't put it top of the list
 of features we care to implement...

 Cheers,
 Trond


 I agree - this has been common behaviour for a very long time in the array
 space. Even without an array,  this is the same as overwriting a block in
 btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

 - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
 - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-30 Thread J. Bruce Fields

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:
 On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler rwhee...@redhat.com wrote:
 
  I don't see the safety argument very compelling either.  There are real
  semantic differences, however: ENOSPC on a write to a
  (apparentlíy) already allocated block.  That could be a bit unexpected.
  Do we
  need a fallocate extension to deal with shared blocks?
 
  The above has been the case for all enterprise storage arrays ever since
  the invention of snapshots. The NFSv4.2 spec does allow you to set a
  per-file attribute that causes the storage server to always preallocate
  enough buffers to guarantee that you can rewrite the entire file, however
  the fact that we've lived without it for said 20 years leads me to believe
  that demand for it is going to be limited. I haven't put it top of the list
  of features we care to implement...
 
  Cheers,
  Trond
 
 
  I agree - this has been common behaviour for a very long time in the array
  space. Even without an array,  this is the same as overwriting a block in
  btrfs or any file system with a read-write LVM snapshot.
 
 Okay, I'm convinced.
 
 So I suggest
 
  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
 set, fall back to page cache copy.
  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
 can force reflink.
 
 Both are trivial to implement and make sure that no backward
 incompatibility surprises happen.
 
 My other worry is about interruptibility/restartability.  Ideas?
 
 What happens on splice(from, to, 4G) and it's a non-reflink copy?
 Can the page cache copy be made restartable?   Or should splice() be
 allowed to return a short count?  What happens on (non-reflink) remote
 copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:34 AM, J. Bruce Fields wrote:

On Mon, Sep 30, 2013 at 02:20:30PM +0200, Miklos Szeredi wrote:

On Sat, Sep 28, 2013 at 11:20 PM, Ric Wheeler rwhee...@redhat.com wrote:


I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.
Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since
the invention of snapshots. The NFSv4.2 spec does allow you to set a
per-file attribute that causes the storage server to always preallocate
enough buffers to guarantee that you can rewrite the entire file, however
the fact that we've lived without it for said 20 years leads me to believe
that demand for it is going to be limited. I haven't put it top of the list
of features we care to implement...

Cheers,
 Trond


I agree - this has been common behaviour for a very long time in the array
space. Even without an array,  this is the same as overwriting a block in
btrfs or any file system with a read-write LVM snapshot.

Okay, I'm convinced.

So I suggest

  - mount(..., MNT_REFLINK): *allow* splice to reflink.  If this is not
set, fall back to page cache copy.
  - splice(... SPLICE_REFLINK):  fail non-reflink copy.  With this app
can force reflink.

Both are trivial to implement and make sure that no backward
incompatibility surprises happen.

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.

For that reason I don't like the idea of a mount option--the choice is
something that the application probably wants to make (or at least to
know about).

The NFS COPY operation, as specified in current drafts, allows for
asynchronous copies but leaves the state of the file undefined in the
case of an aborted COPY.  I worry that agreeing on standard behavior in
the case of an abort might be difficult.

--b.


I think that this is still confusing - reflink and array copy offload should not 
be differentiated.  In effect, they should often be the same order of magnitude 
in performance and possibly even use the same or very similar techniques (just 
on different sides of the initiator/target transaction!).


It is much simpler to let the application fail if the offload (or reflink) is 
not supported and let it do the traditional copy offload.  Then you always send 
the largest possible offload operation and do whatever you do now if that fails.


thanks!

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 My other worry is about interruptibility/restartability.  Ideas?

 What happens on splice(from, to, 4G) and it's a non-reflink copy?
 Can the page cache copy be made restartable?   Or should splice() be
 allowed to return a short count?  What happens on (non-reflink) remote
 copies and huge request sizes?

 If I were writing an application that required copies to be restartable,
 I'd probably use the largest possible range in the reflink case but
 break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.

Thanks,
Miklos


You cannot rely on a short count. That implies that an offloaded copy starts at 
byte 0 and the short count first bytes are all valid.


I don't believe that is in fact required by all (any?) versions of the spec :)

Best just to fail and restart the whole operation.

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:
 On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

 On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
 wrote:

 My other worry is about interruptibility/restartability.  Ideas?

 What happens on splice(from, to, 4G) and it's a non-reflink copy?
 Can the page cache copy be made restartable?   Or should splice() be
 allowed to return a short count?  What happens on (non-reflink) remote
 copies and huge request sizes?

 If I were writing an application that required copies to be restartable,
 I'd probably use the largest possible range in the reflink case but
 break the copy into smaller chunks in the splice case.

 The app really doesn't want to care about that.  And it doesn't want
 to care about restartability, etc..  It's something the *kernel* has
 to care about.   You just can't have uninterruptible syscalls that
 sleep for a long time, otherwise first you'll just have annoyed
 users pressing ^C in vain; then, if the sleep is even longer, warnings
 about task sleeping too long.

 One idea is letting splice() return a short count, and so the app can
 safely issue SIZE_MAX requests and the kernel can decide if it can
 copy the whole file in one go or if it wants to do it in smaller
 chunks.



 You cannot rely on a short count. That implies that an offloaded copy starts
 at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
 1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
1.a) fs reflinks the whole file in a jiffy and returns the size of the file
1 b) fs does copy offload of, say, 64MB and returns 64M
 2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
  1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
 1.a) fs reflinks the whole file in a jiffy and returns the size of the file
 1 b) fs does copy offload of, say, 64MB and returns 64M
  2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?

Thanks,
Miklos


No.

Keep in mind that the offload operation in (1) might fail partially. The target 
file (the copy) is allocated, the question is what ranges have valid data.


I don't see that (2) is interesting or really needed to be done in the kernel. 
If nothing else, it tends to confuse the discussion


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] extending splice for copy offloading

 -Original Message-
 From: Ric Wheeler [mailto:rwhee...@redhat.com]
 Sent: Monday, September 30, 2013 10:29 AM
 To: Miklos Szeredi
 Cc: J. Bruce Fields; Myklebust, Trond; Zach Brown; Anna Schumaker; Kernel
 Mailing List; Linux-Fsdevel; linux-...@vger.kernel.org; Schumaker, Bryan;
 Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
 Subject: Re: [RFC] extending splice for copy offloading

 On 09/30/2013 10:24 AM, Miklos Szeredi wrote:
  On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com
 wrote:
  On 09/30/2013 10:51 AM, Miklos Szeredi wrote:
  On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields
  bfie...@fieldses.org
  wrote:
  My other worry is about interruptibility/restartability.  Ideas?

  What happens on splice(from, to, 4G) and it's a non-reflink copy?
  Can the page cache copy be made restartable?   Or should splice() be
  allowed to return a short count?  What happens on (non-reflink)
  remote copies and huge request sizes?
  If I were writing an application that required copies to be
  restartable, I'd probably use the largest possible range in the
  reflink case but break the copy into smaller chunks in the splice case.

  The app really doesn't want to care about that.  And it doesn't want
  to care about restartability, etc..  It's something the *kernel* has
  to care about.   You just can't have uninterruptible syscalls that
  sleep for a long time, otherwise first you'll just have annoyed
  users pressing ^C in vain; then, if the sleep is even longer,
  warnings about task sleeping too long.

  One idea is letting splice() return a short count, and so the app
  can safely issue SIZE_MAX requests and the kernel can decide if it
  can copy the whole file in one go or if it wants to do it in smaller
  chunks.

  You cannot rely on a short count. That implies that an offloaded copy
  starts at byte 0 and the short count first bytes are all valid.
  Huh?

  - app calls splice(from, 0, to, 0, SIZE_MAX)
1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
   1.a) fs reflinks the whole file in a jiffy and returns the size of the 
  file
   1 b) fs does copy offload of, say, 64MB and returns 64M
2) VFS does page copy of, say, 1MB and returns 1MB
  - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
  ...

  The point is: the app is always doing the same (incrementing offset
  with the return value from splice) and the kernel can decide what is
  the best size it can service within a single uninterruptible syscall.

  Wouldn't that work?

  Thanks,
  Miklos

 No.

 Keep in mind that the offload operation in (1) might fail partially. The 
 target
 file (the copy) is allocated, the question is what ranges have valid data.

 I don't see that (2) is interesting or really needed to be done in the kernel.
 If nothing else, it tends to confuse the discussion

Anna's figures, that were presented at Plumber's, show that (2) is still worth 
doing on the _server_ for the case of NFS.

Cheers
  Trond

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler rwhee...@redhat.com wrote:
 On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

 On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:

 On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

 On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
 wrote:

 My other worry is about interruptibility/restartability.  Ideas?

 What happens on splice(from, to, 4G) and it's a non-reflink copy?
 Can the page cache copy be made restartable?   Or should splice() be
 allowed to return a short count?  What happens on (non-reflink) remote
 copies and huge request sizes?

 If I were writing an application that required copies to be
 restartable,
 I'd probably use the largest possible range in the reflink case but
 break the copy into smaller chunks in the splice case.

 The app really doesn't want to care about that.  And it doesn't want
 to care about restartability, etc..  It's something the *kernel* has
 to care about.   You just can't have uninterruptible syscalls that
 sleep for a long time, otherwise first you'll just have annoyed
 users pressing ^C in vain; then, if the sleep is even longer, warnings
 about task sleeping too long.

 One idea is letting splice() return a short count, and so the app can
 safely issue SIZE_MAX requests and the kernel can decide if it can
 copy the whole file in one go or if it wants to do it in smaller
 chunks.

 You cannot rely on a short count. That implies that an offloaded copy
 starts
 at byte 0 and the short count first bytes are all valid.

 Huh?

 - app calls splice(from, 0, to, 0, SIZE_MAX)
   1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
  1.a) fs reflinks the whole file in a jiffy and returns the size of
 the file
  1 b) fs does copy offload of, say, 64MB and returns 64M
   2) VFS does page copy of, say, 1MB and returns 1MB
 - app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
 ...

 The point is: the app is always doing the same (incrementing offset
 with the return value from splice) and the kernel can decide what is
 the best size it can service within a single uninterruptible syscall.

 Wouldn't that work?



 No.

 Keep in mind that the offload operation in (1) might fail partially. The
 target file (the copy) is allocated, the question is what ranges have valid
 data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:38 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:28 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:24 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:52 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/30/2013 10:51 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:34 PM, J. Bruce Fields bfie...@fieldses.org
wrote:

My other worry is about interruptibility/restartability.  Ideas?

What happens on splice(from, to, 4G) and it's a non-reflink copy?
Can the page cache copy be made restartable?   Or should splice() be
allowed to return a short count?  What happens on (non-reflink) remote
copies and huge request sizes?

If I were writing an application that required copies to be
restartable,
I'd probably use the largest possible range in the reflink case but
break the copy into smaller chunks in the splice case.


The app really doesn't want to care about that.  And it doesn't want
to care about restartability, etc..  It's something the *kernel* has
to care about.   You just can't have uninterruptible syscalls that
sleep for a long time, otherwise first you'll just have annoyed
users pressing ^C in vain; then, if the sleep is even longer, warnings
about task sleeping too long.

One idea is letting splice() return a short count, and so the app can
safely issue SIZE_MAX requests and the kernel can decide if it can
copy the whole file in one go or if it wants to do it in smaller
chunks.


You cannot rely on a short count. That implies that an offloaded copy
starts
at byte 0 and the short count first bytes are all valid.

Huh?

- app calls splice(from, 0, to, 0, SIZE_MAX)
   1) VFS calls -direct_splice(from, 0,  to, 0, SIZE_MAX)
  1.a) fs reflinks the whole file in a jiffy and returns the size of
the file
  1 b) fs does copy offload of, say, 64MB and returns 64M
   2) VFS does page copy of, say, 1MB and returns 1MB
- app calls splice(from, X, to, X, SIZE_MAX) where X is the new offset
...

The point is: the app is always doing the same (incrementing offset
with the return value from splice) and the kernel can decide what is
the best size it can service within a single uninterruptible syscall.

Wouldn't that work?


No.

Keep in mind that the offload operation in (1) might fail partially. The
target file (the copy) is allocated, the question is what ranges have valid
data.

You are talking about case 1.a, right?  So if the offload copy 0-64MB
fails partially, we return failure from splice, yet some of the copy
did succeed.  Is that the problem?  Why?

Thanks,
Miklos


The way the array based offload (and some software side reflink works) is not a 
byte by byte copy. We cannot assume that a valid count can be returned or that 
such a count would be an indication of a sequential segment of good data.  The 
whole thing would normally have to be reissued.


To make that a true assumption, you would have to mandate that in each of the 
specifications (and sw targets)...


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler rwhee...@redhat.com wrote:
 The way the array based offload (and some software side reflink works) is
 not a byte by byte copy. We cannot assume that a valid count can be returned
 or that such a count would be an indication of a sequential segment of good
 data.  The whole thing would normally have to be reissued.

 To make that a true assumption, you would have to mandate that in each of
 the specifications (and sw targets)...

You're missing my point.

 - user issues SIZE_MAX splice request
 - fs issues *64M* (or whatever) request to offload
 - when that completes *fully* then we return 64M to userspace
 - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 10:46 AM, Miklos Szeredi wrote:

On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler rwhee...@redhat.com wrote:

The way the array based offload (and some software side reflink works) is
not a byte by byte copy. We cannot assume that a valid count can be returned
or that such a count would be an indication of a sequential segment of good
data.  The whole thing would normally have to be reissued.

To make that a true assumption, you would have to mandate that in each of
the specifications (and sw targets)...

You're missing my point.

  - user issues SIZE_MAX splice request
  - fs issues *64M* (or whatever) request to offload
  - when that completes *fully* then we return 64M to userspace
  - if it completes partially, then we return an error to userspace

Again, wouldn't that work?

Thanks,
Miklos


Yes, if you send a copy offload command and it works, you can assume that it 
worked fully. It would be pretty interesting if that were not true :)


If it fails, we cannot assume anything about partial completion.

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, Sep 30, 2013 at 4:49 PM, Ric Wheeler rwhee...@redhat.com wrote:
 On 09/30/2013 10:46 AM, Miklos Szeredi wrote:

 On Mon, Sep 30, 2013 at 4:41 PM, Ric Wheeler rwhee...@redhat.com wrote:

 The way the array based offload (and some software side reflink works) is
 not a byte by byte copy. We cannot assume that a valid count can be
 returned
 or that such a count would be an indication of a sequential segment of
 good
 data.  The whole thing would normally have to be reissued.

 To make that a true assumption, you would have to mandate that in each of
 the specifications (and sw targets)...

 You're missing my point.

   - user issues SIZE_MAX splice request
   - fs issues *64M* (or whatever) request to offload
   - when that completes *fully* then we return 64M to userspace
   - if it completes partially, then we return an error to userspace

 Again, wouldn't that work?

 Thanks,
 Miklos


 Yes, if you send a copy offload command and it works, you can assume that it
 worked fully. It would be pretty interesting if that were not true :)

 If it fails, we cannot assume anything about partial completion.

Sure, that was my understanding from the start.  Maybe I wasn't
precise enough in my explanation.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

Here's an example cp app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos


#define _GNU_SOURCE

#include stdio.h
#include fcntl.h
#include unistd.h
#include limits.h
#include sys/stat.h
#include err.h

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT(0x10)  /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;

if (argc != 3)
errx(1, usage: %s from to, argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, opening %s, argv[1]);

res = fstat(in_fd, stbuf);
if (res == -1)
err(1, fstat);

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, opening %s, argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, in_off, out_fd, out_off, SSIZE_MAX,
 SPLICE_F_DIRECT);
if (rres == -1)
err(1, splice);
if (rres == 0)
break;

off += rres;
} while (off  stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, close);

res = fsync(out_fd);
if (res == -1)
err(1, fsync);

res = close(out_fd);
if (res == -1)
err(1, close);

return 0;
}
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 06:31 PM, Miklos Szeredi wrote:

Here's an example cp app using direct splice (and without fallback to
non-splice, which is obviously required unless the kernel is known to support
direct splice).

Untested, but trivial enough...

The important part is, I think, that the app must not assume that the kernel can
complete the request in one go.

Thanks,
Miklos


#define _GNU_SOURCE

#include stdio.h
#include fcntl.h
#include unistd.h
#include limits.h
#include sys/stat.h
#include err.h

#ifndef SPLICE_F_DIRECT
#define SPLICE_F_DIRECT(0x10)  /* neither splice fd is a pipe */
#endif

int main(int argc, char *argv[])
{
struct stat stbuf;
int in_fd;
int out_fd;
int res;
off_t off;


off_t off = 0;



if (argc != 3)
errx(1, usage: %s from to, argv[0]);

in_fd = open(argv[1], O_RDONLY);
if (in_fd == -1)
err(1, opening %s, argv[1]);

res = fstat(in_fd, stbuf);
if (res == -1)
err(1, fstat);

out_fd = open(argv[2], O_CREAT | O_WRONLY | O_TRUNC, stbuf.st_mode);
if (out_fd == -1)
err(1, opening %s, argv[2]);

do {
off_t in_off = off, out_off = off;
ssize_t rres;

rres = splice(in_fd, in_off, out_fd, out_off, SSIZE_MAX,
 SPLICE_F_DIRECT);
if (rres == -1)
err(1, splice);
if (rres == 0)
break;

off += rres;
} while (off  stbuf.st_size);

res = close(in_fd);
if (res == -1)
err(1, close);

res = fsync(out_fd);
if (res == -1)
err(1, fsync);

res = close(out_fd);
if (res == -1)
err(1, close);

return 0;
}



It would be nice if there would be way if the file system would get a 
hint that the target file is supposed to be copy of another file. That 
way distributed file systems could also create the target-file with the 
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at 
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not 
sure if this would work for pNFS, though.



Bernd



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
 It would be nice if there would be way if the file system would get a 
 hint that the target file is supposed to be copy of another file. That 
 way distributed file systems could also create the target-file with the 
 correct meta-information (same storage targets as in-file has).
 Well, if we cannot agree on that, file system with a custom protocol at 
 least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not 
 sure if this would work for pNFS, though.

splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RFC] extending splice for copy offloading


On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context 
of offloaded file copies. So the question is, what is the best way to 
address/discuss that?


Thanks,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
 On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
  On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
  It would be nice if there would be way if the file system would get a
  hint that the target file is supposed to be copy of another file. That
  way distributed file systems could also create the target-file with the
  correct meta-information (same storage targets as in-file has).
  Well, if we cannot agree on that, file system with a custom protocol at
  least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
  sure if this would work for pNFS, though.
 
  splice() does not create new files. What you appear to be asking for
  lies way outside the scope of that system call interface.
 
 
 Sorry I know, definitely outside the scope of splice, but in the context 
 of offloaded file copies. So the question is, what is the best way to 
 address/discuss that?

Why does it need to be addressed in the first place?

What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading


On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage 
servers/targets used by from-file and to-file.




What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm 
working on FhGFS now, I still wonder what other file system need to do - 
for example Lustre pre-allocates storage-target files on creating a 
file, so file layout changes mean even more overhead there.
Anyway, if we could agree on to use libattr or libacl to teach the file 
system about the upcoming splice call I would be fine. Metadata overhead 
is probably negligible for large files.





Thanks,
Bernd

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:
On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.

splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.

Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?

Why does it need to be addressed in the first place?

An offloaded copy is still not efficient if different storage
servers/targets used by from-file and to-file.

So?

What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?

At a minimum this requires network and metadata overhead. And while I'm
working on FhGFS now, I still wonder what other file system need to do -
for example Lustre pre-allocates storage-target files on creating a
file, so file layout changes mean even more overhead there.

The problem you are describing is limited to a narrow set of storage
architectures. If copy offload using splice() doesn't make sense for
those architectures, then don't implement it for them.
You might be able to provide ioctls() to do these special hinted file
creations for those filesystems that need it, but the vast majority
don't, and you shouldn't enforce it on them.

Anyway, if we could agree on to use libattr or libacl to teach the file
system about the upcoming splice call I would be fine.

libattr and libacl are generic libraries that exist to manipulate xattrs
and acls. They do not need to contain Lustre-specific code.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com
N�r��yb�X��ǧv�^�)޺{.n�+{zX����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf��^jǫy�m��@A�a���
0��h���i

Re: [RFC] extending splice for copy offloading


On 09/30/2013 09:34 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:

On 09/30/2013 08:02 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:

On 09/30/2013 07:44 PM, Myklebust, Trond wrote:

On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:

It would be nice if there would be way if the file system would get a
hint that the target file is supposed to be copy of another file. That
way distributed file systems could also create the target-file with the
correct meta-information (same storage targets as in-file has).
Well, if we cannot agree on that, file system with a custom protocol at
least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
sure if this would work for pNFS, though.


splice() does not create new files. What you appear to be asking for
lies way outside the scope of that system call interface.



Sorry I know, definitely outside the scope of splice, but in the context
of offloaded file copies. So the question is, what is the best way to
address/discuss that?


Why does it need to be addressed in the first place?


An offloaded copy is still not efficient if different storage
servers/targets used by from-file and to-file.


So?


mds1: orig-file
oss1/target1: orig-chunk1

mds1: target-file
ossN/targetN: target-chunk1

clientN: Performs the copy

Ideally, orig-chunk1 and target-chunk1 are on the same server and same 
target. Copy offload then even could done from the underlying fs, 
similiar as local splice.
If different ossN servers are used copies still have to be done over 
network by these storage servers, although the client only would need to 
initiate the copy. Still faster, but also not ideal.






What is preventing an application from retrieving and setting this
information using standard libc functions such as fstat()+open(), and
supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
where appropriate?



At a minimum this requires network and metadata overhead. And while I'm
working on FhGFS now, I still wonder what other file system need to do -
for example Lustre pre-allocates storage-target files on creating a
file, so file layout changes mean even more overhead there.


The problem you are describing is limited to a narrow set of storage
architectures. If copy offload using splice() doesn't make sense for
those architectures, then don't implement it for them.


But it _does_ make sense. The file system just needs a hint that a 
splice copy is going to come up.



You might be able to provide ioctls() to do these special hinted file
creations for those filesystems that need it, but the vast majority
don't, and you shouldn't enforce it on them.


And exactly for that we need a standard - it does not make sense if each 
and every distributed file system implements its own 
ioctl/libattr/libacl interface for that.





Anyway, if we could agree on to use libattr or libacl to teach the file
system about the upcoming splice call I would be fine.


libattr and libacl are generic libraries that exist to manipulate xattrs
and acls. They do not need to contain Lustre-specific code.



pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently?


I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, 
didn't find a better name yet), which would take in-file-path and 
out-file-path and allow the file system to create out-file-path with the 
same meta-layout as in-file-path. And it would need some flags, such as 
AUTO (file system decides if it makes sense to do a local copy) and 
FORCE (always try a local copy).



Thanks,
Bernd
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/30/2013 04:00 PM, Bernd Schubert wrote:
pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
interface? And userspace needs to address all of them differently? 


The NFS and SCSI groups have each defined a standard which Zach's proposal 
abstracts into a common user API.


Distributed file systems tend to be rather unique and do not have similar 
standard bodies, but a lot of them could hide server specific implementations 
under the current proposed interfaces.


What is not a good idea is to drag out the core, simple copy offload discussion 
for another 5 years to pull in every odd use case :)


ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 22:00 +0200, Bernd Schubert wrote:
 On 09/30/2013 09:34 PM, Myklebust, Trond wrote:
  On Mon, 2013-09-30 at 20:49 +0200, Bernd Schubert wrote:
  On 09/30/2013 08:02 PM, Myklebust, Trond wrote:
  On Mon, 2013-09-30 at 19:48 +0200, Bernd Schubert wrote:
  On 09/30/2013 07:44 PM, Myklebust, Trond wrote:
  On Mon, 2013-09-30 at 19:17 +0200, Bernd Schubert wrote:
  It would be nice if there would be way if the file system would get a
  hint that the target file is supposed to be copy of another file. That
  way distributed file systems could also create the target-file with the
  correct meta-information (same storage targets as in-file has).
  Well, if we cannot agree on that, file system with a custom protocol at
  least can detect from 0 to SSIZE_MAX and then reset metadata. I'm not
  sure if this would work for pNFS, though.
 
  splice() does not create new files. What you appear to be asking for
  lies way outside the scope of that system call interface.
 
 
  Sorry I know, definitely outside the scope of splice, but in the context
  of offloaded file copies. So the question is, what is the best way to
  address/discuss that?
 
  Why does it need to be addressed in the first place?
 
  An offloaded copy is still not efficient if different storage
  servers/targets used by from-file and to-file.
 
  So?
 
 mds1: orig-file
 oss1/target1: orig-chunk1
 
 mds1: target-file
 ossN/targetN: target-chunk1
 
 clientN: Performs the copy
 
 Ideally, orig-chunk1 and target-chunk1 are on the same server and same 
 target. Copy offload then even could done from the underlying fs, 
 similiar as local splice.
 If different ossN servers are used copies still have to be done over 
 network by these storage servers, although the client only would need to 
 initiate the copy. Still faster, but also not ideal.
 
 
 
  What is preventing an application from retrieving and setting this
  information using standard libc functions such as fstat()+open(), and
  supplemented with libattr attr_setf/getf(), and libacl acl_get_fd/set_fd
  where appropriate?
 
 
  At a minimum this requires network and metadata overhead. And while I'm
  working on FhGFS now, I still wonder what other file system need to do -
  for example Lustre pre-allocates storage-target files on creating a
  file, so file layout changes mean even more overhead there.
 
  The problem you are describing is limited to a narrow set of storage
  architectures. If copy offload using splice() doesn't make sense for
  those architectures, then don't implement it for them.
 
 But it _does_ make sense. The file system just needs a hint that a 
 splice copy is going to come up.

Just wait for the splice() system call. How is this any different from
write()?

  You might be able to provide ioctls() to do these special hinted file
  creations for those filesystems that need it, but the vast majority
  don't, and you shouldn't enforce it on them.
 
 And exactly for that we need a standard - it does not make sense if each 
 and every distributed file system implements its own 
 ioctl/libattr/libacl interface for that.
 
 
  Anyway, if we could agree on to use libattr or libacl to teach the file
  system about the upcoming splice call I would be fine.
 
  libattr and libacl are generic libraries that exist to manipulate xattrs
  and acls. They do not need to contain Lustre-specific code.
 
 
 pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
 interface? And userspace needs to address all of them differently?

 I'm just asking for something like a vfs ioctl SPLICE_META_COPY (sorry, 
 didn't find a better name yet), which would take in-file-path and 
 out-file-path and allow the file system to create out-file-path with the 
 same meta-layout as in-file-path. And it would need some flags, such as 
 AUTO (file system decides if it makes sense to do a local copy) and 
 FORCE (always try a local copy).

splice() is not a whole-file copy operation; it's a byte range copy. How
does the above help other than in the whole-file case?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading

On Mon, 2013-09-30 at 16:08 -0400, Ric Wheeler wrote:
 On 09/30/2013 04:00 PM, Bernd Schubert wrote:
  pNFS, FhGFS, Lustre, Ceph, etc., all of them shall implement their own 
  interface? And userspace needs to address all of them differently? 
 
 The NFS and SCSI groups have each defined a standard which Zach's proposal 
 abstracts into a common user API.
 
 Distributed file systems tend to be rather unique and do not have similar 
 standard bodies, but a lot of them could hide server specific implementations 
 under the current proposed interfaces.
 
 What is not a good idea is to drag out the core, simple copy offload 
 discussion 
 for another 5 years to pull in every odd use case :)

Agreed. The whole idea of a common system call interface should be to
allow us to abstract away the underlying storage and filesystem
architectures. If filesystem developers also want a way to expose that
underlying architecture to applications in order to enable further
optimisations, then that belongs in a separate discussion.

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
trond.mykleb...@netapp.com
www.netapp.com

Re: [RFC] extending splice for copy offloading

2013-09-28 Thread Ric Wheeler

On 09/28/2013 11:20 AM, Myklebust, Trond wrote:

-Original Message-
From: Miklos Szeredi [mailto:mik...@szeredi.hu]
Sent: Saturday, September 28, 2013 12:50 AM
To: Zach Brown
Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown  wrote:

Also, I don't get the first option above at all.  The argument is
that it's safer to have more copies?  How much safety does another
copy on the same disk really give you?  Do systems that do dedup
provide interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
Trond

I agree - this has been common behaviour for a very long time in the array 
space. Even without an array,  this is the same as overwriting a block in btrfs 
or any file system with a read-write LVM snapshot.

Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [RFC] extending splice for copy offloading

2013-09-28 Thread Myklebust, Trond

> -Original Message-
> From: Miklos Szeredi [mailto:mik...@szeredi.hu]
> Sent: Saturday, September 28, 2013 12:50 AM
> To: Zach Brown
> Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
> Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
> Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
> Subject: Re: [RFC] extending splice for copy offloading
> 
> On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown  wrote:
> >> Also, I don't get the first option above at all.  The argument is
> >> that it's safer to have more copies?  How much safety does another
> >> copy on the same disk really give you?  Do systems that do dedup
> >> provide interfaces to turn it off per-file?
> 
> I don't see the safety argument very compelling either.  There are real
> semantic differences, however: ENOSPC on a write to a
> (apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
> need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
   Trond

RE: [RFC] extending splice for copy offloading

2013-09-28 Thread Myklebust, Trond

 -Original Message-
 From: Miklos Szeredi [mailto:mik...@szeredi.hu]
 Sent: Saturday, September 28, 2013 12:50 AM
 To: Zach Brown
 Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
 Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
 Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
 Subject: Re: [RFC] extending splice for copy offloading

 On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown z...@redhat.com wrote:
  Also, I don't get the first option above at all.  The argument is
  that it's safer to have more copies?  How much safety does another
  copy on the same disk really give you?  Do systems that do dedup
  provide interfaces to turn it off per-file?

 I don't see the safety argument very compelling either.  There are real
 semantic differences, however: ENOSPC on a write to a
 (apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
 need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
   Trond

Re: [RFC] extending splice for copy offloading

2013-09-28 Thread Ric Wheeler

On 09/28/2013 11:20 AM, Myklebust, Trond wrote:

-Original Message-
From: Miklos Szeredi [mailto:mik...@szeredi.hu]
Sent: Saturday, September 28, 2013 12:50 AM
To: Zach Brown
Cc: J. Bruce Fields; Ric Wheeler; Anna Schumaker; Kernel Mailing List; Linux-
Fsdevel; linux-...@vger.kernel.org; Myklebust, Trond; Schumaker, Bryan;
Martin K. Petersen; Jens Axboe; Mark Fasheh; Joel Becker; Eric Wong
Subject: Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown z...@redhat.com wrote:

Also, I don't get the first option above at all.  The argument is
that it's safer to have more copies?  How much safety does another
copy on the same disk really give you?  Do systems that do dedup
provide interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are real
semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit unexpected.  Do we
need a fallocate extension to deal with shared blocks?

The above has been the case for all enterprise storage arrays ever since the 
invention of snapshots. The NFSv4.2 spec does allow you to set a per-file 
attribute that causes the storage server to always preallocate enough buffers 
to guarantee that you can rewrite the entire file, however the fact that we've 
lived without it for said 20 years leads me to believe that demand for it is 
going to be limited. I haven't put it top of the list of features we care to 
implement...

Cheers,
Trond

I agree - this has been common behaviour for a very long time in the array 
space. Even without an array,  this is the same as overwriting a block in btrfs 
or any file system with a read-write LVM snapshot.

Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown  wrote:
>> Also, I don't get the first option above at all.  The argument is that
>> it's safer to have more copies?  How much safety does another copy on
>> the same disk really give you?  Do systems that do dedup provide
>> interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are
real semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit
unexpected.  Do we need a fallocate extension to deal with shared
blocks?

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Zach Brown


> > >Sure.  So we'd have:
> > >
> > >- no flag default that forbids knowingly copying with shared references
> > >   so that it will be used by default by people who feel strongly about
> > >   their assumptions about independent write durability.
> > >
> > >- a flag that allows shared references for people who would otherwise
> > >   use the file system shared reference ioctls (ocfs2 reflink, btrfs
> > >   clone) but would like it to also do server-side read/write copies
> > >   over nfs without additional intervention.
> > >
> > >- a flag that requires shared references for callers who don't want
> > >   giant copies to take forever if they aren't instant.  (The qemu guys
> > >   asked for this at Plumbers.)
> 
> Why not implement only the last flag only as  the first step?  It seems
> like the simplest one.  So I think that would mean:
> 
>   - no worrying about cancelling, etc.
>   - apps should be told to pass the entire range at once (normally
> the whole file).
>   - The NFS server probably shouldn't do the internal copy loop by
> default.
> 
> We can't prevent some storage system from implementing a high-latency
> copy operation, but we can refuse to provide them any help (providing no
> progress reports or easy way to cancel) and then they can deal with the
> complaints from their users.

I can see where you're going with that, yeah.

It'd make less sense as a splice extension, then, perhaps.  It'd be more
like a generic entry point for the existing ioctls.  Maybe even just
defining the semantics of a common ioctl.

Hmm.

> Also, I don't get the first option above at all.  The argument is that
> it's safer to have more copies?  How much safety does another copy on
> the same disk really give you?  Do systems that do dedup provide
> interfaces to turn it off per-file?

Yeah, got me.  It's certainly nonsense on a lot of FTL logging
implementations (which are making their way into SMR drives in the
future).

> But I understand that Zach's tired of the woodshedding and I could live
> with the above I guess

No, it's fine.  At least people are expressing some interest in the
interface!  That's a marked improvement over the state of things in the
past.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread J. Bruce Fields

On Thu, Sep 26, 2013 at 05:26:39PM -0400, Ric Wheeler wrote:
> On 09/26/2013 02:55 PM, Zach Brown wrote:
> >On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> >>On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
> A client-side copy will be slower, but I guess it does have the
> advantage that the application can track progress to some degree, and
> abort it fairly quickly without leaving the file in a totally undefined
> state--and both might be useful if the copy's not a simple constant-time
> operation.
> >>>I suppose, but can't the app achieve a nice middle ground by copying the
> >>>file in smaller syscalls?  Avoid bulk data motion back to the client,
> >>>but still get notification every, I dunno, few hundred meg?
> >>Yes.  And if "cp"  could just be switched from a read+write syscall
> >>pair to a single splice syscall using the same buffer size.  And then
> >>the user would only notice that things got faster in case of server
> >>side copy.  No problems with long blocking times (at least not much
> >>worse than it was).
> >Hmm, yes, that would be a nice outcome.
> >
> >>However "cp" doesn't do reflinking by default, it has a switch for
> >>that.  If we just want "cp" and the like to use splice without fearing
> >>side effects then by default we should try to be as close to
> >>read+write behavior as possible.  No?
> >I guess?  I don't find requiring --reflink hugely compelling.  But there
> >it is.
> >
> >>That's what I'm really
> >>worrying about when you want to wire up splice to reflink by default.
> >>I do think there should be a flag for that.  And if on the block level
> >>some magic happens, so be it.  It's not the fs deverloper's worry any
> >>more ;)
> >Sure.  So we'd have:
> >
> >- no flag default that forbids knowingly copying with shared references
> >   so that it will be used by default by people who feel strongly about
> >   their assumptions about independent write durability.
> >
> >- a flag that allows shared references for people who would otherwise
> >   use the file system shared reference ioctls (ocfs2 reflink, btrfs
> >   clone) but would like it to also do server-side read/write copies
> >   over nfs without additional intervention.
> >
> >- a flag that requires shared references for callers who don't want
> >   giant copies to take forever if they aren't instant.  (The qemu guys
> >   asked for this at Plumbers.)

Why not implement only the last flag only as  the first step?  It seems
like the simplest one.  So I think that would mean:

- no worrying about cancelling, etc.
- apps should be told to pass the entire range at once (normally
  the whole file).
- The NFS server probably shouldn't do the internal copy loop by
  default.

We can't prevent some storage system from implementing a high-latency
copy operation, but we can refuse to provide them any help (providing no
progress reports or easy way to cancel) and then they can deal with the
complaints from their users.

Also, I don't get the first option above at all.  The argument is that
it's safer to have more copies?  How much safety does another copy on
the same disk really give you?  Do systems that do dedup provide
interfaces to turn it off per-file?

> This last flag should not prevent a remote target device (NFS or
> SCSI array) copy from working though since they often do reflink
> like operations inside of the remote target device

In fact maybe that's the only case to care about on the first pass.

But I understand that Zach's tired of the woodshedding and I could live
with the above I guess

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 4:00 PM, Ric Wheeler  wrote:

> I think that you are an order of magnitude off here in thinking about the
> scale of the operations.
>
> An enabled, synchronize copy offload to an array (or one that turns into a
> reflink locally) is effectively the cost of the call itself. Let's say no
> slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess.
> Realistically, that call is much faster than that worst case number.
>
> Copying any substantial amount of data - like the target workload of VM
> images or media files - would be hundreds of MB's per copy and that would
> take seconds or minutes.

Will a single splice-copy operation be interruptible/restartable?  If
not, how should apps size one request so that it doesn't take too much
time?  Even for slow devices (usb stick)?  If it will be restartable,
how?   Can remote copy be done with this?  Over a high latency
network?

Those are the questions I'm worried about.

>
> We should really work on getting the basic mechanism working and robust
> without any complications, then we can look at real, measured performance
> and see if there is any justification for adding complexity.

Go for that.  But don't forget that at the end of the day actual apps
will need to be converted like file managers and "dd" and "cp" and we
definitely don't wont a userspace library to be able to figure out how
the copy is done most efficiently; it's something for the kernel to
figure out.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Ric Wheeler


On 09/27/2013 12:47 AM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler  wrote:

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos


Buffering  misses the whole point of the copy offload - the idea is *not* to
read or write the actual data in the most interesting cases which offload
the operation to a smart target device or file system.

I meant buffering the COPY, not the data.  Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous.  So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


I think that you are an order of magnitude off here in thinking about the scale 
of the operations.


An enabled, synchronize copy offload to an array (or one that turns into a 
reflink locally) is effectively the cost of the call itself. Let's say no slower 
than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that 
call is much faster than that worst case number.


Copying any substantial amount of data - like the target workload of VM images 
or media files - would be hundreds of MB's per copy and that would take seconds 
or minutes.


We should really work on getting the basic mechanism working and robust without 
any complications, then we can look at real, measured performance and see if 
there is any justification for adding complexity.


thanks!

Ric





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Ric Wheeler


On 09/27/2013 12:47 AM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler rwhee...@redhat.com wrote:

On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on -direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos


Buffering  misses the whole point of the copy offload - the idea is *not* to
read or write the actual data in the most interesting cases which offload
the operation to a smart target device or file system.

I meant buffering the COPY, not the data.  Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous.  So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


I think that you are an order of magnitude off here in thinking about the scale 
of the operations.


An enabled, synchronize copy offload to an array (or one that turns into a 
reflink locally) is effectively the cost of the call itself. Let's say no slower 
than one IO to a S-ATA disk (10ms?) as a pessimistic guess. Realistically, that 
call is much faster than that worst case number.


Copying any substantial amount of data - like the target workload of VM images 
or media files - would be hundreds of MB's per copy and that would take seconds 
or minutes.


We should really work on getting the basic mechanism working and robust without 
any complications, then we can look at real, measured performance and see if 
there is any justification for adding complexity.


thanks!

Ric





--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 4:00 PM, Ric Wheeler rwhee...@redhat.com wrote:

 I think that you are an order of magnitude off here in thinking about the
 scale of the operations.

 An enabled, synchronize copy offload to an array (or one that turns into a
 reflink locally) is effectively the cost of the call itself. Let's say no
 slower than one IO to a S-ATA disk (10ms?) as a pessimistic guess.
 Realistically, that call is much faster than that worst case number.

 Copying any substantial amount of data - like the target workload of VM
 images or media files - would be hundreds of MB's per copy and that would
 take seconds or minutes.

Will a single splice-copy operation be interruptible/restartable?  If
not, how should apps size one request so that it doesn't take too much
time?  Even for slow devices (usb stick)?  If it will be restartable,
how?   Can remote copy be done with this?  Over a high latency
network?

Those are the questions I'm worried about.


 We should really work on getting the basic mechanism working and robust
 without any complications, then we can look at real, measured performance
 and see if there is any justification for adding complexity.

Go for that.  But don't forget that at the end of the day actual apps
will need to be converted like file managers and dd and cp and we
definitely don't wont a userspace library to be able to figure out how
the copy is done most efficiently; it's something for the kernel to
figure out.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread J. Bruce Fields

On Thu, Sep 26, 2013 at 05:26:39PM -0400, Ric Wheeler wrote:
 On 09/26/2013 02:55 PM, Zach Brown wrote:
 On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
 On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
 A client-side copy will be slower, but I guess it does have the
 advantage that the application can track progress to some degree, and
 abort it fairly quickly without leaving the file in a totally undefined
 state--and both might be useful if the copy's not a simple constant-time
 operation.
 I suppose, but can't the app achieve a nice middle ground by copying the
 file in smaller syscalls?  Avoid bulk data motion back to the client,
 but still get notification every, I dunno, few hundred meg?
 Yes.  And if cp  could just be switched from a read+write syscall
 pair to a single splice syscall using the same buffer size.  And then
 the user would only notice that things got faster in case of server
 side copy.  No problems with long blocking times (at least not much
 worse than it was).
 Hmm, yes, that would be a nice outcome.
 
 However cp doesn't do reflinking by default, it has a switch for
 that.  If we just want cp and the like to use splice without fearing
 side effects then by default we should try to be as close to
 read+write behavior as possible.  No?
 I guess?  I don't find requiring --reflink hugely compelling.  But there
 it is.
 
 That's what I'm really
 worrying about when you want to wire up splice to reflink by default.
 I do think there should be a flag for that.  And if on the block level
 some magic happens, so be it.  It's not the fs deverloper's worry any
 more ;)
 Sure.  So we'd have:
 
 - no flag default that forbids knowingly copying with shared references
so that it will be used by default by people who feel strongly about
their assumptions about independent write durability.
 
 - a flag that allows shared references for people who would otherwise
use the file system shared reference ioctls (ocfs2 reflink, btrfs
clone) but would like it to also do server-side read/write copies
over nfs without additional intervention.
 
 - a flag that requires shared references for callers who don't want
giant copies to take forever if they aren't instant.  (The qemu guys
asked for this at Plumbers.)

Why not implement only the last flag only as  the first step?  It seems
like the simplest one.  So I think that would mean:

- no worrying about cancelling, etc.
- apps should be told to pass the entire range at once (normally
  the whole file).
- The NFS server probably shouldn't do the internal copy loop by
  default.

We can't prevent some storage system from implementing a high-latency
copy operation, but we can refuse to provide them any help (providing no
progress reports or easy way to cancel) and then they can deal with the
complaints from their users.

Also, I don't get the first option above at all.  The argument is that
it's safer to have more copies?  How much safety does another copy on
the same disk really give you?  Do systems that do dedup provide
interfaces to turn it off per-file?

 This last flag should not prevent a remote target device (NFS or
 SCSI array) copy from working though since they often do reflink
 like operations inside of the remote target device

In fact maybe that's the only case to care about on the first pass.

But I understand that Zach's tired of the woodshedding and I could live
with the above I guess

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-27 Thread Zach Brown


  Sure.  So we'd have:
  
  - no flag default that forbids knowingly copying with shared references
 so that it will be used by default by people who feel strongly about
 their assumptions about independent write durability.
  
  - a flag that allows shared references for people who would otherwise
 use the file system shared reference ioctls (ocfs2 reflink, btrfs
 clone) but would like it to also do server-side read/write copies
 over nfs without additional intervention.
  
  - a flag that requires shared references for callers who don't want
 giant copies to take forever if they aren't instant.  (The qemu guys
 asked for this at Plumbers.)
 
 Why not implement only the last flag only as  the first step?  It seems
 like the simplest one.  So I think that would mean:
 
   - no worrying about cancelling, etc.
   - apps should be told to pass the entire range at once (normally
 the whole file).
   - The NFS server probably shouldn't do the internal copy loop by
 default.
 
 We can't prevent some storage system from implementing a high-latency
 copy operation, but we can refuse to provide them any help (providing no
 progress reports or easy way to cancel) and then they can deal with the
 complaints from their users.

I can see where you're going with that, yeah.

It'd make less sense as a splice extension, then, perhaps.  It'd be more
like a generic entry point for the existing ioctls.  Maybe even just
defining the semantics of a common ioctl.

Hmm.

 Also, I don't get the first option above at all.  The argument is that
 it's safer to have more copies?  How much safety does another copy on
 the same disk really give you?  Do systems that do dedup provide
 interfaces to turn it off per-file?

Yeah, got me.  It's certainly nonsense on a lot of FTL logging
implementations (which are making their way into SMR drives in the
future).

 But I understand that Zach's tired of the woodshedding and I could live
 with the above I guess

No, it's fine.  At least people are expressing some interest in the
interface!  That's a marked improvement over the state of things in the
past.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Fri, Sep 27, 2013 at 10:50 PM, Zach Brown z...@redhat.com wrote:
 Also, I don't get the first option above at all.  The argument is that
 it's safer to have more copies?  How much safety does another copy on
 the same disk really give you?  Do systems that do dedup provide
 interfaces to turn it off per-file?

I don't see the safety argument very compelling either.  There are
real semantic differences, however: ENOSPC on a write to a
(apparentlíy) already allocated block.  That could be a bit
unexpected.  Do we need a fallocate extension to deal with shared
blocks?

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 11:23 PM, Ric Wheeler  wrote:
> On 09/26/2013 03:53 PM, Miklos Szeredi wrote:
>>
>> On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:
>>
 But I'm not sure it's worth the effort; 99% of the use of this
 interface will be copying whole files.  And for that perhaps we need a
 different API, one which has been discussed some time ago:
 asynchronous copyfile() returns immediately with a pollable event
 descriptor indicating copy progress, and some way to cancel the copy.
 And that can internally rely on ->direct_splice(), with appropriate
 algorithms for determine the optimal  chunk size.
>>>
>>> And perhaps we don't.  Perhaps we can provide this much simpler
>>> data-plane interface that works well enough for most everyone and can
>>> avoid going down the async rat hole, yet again.
>>
>> I think either buffering or async is needed to get good perforrmace
>> without too much complexity in the app (which is not good).  Buffering
>> works quite well for regular I/O, so maybe its the way to go here as
>> well.
>>
>> Thanks,
>> Miklos
>>
>
> Buffering  misses the whole point of the copy offload - the idea is *not* to
> read or write the actual data in the most interesting cases which offload
> the operation to a smart target device or file system.

I meant buffering the COPY, not the data.  Doing the COPY
synchronously will always incur a performance penalty, the amount
depending on the latency, which can be significant with networking.

We think of write(2) as a synchronous interface, because that's the
appearance we get from all that hard work the page cache and delayed
writeback code does to make an asynchronous operation look as if it
was synchronous.  So from a userspace API perspective a sync interface
is nice, but inside we almost always have async interfaces to do the
actual work.

Thanks,
Miklos


>
> Regards,
>
> Ric
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 02:55 PM, Zach Brown wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if "cp"  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

Hmm, yes, that would be a nice outcome.


However "cp" doesn't do reflinking by default, it has a switch for
that.  If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.


That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
   so that it will be used by default by people who feel strongly about
   their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
   use the file system shared reference ioctls (ocfs2 reflink, btrfs
   clone) but would like it to also do server-side read/write copies
   over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
   giant copies to take forever if they aren't instant.  (The qemu guys
   asked for this at Plumbers.)

I think I can live with that.

- z


This last flag should not prevent a remote target device (NFS or SCSI array) 
copy from working though since they often do reflink like operations inside of 
the remote target device


ric


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos



Buffering  misses the whole point of the copy offload - the idea is *not* to 
read or write the actual data in the most interesting cases which offload the 
operation to a smart target device or file system.


Regards,

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown  wrote:

>> But I'm not sure it's worth the effort; 99% of the use of this
>> interface will be copying whole files.  And for that perhaps we need a
>> different API, one which has been discussed some time ago:
>> asynchronous copyfile() returns immediately with a pollable event
>> descriptor indicating copy progress, and some way to cancel the copy.
>> And that can internally rely on ->direct_splice(), with appropriate
>> algorithms for determine the optimal  chunk size.
>
> And perhaps we don't.  Perhaps we can provide this much simpler
> data-plane interface that works well enough for most everyone and can
> avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 08:06:41PM +0200, Miklos Szeredi wrote:
> On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields  wrote:
> > On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> >> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
> >> >> A client-side copy will be slower, but I guess it does have the
> >> >> advantage that the application can track progress to some degree, and
> >> >> abort it fairly quickly without leaving the file in a totally undefined
> >> >> state--and both might be useful if the copy's not a simple constant-time
> >> >> operation.
> >> >
> >> > I suppose, but can't the app achieve a nice middle ground by copying the
> >> > file in smaller syscalls?  Avoid bulk data motion back to the client,
> >> > but still get notification every, I dunno, few hundred meg?
> >>
> >> Yes.  And if "cp"  could just be switched from a read+write syscall
> >> pair to a single splice syscall using the same buffer size.
> >
> > Will the various magic fs-specific copy operations become inefficient
> > when the range copied is too small?
> 
> We could treat spice-copy operations just like write operations (can
> be buffered, coalesced, synced).
> 
> But I'm not sure it's worth the effort; 99% of the use of this
> interface will be copying whole files.  And for that perhaps we need a
> different API, one which has been discussed some time ago:
> asynchronous copyfile() returns immediately with a pollable event
> descriptor indicating copy progress, and some way to cancel the copy.
> And that can internally rely on ->direct_splice(), with appropriate
> algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
> >> A client-side copy will be slower, but I guess it does have the
> >> advantage that the application can track progress to some degree, and
> >> abort it fairly quickly without leaving the file in a totally undefined
> >> state--and both might be useful if the copy's not a simple constant-time
> >> operation.
> >
> > I suppose, but can't the app achieve a nice middle ground by copying the
> > file in smaller syscalls?  Avoid bulk data motion back to the client,
> > but still get notification every, I dunno, few hundred meg?
> 
> Yes.  And if "cp"  could just be switched from a read+write syscall
> pair to a single splice syscall using the same buffer size.  And then
> the user would only notice that things got faster in case of server
> side copy.  No problems with long blocking times (at least not much
> worse than it was).

Hmm, yes, that would be a nice outcome.

> However "cp" doesn't do reflinking by default, it has a switch for
> that.  If we just want "cp" and the like to use splice without fearing
> side effects then by default we should try to be as close to
> read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.

> That's what I'm really
> worrying about when you want to wire up splice to reflink by default.
> I do think there should be a flag for that.  And if on the block level
> some magic happens, so be it.  It's not the fs deverloper's worry any
> more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
  so that it will be used by default by people who feel strongly about
  their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
  use the file system shared reference ioctls (ocfs2 reflink, btrfs
  clone) but would like it to also do server-side read/write copies
  over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
  giant copies to take forever if they aren't instant.  (The qemu guys
  asked for this at Plumbers.)

I think I can live with that.

- z
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields  wrote:
> On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
>> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
>> >> A client-side copy will be slower, but I guess it does have the
>> >> advantage that the application can track progress to some degree, and
>> >> abort it fairly quickly without leaving the file in a totally undefined
>> >> state--and both might be useful if the copy's not a simple constant-time
>> >> operation.
>> >
>> > I suppose, but can't the app achieve a nice middle ground by copying the
>> > file in smaller syscalls?  Avoid bulk data motion back to the client,
>> > but still get notification every, I dunno, few hundred meg?
>>
>> Yes.  And if "cp"  could just be switched from a read+write syscall
>> pair to a single splice syscall using the same buffer size.
>
> Will the various magic fs-specific copy operations become inefficient
> when the range copied is too small?

We could treat spice-copy operations just like write operations (can
be buffered, coalesced, synced).

But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on ->direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 11:34 AM, J. Bruce Fields wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if "cp"  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.


I think that is not really possible to tell when we invoke it. It is very much 
target device (or file system, etc) dependent on how long it takes. It could be 
as simple as a reflink copying in a smallish amount of metadata or fall back to 
a full byte-by-byte copy.  Also note that speed is not the only impact here, 
some of the mechanisms actually do not consume more space (just increment shared 
data references).


It would probably make more sense to send it off to the target device and have 
it return an error when not appropriate (then the app can fall back to the old 
fashion copy).


ric




And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However "cp" doesn't do reflinking by default, it has a switch for
that.  If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-26 Thread J. Bruce Fields

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
> On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
> >> A client-side copy will be slower, but I guess it does have the
> >> advantage that the application can track progress to some degree, and
> >> abort it fairly quickly without leaving the file in a totally undefined
> >> state--and both might be useful if the copy's not a simple constant-time
> >> operation.
> >
> > I suppose, but can't the app achieve a nice middle ground by copying the
> > file in smaller syscalls?  Avoid bulk data motion back to the client,
> > but still get notification every, I dunno, few hundred meg?
> 
> Yes.  And if "cp"  could just be switched from a read+write syscall
> pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.

> And then
> the user would only notice that things got faster in case of server
> side copy.  No problems with long blocking times (at least not much
> worse than it was).
> 
> However "cp" doesn't do reflinking by default, it has a switch for
> that.  If we just want "cp" and the like to use splice without fearing
> side effects then by default we should try to be as close to
> read+write behavior as possible.  No?   That's what I'm really
> worrying about when you want to wire up splice to reflink by default.
> I do think there should be a flag for that.  And if on the block level
> some magic happens, so be it.  It's not the fs deverloper's worry any
> more ;)
> 
> Thanks,
> Miklos
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown  wrote:
>> A client-side copy will be slower, but I guess it does have the
>> advantage that the application can track progress to some degree, and
>> abort it fairly quickly without leaving the file in a totally undefined
>> state--and both might be useful if the copy's not a simple constant-time
>> operation.
>
> I suppose, but can't the app achieve a nice middle ground by copying the
> file in smaller syscalls?  Avoid bulk data motion back to the client,
> but still get notification every, I dunno, few hundred meg?

Yes.  And if "cp"  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However "cp" doesn't do reflinking by default, it has a switch for
that.  If we just want "cp" and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
 A client-side copy will be slower, but I guess it does have the
 advantage that the application can track progress to some degree, and
 abort it fairly quickly without leaving the file in a totally undefined
 state--and both might be useful if the copy's not a simple constant-time
 operation.

 I suppose, but can't the app achieve a nice middle ground by copying the
 file in smaller syscalls?  Avoid bulk data motion back to the client,
 but still get notification every, I dunno, few hundred meg?

Yes.  And if cp  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However cp doesn't do reflinking by default, it has a switch for
that.  If we just want cp and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

2013-09-26 Thread J. Bruce Fields

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
 On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
  A client-side copy will be slower, but I guess it does have the
  advantage that the application can track progress to some degree, and
  abort it fairly quickly without leaving the file in a totally undefined
  state--and both might be useful if the copy's not a simple constant-time
  operation.
 
  I suppose, but can't the app achieve a nice middle ground by copying the
  file in smaller syscalls?  Avoid bulk data motion back to the client,
  but still get notification every, I dunno, few hundred meg?
 
 Yes.  And if cp  could just be switched from a read+write syscall
 pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.

 And then
 the user would only notice that things got faster in case of server
 side copy.  No problems with long blocking times (at least not much
 worse than it was).
 
 However cp doesn't do reflinking by default, it has a switch for
 that.  If we just want cp and the like to use splice without fearing
 side effects then by default we should try to be as close to
 read+write behavior as possible.  No?   That's what I'm really
 worrying about when you want to wire up splice to reflink by default.
 I do think there should be a flag for that.  And if on the block level
 some magic happens, so be it.  It's not the fs deverloper's worry any
 more ;)
 
 Thanks,
 Miklos
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 11:34 AM, J. Bruce Fields wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if cp  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.

Will the various magic fs-specific copy operations become inefficient
when the range copied is too small?

(Totally naive question, as I have no idea how they really work.)

--b.


I think that is not really possible to tell when we invoke it. It is very much 
target device (or file system, etc) dependent on how long it takes. It could be 
as simple as a reflink copying in a smallish amount of metadata or fall back to 
a full byte-by-byte copy.  Also note that speed is not the only impact here, 
some of the mechanisms actually do not consume more space (just increment shared 
data references).


It would probably make more sense to send it off to the target device and have 
it return an error when not appropriate (then the app can fall back to the old 
fashion copy).


ric




And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

However cp doesn't do reflinking by default, it has a switch for
that.  If we just want cp and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?   That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-nfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
 On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
  A client-side copy will be slower, but I guess it does have the
  advantage that the application can track progress to some degree, and
  abort it fairly quickly without leaving the file in a totally undefined
  state--and both might be useful if the copy's not a simple constant-time
  operation.
 
  I suppose, but can't the app achieve a nice middle ground by copying the
  file in smaller syscalls?  Avoid bulk data motion back to the client,
  but still get notification every, I dunno, few hundred meg?

 Yes.  And if cp  could just be switched from a read+write syscall
 pair to a single splice syscall using the same buffer size.

 Will the various magic fs-specific copy operations become inefficient
 when the range copied is too small?

We could treat spice-copy operations just like write operations (can
be buffered, coalesced, synced).

But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on -direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
 On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
  A client-side copy will be slower, but I guess it does have the
  advantage that the application can track progress to some degree, and
  abort it fairly quickly without leaving the file in a totally undefined
  state--and both might be useful if the copy's not a simple constant-time
  operation.
 
  I suppose, but can't the app achieve a nice middle ground by copying the
  file in smaller syscalls?  Avoid bulk data motion back to the client,
  but still get notification every, I dunno, few hundred meg?
 
 Yes.  And if cp  could just be switched from a read+write syscall
 pair to a single splice syscall using the same buffer size.  And then
 the user would only notice that things got faster in case of server
 side copy.  No problems with long blocking times (at least not much
 worse than it was).

Hmm, yes, that would be a nice outcome.

 However cp doesn't do reflinking by default, it has a switch for
 that.  If we just want cp and the like to use splice without fearing
 side effects then by default we should try to be as close to
 read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.

 That's what I'm really
 worrying about when you want to wire up splice to reflink by default.
 I do think there should be a flag for that.  And if on the block level
 some magic happens, so be it.  It's not the fs deverloper's worry any
 more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
  so that it will be used by default by people who feel strongly about
  their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
  use the file system shared reference ioctls (ocfs2 reflink, btrfs
  clone) but would like it to also do server-side read/write copies
  over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
  giant copies to take forever if they aren't instant.  (The qemu guys
  asked for this at Plumbers.)

I think I can live with that.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 08:06:41PM +0200, Miklos Szeredi wrote:
 On Thu, Sep 26, 2013 at 5:34 PM, J. Bruce Fields bfie...@fieldses.org wrote:
  On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:
  On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:
   A client-side copy will be slower, but I guess it does have the
   advantage that the application can track progress to some degree, and
   abort it fairly quickly without leaving the file in a totally undefined
   state--and both might be useful if the copy's not a simple constant-time
   operation.
  
   I suppose, but can't the app achieve a nice middle ground by copying the
   file in smaller syscalls?  Avoid bulk data motion back to the client,
   but still get notification every, I dunno, few hundred meg?
 
  Yes.  And if cp  could just be switched from a read+write syscall
  pair to a single splice syscall using the same buffer size.
 
  Will the various magic fs-specific copy operations become inefficient
  when the range copied is too small?
 
 We could treat spice-copy operations just like write operations (can
 be buffered, coalesced, synced).
 
 But I'm not sure it's worth the effort; 99% of the use of this
 interface will be copying whole files.  And for that perhaps we need a
 different API, one which has been discussed some time ago:
 asynchronous copyfile() returns immediately with a pollable event
 descriptor indicating copy progress, and some way to cancel the copy.
 And that can internally rely on -direct_splice(), with appropriate
 algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

- z
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote:

 But I'm not sure it's worth the effort; 99% of the use of this
 interface will be copying whole files.  And for that perhaps we need a
 different API, one which has been discussed some time ago:
 asynchronous copyfile() returns immediately with a pollable event
 descriptor indicating copy progress, and some way to cancel the copy.
 And that can internally rely on -direct_splice(), with appropriate
 algorithms for determine the optimal  chunk size.

 And perhaps we don't.  Perhaps we can provide this much simpler
 data-plane interface that works well enough for most everyone and can
 avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 03:53 PM, Miklos Szeredi wrote:

On Thu, Sep 26, 2013 at 9:06 PM, Zach Brown z...@redhat.com wrote:


But I'm not sure it's worth the effort; 99% of the use of this
interface will be copying whole files.  And for that perhaps we need a
different API, one which has been discussed some time ago:
asynchronous copyfile() returns immediately with a pollable event
descriptor indicating copy progress, and some way to cancel the copy.
And that can internally rely on -direct_splice(), with appropriate
algorithms for determine the optimal  chunk size.

And perhaps we don't.  Perhaps we can provide this much simpler
data-plane interface that works well enough for most everyone and can
avoid going down the async rat hole, yet again.

I think either buffering or async is needed to get good perforrmace
without too much complexity in the app (which is not good).  Buffering
works quite well for regular I/O, so maybe its the way to go here as
well.

Thanks,
Miklos



Buffering  misses the whole point of the copy offload - the idea is *not* to 
read or write the actual data in the most interesting cases which offload the 
operation to a smart target device or file system.


Regards,

Ric

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading


On 09/26/2013 02:55 PM, Zach Brown wrote:

On Thu, Sep 26, 2013 at 10:58:05AM +0200, Miklos Szeredi wrote:

On Wed, Sep 25, 2013 at 11:07 PM, Zach Brown z...@redhat.com wrote:

A client-side copy will be slower, but I guess it does have the
advantage that the application can track progress to some degree, and
abort it fairly quickly without leaving the file in a totally undefined
state--and both might be useful if the copy's not a simple constant-time
operation.

I suppose, but can't the app achieve a nice middle ground by copying the
file in smaller syscalls?  Avoid bulk data motion back to the client,
but still get notification every, I dunno, few hundred meg?

Yes.  And if cp  could just be switched from a read+write syscall
pair to a single splice syscall using the same buffer size.  And then
the user would only notice that things got faster in case of server
side copy.  No problems with long blocking times (at least not much
worse than it was).

Hmm, yes, that would be a nice outcome.


However cp doesn't do reflinking by default, it has a switch for
that.  If we just want cp and the like to use splice without fearing
side effects then by default we should try to be as close to
read+write behavior as possible.  No?

I guess?  I don't find requiring --reflink hugely compelling.  But there
it is.


That's what I'm really
worrying about when you want to wire up splice to reflink by default.
I do think there should be a flag for that.  And if on the block level
some magic happens, so be it.  It's not the fs deverloper's worry any
more ;)

Sure.  So we'd have:

- no flag default that forbids knowingly copying with shared references
   so that it will be used by default by people who feel strongly about
   their assumptions about independent write durability.

- a flag that allows shared references for people who would otherwise
   use the file system shared reference ioctls (ocfs2 reflink, btrfs
   clone) but would like it to also do server-side read/write copies
   over nfs without additional intervention.

- a flag that requires shared references for callers who don't want
   giant copies to take forever if they aren't instant.  (The qemu guys
   asked for this at Plumbers.)

I think I can live with that.

- z


This last flag should not prevent a remote target device (NFS or SCSI array) 
copy from working though since they often do reflink like operations inside of 
the remote target device


ric


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] extending splice for copy offloading