Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Mahesh Siddheshwar
Roch wrote:
 Filesystems might have some blocksize and alignment constraints
 conditioning their ability to loan up buffers (for writes). 
 If that is so, we could use an API to query the FS about
 those values. For a copy on write  variable block size
 filesystem, that natural blocksize might also depend on the
 vnode being targetted. 
Yes. The provider can fail the VOP_REQZCBUF() call if it determines
that it is inefficient to take the zero-copy path. Depending on the
provider implementation, this could be blocksize aligned. In such cases,
the consumer could use VFSNAME_STATVFS() call to determine
'f_bsize' value.  But as you note, certain implementations may have
different values for individual files. In such cases if the VOP_REQZCBUF()
fails, the consumer then uses the traditional non zero-copy path.

An additional API to find the such constraints/requirements may
be useful in future, but is out-of-scope for this project.  However, the
project team will open an RFE for this issue and put you on the
interest list.
 Do we know if ZFS will ever be able to
 loan up buffers for writes that are not aligned full records ?
   
No, not planned currently. It has to be block size aligned.
Also note that currently, from an implementation perspective,
zero-copy WRITEs are efficient only in case network-based
filesystems like NFS over RDMA transports.

Mahesh
 -r

 Rich.Brown at Sun.COM writes:
   I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
   This case proposes new interfaces to support copy reduction in the I/O path
   especially for file sharing services.
   
   Minor binding is requested.
   
   This times out on Wednesday, 16 September, 2009.
   
   
   Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
   This information is Copyright 2009 Sun Microsystems
   1. Introduction
   1.1. Project/Component Working Name:
   Copy Reduction Interfaces
   1.2. Name of Document Author/Supplier:
   Author:  Mahesh Siddheshwar, Chunli Zhang
   1.3  Date of This Document:
  09 September, 2009
   4. Technical Description
   
== Introduction/Background ==
   
Zero-copy (copy avoidance) is essentially buffer sharing
among multiple modules that pass data between the modules. 
This proposal avoids the data copy in the READ/WRITE path 
of filesystems, by providing a mechanism to share data buffers
between the modules. It is intended to be used by network file
sharing services like NFS, CIFS or others.
   
Although the buffer sharing can be achieved through a few different
solutions, any such solution must work with File Event Monitors
(FEM monitors)[1] installed on the files. The solution must
allow the underlying filesystem to maintain any existing file 
range locking in the filesystem.

The proposed solution provides extensions to the existing VOP
interface to request and return buffers from a filesystem. The 
buffers are then used with existing VOP_READ/VOP_WRITE calls with
minimal changes.
   
   
== Proposed Changes ==
   
VOP Extensions for Zero-Copy Support

   
a. Extended struct uio, xuio_t
   
 The following proposes an extensible uio structure that can be extended 
 for
 multiple purposes.  For example, an immediate extension, xu_zc, is to be 
 used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
 zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
 calls for normal read/write operations.  Another example of extension,
 xu_aio, is intended to replace uioa_t for async I/O.
   
 This new structure, xuio_t, contains the following:
   
 - the existing uio structure (embedded) as the first member
 - additional fields to support extensibility
 - a union of all the defined extensions
   
 The following uio_extflag is added to indicate that an uio structure is
 indeed an xuio_t:
   
 #define  UIO_XUIO0x004   /* Structure is xuio_t */
   
 The following uio_extflag will be removed after uioa_t has been 
 converted 
 to xuio_t:
   
 #define  UIO_ASYNC   0x002   /* Structure is xuio_t */
   
 The project team has commitment from the networking team to remove
 the current use of uioa_t and use the proposed extensions (CR 6880095).
   
 The definition of xuio_t is:
   
 typedef struct xuio {
   uio_t xu_uio;  /* Embedded UIO structure */
   
   /* Extended uio fields */
   enum xuio_type xu_type;/* What kind of uio structure? */
   
   union {
   
  /* Async I/O Support */
  struct {
   uint32_t xu_a_state;   /* state of async i/o */
   uint32_t xu_a_state;   /* state of async i/o */
   ssize_t xu_a_mbytes;   /* bytes that have been uioamove()ed */
   uioa_page_t *xu_a_lcur;/* pointer into uioa_locked

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Mahesh Siddheshwar
Rick Matthews wrote:
 Are there instances where an assigned zero-copy buffer could be orphaned?
No. The consumer Must release the buffers through VOP_RETZCBUF().

Mahesh

 If so, should there be a recovery list associated with this addition? 
 Perhaps off
 the designated vnode.

 This comment shouldn't block fast-track approval. Just a question.
 -- 
 Rick

 On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli 
 Zhang.
 This case proposes new interfaces to support copy reduction in the 
 I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
  Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
  Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
 09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules.  This 
 proposal avoids the data copy in the READ/WRITE path  of filesystems, 
 by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file  range 
 locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be 
 extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is 
 to be   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to 
 pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio 
 structure is
   indeed an xuio_t:

   #defineUIO_XUIO0x004/* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been 
 converted   to xuio_t:

   #defineUIO_ASYNC0x002/* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 
 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio;/* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;/* What kind of uio structure? */

 union {

 /* Async I/O Support */
 struct {
 uint32_t xu_a_state;/* state of async i/o */
 uint32_t xu_a_state;/* state of async i/o */
 ssize_t xu_a_mbytes;/* bytes that have been 
 uioamove()ed */
 uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */
 void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];/* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov 
 locked pages */
 } xu_aio;

 /* Zero Copy Support */
 struct {
 enum uio_rw xu_zc_rw;/* the use of the buffer */
 void *xu_zc_priv;/* fs specific */
 } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and 
 adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 caller_context_t *);
  
 This function requests buffers associated with file vp in 
 preparation for a
 subsequent zero copy read or write. The extended uio_t -- xuio_t 
 is used
 to pass the parameters and results. Only the following fields of 
 xuio_t

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout09/16/2009]

2009-09-11 Thread Mahesh Siddheshwar
Roland Mainz wrote:
 Mahesh Siddheshwar wrote:
   
 Roland Mainz wrote:
 
 Does it make sense to have a |xu_flags| field here for future
 enhancments ?
   
 If future enhancements are needed to extend xuio_t,  a new xuio_type
 can be defined and extended that way.  For extensions not specific
 to xuio, there also exists uio_extflg in the uio_t.  Without a particular
 purpose an additional flag seems unnecessary for zero-copy right now.
 

 Right now... yes. But Unix has a little (IMO) ugly tradition of not
 adding such flag fields and instead swamping the headers with many many
 variations of one interface over time which could be avoided by use
 having a flags field as argument (that's a generic issue).
   
[snip]
  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 caller_context_t *)
 AFAIK the prototype should have a flags field to allow future
 changes/extenstions without adding another VOP_*-hook ...
   
 Roland, if the extensions/changes are for the purpose of
 copy reduction/buffer sharing,  we don't need to add
 additional VOP_* routines. The current xuio_t extension is
 defined just for that.
 

 Erm... the idea of having a flags field in |fop_reqzcbuf()| was to allow
 slight modifications in behaviour - for example in the future there
 could be flags which describe where (in a NUMA system) the buffer memory
 resides (e.g. near the calling thread, near a point which is optimal for
 all consumers, or near the hardware which fills the buffer etc.),
 whether it should be in the L2 cache or not etc. etc.
   
 
Roland, I agree with Nico on the drawbacks of adding undefined flags for 
future use.
 
You make two suggestions:
 1) addition of a flag to xuio_t for future use
 
 2) addition of a flag to VOP_REQZCBUF() for future use.
 
It's the project team's opinion that these flags are not needed for
this spec, for the following reasons:
 
For 1) the existence of uio_extflg in uio_t and the possibility of
extending xuio_t through additional xuio_type's, make an
additional 'xu_flags' flag an overhead that can be avoided.
 
For 2) we don't have a specific purpose for the flag right now.
If there is a need for additional flags or arguments in future,
the VOP routine can be easily extended. As has been done in
the recent past with several extensions for existing VOP routines
(PSARC 2007/244, 2007/218,  2007/227, 2009/387).
The existence of strong type checking for vnode/vfs operations
through PSARC 2007/124 make it easier to catch an interface mismatch.
 
Thanks,
Mahesh


Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-11 Thread Mahesh Siddheshwar
johansen at sun.com wrote:
 On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote:
   
  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules. 
  This proposal avoids the data copy in the READ/WRITE path 
  of filesystems, by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file 
  range locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==
 
 ...

   
  == Using the New VOP Interfaces for Zero-copy ==

  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 

  a. Read

 In a normal read, the consumer allocates the data buffer and passes it to
 VOP_READ().  The provider initiates the I/O, and copies the data from its
 own cache buffer to the consumer supplied buffer.

 To avoid the copy (initiating a zero-copy read), the consumer
 first calls VOP_REQZCBUF() to inform the provider to prepare to
 loan out its cache buffer.  It then calls VOP_READ().  After the
 call returns, the consumer has direct access to the cache buffer
 loaned out by the provider.  After processing the data, the
 consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.
 
 ...

   
  b. Write

 In a normal write, the consumer allocates the data buffer, loads the 
 data,
 and passes the buffer to VOP_WRITE().  The provider copies the data from
 the consumer supplied buffer to its own cache buffer, and starts the I/O.

 To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
 grab a cache buffer from the provider.  It loads the data directly to
 the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
 the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.
 

 Just for clarification: this interface only affects pages mapped in the
 kernel, correct?  I'm trying to understand if this is just for reducing
 the number of in-kernel copies, or if this is a userland - kernel
 zero-copy interface.

   
That is correct. This interface is to prevent in-kernel copies and allow
buffer sharing between kernel modules (that can be used by in-kernel
services like NFS or CIFS). The spec does not define any userland - kernel
zero-copy interface.

Thanks,
Mahesh
 Thanks,

 -j
   



Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout09/16/2009]

2009-09-10 Thread Mahesh Siddheshwar
Hi Roland,

Roland Mainz wrote:
[snip]
 How do you handle sparse files, e.g. files with one or more holes ?

   
Sparse files are not handled any differently in VOP_READ/VOP_WRITE calls
when using the zero-copy interface. Modules that want to seek/skip holes
can use the _FIO_SEEK_DATA/_FIO_SEEK_HOLE  commands
of VOP_IOCTL, to do so.
  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t
 
 [snip]
   
   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio;   /* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type; /* What kind of uio structure? */

 union {

 /* Async I/O Support */
 struct {
 uint32_t xu_a_state;/* state of async i/o */
 uint32_t xu_a_state;/* state of async i/o */
 ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed 
 */
 uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */
 void **xu_a_lppp;   /* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4]; /* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
 */
 } xu_aio;

 /* Zero Copy Support */
 struct {
 enum uio_rw xu_zc_rw;   /* the use of the buffer */
 void *xu_zc_priv;   /* fs specific */
 

 Does it make sense to have a |xu_flags| field here for future
 enhancments ?
   
If future enhancements are needed to extend xuio_t,  a new xuio_type
can be defined and extended that way.  For extensions not specific
to xuio, there also exists uio_extflg in the uio_t.  Without a particular
purpose an additional flag seems unnecessary for zero-copy right now.

Please note that there is a typo in the spec, in the definition
for struct xu_aio. The below line is printed twice:

uint32_t xu_a_state;/* state of async i/o */

A corrected final spec will be posted in the case directory.

 } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 caller_context_t *);
 

 AFAIK the prototype should have a flags field to allow future
 changes/extenstions without adding another VOP_*-hook ...

   
Roland, if the extensions/changes are for the purpose of
copy reduction/buffer sharing,  we don't need to add
additional VOP_* routines. The current xuio_t extension is
defined just for that.

Thanks,
Mahesh 
 

 Bye,
 Roland

   



Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-09 Thread Mahesh Siddheshwar
Garrett D'Amore wrote:
 I've not had time to go over all this yet, but do we really believe 
 this kind of change is fast track appropriate?  I have a feeling that 
 this is a significant enough core change with implications for a 
 variety of project teams, that maybe this one ought to be a full 
 case.  I'd be a  bit uncomfortable allowing this one to just time out 
 with a single +1, which is the normal rule for fast tracks.

 Am I alone in this particular concern?

 Are there any implications for unbundled 3rd party filesystems?
Not unless the 3rd party filesystem wants to support this optional
feature. This is covered in section (d) of the spec. The intermediate fop
routines handle it correctly.

Regards,
Mahesh
- Garrett


 Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli 
 Zhang.
 This case proposes new interfaces to support copy reduction in the 
 I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
  Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
  Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
 09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules.  This 
 proposal avoids the data copy in the READ/WRITE path  of filesystems, 
 by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file  range 
 locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be 
 extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is 
 to be   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to 
 pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio 
 structure is
   indeed an xuio_t:

   #defineUIO_XUIO0x004/* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been 
 converted   to xuio_t:

   #defineUIO_ASYNC0x002/* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 
 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio;/* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;/* What kind of uio structure? */

 union {

 /* Async I/O Support */
 struct {
 uint32_t xu_a_state;/* state of async i/o */
 uint32_t xu_a_state;/* state of async i/o */
 ssize_t xu_a_mbytes;/* bytes that have been 
 uioamove()ed */
 uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */
 void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];/* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov 
 locked pages */
 } xu_aio;

 /* Zero Copy Support */
 struct {
 enum uio_rw xu_zc_rw;/* the use of the buffer */
 void *xu_zc_priv;/* fs specific */
 } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and 
 adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag