Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Roch

Filesystems might have some blocksize and alignment constraints
conditioning their ability to loan up buffers (for writes). 
If that is so, we could use an API to query the FS about
those values. For a copy on write  variable block size
filesystem, that natural blocksize might also depend on the
vnode being targetted. Do we know if ZFS will ever be able to
loan up buffers for writes that are not aligned full records ?

-r

Rich.Brown at Sun.COM writes:
  I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
  This case proposes new interfaces to support copy reduction in the I/O path
  especially for file sharing services.
  
  Minor binding is requested.
  
  This times out on Wednesday, 16 September, 2009.
  
  
  Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
  This information is Copyright 2009 Sun Microsystems
  1. Introduction
  1.1. Project/Component Working Name:
Copy Reduction Interfaces
  1.2. Name of Document Author/Supplier:
Author:  Mahesh Siddheshwar, Chunli Zhang
  1.3  Date of This Document:
   09 September, 2009
  4. Technical Description
  
   == Introduction/Background ==
  
   Zero-copy (copy avoidance) is essentially buffer sharing
   among multiple modules that pass data between the modules. 
   This proposal avoids the data copy in the READ/WRITE path 
   of filesystems, by providing a mechanism to share data buffers
   between the modules. It is intended to be used by network file
   sharing services like NFS, CIFS or others.
  
   Although the buffer sharing can be achieved through a few different
   solutions, any such solution must work with File Event Monitors
   (FEM monitors)[1] installed on the files. The solution must
   allow the underlying filesystem to maintain any existing file 
   range locking in the filesystem.
   
   The proposed solution provides extensions to the existing VOP
   interface to request and return buffers from a filesystem. The 
   buffers are then used with existing VOP_READ/VOP_WRITE calls with
   minimal changes.
  
  
   == Proposed Changes ==
  
   VOP Extensions for Zero-Copy Support
   
  
   a. Extended struct uio, xuio_t
  
The following proposes an extensible uio structure that can be extended for
multiple purposes.  For example, an immediate extension, xu_zc, is to be 
used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
zero-copy buffers, as well as to be passed to the existing 
  VOP_READ/VOP_WRITE
calls for normal read/write operations.  Another example of extension,
xu_aio, is intended to replace uioa_t for async I/O.
  
This new structure, xuio_t, contains the following:
  
- the existing uio structure (embedded) as the first member
- additional fields to support extensibility
- a union of all the defined extensions
  
The following uio_extflag is added to indicate that an uio structure is
indeed an xuio_t:
  
#defineUIO_XUIO0x004   /* Structure is xuio_t */
  
The following uio_extflag will be removed after uioa_t has been converted 
to xuio_t:
  
#defineUIO_ASYNC   0x002   /* Structure is xuio_t */
  
The project team has commitment from the networking team to remove
the current use of uioa_t and use the proposed extensions (CR 6880095).
  
The definition of xuio_t is:
  
typedef struct xuio {
  uio_t xu_uio;/* Embedded UIO structure */
  
  /* Extended uio fields */
  enum xuio_type xu_type;  /* What kind of uio structure? */
  
  union {
  
   /* Async I/O Support */
   struct {
  uint32_t xu_a_state; /* state of async i/o */
  uint32_t xu_a_state; /* state of async i/o */
  ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */
  uioa_page_t *xu_a_lcur;  /* pointer into uioa_locked[] */
  void **xu_a_lppp;/* pointer into 
  lcur-uioa_ppp[] */
  void *xu_a_hwst[4];  /* opaque hardware state */
  uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
  */
   } xu_aio;
  
   /* Zero Copy Support */
   struct {
  enum uio_rw xu_zc_rw;/* the use of the buffer */
  void *xu_zc_priv;/* fs specific */
   } xu_zc;
  
  } xu_ext;
} xuio_t;
  
where xu_type is currently defined as:
  
typedef enum xuio_type {
  UIOTYPE_ASYNCIO,
  UIOTYPE_ZEROCOPY
} xuio_type_t;
  
New uio extensions can be added by defining a new xuio_type_t, and adding a
new member to the xu_ext union.
  
   b. Requesting zero-copy buffers
  
  #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
  fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
  
  int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
   caller_context_t *);
   
  This function requests buffers 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Rick Matthews
Are there instances where an assigned zero-copy buffer could be orphaned?
If so, should there be a recovery list associated with this addition? 
Perhaps off
the designated vnode.

This comment shouldn't block fast-track approval. Just a question.
--
Rick

On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
 This case proposes new interfaces to support copy reduction in the I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
   09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules. 
  This proposal avoids the data copy in the READ/WRITE path 
  of filesystems, by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file 
  range locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio structure is
   indeed an xuio_t:

   #define UIO_XUIO0x004   /* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been converted 
   to xuio_t:

   #define UIO_ASYNC   0x002   /* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio; /* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;   /* What kind of uio structure? */

 union {

   /* Async I/O Support */
   struct {
 uint32_t xu_a_state;  /* state of async i/o */
 uint32_t xu_a_state;  /* state of async i/o */
 ssize_t xu_a_mbytes;  /* bytes that have been uioamove()ed */
 uioa_page_t *xu_a_lcur;   /* pointer into uioa_locked[] */
 void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];   /* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
 */
   } xu_aio;

   /* Zero Copy Support */
   struct {
 enum uio_rw xu_zc_rw; /* the use of the buffer */
 void *xu_zc_priv; /* fs specific */
   } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
   caller_context_t *);
  
 This function requests buffers associated with file vp in preparation for 
 a
 subsequent zero copy read or write. The extended uio_t -- xuio_t is used
 to pass the parameters and results. Only the following fields of xuio_t 
 are
 relevant to this call.
  
 uiozcp-xu_uio.uio_resid: used by the caller to specify the total 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Mahesh Siddheshwar
Roch wrote:
 Filesystems might have some blocksize and alignment constraints
 conditioning their ability to loan up buffers (for writes). 
 If that is so, we could use an API to query the FS about
 those values. For a copy on write  variable block size
 filesystem, that natural blocksize might also depend on the
 vnode being targetted. 
Yes. The provider can fail the VOP_REQZCBUF() call if it determines
that it is inefficient to take the zero-copy path. Depending on the
provider implementation, this could be blocksize aligned. In such cases,
the consumer could use VFSNAME_STATVFS() call to determine
'f_bsize' value.  But as you note, certain implementations may have
different values for individual files. In such cases if the VOP_REQZCBUF()
fails, the consumer then uses the traditional non zero-copy path.

An additional API to find the such constraints/requirements may
be useful in future, but is out-of-scope for this project.  However, the
project team will open an RFE for this issue and put you on the
interest list.
 Do we know if ZFS will ever be able to
 loan up buffers for writes that are not aligned full records ?
   
No, not planned currently. It has to be block size aligned.
Also note that currently, from an implementation perspective,
zero-copy WRITEs are efficient only in case network-based
filesystems like NFS over RDMA transports.

Mahesh
 -r

 Rich.Brown at Sun.COM writes:
   I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
   This case proposes new interfaces to support copy reduction in the I/O path
   especially for file sharing services.
   
   Minor binding is requested.
   
   This times out on Wednesday, 16 September, 2009.
   
   
   Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
   This information is Copyright 2009 Sun Microsystems
   1. Introduction
   1.1. Project/Component Working Name:
   Copy Reduction Interfaces
   1.2. Name of Document Author/Supplier:
   Author:  Mahesh Siddheshwar, Chunli Zhang
   1.3  Date of This Document:
  09 September, 2009
   4. Technical Description
   
== Introduction/Background ==
   
Zero-copy (copy avoidance) is essentially buffer sharing
among multiple modules that pass data between the modules. 
This proposal avoids the data copy in the READ/WRITE path 
of filesystems, by providing a mechanism to share data buffers
between the modules. It is intended to be used by network file
sharing services like NFS, CIFS or others.
   
Although the buffer sharing can be achieved through a few different
solutions, any such solution must work with File Event Monitors
(FEM monitors)[1] installed on the files. The solution must
allow the underlying filesystem to maintain any existing file 
range locking in the filesystem.

The proposed solution provides extensions to the existing VOP
interface to request and return buffers from a filesystem. The 
buffers are then used with existing VOP_READ/VOP_WRITE calls with
minimal changes.
   
   
== Proposed Changes ==
   
VOP Extensions for Zero-Copy Support

   
a. Extended struct uio, xuio_t
   
 The following proposes an extensible uio structure that can be extended 
 for
 multiple purposes.  For example, an immediate extension, xu_zc, is to be 
 used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
 zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
 calls for normal read/write operations.  Another example of extension,
 xu_aio, is intended to replace uioa_t for async I/O.
   
 This new structure, xuio_t, contains the following:
   
 - the existing uio structure (embedded) as the first member
 - additional fields to support extensibility
 - a union of all the defined extensions
   
 The following uio_extflag is added to indicate that an uio structure is
 indeed an xuio_t:
   
 #define  UIO_XUIO0x004   /* Structure is xuio_t */
   
 The following uio_extflag will be removed after uioa_t has been 
 converted 
 to xuio_t:
   
 #define  UIO_ASYNC   0x002   /* Structure is xuio_t */
   
 The project team has commitment from the networking team to remove
 the current use of uioa_t and use the proposed extensions (CR 6880095).
   
 The definition of xuio_t is:
   
 typedef struct xuio {
   uio_t xu_uio;  /* Embedded UIO structure */
   
   /* Extended uio fields */
   enum xuio_type xu_type;/* What kind of uio structure? */
   
   union {
   
  /* Async I/O Support */
  struct {
   uint32_t xu_a_state;   /* state of async i/o */
   uint32_t xu_a_state;   /* state of async i/o */
   ssize_t xu_a_mbytes;   /* bytes that have been uioamove()ed */
   uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Mahesh Siddheshwar
Rick Matthews wrote:
 Are there instances where an assigned zero-copy buffer could be orphaned?
No. The consumer Must release the buffers through VOP_RETZCBUF().

Mahesh

 If so, should there be a recovery list associated with this addition? 
 Perhaps off
 the designated vnode.

 This comment shouldn't block fast-track approval. Just a question.
 -- 
 Rick

 On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli 
 Zhang.
 This case proposes new interfaces to support copy reduction in the 
 I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
  Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
  Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
 09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules.  This 
 proposal avoids the data copy in the READ/WRITE path  of filesystems, 
 by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file  range 
 locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be 
 extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is 
 to be   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to 
 pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio 
 structure is
   indeed an xuio_t:

   #defineUIO_XUIO0x004/* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been 
 converted   to xuio_t:

   #defineUIO_ASYNC0x002/* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 
 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio;/* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;/* What kind of uio structure? */

 union {

 /* Async I/O Support */
 struct {
 uint32_t xu_a_state;/* state of async i/o */
 uint32_t xu_a_state;/* state of async i/o */
 ssize_t xu_a_mbytes;/* bytes that have been 
 uioamove()ed */
 uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */
 void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];/* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov 
 locked pages */
 } xu_aio;

 /* Zero Copy Support */
 struct {
 enum uio_rw xu_zc_rw;/* the use of the buffer */
 void *xu_zc_priv;/* fs specific */
 } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and 
 adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
 caller_context_t *);
  
 This function requests buffers associated with file vp in 
 preparation for a
 subsequent zero copy read or write. The extended uio_t -- xuio_t 
 is used
 to pass the parameters and results. Only the following fields of 
 xuio_t are
 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Roch

My issues have been resolved. Thanks Mahesh.

-r

Rich.Brown at Sun.COM writes:

  I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
  This case proposes new interfaces to support copy reduction in the I/O path
  especially for file sharing services.
  
  Minor binding is requested.
  
  This times out on Wednesday, 16 September, 2009.
  
  
  Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
  This information is Copyright 2009 Sun Microsystems
  1. Introduction
  1.1. Project/Component Working Name:
Copy Reduction Interfaces
  1.2. Name of Document Author/Supplier:
Author:  Mahesh Siddheshwar, Chunli Zhang
  1.3  Date of This Document:
   09 September, 2009
  4. Technical Description
  
   == Introduction/Background ==
  
   Zero-copy (copy avoidance) is essentially buffer sharing
   among multiple modules that pass data between the modules. 
   This proposal avoids the data copy in the READ/WRITE path 
   of filesystems, by providing a mechanism to share data buffers
   between the modules. It is intended to be used by network file
   sharing services like NFS, CIFS or others.
  
   Although the buffer sharing can be achieved through a few different
   solutions, any such solution must work with File Event Monitors
   (FEM monitors)[1] installed on the files. The solution must
   allow the underlying filesystem to maintain any existing file 
   range locking in the filesystem.
   
   The proposed solution provides extensions to the existing VOP
   interface to request and return buffers from a filesystem. The 
   buffers are then used with existing VOP_READ/VOP_WRITE calls with
   minimal changes.
  
  
   == Proposed Changes ==
  
   VOP Extensions for Zero-Copy Support
   
  
   a. Extended struct uio, xuio_t
  
The following proposes an extensible uio structure that can be extended for
multiple purposes.  For example, an immediate extension, xu_zc, is to be 
used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
zero-copy buffers, as well as to be passed to the existing 
  VOP_READ/VOP_WRITE
calls for normal read/write operations.  Another example of extension,
xu_aio, is intended to replace uioa_t for async I/O.
  
This new structure, xuio_t, contains the following:
  
- the existing uio structure (embedded) as the first member
- additional fields to support extensibility
- a union of all the defined extensions
  
The following uio_extflag is added to indicate that an uio structure is
indeed an xuio_t:
  
#defineUIO_XUIO0x004   /* Structure is xuio_t */
  
The following uio_extflag will be removed after uioa_t has been converted 
to xuio_t:
  
#defineUIO_ASYNC   0x002   /* Structure is xuio_t */
  
The project team has commitment from the networking team to remove
the current use of uioa_t and use the proposed extensions (CR 6880095).
  
The definition of xuio_t is:
  
typedef struct xuio {
  uio_t xu_uio;/* Embedded UIO structure */
  
  /* Extended uio fields */
  enum xuio_type xu_type;  /* What kind of uio structure? */
  
  union {
  
   /* Async I/O Support */
   struct {
  uint32_t xu_a_state; /* state of async i/o */
  uint32_t xu_a_state; /* state of async i/o */
  ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */
  uioa_page_t *xu_a_lcur;  /* pointer into uioa_locked[] */
  void **xu_a_lppp;/* pointer into 
  lcur-uioa_ppp[] */
  void *xu_a_hwst[4];  /* opaque hardware state */
  uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
  */
   } xu_aio;
  
   /* Zero Copy Support */
   struct {
  enum uio_rw xu_zc_rw;/* the use of the buffer */
  void *xu_zc_priv;/* fs specific */
   } xu_zc;
  
  } xu_ext;
} xuio_t;
  
where xu_type is currently defined as:
  
typedef enum xuio_type {
  UIOTYPE_ASYNCIO,
  UIOTYPE_ZEROCOPY
} xuio_type_t;
  
New uio extensions can be added by defining a new xuio_type_t, and adding a
new member to the xu_ext union.
  
   b. Requesting zero-copy buffers
  
  #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
  fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)
  
  int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
   caller_context_t *);
   
  This function requests buffers associated with file vp in preparation 
  for a
  subsequent zero copy read or write. The extended uio_t -- xuio_t is used
  to pass the parameters and results. Only the following fields of xuio_t 
  are
  relevant to this call.
   
  uiozcp-xu_uio.uio_resid: used by the caller to specify the total length
   of the buffer.
  
  uiozcp-xu_uio.uio_loffset: 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-16 Thread Rich Brown
This case was approved at today's PSARC meeting.

I put an updated final_spec.txt in the case directory
which corrects a typo that Mahesh found.

On behalf of the team, thank you for your time and assistance
on this case.

Rich


Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-11 Thread johan...@sun.com
On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote:
  == Introduction/Background ==
 
  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules. 
  This proposal avoids the data copy in the READ/WRITE path 
  of filesystems, by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.
 
  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file 
  range locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.
 
 
  == Proposed Changes ==
...

  == Using the New VOP Interfaces for Zero-copy ==
 
  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 
 
  a. Read
 
 In a normal read, the consumer allocates the data buffer and passes it to
 VOP_READ().  The provider initiates the I/O, and copies the data from its
 own cache buffer to the consumer supplied buffer.
 
 To avoid the copy (initiating a zero-copy read), the consumer
 first calls VOP_REQZCBUF() to inform the provider to prepare to
 loan out its cache buffer.  It then calls VOP_READ().  After the
 call returns, the consumer has direct access to the cache buffer
 loaned out by the provider.  After processing the data, the
 consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.
...

  b. Write
 
 In a normal write, the consumer allocates the data buffer, loads the data,
 and passes the buffer to VOP_WRITE().  The provider copies the data from
 the consumer supplied buffer to its own cache buffer, and starts the I/O.
 
 To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
 grab a cache buffer from the provider.  It loads the data directly to
 the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
 the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.

Just for clarification: this interface only affects pages mapped in the
kernel, correct?  I'm trying to understand if this is just for reducing
the number of in-kernel copies, or if this is a userland - kernel
zero-copy interface.


Thanks,

-j


Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-11 Thread Mahesh Siddheshwar
johansen at sun.com wrote:
 On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote:
   
  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules. 
  This proposal avoids the data copy in the READ/WRITE path 
  of filesystems, by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file 
  range locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==
 
 ...

   
  == Using the New VOP Interfaces for Zero-copy ==

  VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with
  VOP_READ() or VOP_WRITE() to implement zero-copy read or write. 

  a. Read

 In a normal read, the consumer allocates the data buffer and passes it to
 VOP_READ().  The provider initiates the I/O, and copies the data from its
 own cache buffer to the consumer supplied buffer.

 To avoid the copy (initiating a zero-copy read), the consumer
 first calls VOP_REQZCBUF() to inform the provider to prepare to
 loan out its cache buffer.  It then calls VOP_READ().  After the
 call returns, the consumer has direct access to the cache buffer
 loaned out by the provider.  After processing the data, the
 consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.
 
 ...

   
  b. Write

 In a normal write, the consumer allocates the data buffer, loads the 
 data,
 and passes the buffer to VOP_WRITE().  The provider copies the data from
 the consumer supplied buffer to its own cache buffer, and starts the I/O.

 To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to
 grab a cache buffer from the provider.  It loads the data directly to
 the loaned cache buffer, and calls VOP_WRITE().  After the call returns,
 the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to
 the provider.
 

 Just for clarification: this interface only affects pages mapped in the
 kernel, correct?  I'm trying to understand if this is just for reducing
 the number of in-kernel copies, or if this is a userland - kernel
 zero-copy interface.

   
That is correct. This interface is to prevent in-kernel copies and allow
buffer sharing between kernel modules (that can be used by in-kernel
services like NFS or CIFS). The spec does not define any userland - kernel
zero-copy interface.

Thanks,
Mahesh
 Thanks,

 -j
   



Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-10 Thread Rich Brown

On 09/09/09 17:08, Garrett D'Amore wrote:
 I've not had time to go over all this yet, but do we really believe this 
 kind of change is fast track appropriate?  I have a feeling that this is 
 a significant enough core change with implications for a variety of 
 project teams, that maybe this one ought to be a full case.  I'd be a  
 bit uncomfortable allowing this one to just time out with a single +1, 
 which is the normal rule for fast tracks.
 
 Am I alone in this particular concern?
 
 Are there any implications for unbundled 3rd party filesystems?
 
- Garrett
 

Garrett,

Perhaps this will help.

As the sponsor, I asked myself the same question.  This seemed similar
in scope to another fast-track:  PSARC/2007/315 (Extensible Attribute
Interfaces).  One could argue that this proposal is smaller in scope
and impact since there are no user level interfaces involved.

Here's what I considered during my review of the project (which might
help make the proposal a bit more digestible):


- The proposal extends the uio_t structure in a way that includes
   (and cleans up) the existing uioa_t (asynchronous uio) structure
   and adds a zero copy feature.  With all due respect to the original
   implementors of uioa_t, this proposal seemed like a cleaner, more
   flexible solution to extending the functionality of the uio_t structure.

   This is roughly equivalent to the way that the vattr_t structure was
   extended with the xvattr_t structure in PSARC/2007/315.

- This does not change the way that the existing VOP_READ/VOP_WRITE
   implementations work the same way that the addition of xvattr_t
   didn't change the way existing VOP_GETATTR/VOP_SETATTR implementations.

   Only those file systems that explicitly choose to participate
   in the extensions need to change their VOP_READ and VOP_WRITE
   implementations to handle the xuio_t structure.  The current use
   of the uio_t structure will still work.

- Also, only those file systems that explicitly choose to participate
   in the zero-copy feature need to implement VOP_REQZCBUF and
   VOP_RETZCBUF.  For those file systems that do not implement these
   interfaces, they will automatically default to fs_nosys() without
   any effort by the file system implementor thanks to the vnode/vfs
   operation registration mechanism (introduced in PSARC/2001/679).

Just to be clear, unbundled (Sun and 3rd party) file systems won't
notice any difference.

 Rich


Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-09 Thread rich.br...@sun.com
I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
This case proposes new interfaces to support copy reduction in the I/O path
especially for file sharing services.

Minor binding is requested.

This times out on Wednesday, 16 September, 2009.


Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
This information is Copyright 2009 Sun Microsystems
1. Introduction
1.1. Project/Component Working Name:
 Copy Reduction Interfaces
1.2. Name of Document Author/Supplier:
 Author:  Mahesh Siddheshwar, Chunli Zhang
1.3  Date of This Document:
09 September, 2009
4. Technical Description

 == Introduction/Background ==

 Zero-copy (copy avoidance) is essentially buffer sharing
 among multiple modules that pass data between the modules. 
 This proposal avoids the data copy in the READ/WRITE path 
 of filesystems, by providing a mechanism to share data buffers
 between the modules. It is intended to be used by network file
 sharing services like NFS, CIFS or others.

 Although the buffer sharing can be achieved through a few different
 solutions, any such solution must work with File Event Monitors
 (FEM monitors)[1] installed on the files. The solution must
 allow the underlying filesystem to maintain any existing file 
 range locking in the filesystem.
 
 The proposed solution provides extensions to the existing VOP
 interface to request and return buffers from a filesystem. The 
 buffers are then used with existing VOP_READ/VOP_WRITE calls with
 minimal changes.


 == Proposed Changes ==

 VOP Extensions for Zero-Copy Support
 

 a. Extended struct uio, xuio_t

  The following proposes an extensible uio structure that can be extended for
  multiple purposes.  For example, an immediate extension, xu_zc, is to be 
  used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
  zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE
  calls for normal read/write operations.  Another example of extension,
  xu_aio, is intended to replace uioa_t for async I/O.

  This new structure, xuio_t, contains the following:

  - the existing uio structure (embedded) as the first member
  - additional fields to support extensibility
  - a union of all the defined extensions

  The following uio_extflag is added to indicate that an uio structure is
  indeed an xuio_t:

  #define   UIO_XUIO0x004   /* Structure is xuio_t */

  The following uio_extflag will be removed after uioa_t has been converted 
  to xuio_t:

  #define   UIO_ASYNC   0x002   /* Structure is xuio_t */

  The project team has commitment from the networking team to remove
  the current use of uioa_t and use the proposed extensions (CR 6880095).

  The definition of xuio_t is:

  typedef struct xuio {
uio_t xu_uio;   /* Embedded UIO structure */

/* Extended uio fields */
enum xuio_type xu_type; /* What kind of uio structure? */

union {

/* Async I/O Support */
struct {
uint32_t xu_a_state;/* state of async i/o */
uint32_t xu_a_state;/* state of async i/o */
ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */
uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */
void **xu_a_lppp;   /* pointer into lcur-uioa_ppp[] */
void *xu_a_hwst[4]; /* opaque hardware state */
uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages */
} xu_aio;

/* Zero Copy Support */
struct {
enum uio_rw xu_zc_rw;   /* the use of the buffer */
void *xu_zc_priv;   /* fs specific */
} xu_zc;

} xu_ext;
  } xuio_t;

  where xu_type is currently defined as:

  typedef enum xuio_type {
UIOTYPE_ASYNCIO,
UIOTYPE_ZEROCOPY
  } xuio_type_t;

  New uio extensions can be added by defining a new xuio_type_t, and adding a
  new member to the xu_ext union.

 b. Requesting zero-copy buffers

#define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
caller_context_t *);
 
This function requests buffers associated with file vp in preparation for a
subsequent zero copy read or write. The extended uio_t -- xuio_t is used
to pass the parameters and results. Only the following fields of xuio_t are
relevant to this call.
 
uiozcp-xu_uio.uio_resid: used by the caller to specify the total length
 of the buffer.

uiozcp-xu_uio.uio_loffset: Used by the caller to indicate the file offset
 it would like the buffers to be associated with. A value of -1 
 indicates that the provider returns buffers that are not associated
 with a particular offset.  These are defined to be anonymous buffers.
 Anonymous buffers may be used 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-09 Thread Garrett D'Amore
I've not had time to go over all this yet, but do we really believe this 
kind of change is fast track appropriate?  I have a feeling that this is 
a significant enough core change with implications for a variety of 
project teams, that maybe this one ought to be a full case.  I'd be a  
bit uncomfortable allowing this one to just time out with a single +1, 
which is the normal rule for fast tracks.

Am I alone in this particular concern?

Are there any implications for unbundled 3rd party filesystems?

- Garrett


Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang.
 This case proposes new interfaces to support copy reduction in the I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
   09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules. 
  This proposal avoids the data copy in the READ/WRITE path 
  of filesystems, by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file 
  range locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is to be 
   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio structure is
   indeed an xuio_t:

   #define UIO_XUIO0x004   /* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been converted 
   to xuio_t:

   #define UIO_ASYNC   0x002   /* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio; /* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;   /* What kind of uio structure? */

 union {

   /* Async I/O Support */
   struct {
 uint32_t xu_a_state;  /* state of async i/o */
 uint32_t xu_a_state;  /* state of async i/o */
 ssize_t xu_a_mbytes;  /* bytes that have been uioamove()ed */
 uioa_page_t *xu_a_lcur;   /* pointer into uioa_locked[] */
 void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];   /* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov locked pages 
 */
   } xu_aio;

   /* Zero Copy Support */
   struct {
 enum uio_rw xu_zc_rw; /* the use of the buffer */
 void *xu_zc_priv; /* fs specific */
   } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct)

 int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *,
   caller_context_t *);
  
 This function requests buffers associated with file vp in preparation for 
 a
 subsequent zero 

Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]

2009-09-09 Thread Mahesh Siddheshwar
Garrett D'Amore wrote:
 I've not had time to go over all this yet, but do we really believe 
 this kind of change is fast track appropriate?  I have a feeling that 
 this is a significant enough core change with implications for a 
 variety of project teams, that maybe this one ought to be a full 
 case.  I'd be a  bit uncomfortable allowing this one to just time out 
 with a single +1, which is the normal rule for fast tracks.

 Am I alone in this particular concern?

 Are there any implications for unbundled 3rd party filesystems?
Not unless the 3rd party filesystem wants to support this optional
feature. This is covered in section (d) of the spec. The intermediate fop
routines handle it correctly.

Regards,
Mahesh
- Garrett


 Rich.Brown at Sun.COM wrote:
 I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli 
 Zhang.
 This case proposes new interfaces to support copy reduction in the 
 I/O path
 especially for file sharing services.

 Minor binding is requested.

 This times out on Wednesday, 16 September, 2009.


 Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI
 This information is Copyright 2009 Sun Microsystems
 1. Introduction
 1.1. Project/Component Working Name:
  Copy Reduction Interfaces
 1.2. Name of Document Author/Supplier:
  Author:  Mahesh Siddheshwar, Chunli Zhang
 1.3  Date of This Document:
 09 September, 2009
 4. Technical Description

  == Introduction/Background ==

  Zero-copy (copy avoidance) is essentially buffer sharing
  among multiple modules that pass data between the modules.  This 
 proposal avoids the data copy in the READ/WRITE path  of filesystems, 
 by providing a mechanism to share data buffers
  between the modules. It is intended to be used by network file
  sharing services like NFS, CIFS or others.

  Although the buffer sharing can be achieved through a few different
  solutions, any such solution must work with File Event Monitors
  (FEM monitors)[1] installed on the files. The solution must
  allow the underlying filesystem to maintain any existing file  range 
 locking in the filesystem.
  
  The proposed solution provides extensions to the existing VOP
  interface to request and return buffers from a filesystem. The 
  buffers are then used with existing VOP_READ/VOP_WRITE calls with
  minimal changes.


  == Proposed Changes ==

  VOP Extensions for Zero-Copy Support
  

  a. Extended struct uio, xuio_t

   The following proposes an extensible uio structure that can be 
 extended for
   multiple purposes.  For example, an immediate extension, xu_zc, is 
 to be   used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to 
 pass loaned
   zero-copy buffers, as well as to be passed to the existing 
 VOP_READ/VOP_WRITE
   calls for normal read/write operations.  Another example of extension,
   xu_aio, is intended to replace uioa_t for async I/O.

   This new structure, xuio_t, contains the following:

   - the existing uio structure (embedded) as the first member
   - additional fields to support extensibility
   - a union of all the defined extensions

   The following uio_extflag is added to indicate that an uio 
 structure is
   indeed an xuio_t:

   #defineUIO_XUIO0x004/* Structure is xuio_t */

   The following uio_extflag will be removed after uioa_t has been 
 converted   to xuio_t:

   #defineUIO_ASYNC0x002/* Structure is xuio_t */

   The project team has commitment from the networking team to remove
   the current use of uioa_t and use the proposed extensions (CR 
 6880095).

   The definition of xuio_t is:

   typedef struct xuio {
 uio_t xu_uio;/* Embedded UIO structure */

 /* Extended uio fields */
 enum xuio_type xu_type;/* What kind of uio structure? */

 union {

 /* Async I/O Support */
 struct {
 uint32_t xu_a_state;/* state of async i/o */
 uint32_t xu_a_state;/* state of async i/o */
 ssize_t xu_a_mbytes;/* bytes that have been 
 uioamove()ed */
 uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */
 void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */
 void *xu_a_hwst[4];/* opaque hardware state */
 uioa_page_t xu_a_locked[UIOA_IOV_MAX];   /* Per iov 
 locked pages */
 } xu_aio;

 /* Zero Copy Support */
 struct {
 enum uio_rw xu_zc_rw;/* the use of the buffer */
 void *xu_zc_priv;/* fs specific */
 } xu_zc;

 } xu_ext;
   } xuio_t;

   where xu_type is currently defined as:

   typedef enum xuio_type {
 UIOTYPE_ASYNCIO,
 UIOTYPE_ZEROCOPY
   } xuio_type_t;

   New uio extensions can be added by defining a new xuio_type_t, and 
 adding a
   new member to the xu_ext union.

  b. Requesting zero-copy buffers

 #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \
 fop_reqzcbuf(vp, rwflag,