Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Filesystems might have some blocksize and alignment constraints conditioning their ability to loan up buffers (for writes). If that is so, we could use an API to query the FS about those values. For a copy on write variable block size filesystem, that natural blocksize might also depend on the vnode being targetted. Do we know if ZFS will ever be able to loan up buffers for writes that are not aligned full records ? -r Rich.Brown at Sun.COM writes: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Are there instances where an assigned zero-copy buffer could be orphaned? If so, should there be a recovery list associated with this addition? Perhaps off the designated vnode. This comment shouldn't block fast-track approval. Just a question. -- Rick On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw; /* the use of the buffer */ void *xu_zc_priv; /* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t are relevant to this call. uiozcp-xu_uio.uio_resid: used by the caller to specify the total
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Roch wrote: Filesystems might have some blocksize and alignment constraints conditioning their ability to loan up buffers (for writes). If that is so, we could use an API to query the FS about those values. For a copy on write variable block size filesystem, that natural blocksize might also depend on the vnode being targetted. Yes. The provider can fail the VOP_REQZCBUF() call if it determines that it is inefficient to take the zero-copy path. Depending on the provider implementation, this could be blocksize aligned. In such cases, the consumer could use VFSNAME_STATVFS() call to determine 'f_bsize' value. But as you note, certain implementations may have different values for individual files. In such cases if the VOP_REQZCBUF() fails, the consumer then uses the traditional non zero-copy path. An additional API to find the such constraints/requirements may be useful in future, but is out-of-scope for this project. However, the project team will open an RFE for this issue and put you on the interest list. Do we know if ZFS will ever be able to loan up buffers for writes that are not aligned full records ? No, not planned currently. It has to be block size aligned. Also note that currently, from an implementation perspective, zero-copy WRITEs are efficient only in case network-based filesystems like NFS over RDMA transports. Mahesh -r Rich.Brown at Sun.COM writes: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Rick Matthews wrote: Are there instances where an assigned zero-copy buffer could be orphaned? No. The consumer Must release the buffers through VOP_RETZCBUF(). Mahesh If so, should there be a recovery list associated with this addition? Perhaps off the designated vnode. This comment shouldn't block fast-track approval. Just a question. -- Rick On 09/ 9/09 04:02 PM, Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004/* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC0x002/* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4];/* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t are
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
My issues have been resolved. Thanks Mahesh. -r Rich.Brown at Sun.COM writes: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t are relevant to this call. uiozcp-xu_uio.uio_resid: used by the caller to specify the total length of the buffer. uiozcp-xu_uio.uio_loffset:
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
This case was approved at today's PSARC meeting. I put an updated final_spec.txt in the case directory which corrects a typo that Mahesh found. On behalf of the team, thank you for your time and assistance on this case. Rich
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote: == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == ... == Using the New VOP Interfaces for Zero-copy == VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with VOP_READ() or VOP_WRITE() to implement zero-copy read or write. a. Read In a normal read, the consumer allocates the data buffer and passes it to VOP_READ(). The provider initiates the I/O, and copies the data from its own cache buffer to the consumer supplied buffer. To avoid the copy (initiating a zero-copy read), the consumer first calls VOP_REQZCBUF() to inform the provider to prepare to loan out its cache buffer. It then calls VOP_READ(). After the call returns, the consumer has direct access to the cache buffer loaned out by the provider. After processing the data, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. ... b. Write In a normal write, the consumer allocates the data buffer, loads the data, and passes the buffer to VOP_WRITE(). The provider copies the data from the consumer supplied buffer to its own cache buffer, and starts the I/O. To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to grab a cache buffer from the provider. It loads the data directly to the loaned cache buffer, and calls VOP_WRITE(). After the call returns, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. Just for clarification: this interface only affects pages mapped in the kernel, correct? I'm trying to understand if this is just for reducing the number of in-kernel copies, or if this is a userland - kernel zero-copy interface. Thanks, -j
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
johansen at sun.com wrote: On Wed, Sep 09, 2009 at 04:02:15PM -0500, Rich.Brown at sun.com wrote: == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == ... == Using the New VOP Interfaces for Zero-copy == VOP_REQZCBUF()/VOP_RETZCBUF() are expected to be used in conjunction with VOP_READ() or VOP_WRITE() to implement zero-copy read or write. a. Read In a normal read, the consumer allocates the data buffer and passes it to VOP_READ(). The provider initiates the I/O, and copies the data from its own cache buffer to the consumer supplied buffer. To avoid the copy (initiating a zero-copy read), the consumer first calls VOP_REQZCBUF() to inform the provider to prepare to loan out its cache buffer. It then calls VOP_READ(). After the call returns, the consumer has direct access to the cache buffer loaned out by the provider. After processing the data, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. ... b. Write In a normal write, the consumer allocates the data buffer, loads the data, and passes the buffer to VOP_WRITE(). The provider copies the data from the consumer supplied buffer to its own cache buffer, and starts the I/O. To initiate a zero-copy write, the consumer first calls VOP_REQZCBUF() to grab a cache buffer from the provider. It loads the data directly to the loaned cache buffer, and calls VOP_WRITE(). After the call returns, the consumer calls VOP_RETZCBUF() to return the loaned cache buffer to the provider. Just for clarification: this interface only affects pages mapped in the kernel, correct? I'm trying to understand if this is just for reducing the number of in-kernel copies, or if this is a userland - kernel zero-copy interface. That is correct. This interface is to prevent in-kernel copies and allow buffer sharing between kernel modules (that can be used by in-kernel services like NFS or CIFS). The spec does not define any userland - kernel zero-copy interface. Thanks, Mahesh Thanks, -j
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
On 09/09/09 17:08, Garrett D'Amore wrote: I've not had time to go over all this yet, but do we really believe this kind of change is fast track appropriate? I have a feeling that this is a significant enough core change with implications for a variety of project teams, that maybe this one ought to be a full case. I'd be a bit uncomfortable allowing this one to just time out with a single +1, which is the normal rule for fast tracks. Am I alone in this particular concern? Are there any implications for unbundled 3rd party filesystems? - Garrett Garrett, Perhaps this will help. As the sponsor, I asked myself the same question. This seemed similar in scope to another fast-track: PSARC/2007/315 (Extensible Attribute Interfaces). One could argue that this proposal is smaller in scope and impact since there are no user level interfaces involved. Here's what I considered during my review of the project (which might help make the proposal a bit more digestible): - The proposal extends the uio_t structure in a way that includes (and cleans up) the existing uioa_t (asynchronous uio) structure and adds a zero copy feature. With all due respect to the original implementors of uioa_t, this proposal seemed like a cleaner, more flexible solution to extending the functionality of the uio_t structure. This is roughly equivalent to the way that the vattr_t structure was extended with the xvattr_t structure in PSARC/2007/315. - This does not change the way that the existing VOP_READ/VOP_WRITE implementations work the same way that the addition of xvattr_t didn't change the way existing VOP_GETATTR/VOP_SETATTR implementations. Only those file systems that explicitly choose to participate in the extensions need to change their VOP_READ and VOP_WRITE implementations to handle the xuio_t structure. The current use of the uio_t structure will still work. - Also, only those file systems that explicitly choose to participate in the zero-copy feature need to implement VOP_REQZCBUF and VOP_RETZCBUF. For those file systems that do not implement these interfaces, they will automatically default to fs_nosys() without any effort by the file system implementor thanks to the vnode/vfs operation registration mechanism (introduced in PSARC/2001/679). Just to be clear, unbundled (Sun and 3rd party) file systems won't notice any difference. Rich
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw; /* the use of the buffer */ void *xu_zc_priv; /* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero copy read or write. The extended uio_t -- xuio_t is used to pass the parameters and results. Only the following fields of xuio_t are relevant to this call. uiozcp-xu_uio.uio_resid: used by the caller to specify the total length of the buffer. uiozcp-xu_uio.uio_loffset: Used by the caller to indicate the file offset it would like the buffers to be associated with. A value of -1 indicates that the provider returns buffers that are not associated with a particular offset. These are defined to be anonymous buffers. Anonymous buffers may be used
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
I've not had time to go over all this yet, but do we really believe this kind of change is fast track appropriate? I have a feeling that this is a significant enough core change with implications for a variety of project teams, that maybe this one ought to be a full case. I'd be a bit uncomfortable allowing this one to just time out with a single +1, which is the normal rule for fast tracks. Am I alone in this particular concern? Are there any implications for unbundled 3rd party filesystems? - Garrett Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #define UIO_XUIO0x004 /* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #define UIO_ASYNC 0x002 /* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio; /* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type; /* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state; /* state of async i/o */ uint32_t xu_a_state; /* state of async i/o */ ssize_t xu_a_mbytes; /* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur; /* pointer into uioa_locked[] */ void **xu_a_lppp; /* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4]; /* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw; /* the use of the buffer */ void *xu_zc_priv; /* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag, uiozcp, cr, ct) int fop_reqzcbuf(vnode_t *, enum uio_rw, xuio_t *, cred_t *, caller_context_t *); This function requests buffers associated with file vp in preparation for a subsequent zero
Copy Reduction Interfaces [PSARC/2009/478 FastTrack timeout 09/16/2009]
Garrett D'Amore wrote: I've not had time to go over all this yet, but do we really believe this kind of change is fast track appropriate? I have a feeling that this is a significant enough core change with implications for a variety of project teams, that maybe this one ought to be a full case. I'd be a bit uncomfortable allowing this one to just time out with a single +1, which is the normal rule for fast tracks. Am I alone in this particular concern? Are there any implications for unbundled 3rd party filesystems? Not unless the 3rd party filesystem wants to support this optional feature. This is covered in section (d) of the spec. The intermediate fop routines handle it correctly. Regards, Mahesh - Garrett Rich.Brown at Sun.COM wrote: I'm sponsoring this case on behalf of Mahesh Siddheshwar and Chunli Zhang. This case proposes new interfaces to support copy reduction in the I/O path especially for file sharing services. Minor binding is requested. This times out on Wednesday, 16 September, 2009. Template Version: @(#)sac_nextcase 1.68 02/23/09 SMI This information is Copyright 2009 Sun Microsystems 1. Introduction 1.1. Project/Component Working Name: Copy Reduction Interfaces 1.2. Name of Document Author/Supplier: Author: Mahesh Siddheshwar, Chunli Zhang 1.3 Date of This Document: 09 September, 2009 4. Technical Description == Introduction/Background == Zero-copy (copy avoidance) is essentially buffer sharing among multiple modules that pass data between the modules. This proposal avoids the data copy in the READ/WRITE path of filesystems, by providing a mechanism to share data buffers between the modules. It is intended to be used by network file sharing services like NFS, CIFS or others. Although the buffer sharing can be achieved through a few different solutions, any such solution must work with File Event Monitors (FEM monitors)[1] installed on the files. The solution must allow the underlying filesystem to maintain any existing file range locking in the filesystem. The proposed solution provides extensions to the existing VOP interface to request and return buffers from a filesystem. The buffers are then used with existing VOP_READ/VOP_WRITE calls with minimal changes. == Proposed Changes == VOP Extensions for Zero-Copy Support a. Extended struct uio, xuio_t The following proposes an extensible uio structure that can be extended for multiple purposes. For example, an immediate extension, xu_zc, is to be used by the proposed VOP_REQZCBUF/VOP_RETZCBUF interfaces to pass loaned zero-copy buffers, as well as to be passed to the existing VOP_READ/VOP_WRITE calls for normal read/write operations. Another example of extension, xu_aio, is intended to replace uioa_t for async I/O. This new structure, xuio_t, contains the following: - the existing uio structure (embedded) as the first member - additional fields to support extensibility - a union of all the defined extensions The following uio_extflag is added to indicate that an uio structure is indeed an xuio_t: #defineUIO_XUIO0x004/* Structure is xuio_t */ The following uio_extflag will be removed after uioa_t has been converted to xuio_t: #defineUIO_ASYNC0x002/* Structure is xuio_t */ The project team has commitment from the networking team to remove the current use of uioa_t and use the proposed extensions (CR 6880095). The definition of xuio_t is: typedef struct xuio { uio_t xu_uio;/* Embedded UIO structure */ /* Extended uio fields */ enum xuio_type xu_type;/* What kind of uio structure? */ union { /* Async I/O Support */ struct { uint32_t xu_a_state;/* state of async i/o */ uint32_t xu_a_state;/* state of async i/o */ ssize_t xu_a_mbytes;/* bytes that have been uioamove()ed */ uioa_page_t *xu_a_lcur;/* pointer into uioa_locked[] */ void **xu_a_lppp;/* pointer into lcur-uioa_ppp[] */ void *xu_a_hwst[4];/* opaque hardware state */ uioa_page_t xu_a_locked[UIOA_IOV_MAX]; /* Per iov locked pages */ } xu_aio; /* Zero Copy Support */ struct { enum uio_rw xu_zc_rw;/* the use of the buffer */ void *xu_zc_priv;/* fs specific */ } xu_zc; } xu_ext; } xuio_t; where xu_type is currently defined as: typedef enum xuio_type { UIOTYPE_ASYNCIO, UIOTYPE_ZEROCOPY } xuio_type_t; New uio extensions can be added by defining a new xuio_type_t, and adding a new member to the xu_ext union. b. Requesting zero-copy buffers #define VOP_REQZCBUF(vp, rwflag, uiozcp, cr, ct) \ fop_reqzcbuf(vp, rwflag,