RE: A question regarding "multiple SGL"
> > Hi Robert, > > Hey Robert, Christoph, > > > please explain your use cases that isn't handled. The one and only > > reason to set MSDBD to 1 is to make the code a lot simpler given that > > there is no real use case for supporting more. > > > > RDMA uses memory registrations to register large and possibly > > discontiguous data regions for a single rkey, aka single SGL descriptor > > in NVMe terms. There would be two reasons to support multiple SGL > > descriptors: a) to support a larger I/O size than supported by a single > > MR, or b) to support a data region format not mappable by a single > > MR. > > > > iSER only supports a single rkey (or stag in IETF terminology) and has > > been doing fine on a) and mostly fine on b). There are a few possible > > data layouts not supported by the traditional IB/iWarp FR WRs, but the > > limit is in fact exactly the same as imposed by the NVMe PRPs used for > > PCIe NVMe devices, so the Linux block layer has support to not generate > > them. Also with modern Mellanox IB/RoCE hardware we can actually > > register completely arbitrary SGLs. iSER supports using this registration > > mode already with a trivial code addition, but for NVMe we didn't have a > > pressing need yet. > > Good explanation :) > > The IO transfer size is a bit more pressing on some devices (e.g. > cxgb3/4) where the number of pages per-MR can be indeed lower than > a reasonable transfer size (Steve can correct me if I'm wrong). > Currently, cxgb4 support 128KB REG_MR operations on a host with 4K page size, via a max mr page list depth of 32. Soon it will be bumped up from 32 to 128 and life will be better... > However, if there is a real demand for this we'll happily accept > patches :) > > Just a note, having this feature in-place can bring unexpected behavior > depending on how we implement it: > - If we can use multiple MRs per IO (for multiple SGLs) we can either > prepare for the worst-case and allocate enough MRs to satisfy the > various IO patterns. This will be much heavier in terms of resource > allocation and can limit the scalability of the host driver. > - Or we can implement a shared MR pool with a reasonable number of MRs. > In this case each IO can consume one or more MRs on the expense of > other IOs. In this case we may need to requeue the IO later when we > have enough available MRs to satisfy the IO. This can yield some > unexpected performance gaps for some workloads. > I would like to see the storage protocols deal with lack of resources for the worst case. This allows much smaller resource usage for both MRs, and SQ resources, at the expense of adding flow control logic to deal with lack of available MR and/or SQ slots to process the next IO. I think it can be implemented efficiently such that when in flow-control mode, the code is driving new IO submissions off of SQ completions which will free up SQ slots and most likely MRs from the QP's MR pool. Steve. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, Hey Robert, Christoph, please explain your use cases that isn't handled. The one and only reason to set MSDBD to 1 is to make the code a lot simpler given that there is no real use case for supporting more. RDMA uses memory registrations to register large and possibly discontiguous data regions for a single rkey, aka single SGL descriptor in NVMe terms. There would be two reasons to support multiple SGL descriptors: a) to support a larger I/O size than supported by a single MR, or b) to support a data region format not mappable by a single MR. iSER only supports a single rkey (or stag in IETF terminology) and has been doing fine on a) and mostly fine on b). There are a few possible data layouts not supported by the traditional IB/iWarp FR WRs, but the limit is in fact exactly the same as imposed by the NVMe PRPs used for PCIe NVMe devices, so the Linux block layer has support to not generate them. Also with modern Mellanox IB/RoCE hardware we can actually register completely arbitrary SGLs. iSER supports using this registration mode already with a trivial code addition, but for NVMe we didn't have a pressing need yet. Good explanation :) The IO transfer size is a bit more pressing on some devices (e.g. cxgb3/4) where the number of pages per-MR can be indeed lower than a reasonable transfer size (Steve can correct me if I'm wrong). However, if there is a real demand for this we'll happily accept patches :) Just a note, having this feature in-place can bring unexpected behavior depending on how we implement it: - If we can use multiple MRs per IO (for multiple SGLs) we can either prepare for the worst-case and allocate enough MRs to satisfy the various IO patterns. This will be much heavier in terms of resource allocation and can limit the scalability of the host driver. - Or we can implement a shared MR pool with a reasonable number of MRs. In this case each IO can consume one or more MRs on the expense of other IOs. In this case we may need to requeue the IO later when we have enough available MRs to satisfy the IO. This can yield some unexpected performance gaps for some workloads. Cheers, Sagi. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Robert, please explain your use cases that isn't handled. The one and only reason to set MSDBD to 1 is to make the code a lot simpler given that there is no real use case for supporting more. RDMA uses memory registrations to register large and possibly discontiguous data regions for a single rkey, aka single SGL descriptor in NVMe terms. There would be two reasons to support multiple SGL descriptors: a) to support a larger I/O size than supported by a single MR, or b) to support a data region format not mappable by a single MR. iSER only supports a single rkey (or stag in IETF terminology) and has been doing fine on a) and mostly fine on b). There are a few possible data layouts not supported by the traditional IB/iWarp FR WRs, but the limit is in fact exactly the same as imposed by the NVMe PRPs used for PCIe NVMe devices, so the Linux block layer has support to not generate them. Also with modern Mellanox IB/RoCE hardware we can actually register completely arbitrary SGLs. iSER supports using this registration mode already with a trivial code addition, but for NVMe we didn't have a pressing need yet. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A question regarding "multiple SGL"
Hi Christoph, Thanks , got it. Could you please do me favor to let me know the background why we ONLY support " MSDBD ==1"? I am NOT trying to resist or oppose anything , I just want to know the reason. You know, it is a little wired for me, as "MSDBD ==1" does not fulfill all the use cases which is depicted in the spec. Best, Robert Qiuxin Robert Qiuxin 华为技术有限公司 Huawei Technologies Co., Ltd. Phone: +86-755-28420357 Fax: Mobile: +86 15986638429 Email: qiu...@huawei.com 地址:深圳市龙岗区坂田华为基地 邮编:518129 Huawei Technologies Co., Ltd. Bantian, Longgang District,Shenzhen 518129, P.R.China http://www.huawei.com 本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁 止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中 的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件! This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -邮件原件- 发件人: Christoph Hellwig [mailto:h...@lst.de] 发送时间: 2016年10月27日 14:41 收件人: 鑫愿 抄送: Bart Van Assche; Jens Axboe; linux-bl...@vger.kernel.org; James Bottomley; Martin K. Petersen; Mike Snitzer; linux-r...@vger.kernel.org; Ming Lei; linux-n...@lists.infradead.org; Keith Busch; Doug Ledford; linux-scsi@vger.kernel.org; Laurence Oberman; Christoph Hellwig; Tiger zhao; Qiuxin (robert) 主题: Re: A question regarding "multiple SGL" Hi Robert, There is no feature called "Multiple SGL in one NVMe capsule". The NVMe over Fabrics specification allows a controller to advertise how many SGL descriptors it supports using the MSDBD Identify field: "Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to place in a capsule. A value of 0h indicates no limit." Setting this value to 1 is perfectly valid. Similarly a host is free to chose any number of SGL descriptors between 0 (only for command that don't transfer data) to the limit imposed by the controller using the MSDBD field. There are no plans to support a MSDBD value larger than 1 in the Linux NVMe target, and there are no plans to ever submit commands with multiple SGLs from the host driver either. Cheers, Christoph
Re: A question regarding "multiple SGL"
Hi Robert, There is no feature called "Multiple SGL in one NVMe capsule". The NVMe over Fabrics specification allows a controller to advertise how many SGL descriptors it supports using the MSDBD Identify field: "Maximum SGL Data Block Descriptors (MSDBD): This field indicates the maximum number of (Keyed) SGL Data Block descriptors that a host is allowed to place in a capsule. A value of 0h indicates no limit." Setting this value to 1 is perfectly valid. Similarly a host is free to chose any number of SGL descriptors between 0 (only for command that don't transfer data) to the limit imposed by the controller using the MSDBD field. There are no plans to support a MSDBD value larger than 1 in the Linux NVMe target, and there are no plans to ever submit commands with multiple SGLs from the host driver either. Cheers, Christoph -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html