Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Suganath, > Also, Making all PRP buffer may or may not need FW changes (assuming > it is possible.), we may end up into multiple FW version check. I don't understand how submitting an I/O that is guaranteed to honor the constraints of the target NVMe drive could possibly cause problems for the controller firmware. Quite the contrary, it's the best case scenario. > Since this is main IO path and current driver is following H/W > limitation, we should avoid any changes in this area until and unless > change is universal acceptable in FW (for all type of work load). This is why you need to involve the Linux community early in the design process and not when your implementation is complete. We could have told you right away what the correct approach would be for your Linux driver. And that said approach works for products from other vendors so we see no compelling reason to deviate from it. As evidenced by Broadcom disowning the legacy mpt and megaraid drivers, I will be stuck maintaining this mpt3sas code for a decade or more. Long after Broadcom has ended official support and moved on to different ASICs and programming interfaces. Consequently, I am very heavily biased towards solutions that leverage the shared interfaces provided by the kernel and that don't have special cases and workarounds inside the driver. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Martin, On Fri, Sep 15, 2017 at 6:37 AM, Martin K. Petersen wrote: > > Suganath, > >> Is there any update on the submitted mpt3sas patches. > > We are waiting for you to report back your findings on PRP vs. SGL. We are working on this, since there is h/w dependent, we are in discussion with H/W & F/W team and doing experiments. If there is no impact, and if SGL translation has to be removed, this change has to go through some phase of testing, before we post it to upstream, since that is not inline with H/W requirement. The hardware translation of IEEE SGL to NVMe PRPs has limitation. We have added the below comment in patch 3 as well: if a command cannot be translated by hardware then it will go to firmware and the firmware needs to translate it. And this will have a performance reduction. To avoid that driver proactively checks whether the translation will be done in hardware or not, if not then driver try to translate inside the driver Current code posted to upstream is inline with hardware requirements and well tested internally. SGL vs NVMe PRP building in driver is small sanity check for decision making and it is not going to change in long run. Also, Making all PRP buffer may or may not need FW changes (assuming it is possible.), we may end up into multiple FW version check. Since this is main IO path and current driver is following H/W limitation, we should avoid any changes in this area until and unless change is universal acceptable in FW (for all type of work load). Hope this clarifies. > > -- > Martin K. Petersen Oracle Linux Engineering Thanks, Suganath Prabu S
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Suganath, > Is there any update on the submitted mpt3sas patches. We are waiting for you to report back your findings on PRP vs. SGL. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Martin, Is there any update on the submitted mpt3sas patches. Thanks, Suganath Prabu S On Fri, Sep 1, 2017 at 2:09 PM, Suganath Prabu Subramani wrote: > Hi Martin, > > On Fri, Sep 1, 2017 at 8:52 AM, Martin K. Petersen > wrote: >> >> Hi Suganath, >> >>> Let me explain - NVME device fast path is possible in two ways. IEEE >>> SGL and PRP SGL. Due to h/w constraint we choose IEEE SGL only for >>> smaller IO size. Both above is true h/w Fast Path and no firmware >>> involvement. >> >>> Agree with you. We are planning to see if we can keep only simple Fast >>> Path using only PRP. >> >> That would be great, thank you! >> >>> Currently there is no performance issue for UNMAP translation in FW. >> >> Good! >> And yet patch 4 circumvents that statement by adding support for encapsulated commands to bypass the FW translation... >>> >>> This path is not due to performance reason. User wants to interact >>> with NVME drive in native NVME command for management. >> >> Patch 4 states: >> >> "This encapsulated NVMe command is used by applications to send direct >> NVMe commands to NVMe drives or for handling unmap where the translation >> at controller/firmware level is having performance issues." >> >> -- > > The statement in description of patch 4 is added by mistake, We ll > correct the description and re sending that. > >> Martin K. Petersen Oracle Linux Engineering > > Thanks, > Suganath Prabu S
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Martin, On Fri, Sep 1, 2017 at 8:52 AM, Martin K. Petersen wrote: > > Hi Suganath, > >> Let me explain - NVME device fast path is possible in two ways. IEEE >> SGL and PRP SGL. Due to h/w constraint we choose IEEE SGL only for >> smaller IO size. Both above is true h/w Fast Path and no firmware >> involvement. > >> Agree with you. We are planning to see if we can keep only simple Fast >> Path using only PRP. > > That would be great, thank you! > >> Currently there is no performance issue for UNMAP translation in FW. > > Good! > >>> And yet patch 4 circumvents that statement by adding support for >>> encapsulated commands to bypass the FW translation... >> >> This path is not due to performance reason. User wants to interact >> with NVME drive in native NVME command for management. > > Patch 4 states: > > "This encapsulated NVMe command is used by applications to send direct > NVMe commands to NVMe drives or for handling unmap where the translation > at controller/firmware level is having performance issues." > > -- The statement in description of patch 4 is added by mistake, We ll correct the description and re sending that. > Martin K. Petersen Oracle Linux Engineering Thanks, Suganath Prabu S
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Suganath, > Let me explain - NVME device fast path is possible in two ways. IEEE > SGL and PRP SGL. Due to h/w constraint we choose IEEE SGL only for > smaller IO size. Both above is true h/w Fast Path and no firmware > involvement. > Agree with you. We are planning to see if we can keep only simple Fast > Path using only PRP. That would be great, thank you! > Currently there is no performance issue for UNMAP translation in FW. Good! >> And yet patch 4 circumvents that statement by adding support for >> encapsulated commands to bypass the FW translation... > > This path is not due to performance reason. User wants to interact > with NVME drive in native NVME command for management. Patch 4 states: "This encapsulated NVMe command is used by applications to send direct NVMe commands to NVMe drives or for handling unmap where the translation at controller/firmware level is having performance issues." -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Martin, Replied inline. Thanks, Suganath Prabu S On Thu, Aug 31, 2017 at 8:35 AM, Martin K. Petersen wrote: > > Hi Suganath, > >> Theoretically we want to use h/w capability (to translate IEEE to PRP) >> for smaller IO size to leverage h/w capability. > > Nobody says we have to use the capability just because the hardware has > it. > > Unlike some other operating systems, Linux will only submit I/Os to the > driver that conform to the reported underlying constraints of the > hardware. In general, h/w constraints are handled. What we missed is Fast Path h/w which is not exposed to OS. > I fail to understand how letting the HBA firmware translate an > SGL to a PRP for a subset of I/Os could do anything but add latency. Let me explain - NVME device fast path is possible in two ways. IEEE SGL and PRP SGL. Due to h/w constraint we choose IEEE SGL only for smaller IO size. Both above is true h/w Fast Path and no firmware involvement. > Plus complexity in the hot path of the driver. Agree with you. We are planning to see if we can keep only simple Fast Path using only PRP. It will take some time to finalize as we have to engage h/w and f/w team. BTW - This area is h/w dependent and we do not see further changes in this area. > >> - If the unmap translation in firmware is slow, why don't you translate >> WRITE SAME/w UNMAP set to DSM DEALLOCATE without requiring >> applications to do encapsulated passthrough? > >> => As of now, current FW supports UNMAP command but not WRITE_SAME for >> NVME drive. We did some experiment to convert UMAP command in driver, >> but that is not really giving any performance improvement. > > It is imperative that the common use case, Linux' discard > infrastructure, is working correctly and is as performant as any > application-driven passthrough workaround. > > Unlike SCSI-to-SATA translation you have the benefit of a 1:1 mapping > between UNMAP and DEALLOCATE. I'm not even sure why there would be a > significant performance penalty in the firmware? I agree. Currently there is no performance issue for UNMAP translation in FW. >> We would like to continue with UNMAP (and all other non-read/write >> commands) to be handled in FW. > > And yet patch 4 circumvents that statement by adding support for > encapsulated commands to bypass the FW translation... This path is not due to performance reason. User wants to interact with NVME drive in native NVME command for management. > > -- > Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Suganath, > Theoretically we want to use h/w capability (to translate IEEE to PRP) > for smaller IO size to leverage h/w capability. Nobody says we have to use the capability just because the hardware has it. Unlike some other operating systems, Linux will only submit I/Os to the driver that conform to the reported underlying constraints of the hardware. I fail to understand how letting the HBA firmware translate an SGL to a PRP for a subset of I/Os could do anything but add latency. Plus complexity in the hot path of the driver. > - If the unmap translation in firmware is slow, why don't you translate > WRITE SAME/w UNMAP set to DSM DEALLOCATE without requiring > applications to do encapsulated passthrough? > => As of now, current FW supports UNMAP command but not WRITE_SAME for > NVME drive. We did some experiment to convert UMAP command in driver, > but that is not really giving any performance improvement. It is imperative that the common use case, Linux' discard infrastructure, is working correctly and is as performant as any application-driven passthrough workaround. Unlike SCSI-to-SATA translation you have the benefit of a 1:1 mapping between UNMAP and DEALLOCATE. I'm not even sure why there would be a significant performance penalty in the firmware? > We would like to continue with UNMAP (and all other non-read/write > commands) to be handled in FW. And yet patch 4 circumvents that statement by adding support for encapsulated commands to bypass the FW translation... -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Hi Martin, Replied in line. - I don't understand why you go through all these hoops to decide whether to use PRPs or IEEE scatterlists. If the firmware translation is slow, why even bother with the SG format in the first place? Set the max I/O size to match MDTS and you're done. => We will set MDTS value as max hw sectors using blk_queue_max_hw_sectors(). As of now, we see correct MDTS value is being set to block layer via VPD page 0xb0 (Block limits VPD page ) response from FW for NVME device. => I will remove MDTS checks in IO path. - What's the benefit of using SG for regular I/O commands? => Broadcom's IT Tri-mode HBA hardware has a capability of translating IEEE SGLs to PRP's only up to ~4 page block size. If the IO block size is greater than that (along with other condition described code base_is_prp_possible), driver has to frame the PRP's to avoid FW intervention. Both the case is a fast path, but for smaller IO (up to 20K) size will frame IEEE SGL and large IO size will frame PRP format SGL. Theoretically we want to use h/w capability (to translate IEEE to PRP) for smaller IO size to leverage h/w capability. We are investigating if at all we can send all PRP and avoid checks in driver, but that exercise may take time as we have many different opinions. We prefer to use existing code as it is stable and in-line with h/w requirement. - If the unmap translation in firmware is slow, why don't you translate WRITE SAME/w UNMAP set to DSM DEALLOCATE without requiring applications to do encapsulated passthrough? => As of now, current FW supports UNMAP command but not WRITE_SAME for NVME drive. We did some experiment to convert UMAP command in driver, but that is not really giving any performance improvement. We would like to continue with UNMAP (and all other non-read/write commands) to be handled in FW. - Also make sure you attribute your patches correctly (From: root ). And you don't need that long CC: list. Just send the patch series to linux-s...@vger.kernel.org. => I will fix this type of issue going forward Thanks, Suganath Prabu S On Wed, Aug 23, 2017 at 7:48 AM, Martin K. Petersen wrote: > > Suganath, > >> mpt3sas: SGL to PRP Translation for I/Os to NVMe devices > > I'm still confused about this patch. > > - I don't understand why you go through all these hoops to decide >whether to use PRPs or IEEE scatterlists. If the firmware translation >is slow, why even bother with the SG format in the first place? Set >the max I/O size to match MDTS and you're done. > > - What's the benefit of using SG for regular I/O commands? > > - If the unmap translation in firmware is slow, why don't you translate >WRITE SAME/w UNMAP set to DSM DEALLOCATE without requiring >applications to do encapsulated passthrough? > > Also make sure you attribute your patches correctly (From: root > ). And you don't need that > long CC: list. Just send the patch series to linux-s...@vger.kernel.org. > > Thanks! > > -- > Martin K. Petersen Oracle Linux Engineering
Re: [PATCH v4 00/14] mpt3sas driver NVMe support:
Suganath, > mpt3sas: SGL to PRP Translation for I/Os to NVMe devices I'm still confused about this patch. - I don't understand why you go through all these hoops to decide whether to use PRPs or IEEE scatterlists. If the firmware translation is slow, why even bother with the SG format in the first place? Set the max I/O size to match MDTS and you're done. - What's the benefit of using SG for regular I/O commands? - If the unmap translation in firmware is slow, why don't you translate WRITE SAME/w UNMAP set to DSM DEALLOCATE without requiring applications to do encapsulated passthrough? Also make sure you attribute your patches correctly (From: root ). And you don't need that long CC: list. Just send the patch series to linux-s...@vger.kernel.org. Thanks! -- Martin K. Petersen Oracle Linux Engineering