Re: NVMe over Fabrics target implementation
On Wed, Jun 08, 2016 at 09:36:15PM -0700, Nicholas A. Bellinger wrote: > The configfs ABI should not dictate a single backend use-case. And it doesn't. I actually had a file backend implemented to benchmark it against the loopback driver. It needed absolutely zero new configfs interface. And if we at some point want different backends using different attributes we can trivially add them using configfs_register_group. > Along with having common code and existing configfs > ABI, we also get a proper starting point for target-core > features that span across endpoints, and are defined for > both scsi and nvme. PR APTPL immediately comes to mind. PRs are a useful feature on the road map. However we need a separate pluggable backend anyway for distributed backends like RBD or Bart's DLM implementation. Also the current LIO PR implementation will need a lot of work to be usable for NVMe while actually following the spec in all it's details and to be power Ń•afe. The right way to go here is a PR API that allows different backends, and the existing LIO one might be one of them after it's got the needed attention. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Wed, 2016-06-08 at 16:12 +0300, Sagi Grimberg wrote: > >> *) Extensible to multiple types of backend drivers. > >> > >> nvme-target needs a way to absorb new backend drivers, that > >> does not effect existing configfs group layout or attributes. > >> > >> Looking at the nvmet/configfs layout as-is, there are no multiple > >> backend types defined, nor a way to control backend feature bits > >> exposed to nvme namespaces at runtime. > > Hey Nic, > > As for different type of backends, I still don't see a big justification > for adding the LIO backends pscsi (as it doesn't make sense), > ramdisk (we have brd), or file (losetup). > The configfs ABI should not dictate a single backend use-case. In the target-core ecosystem today, there are just as many people using FILEIO atop local file-systems as there are using IBLOCK and submit_bio(). As mentioned, target_core_iblock.c has absorbed the io-cmd.c improvements so existing scsi target drivers can benefit too. Plus, having interested folks focusing on a single set of backends for FILEIO + IBLOCK means both scsi and nvme target drivers benefit from further improvements. As we've already got both, a target backend configfs ABI and user ecosystem using /sys/kernel/config/target/core/, it's a straight forward way to share common code, while still allowing scsi and nvme to function using their own independent fabric configfs ABI layouts. Along with having common code and existing configfs ABI, we also get a proper starting point for target-core features that span across endpoints, and are defined for both scsi and nvme. PR APTPL immediately comes to mind. Namely, for the ability of one backend device to interact between both scsi target luns and nvme target namespaces, as well as different backends across scsi and nvme exports. > What kind of feature bits would you want to expose at runtime? As for feature bits, basically everything currently or in the future to be reported by ID_NS. There is T10-PI, and another example is copy-offload support, once the NVMe spec gets that far.. The main point is that we should be able to add new feature bits to common code in target-core backend configfs ABI, without having to change the individual scsi or nvme configfs ABIs. > > > And that's very much intentional. We have a very well working block > > layer which we're going to use, no need to reivent it. The block > > layer supports NVMe pass through just fine in case we'll need it, > > as I spent the last year preparing it for that. > > > >> Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO > >> to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? > > > > Because it keeps the code simple. If you had actually participated > > on our development list you might have seen that until not too long > > ago we have very fine grainded locks here. In the end Armen convinced > > me that it's easier to maintain if we don't bother with fine grained > > locking outside the fast path, especially as it significantly simplifies > > the discovery implementation. If if it ever turns out to be an > > issue we can change it easily as the implementation is well encapsulated. > > We did change that, and Nic is raising a valid point in terms of having > a global mutex around all the ports. If the requirement of nvme > subsystems and ports configuration is that it should happen fast enough > and scale to the numbers that Nic is referring to, we'll need to change > that back. > > Having said that, I'm not sure this is a real hard requirement for RDMA > and FC in the mid-term, because from what I've seen, the workloads Nic > is referring to are more typical for iscsi/tcp where connections are > cheaper and you need more to saturate a high-speed interconnects, so > we'll probably see this when we have nvme over tcp working. Yes. Further, my objections to the proposed nvmet configfs ABI are: - Doesn't support multiple backend types. - Doesn't provide a way to control backend feature bits separate from fabric layout. - Doesn't provide a starting point between target features that span both scsi and nvme. - Doesn't allow for concurrent parallel configfs create + delete operations of subsystem NQNs across ports and host acls. - Global synchronization of nvmet_fabric_ops->add_port() -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Wed, 2016-06-08 at 14:19 +0200, Christoph Hellwig wrote: > On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote: > > *) Extensible to multiple types of backend drivers. > > > > nvme-target needs a way to absorb new backend drivers, that > > does not effect existing configfs group layout or attributes. > > > > Looking at the nvmet/configfs layout as-is, there are no multiple > > backend types defined, nor a way to control backend feature bits > > exposed to nvme namespaces at runtime. > > And that's very much intentional. We have a very well working block > layer which we're going to use, no need to reivent it. The block > layer supports NVMe pass through just fine in case we'll need it, > as I spent the last year preparing it for that. > > > Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO > > to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? > > Because it keeps the code simple. If you had actually participated > on our development list you might have seen that until not too long > ago we have very fine grainded locks here. In the end Armen convinced > me that it's easier to maintain if we don't bother with fine grained > locking outside the fast path, especially as it significantly simplifies > the discovery implementation. If if it ever turns out to be an > issue we can change it easily as the implementation is well encapsulated. Well, it's rather obvious from the design limitations highlighted earlier that using a configfs model for nvmet was an after-thought. You'll recall a prototype of nvmet using target_core_fabric_configfs.c, which allowed nvmet to work out-of-the-box with LIO userspace, that fit directly into target_core_pr.c, and other existing target-core features that scsi and nvme share. Since our discussions at LSF, I've taken the point that it makes sense for nvmet to have it's own configfs layout using common configfs backends, which is reflected in nvmet/configfs-ng RFC from monday. However, I completely disagree with you that scsi (specifically iscsi) and nvme models are somehow incompatible from a target subsystem configfs perspective. They are not. Point being, if you took the exact same three top level configfs groups in the nvmet implementation using subsystem configfs symlinks into ports and hosts, and did the same for iscsi by renaming them to: /sys/kernel/config/iscsit/ subsystems -> iqns ports -> network portals hosts -> acls You would have the exact same type of scale limitations with iscsi that have already highlighted. There is nothing iscsi or nvme specific about these limitations. The point is that it's not a question of if this configfs model is required for a nvme target design, it's a bad model for any configfs consumer design to follow. To repeat, any time there are global locks and sequential lookups across multiple top level configfs groups, you're doing configfs wrong. The whole point of configfs is to allow vfs to reference count data structures using parent/child relationships, and let configfs give you the reference counting for free. And getting that reference counting for free in configfs is what allows existing target-core backends + endpoints to scale creation, operation and deletion independent from each other. Any type of medium or larger service provider and hosting environment requires the ability of it's target mode storage control plane to function independent of individual tenants. The limitations as-is of the nvmet/configfs design makes it rather useless for these environments, and any type of upstream commitment to a target mode configfs based ABI needs to be able to, at least, scale to what we've already got in target_core_fabric_configfs. Otherwise, it will eventually be thrown out for something that both fits the needs of today's requirements, and can grow into the future while never breaking user-space ABI compatibility. We got the /sys/kernel/config/target/$FABRIC/ layout right the first time back in 2008, have *never* had to break ABI compat, and we are still able scale to the requirements of today. I'd like to hope we'd be able to achieve the same goals for nvmet in 2024. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Wed, Jun 08, 2016 at 04:12:27PM +0300, Sagi Grimberg wrote: >> Because it keeps the code simple. If you had actually participated >> on our development list you might have seen that until not too long >> ago we have very fine grainded locks here. In the end Armen convinced >> me that it's easier to maintain if we don't bother with fine grained >> locking outside the fast path, especially as it significantly simplifies >> the discovery implementation. If if it ever turns out to be an >> issue we can change it easily as the implementation is well encapsulated. > > We did change that, and Nic is raising a valid point in terms of having > a global mutex around all the ports. If the requirement of nvme > subsystems and ports configuration is that it should happen fast enough > and scale to the numbers that Nic is referring to, we'll need to change > that back. > > Having said that, I'm not sure this is a real hard requirement for RDMA > and FC in the mid-term, because from what I've seen, the workloads Nic > is referring to are more typical for iscsi/tcp where connections are > cheaper and you need more to saturate a high-speed interconnects, so > we'll probably see this when we have nvme over tcp working. I'm not really worried about connection establishment - that can be changed to RCU locking really easily. I'm a bit more worried about the case where a driver would block long in ->add_port. But let's worry about that if an actual user comes up. The last thing we need in a new driver is lots of complexity for hypothetical use cases, I'm much more interested in having the driver simple, testable and actually tested than optimizing for something. That is to say the priorities here are very different from Nic's goals for the target code. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
*) Extensible to multiple types of backend drivers. nvme-target needs a way to absorb new backend drivers, that does not effect existing configfs group layout or attributes. Looking at the nvmet/configfs layout as-is, there are no multiple backend types defined, nor a way to control backend feature bits exposed to nvme namespaces at runtime. Hey Nic, As for different type of backends, I still don't see a big justification for adding the LIO backends pscsi (as it doesn't make sense), ramdisk (we have brd), or file (losetup). What kind of feature bits would you want to expose at runtime? And that's very much intentional. We have a very well working block layer which we're going to use, no need to reivent it. The block layer supports NVMe pass through just fine in case we'll need it, as I spent the last year preparing it for that. Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? Because it keeps the code simple. If you had actually participated on our development list you might have seen that until not too long ago we have very fine grainded locks here. In the end Armen convinced me that it's easier to maintain if we don't bother with fine grained locking outside the fast path, especially as it significantly simplifies the discovery implementation. If if it ever turns out to be an issue we can change it easily as the implementation is well encapsulated. We did change that, and Nic is raising a valid point in terms of having a global mutex around all the ports. If the requirement of nvme subsystems and ports configuration is that it should happen fast enough and scale to the numbers that Nic is referring to, we'll need to change that back. Having said that, I'm not sure this is a real hard requirement for RDMA and FC in the mid-term, because from what I've seen, the workloads Nic is referring to are more typical for iscsi/tcp where connections are cheaper and you need more to saturate a high-speed interconnects, so we'll probably see this when we have nvme over tcp working. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote: > *) Extensible to multiple types of backend drivers. > > nvme-target needs a way to absorb new backend drivers, that > does not effect existing configfs group layout or attributes. > > Looking at the nvmet/configfs layout as-is, there are no multiple > backend types defined, nor a way to control backend feature bits > exposed to nvme namespaces at runtime. And that's very much intentional. We have a very well working block layer which we're going to use, no need to reivent it. The block layer supports NVMe pass through just fine in case we'll need it, as I spent the last year preparing it for that. > Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO > to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? Because it keeps the code simple. If you had actually participated on our development list you might have seen that until not too long ago we have very fine grainded locks here. In the end Armen convinced me that it's easier to maintain if we don't bother with fine grained locking outside the fast path, especially as it significantly simplifies the discovery implementation. If if it ever turns out to be an issue we can change it easily as the implementation is well encapsulated. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Tue, 2016-06-07 at 12:55 +0200, Christoph Hellwig wrote: > There is absolutely no point in dragging in an overcomplicated configfs > structure for a very simple protocol which also is very different from > SCSI in it's nitty gritty details. Please be more specific wrt the two individual points that have been raised. > Keeping the nvme target self contains > allows it to be both much simpler and much easier to understand, as well > as much better testable - see the amount of test coverage we could easily > add for example. I disagree. > > Or to put it the other way around - if there was any major synergy in > reusing the SCSI target code that just shows we're missing functionality > in the block layer or configfs. > To reiterate the points again. *) Extensible to multiple types of backend drivers. nvme-target needs a way to absorb new backend drivers, that does not effect existing configfs group layout or attributes. Looking at the nvmet/configfs layout as-is, there are no multiple backend types defined, nor a way to control backend feature bits exposed to nvme namespaces at runtime. What is being proposed is a way to share target-core backends via existing configfs symlinks across SCSI and NVMe targets. Which means: - All I/O state + memory submission is done at RCU protected se_device level via sbc_ops - percpu reference counting is done outside of target-core - Absorb all nvmet/io-cmd optimizations into target_core_iblock.c - Base starting point for features in SCSI + NVMe that span across multiple endpoints and instances (reservations + APTPL, multipath, copy-offload across fabric types) Using target-core backends means we get features like T10-PI and sbc_ops->write_same for free that don't exist in nvmet, and can utilize a common set of backend drivers for SCSI and NVMe via an existing configfs ABI and python userspace community. And to the second, and more important point for defining a configfs ABI that works for both today's requirements, as well into the 2020s without breaking user-space compatibility. As-is, the initial design using top level nvmet configfs symlinks of subsystem groups into individual port + host groups does not scale. That is, it currently does: - Sequential list lookup under global rw_mutex of top-level nvmet_port and nvmet_host symlink ->allow_link() and ->drop_link() configfs callbacks. - nvmet_fabrics_ops->add_port() callback invoked under same global rw mutex. This is very bad for several reasons. As-is, this blocks all other configfs port + host operations from occurring even during normal operation, which makes it quite useless for any type of multi-tenant target environment where the individual target endpoints *must* be able to operate independently. Seriously, there is never a good reason why configfs group or item callbacks should be performing list lookup under a global lock at this level. Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..? A simple example where this design breaks down quickly is a NVMf ops->add_port() call that requires a HW reset, or say reloading of firmware that can take multiple seconds. (qla2xxx comes to mind). There is a simple test to highlight this limitation. Take any nvme-target driver that is capable of multiple ports, and introduce a sleep(5) into each ops->add_port() call. Now create 256 different subsystem NQNs with 256 different ports across four different user-space processes. What happens to other subsystems, ports and host groups configfs symlinks when this occurs..? What happens to the other user-space processes..? -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On Tue, Jun 7, 2016 at 2:02 PM, Andy Grover wrote: > On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote: >> >> Hi HCH & Co, >> >> On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote: >>> >>> This patch set adds a generic NVMe over Fabrics target. The >>> implementation conforms to the NVMe 1.2b specification (which >>> includes Fabrics) and provides the NVMe over Fabrics access >>> to Linux block devices. >>> >> >> Thanks for all of the development work by the fabric_linux_driver team >> (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. >> >> Very excited to see this code get a public release now that NVMf >> specification is out. Now that it's in the wild, it's a good >> opportunity to discuss some of the more interesting implementation >> details, beyond the new NVMf wire-protocol itself. > > > I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)? http://www.nvmexpress.org/wp-content/uploads/NVM_Express_1_2_1_Gold_20160603.pdf http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605.pdf -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote: Hi HCH & Co, On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote: This patch set adds a generic NVMe over Fabrics target. The implementation conforms to the NVMe 1.2b specification (which includes Fabrics) and provides the NVMe over Fabrics access to Linux block devices. Thanks for all of the development work by the fabric_linux_driver team (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. Very excited to see this code get a public release now that NVMf specification is out. Now that it's in the wild, it's a good opportunity to discuss some of the more interesting implementation details, beyond the new NVMf wire-protocol itself. I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)? Thanks -- Andy -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
There is absolutely no point in dragging in an overcomplicated configfs structure for a very simple protocol which also is very different from SCSI in it's nitty gritty details. Keeping the nvme target self contains allows it to be both much simpler and much easier to understand, as well as much better testable - see the amount of test coverage we could easily add for example. Or to put it the other way around - if there was any major synergy in reusing the SCSI target code that just shows we're missing functionality in the block layer or configfs. The only thing where this is the case mid-term is persistent reservations, but Mike Christie already has a plan for a pluggable PR API, which I'm very interested in. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NVMe over Fabrics target implementation
Hi HCH & Co, On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote: > This patch set adds a generic NVMe over Fabrics target. The > implementation conforms to the NVMe 1.2b specification (which > includes Fabrics) and provides the NVMe over Fabrics access > to Linux block devices. > Thanks for all of the development work by the fabric_linux_driver team (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. Very excited to see this code get a public release now that NVMf specification is out. Now that it's in the wild, it's a good opportunity to discuss some of the more interesting implementation details, beyond the new NVMf wire-protocol itself. (Adding target-devel + linux-scsi CC') > The target implementation consists of several elements: > > - NVMe target core: defines and manages the NVMe entities (subsystems, > controllers, namespaces, ...) and their allocation, responsible > for initial commands processing and correct orchestration of > the stack setup and tear down. > > - NVMe admin command implementation: responsible for parsing and > servicing admin commands such as controller identify, set features, > keep-alive, log page, ...). > > - NVMe I/O command implementation: responsible for performing the actual > I/O (Read, Write, Flush, Deallocate (aka Discard). It is a very thin > layer on top of the block layer and implements no logic of it's own. > To support exporting file systems please use the loopback block driver > in direct I/O mode, which gives very good performance. > > - NVMe over Fabrics support: responsible for servicing Fabrics commands > (connect, property get/set). > > - NVMe over Fabrics discovery service: responsible to serve the Discovery > log page through a special cut down Discovery controller. > > The target is configured using configfs, and configurable entities are: > > - NVMe subsystems and namespaces > - NVMe over Fabrics ports and referrals > - Host ACLs for primitive access control - NVMe over Fabrics access >control is still work in progress at the specification level and >will be implemented once that work has finished. > > To configure the target use the nvmetcli tool from > http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed > setup documentation. > > In addition to the Fabrics target implementation we provide a loopback > driver which also conforms the NVMe over Fabrics specification and allows > evaluation of the target stack with local access without requiring a real > fabric. > So as-is, I have two main objections that been discussed off-list for some time, that won't be a big surprise to anyone following fabrics_linux_driver list. ;P First topic, I think nvme-target name-spaces should be utilizing existing configfs logic, and sharing /sys/kernel/config/target/core/ backend driver symlinks as individual nvme-target subsystem namespaces. That is, we've already got a configfs ABI in place for target mode back-ends that today is able to operate independently from SCSI architecture model dependencies. To that end, the prerequisite series to allow target-core backends to operate independent of se_cmd, and allow se_device backends to be configfs symlinked directly into /sys/kernel/config/nvmet/, outside of /sys/kernel/config/target/$FABRIC/ has been posted earlier here: http://marc.info/?l=linux-scsi&m=146527281416606&w=2 Note the -v2 series has absorbed the nvmet/io-cmd execute_rw() improvements from Sagi + Ming (inline bio/bvec and blk_poll) into target_core_iblock.c driver code. Second topic, and more important from a kernel ABI perspective are the current scale limitations around the first pass of nvmet configfs.c layout code in /sys/kernel/config/nvmet/. Namely, the design of having three top level configfs groups in /sys/kernel/config/nvmet/[subsystems,ports,hosts] that are configfs symlinked between each other, with a single rw_mutex (nvmet_config_sem) used for global list lookup and enforcing a globally synchronized nvmet_fabrics_ops->add_port() creation across all subsystem NQN ports. >From the shared experience in target_core_fabric_configfs.c over the last 8 years, perhaps the greatest strength of configfs has been it's ability to allow config_item_type parent/child relationships to exist and operate independently of one another. Specifically in the context of storage tenants, this means creation + deletion of one backend + target fabric endpoint tenant, should not block creation + deletion of another backend + target fabric endpoint tenant. As-is, a nvmet configfs layout holding a global mutex across subsystem/port/host creation + deletion, and doing internal list lookup within configfs ->allow_link + ->drop_link callbacks ends up being severely limiting when scaling up the total number of nvmet subsystem NQNs and ports. Specifically, modern deployments of /sys/kernel/config/target/iscsi/ expect backends + fabric endpoints to be configured in parallel at