Re: NVMe over Fabrics target implementation

2016-06-09 Thread Christoph Hellwig
On Wed, Jun 08, 2016 at 09:36:15PM -0700, Nicholas A. Bellinger wrote:
> The configfs ABI should not dictate a single backend use-case.

And it doesn't.  I actually had a file backend implemented to
benchmark it against the loopback driver.  It needed absolutely
zero new configfs interface.  And if we at some point want
different backends using different attributes we can trivially
add them using configfs_register_group.

> Along with having common code and existing configfs
> ABI, we also get a proper starting point for target-core
> features that span across endpoints, and are defined for
> both scsi and nvme.  PR APTPL immediately comes to mind.

PRs are a useful feature on the road map.  However we need a separate
pluggable backend anyway for distributed backends like RBD or Bart's
DLM implementation.  Also the current LIO PR implementation will need
a lot of work to be usable for NVMe while actually following the
spec in all it's details and to be power ѕafe.   The right way to
go here is a PR API that allows different backends, and the existing
LIO one might be one of them after it's got the needed attention.


Re: NVMe over Fabrics target implementation

2016-06-09 Thread Christoph Hellwig
On Wed, Jun 08, 2016 at 09:36:15PM -0700, Nicholas A. Bellinger wrote:
> The configfs ABI should not dictate a single backend use-case.

And it doesn't.  I actually had a file backend implemented to
benchmark it against the loopback driver.  It needed absolutely
zero new configfs interface.  And if we at some point want
different backends using different attributes we can trivially
add them using configfs_register_group.

> Along with having common code and existing configfs
> ABI, we also get a proper starting point for target-core
> features that span across endpoints, and are defined for
> both scsi and nvme.  PR APTPL immediately comes to mind.

PRs are a useful feature on the road map.  However we need a separate
pluggable backend anyway for distributed backends like RBD or Bart's
DLM implementation.  Also the current LIO PR implementation will need
a lot of work to be usable for NVMe while actually following the
spec in all it's details and to be power ѕafe.   The right way to
go here is a PR API that allows different backends, and the existing
LIO one might be one of them after it's got the needed attention.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Nicholas A. Bellinger
On Wed, 2016-06-08 at 16:12 +0300, Sagi Grimberg wrote:
> >> *) Extensible to multiple types of backend drivers.
> >>
> >> nvme-target needs a way to absorb new backend drivers, that
> >> does not effect existing configfs group layout or attributes.
> >>
> >> Looking at the nvmet/configfs layout as-is, there are no multiple
> >> backend types defined, nor a way to control backend feature bits
> >> exposed to nvme namespaces at runtime.
> 
> Hey Nic,
> 
> As for different type of backends, I still don't see a big justification
> for adding the LIO backends pscsi (as it doesn't make sense),
> ramdisk (we have brd), or file (losetup).
> 

The configfs ABI should not dictate a single backend use-case.

In the target-core ecosystem today, there are just as many
people using FILEIO atop local file-systems as there are
using IBLOCK and submit_bio().

As mentioned, target_core_iblock.c has absorbed the io-cmd.c
improvements so existing scsi target drivers can benefit
too.  Plus, having interested folks focusing on a single set of
backends for FILEIO + IBLOCK means both scsi and nvme
target drivers benefit from further improvements.

As we've already got both, a target backend configfs ABI
and user ecosystem using /sys/kernel/config/target/core/, it's
a straight forward way to share common code, while still allowing
scsi and nvme to function using their own independent fabric
configfs ABI layouts.

Along with having common code and existing configfs
ABI, we also get a proper starting point for target-core
features that span across endpoints, and are defined for
both scsi and nvme.  PR APTPL immediately comes to mind.

Namely, for the ability of one backend device to interact
between both scsi target luns and nvme target namespaces,
as well as different backends across scsi and nvme exports.

> What kind of feature bits would you want to expose at runtime?

As for feature bits, basically everything currently or
in the future to be reported by ID_NS.  There is T10-PI, and another
example is copy-offload support, once the NVMe spec gets that far..

The main point is that we should be able to add new feature bits to
common code in target-core backend configfs ABI, without having to
change the individual scsi or nvme configfs ABIs.

> 
> > And that's very much intentional.  We have a very well working block
> > layer which we're going to use, no need to reivent it.  The block
> > layer supports NVMe pass through just fine in case we'll need it,
> > as I spent the last year preparing it for that.
> >
> >> Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> >> to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?
> >
> > Because it keeps the code simple.  If you had actually participated
> > on our development list you might have seen that until not too long
> > ago we have very fine grainded locks here.  In the end Armen convinced
> > me that it's easier to maintain if we don't bother with fine grained
> > locking outside the fast path, especially as it significantly simplifies
> > the discovery implementation.   If if it ever turns out to be an
> > issue we can change it easily as the implementation is well encapsulated.
> 
> We did change that, and Nic is raising a valid point in terms of having
> a global mutex around all the ports. If the requirement of nvme
> subsystems and ports configuration is that it should happen fast enough
> and scale to the numbers that Nic is referring to, we'll need to change
> that back.
> 
> Having said that, I'm not sure this is a real hard requirement for RDMA
> and FC in the mid-term, because from what I've seen, the workloads Nic
> is referring to are more typical for iscsi/tcp where connections are
> cheaper and you need more to saturate a high-speed interconnects, so
> we'll probably see this when we have nvme over tcp working.

Yes.

Further, my objections to the proposed nvmet configfs ABI are:

  - Doesn't support multiple backend types.
  - Doesn't provide a way to control backend feature bits separate from 
fabric layout.
  - Doesn't provide a starting point between target features that span
both scsi and nvme.
  - Doesn't allow for concurrent parallel configfs create + delete 
operations of subsystem NQNs across ports and host acls.
  - Global synchronization of nvmet_fabric_ops->add_port()



Re: NVMe over Fabrics target implementation

2016-06-08 Thread Nicholas A. Bellinger
On Wed, 2016-06-08 at 16:12 +0300, Sagi Grimberg wrote:
> >> *) Extensible to multiple types of backend drivers.
> >>
> >> nvme-target needs a way to absorb new backend drivers, that
> >> does not effect existing configfs group layout or attributes.
> >>
> >> Looking at the nvmet/configfs layout as-is, there are no multiple
> >> backend types defined, nor a way to control backend feature bits
> >> exposed to nvme namespaces at runtime.
> 
> Hey Nic,
> 
> As for different type of backends, I still don't see a big justification
> for adding the LIO backends pscsi (as it doesn't make sense),
> ramdisk (we have brd), or file (losetup).
> 

The configfs ABI should not dictate a single backend use-case.

In the target-core ecosystem today, there are just as many
people using FILEIO atop local file-systems as there are
using IBLOCK and submit_bio().

As mentioned, target_core_iblock.c has absorbed the io-cmd.c
improvements so existing scsi target drivers can benefit
too.  Plus, having interested folks focusing on a single set of
backends for FILEIO + IBLOCK means both scsi and nvme
target drivers benefit from further improvements.

As we've already got both, a target backend configfs ABI
and user ecosystem using /sys/kernel/config/target/core/, it's
a straight forward way to share common code, while still allowing
scsi and nvme to function using their own independent fabric
configfs ABI layouts.

Along with having common code and existing configfs
ABI, we also get a proper starting point for target-core
features that span across endpoints, and are defined for
both scsi and nvme.  PR APTPL immediately comes to mind.

Namely, for the ability of one backend device to interact
between both scsi target luns and nvme target namespaces,
as well as different backends across scsi and nvme exports.

> What kind of feature bits would you want to expose at runtime?

As for feature bits, basically everything currently or
in the future to be reported by ID_NS.  There is T10-PI, and another
example is copy-offload support, once the NVMe spec gets that far..

The main point is that we should be able to add new feature bits to
common code in target-core backend configfs ABI, without having to
change the individual scsi or nvme configfs ABIs.

> 
> > And that's very much intentional.  We have a very well working block
> > layer which we're going to use, no need to reivent it.  The block
> > layer supports NVMe pass through just fine in case we'll need it,
> > as I spent the last year preparing it for that.
> >
> >> Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> >> to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?
> >
> > Because it keeps the code simple.  If you had actually participated
> > on our development list you might have seen that until not too long
> > ago we have very fine grainded locks here.  In the end Armen convinced
> > me that it's easier to maintain if we don't bother with fine grained
> > locking outside the fast path, especially as it significantly simplifies
> > the discovery implementation.   If if it ever turns out to be an
> > issue we can change it easily as the implementation is well encapsulated.
> 
> We did change that, and Nic is raising a valid point in terms of having
> a global mutex around all the ports. If the requirement of nvme
> subsystems and ports configuration is that it should happen fast enough
> and scale to the numbers that Nic is referring to, we'll need to change
> that back.
> 
> Having said that, I'm not sure this is a real hard requirement for RDMA
> and FC in the mid-term, because from what I've seen, the workloads Nic
> is referring to are more typical for iscsi/tcp where connections are
> cheaper and you need more to saturate a high-speed interconnects, so
> we'll probably see this when we have nvme over tcp working.

Yes.

Further, my objections to the proposed nvmet configfs ABI are:

  - Doesn't support multiple backend types.
  - Doesn't provide a way to control backend feature bits separate from 
fabric layout.
  - Doesn't provide a starting point between target features that span
both scsi and nvme.
  - Doesn't allow for concurrent parallel configfs create + delete 
operations of subsystem NQNs across ports and host acls.
  - Global synchronization of nvmet_fabric_ops->add_port()



Re: NVMe over Fabrics target implementation

2016-06-08 Thread Nicholas A. Bellinger
On Wed, 2016-06-08 at 14:19 +0200, Christoph Hellwig wrote:
> On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote:
> > *) Extensible to multiple types of backend drivers.
> > 
> > nvme-target needs a way to absorb new backend drivers, that
> > does not effect existing configfs group layout or attributes.
> > 
> > Looking at the nvmet/configfs layout as-is, there are no multiple
> > backend types defined, nor a way to control backend feature bits
> > exposed to nvme namespaces at runtime.
> 
> And that's very much intentional.  We have a very well working block
> layer which we're going to use, no need to reivent it.  The block
> layer supports NVMe pass through just fine in case we'll need it,
> as I spent the last year preparing it for that.
> 
> > Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> > to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?
> 
> Because it keeps the code simple.  If you had actually participated
> on our development list you might have seen that until not too long
> ago we have very fine grainded locks here.  In the end Armen convinced
> me that it's easier to maintain if we don't bother with fine grained
> locking outside the fast path, especially as it significantly simplifies
> the discovery implementation.   If if it ever turns out to be an
> issue we can change it easily as the implementation is well encapsulated.

Well, it's rather obvious from the design limitations highlighted
earlier that using a configfs model for nvmet was an after-thought.

You'll recall a prototype of nvmet using target_core_fabric_configfs.c,
which allowed nvmet to work out-of-the-box with LIO userspace, that fit
directly into target_core_pr.c, and other existing target-core features
that scsi and nvme share.

Since our discussions at LSF, I've taken the point that it makes sense
for nvmet to have it's own configfs layout using common configfs
backends, which is reflected in nvmet/configfs-ng RFC from monday.

However, I completely disagree with you that scsi (specifically iscsi)
and nvme models are somehow incompatible from a target subsystem
configfs perspective.

They are not.

Point being, if you took the exact same three top level configfs groups
in the nvmet implementation  using subsystem configfs symlinks into
ports and hosts, and did the same for iscsi by renaming them to:

/sys/kernel/config/iscsit/

subsystems -> iqns
ports -> network portals
hosts -> acls

You would have the exact same type of scale limitations with iscsi that
have already highlighted.  There is nothing iscsi or nvme specific about
these limitations.

The point is that it's not a question of if this configfs model is
required for a nvme target design, it's a bad model for any configfs
consumer design to follow.

To repeat, any time there are global locks and sequential lookups
across multiple top level configfs groups, you're doing configfs
wrong.

The whole point of configfs is to allow vfs to reference count data
structures using parent/child relationships, and let configfs give you
the reference counting for free.

And getting that reference counting for free in configfs is what allows
existing target-core backends + endpoints to scale creation, operation
and deletion independent from each other.

Any type of medium or larger service provider and hosting environment
requires the ability of it's target mode storage control plane to
function independent of individual tenants.

The limitations as-is of the nvmet/configfs design makes it rather
useless for these environments, and any type of upstream commitment to a
target mode configfs based ABI needs to be able to, at least, scale to
what we've already got in target_core_fabric_configfs.

Otherwise, it will eventually be thrown out for something that both fits
the needs of today's requirements, and can grow into the future while
never breaking user-space ABI compatibility.

We got the /sys/kernel/config/target/$FABRIC/ layout right the first
time back in 2008, have *never* had to break ABI compat, and we are
still able scale to the requirements of today.

I'd like to hope we'd be able to achieve the same goals for nvmet in
2024.



Re: NVMe over Fabrics target implementation

2016-06-08 Thread Nicholas A. Bellinger
On Wed, 2016-06-08 at 14:19 +0200, Christoph Hellwig wrote:
> On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote:
> > *) Extensible to multiple types of backend drivers.
> > 
> > nvme-target needs a way to absorb new backend drivers, that
> > does not effect existing configfs group layout or attributes.
> > 
> > Looking at the nvmet/configfs layout as-is, there are no multiple
> > backend types defined, nor a way to control backend feature bits
> > exposed to nvme namespaces at runtime.
> 
> And that's very much intentional.  We have a very well working block
> layer which we're going to use, no need to reivent it.  The block
> layer supports NVMe pass through just fine in case we'll need it,
> as I spent the last year preparing it for that.
> 
> > Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> > to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?
> 
> Because it keeps the code simple.  If you had actually participated
> on our development list you might have seen that until not too long
> ago we have very fine grainded locks here.  In the end Armen convinced
> me that it's easier to maintain if we don't bother with fine grained
> locking outside the fast path, especially as it significantly simplifies
> the discovery implementation.   If if it ever turns out to be an
> issue we can change it easily as the implementation is well encapsulated.

Well, it's rather obvious from the design limitations highlighted
earlier that using a configfs model for nvmet was an after-thought.

You'll recall a prototype of nvmet using target_core_fabric_configfs.c,
which allowed nvmet to work out-of-the-box with LIO userspace, that fit
directly into target_core_pr.c, and other existing target-core features
that scsi and nvme share.

Since our discussions at LSF, I've taken the point that it makes sense
for nvmet to have it's own configfs layout using common configfs
backends, which is reflected in nvmet/configfs-ng RFC from monday.

However, I completely disagree with you that scsi (specifically iscsi)
and nvme models are somehow incompatible from a target subsystem
configfs perspective.

They are not.

Point being, if you took the exact same three top level configfs groups
in the nvmet implementation  using subsystem configfs symlinks into
ports and hosts, and did the same for iscsi by renaming them to:

/sys/kernel/config/iscsit/

subsystems -> iqns
ports -> network portals
hosts -> acls

You would have the exact same type of scale limitations with iscsi that
have already highlighted.  There is nothing iscsi or nvme specific about
these limitations.

The point is that it's not a question of if this configfs model is
required for a nvme target design, it's a bad model for any configfs
consumer design to follow.

To repeat, any time there are global locks and sequential lookups
across multiple top level configfs groups, you're doing configfs
wrong.

The whole point of configfs is to allow vfs to reference count data
structures using parent/child relationships, and let configfs give you
the reference counting for free.

And getting that reference counting for free in configfs is what allows
existing target-core backends + endpoints to scale creation, operation
and deletion independent from each other.

Any type of medium or larger service provider and hosting environment
requires the ability of it's target mode storage control plane to
function independent of individual tenants.

The limitations as-is of the nvmet/configfs design makes it rather
useless for these environments, and any type of upstream commitment to a
target mode configfs based ABI needs to be able to, at least, scale to
what we've already got in target_core_fabric_configfs.

Otherwise, it will eventually be thrown out for something that both fits
the needs of today's requirements, and can grow into the future while
never breaking user-space ABI compatibility.

We got the /sys/kernel/config/target/$FABRIC/ layout right the first
time back in 2008, have *never* had to break ABI compat, and we are
still able scale to the requirements of today.

I'd like to hope we'd be able to achieve the same goals for nvmet in
2024.



Re: NVMe over Fabrics target implementation

2016-06-08 Thread Christoph Hellwig
On Wed, Jun 08, 2016 at 04:12:27PM +0300, Sagi Grimberg wrote:
>> Because it keeps the code simple.  If you had actually participated
>> on our development list you might have seen that until not too long
>> ago we have very fine grainded locks here.  In the end Armen convinced
>> me that it's easier to maintain if we don't bother with fine grained
>> locking outside the fast path, especially as it significantly simplifies
>> the discovery implementation.   If if it ever turns out to be an
>> issue we can change it easily as the implementation is well encapsulated.
>
> We did change that, and Nic is raising a valid point in terms of having
> a global mutex around all the ports. If the requirement of nvme
> subsystems and ports configuration is that it should happen fast enough
> and scale to the numbers that Nic is referring to, we'll need to change
> that back.
>
> Having said that, I'm not sure this is a real hard requirement for RDMA
> and FC in the mid-term, because from what I've seen, the workloads Nic
> is referring to are more typical for iscsi/tcp where connections are
> cheaper and you need more to saturate a high-speed interconnects, so
> we'll probably see this when we have nvme over tcp working.

I'm not really worried about connection establishment - that can be
changed to RCU locking really easily.  I'm a bit more worried about
the case where a driver would block long in ->add_port.  But let's
worry about that if an actual user comes up.  The last thing we need
in a new driver is lots of complexity for hypothetical use cases,
I'm much more interested in having the driver simple, testable and
actually tested than optimizing for something.

That is to say the priorities here are very different from Nic's goals
for the target code.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Christoph Hellwig
On Wed, Jun 08, 2016 at 04:12:27PM +0300, Sagi Grimberg wrote:
>> Because it keeps the code simple.  If you had actually participated
>> on our development list you might have seen that until not too long
>> ago we have very fine grainded locks here.  In the end Armen convinced
>> me that it's easier to maintain if we don't bother with fine grained
>> locking outside the fast path, especially as it significantly simplifies
>> the discovery implementation.   If if it ever turns out to be an
>> issue we can change it easily as the implementation is well encapsulated.
>
> We did change that, and Nic is raising a valid point in terms of having
> a global mutex around all the ports. If the requirement of nvme
> subsystems and ports configuration is that it should happen fast enough
> and scale to the numbers that Nic is referring to, we'll need to change
> that back.
>
> Having said that, I'm not sure this is a real hard requirement for RDMA
> and FC in the mid-term, because from what I've seen, the workloads Nic
> is referring to are more typical for iscsi/tcp where connections are
> cheaper and you need more to saturate a high-speed interconnects, so
> we'll probably see this when we have nvme over tcp working.

I'm not really worried about connection establishment - that can be
changed to RCU locking really easily.  I'm a bit more worried about
the case where a driver would block long in ->add_port.  But let's
worry about that if an actual user comes up.  The last thing we need
in a new driver is lots of complexity for hypothetical use cases,
I'm much more interested in having the driver simple, testable and
actually tested than optimizing for something.

That is to say the priorities here are very different from Nic's goals
for the target code.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Sagi Grimberg



*) Extensible to multiple types of backend drivers.

nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.

Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.


Hey Nic,

As for different type of backends, I still don't see a big justification
for adding the LIO backends pscsi (as it doesn't make sense),
ramdisk (we have brd), or file (losetup).

What kind of feature bits would you want to expose at runtime?


And that's very much intentional.  We have a very well working block
layer which we're going to use, no need to reivent it.  The block
layer supports NVMe pass through just fine in case we'll need it,
as I spent the last year preparing it for that.


Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?


Because it keeps the code simple.  If you had actually participated
on our development list you might have seen that until not too long
ago we have very fine grainded locks here.  In the end Armen convinced
me that it's easier to maintain if we don't bother with fine grained
locking outside the fast path, especially as it significantly simplifies
the discovery implementation.   If if it ever turns out to be an
issue we can change it easily as the implementation is well encapsulated.


We did change that, and Nic is raising a valid point in terms of having
a global mutex around all the ports. If the requirement of nvme
subsystems and ports configuration is that it should happen fast enough
and scale to the numbers that Nic is referring to, we'll need to change
that back.

Having said that, I'm not sure this is a real hard requirement for RDMA
and FC in the mid-term, because from what I've seen, the workloads Nic
is referring to are more typical for iscsi/tcp where connections are
cheaper and you need more to saturate a high-speed interconnects, so
we'll probably see this when we have nvme over tcp working.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Sagi Grimberg



*) Extensible to multiple types of backend drivers.

nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.

Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.


Hey Nic,

As for different type of backends, I still don't see a big justification
for adding the LIO backends pscsi (as it doesn't make sense),
ramdisk (we have brd), or file (losetup).

What kind of feature bits would you want to expose at runtime?


And that's very much intentional.  We have a very well working block
layer which we're going to use, no need to reivent it.  The block
layer supports NVMe pass through just fine in case we'll need it,
as I spent the last year preparing it for that.


Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?


Because it keeps the code simple.  If you had actually participated
on our development list you might have seen that until not too long
ago we have very fine grainded locks here.  In the end Armen convinced
me that it's easier to maintain if we don't bother with fine grained
locking outside the fast path, especially as it significantly simplifies
the discovery implementation.   If if it ever turns out to be an
issue we can change it easily as the implementation is well encapsulated.


We did change that, and Nic is raising a valid point in terms of having
a global mutex around all the ports. If the requirement of nvme
subsystems and ports configuration is that it should happen fast enough
and scale to the numbers that Nic is referring to, we'll need to change
that back.

Having said that, I'm not sure this is a real hard requirement for RDMA
and FC in the mid-term, because from what I've seen, the workloads Nic
is referring to are more typical for iscsi/tcp where connections are
cheaper and you need more to saturate a high-speed interconnects, so
we'll probably see this when we have nvme over tcp working.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Christoph Hellwig
On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote:
> *) Extensible to multiple types of backend drivers.
> 
> nvme-target needs a way to absorb new backend drivers, that
> does not effect existing configfs group layout or attributes.
> 
> Looking at the nvmet/configfs layout as-is, there are no multiple
> backend types defined, nor a way to control backend feature bits
> exposed to nvme namespaces at runtime.

And that's very much intentional.  We have a very well working block
layer which we're going to use, no need to reivent it.  The block
layer supports NVMe pass through just fine in case we'll need it,
as I spent the last year preparing it for that.

> Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?

Because it keeps the code simple.  If you had actually participated
on our development list you might have seen that until not too long
ago we have very fine grainded locks here.  In the end Armen convinced
me that it's easier to maintain if we don't bother with fine grained
locking outside the fast path, especially as it significantly simplifies
the discovery implementation.   If if it ever turns out to be an
issue we can change it easily as the implementation is well encapsulated.


Re: NVMe over Fabrics target implementation

2016-06-08 Thread Christoph Hellwig
On Tue, Jun 07, 2016 at 10:21:41PM -0700, Nicholas A. Bellinger wrote:
> *) Extensible to multiple types of backend drivers.
> 
> nvme-target needs a way to absorb new backend drivers, that
> does not effect existing configfs group layout or attributes.
> 
> Looking at the nvmet/configfs layout as-is, there are no multiple
> backend types defined, nor a way to control backend feature bits
> exposed to nvme namespaces at runtime.

And that's very much intentional.  We have a very well working block
layer which we're going to use, no need to reivent it.  The block
layer supports NVMe pass through just fine in case we'll need it,
as I spent the last year preparing it for that.

> Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
> to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?

Because it keeps the code simple.  If you had actually participated
on our development list you might have seen that until not too long
ago we have very fine grainded locks here.  In the end Armen convinced
me that it's easier to maintain if we don't bother with fine grained
locking outside the fast path, especially as it significantly simplifies
the discovery implementation.   If if it ever turns out to be an
issue we can change it easily as the implementation is well encapsulated.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Nicholas A. Bellinger
On Tue, 2016-06-07 at 12:55 +0200, Christoph Hellwig wrote:
> There is absolutely no point in dragging in an overcomplicated configfs 
> structure for a very simple protocol which also is very different from
> SCSI in it's nitty gritty details.

Please be more specific wrt the two individual points that have been
raised.

>  Keeping the nvme target self contains
> allows it to be both much simpler and much easier to understand, as well
> as much better testable - see the amount of test coverage we could easily
> add for example.

I disagree.

> 
> Or to put it the other way around - if there was any major synergy in
> reusing the SCSI target code that just shows we're missing functionality
> in the block layer or configfs.
> 

To reiterate the points again.

*) Extensible to multiple types of backend drivers.

nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.

Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.

What is being proposed is a way to share target-core backends via
existing configfs symlinks across SCSI and NVMe targets.

Which means:

   - All I/O state + memory submission is done at RCU protected
 se_device level via sbc_ops
   - percpu reference counting is done outside of target-core
   - Absorb all nvmet/io-cmd optimizations into target_core_iblock.c
   - Base starting point for features in SCSI + NVMe that span
 across multiple endpoints and instances (reservations + APTPL, 
 multipath, copy-offload across fabric types)

Using target-core backends means we get features like T10-PI and
sbc_ops->write_same for free that don't exist in nvmet, and can
utilize a common set of backend drivers for SCSI and NVMe via an
existing configfs ABI and python userspace community.

And to the second, and more important point for defining a configfs ABI
that works for both today's requirements, as well into the 2020s
without breaking user-space compatibility.

As-is, the initial design using top level nvmet configfs symlinks of
subsystem groups into individual port + host groups does not scale.

That is, it currently does:

  - Sequential list lookup under global rw_mutex of top-level nvmet_port
and nvmet_host symlink ->allow_link() and ->drop_link() configfs
callbacks.
  - nvmet_fabrics_ops->add_port() callback invoked under same global
rw mutex.

This is very bad for several reasons.

As-is, this blocks all other configfs port + host operations from
occurring even during normal operation, which makes it quite useless for
any type of multi-tenant target environment where the individual target
endpoints *must* be able to operate independently.

Seriously, there is never a good reason why configfs group or item
callbacks should be performing list lookup under a global lock at
this level.

Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?

A simple example where this design breaks down quickly is a NVMf
ops->add_port() call that requires a HW reset, or say reloading of
firmware that can take multiple seconds. (qla2xxx comes to mind).

There is a simple test to highlight this limitation.  Take any
nvme-target driver that is capable of multiple ports, and introduce
a sleep(5) into each ops->add_port() call.

Now create 256 different subsystem NQNs with 256 different ports
across four different user-space processes.

What happens to other subsystems, ports and host groups configfs
symlinks when this occurs..?

What happens to the other user-space processes..?



Re: NVMe over Fabrics target implementation

2016-06-07 Thread Nicholas A. Bellinger
On Tue, 2016-06-07 at 12:55 +0200, Christoph Hellwig wrote:
> There is absolutely no point in dragging in an overcomplicated configfs 
> structure for a very simple protocol which also is very different from
> SCSI in it's nitty gritty details.

Please be more specific wrt the two individual points that have been
raised.

>  Keeping the nvme target self contains
> allows it to be both much simpler and much easier to understand, as well
> as much better testable - see the amount of test coverage we could easily
> add for example.

I disagree.

> 
> Or to put it the other way around - if there was any major synergy in
> reusing the SCSI target code that just shows we're missing functionality
> in the block layer or configfs.
> 

To reiterate the points again.

*) Extensible to multiple types of backend drivers.

nvme-target needs a way to absorb new backend drivers, that
does not effect existing configfs group layout or attributes.

Looking at the nvmet/configfs layout as-is, there are no multiple
backend types defined, nor a way to control backend feature bits
exposed to nvme namespaces at runtime.

What is being proposed is a way to share target-core backends via
existing configfs symlinks across SCSI and NVMe targets.

Which means:

   - All I/O state + memory submission is done at RCU protected
 se_device level via sbc_ops
   - percpu reference counting is done outside of target-core
   - Absorb all nvmet/io-cmd optimizations into target_core_iblock.c
   - Base starting point for features in SCSI + NVMe that span
 across multiple endpoints and instances (reservations + APTPL, 
 multipath, copy-offload across fabric types)

Using target-core backends means we get features like T10-PI and
sbc_ops->write_same for free that don't exist in nvmet, and can
utilize a common set of backend drivers for SCSI and NVMe via an
existing configfs ABI and python userspace community.

And to the second, and more important point for defining a configfs ABI
that works for both today's requirements, as well into the 2020s
without breaking user-space compatibility.

As-is, the initial design using top level nvmet configfs symlinks of
subsystem groups into individual port + host groups does not scale.

That is, it currently does:

  - Sequential list lookup under global rw_mutex of top-level nvmet_port
and nvmet_host symlink ->allow_link() and ->drop_link() configfs
callbacks.
  - nvmet_fabrics_ops->add_port() callback invoked under same global
rw mutex.

This is very bad for several reasons.

As-is, this blocks all other configfs port + host operations from
occurring even during normal operation, which makes it quite useless for
any type of multi-tenant target environment where the individual target
endpoints *must* be able to operate independently.

Seriously, there is never a good reason why configfs group or item
callbacks should be performing list lookup under a global lock at
this level.

Why does it ever make sense for $SUBSYSTEM_NQN_0 with $PORT_DRIVER_FOO
to block operation of $SUBSYSTEM_NQN_1 with $PORT_DRIVER_BAR..?

A simple example where this design breaks down quickly is a NVMf
ops->add_port() call that requires a HW reset, or say reloading of
firmware that can take multiple seconds. (qla2xxx comes to mind).

There is a simple test to highlight this limitation.  Take any
nvme-target driver that is capable of multiple ports, and introduce
a sleep(5) into each ops->add_port() call.

Now create 256 different subsystem NQNs with 256 different ports
across four different user-space processes.

What happens to other subsystems, ports and host groups configfs
symlinks when this occurs..?

What happens to the other user-space processes..?



Re: NVMe over Fabrics target implementation

2016-06-07 Thread Ming Lin
On Tue, Jun 7, 2016 at 2:02 PM, Andy Grover  wrote:
> On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote:
>>
>> Hi HCH & Co,
>>
>> On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:
>>>
>>> This patch set adds a generic NVMe over Fabrics target. The
>>> implementation conforms to the NVMe 1.2b specification (which
>>> includes Fabrics) and provides the NVMe over Fabrics access
>>> to Linux block devices.
>>>
>>
>> Thanks for all of the development work by the fabric_linux_driver team
>> (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year.
>>
>> Very excited to see this code get a public release now that NVMf
>> specification is out.  Now that it's in the wild, it's a good
>> opportunity to discuss some of the more interesting implementation
>> details, beyond the new NVMf wire-protocol itself.
>
>
> I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)?

http://www.nvmexpress.org/wp-content/uploads/NVM_Express_1_2_1_Gold_20160603.pdf
http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605.pdf


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Ming Lin
On Tue, Jun 7, 2016 at 2:02 PM, Andy Grover  wrote:
> On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote:
>>
>> Hi HCH & Co,
>>
>> On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:
>>>
>>> This patch set adds a generic NVMe over Fabrics target. The
>>> implementation conforms to the NVMe 1.2b specification (which
>>> includes Fabrics) and provides the NVMe over Fabrics access
>>> to Linux block devices.
>>>
>>
>> Thanks for all of the development work by the fabric_linux_driver team
>> (HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year.
>>
>> Very excited to see this code get a public release now that NVMf
>> specification is out.  Now that it's in the wild, it's a good
>> opportunity to discuss some of the more interesting implementation
>> details, beyond the new NVMf wire-protocol itself.
>
>
> I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)?

http://www.nvmexpress.org/wp-content/uploads/NVM_Express_1_2_1_Gold_20160603.pdf
http://www.nvmexpress.org/wp-content/uploads/NVMe_over_Fabrics_1_0_Gold_20160605.pdf


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Andy Grover

On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote:

Hi HCH & Co,

On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:

This patch set adds a generic NVMe over Fabrics target. The
implementation conforms to the NVMe 1.2b specification (which
includes Fabrics) and provides the NVMe over Fabrics access
to Linux block devices.



Thanks for all of the development work by the fabric_linux_driver team
(HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year.

Very excited to see this code get a public release now that NVMf
specification is out.  Now that it's in the wild, it's a good
opportunity to discuss some of the more interesting implementation
details, beyond the new NVMf wire-protocol itself.


I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)?

Thanks -- Andy



Re: NVMe over Fabrics target implementation

2016-06-07 Thread Andy Grover

On 06/06/2016 11:23 PM, Nicholas A. Bellinger wrote:

Hi HCH & Co,

On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:

This patch set adds a generic NVMe over Fabrics target. The
implementation conforms to the NVMe 1.2b specification (which
includes Fabrics) and provides the NVMe over Fabrics access
to Linux block devices.



Thanks for all of the development work by the fabric_linux_driver team
(HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year.

Very excited to see this code get a public release now that NVMf
specification is out.  Now that it's in the wild, it's a good
opportunity to discuss some of the more interesting implementation
details, beyond the new NVMf wire-protocol itself.


I'm sorry but I missed it, can you repeat the link to the NVMe spec(s)?

Thanks -- Andy



Re: NVMe over Fabrics target implementation

2016-06-07 Thread Bart Van Assche

On 06/07/2016 10:31 AM, Christoph Hellwig wrote:

Of course you can, and every user of configfs relies on being able
to create directories from the kernel using either the default_groups
list, or through the use of configfs_register_group().


Apparently configfs_register_group() has been added last year (Nov 2015; 
commit 5cf6a51e6062)? This is interesting and I think this functionality 
would also useful for LIO target drivers, e.g. ib_srpt and tcm_qla2xxx.


Bart.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Bart Van Assche

On 06/07/2016 10:31 AM, Christoph Hellwig wrote:

Of course you can, and every user of configfs relies on being able
to create directories from the kernel using either the default_groups
list, or through the use of configfs_register_group().


Apparently configfs_register_group() has been added last year (Nov 2015; 
commit 5cf6a51e6062)? This is interesting and I think this functionality 
would also useful for LIO target drivers, e.g. ib_srpt and tcm_qla2xxx.


Bart.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Christoph Hellwig
Hi Bart,

> As you know with configfs it is not allowed to create directories in the 
> configfs hierarchy from inside the kernel.

Of course you can, and every user of configfs relies on being able
to create directories from the kernel using either the default_groups
list, or through the use of configfs_register_group().


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Christoph Hellwig
Hi Bart,

> As you know with configfs it is not allowed to create directories in the 
> configfs hierarchy from inside the kernel.

Of course you can, and every user of configfs relies on being able
to create directories from the kernel using either the default_groups
list, or through the use of configfs_register_group().


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Bart Van Assche

On 06/06/2016 02:22 PM, Christoph Hellwig wrote:

The target is configured using configfs, and configurable entities are:
 - NVMe subsystems and namespaces
 - NVMe over Fabrics ports and referrals
 - Host ACLs for primitive access control - NVMe over Fabrics access
   control is still work in progress at the specification level and
   will be implemented once that work has finished.


Hello Christoph,

As you know with configfs it is not allowed to create directories in the 
configfs hierarchy from inside the kernel. This means that users will 
have to create one directory per fabric port before configuring a target 
port. Other virtual filesystems, e.g. sysfs, allow directory creation by 
kernel drivers. This means that with sysfs the fabric port directory 
could be created by the nvmet driver. Is this how you want the nvmet 
driver to work?


Thanks,

Bart.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Bart Van Assche

On 06/06/2016 02:22 PM, Christoph Hellwig wrote:

The target is configured using configfs, and configurable entities are:
 - NVMe subsystems and namespaces
 - NVMe over Fabrics ports and referrals
 - Host ACLs for primitive access control - NVMe over Fabrics access
   control is still work in progress at the specification level and
   will be implemented once that work has finished.


Hello Christoph,

As you know with configfs it is not allowed to create directories in the 
configfs hierarchy from inside the kernel. This means that users will 
have to create one directory per fabric port before configuring a target 
port. Other virtual filesystems, e.g. sysfs, allow directory creation by 
kernel drivers. This means that with sysfs the fabric port directory 
could be created by the nvmet driver. Is this how you want the nvmet 
driver to work?


Thanks,

Bart.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Christoph Hellwig
There is absolutely no point in dragging in an overcomplicated configfs 
structure for a very simple protocol which also is very different from
SCSI in it's nitty gritty details.  Keeping the nvme target self contains
allows it to be both much simpler and much easier to understand, as well
as much better testable - see the amount of test coverage we could easily
add for example.

Or to put it the other way around - if there was any major synergy in
reusing the SCSI target code that just shows we're missing functionality
in the block layer or configfs.

The only thing where this is the case mid-term is persistent reservations,
but Mike Christie already has a plan for a pluggable PR API, which I'm
very interested in.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Christoph Hellwig
There is absolutely no point in dragging in an overcomplicated configfs 
structure for a very simple protocol which also is very different from
SCSI in it's nitty gritty details.  Keeping the nvme target self contains
allows it to be both much simpler and much easier to understand, as well
as much better testable - see the amount of test coverage we could easily
add for example.

Or to put it the other way around - if there was any major synergy in
reusing the SCSI target code that just shows we're missing functionality
in the block layer or configfs.

The only thing where this is the case mid-term is persistent reservations,
but Mike Christie already has a plan for a pluggable PR API, which I'm
very interested in.


Re: NVMe over Fabrics target implementation

2016-06-07 Thread Nicholas A. Bellinger
Hi HCH & Co,

On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:
> This patch set adds a generic NVMe over Fabrics target. The
> implementation conforms to the NVMe 1.2b specification (which
> includes Fabrics) and provides the NVMe over Fabrics access
> to Linux block devices.
> 

Thanks for all of the development work by the fabric_linux_driver team
(HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. 

Very excited to see this code get a public release now that NVMf
specification is out.  Now that it's in the wild, it's a good
opportunity to discuss some of the more interesting implementation
details, beyond the new NVMf wire-protocol itself.

(Adding target-devel + linux-scsi CC')

> The target implementation consists of several elements:
> 
> - NVMe target core: defines and manages the NVMe entities (subsystems,
>   controllers, namespaces, ...) and their allocation, responsible
>   for initial commands processing and correct orchestration of
>   the stack setup and tear down.
> 
> - NVMe admin command implementation: responsible for parsing and
>   servicing admin commands such as controller identify, set features,
>   keep-alive, log page, ...).
> 
> - NVMe I/O command implementation: responsible for performing the actual
>   I/O (Read, Write, Flush, Deallocate (aka Discard).  It is a very thin
>   layer on top of the block layer and implements no logic of it's own.
>   To support exporting file systems please use the loopback block driver
>   in direct I/O mode, which gives very good performance.
> 
> - NVMe over Fabrics support: responsible for servicing Fabrics commands
>   (connect, property get/set).
> 
> - NVMe over Fabrics discovery service: responsible to serve the Discovery
>   log page through a special cut down Discovery controller.
> 
> The target is configured using configfs, and configurable entities are:
> 
>  - NVMe subsystems and namespaces
>  - NVMe over Fabrics ports and referrals
>  - Host ACLs for primitive access control - NVMe over Fabrics access
>control is still work in progress at the specification level and
>will be implemented once that work has finished.
> 
> To configure the target use the nvmetcli tool from
> http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed
> setup documentation.
> 
> In addition to the Fabrics target implementation we provide a loopback
> driver which also conforms the NVMe over Fabrics specification and allows
> evaluation of the target stack with local access without requiring a real
> fabric.
> 

So as-is, I have two main objections that been discussed off-list for
some time, that won't be a big surprise to anyone following
fabrics_linux_driver list.  ;P

First topic, I think nvme-target name-spaces should be utilizing
existing configfs logic, and sharing /sys/kernel/config/target/core/
backend driver symlinks as individual nvme-target subsystem namespaces.

That is, we've already got a configfs ABI in place for target mode
back-ends that today is able to operate independently from SCSI
architecture model dependencies.

To that end, the prerequisite series to allow target-core backends to
operate independent of se_cmd, and allow se_device backends to be
configfs symlinked directly into /sys/kernel/config/nvmet/, outside
of /sys/kernel/config/target/$FABRIC/ has been posted earlier here:

http://marc.info/?l=linux-scsi=146527281416606=2

Note the -v2 series has absorbed the nvmet/io-cmd execute_rw()
improvements from Sagi + Ming (inline bio/bvec and blk_poll) into
target_core_iblock.c driver code.

Second topic, and more important from a kernel ABI perspective are the
current scale limitations around the first pass of nvmet configfs.c
layout code in /sys/kernel/config/nvmet/.

Namely, the design of having three top level configfs groups in
/sys/kernel/config/nvmet/[subsystems,ports,hosts] that are configfs
symlinked between each other, with a single rw_mutex (nvmet_config_sem)
used for global list lookup and enforcing a globally synchronized
nvmet_fabrics_ops->add_port() creation across all subsystem NQN ports.

>From the shared experience in target_core_fabric_configfs.c over the
last 8 years, perhaps the greatest strength of configfs has been it's
ability to allow config_item_type parent/child relationships to exist
and operate independently of one another.

Specifically in the context of storage tenants, this means creation +
deletion of one backend + target fabric endpoint tenant, should not
block creation + deletion of another backend + target fabric endpoint
tenant.

As-is, a nvmet configfs layout holding a global mutex across
subsystem/port/host creation + deletion, and doing internal list lookup
within configfs ->allow_link + ->drop_link callbacks ends up being
severely limiting when scaling up the total number of nvmet subsystem
NQNs and ports.

Specifically, modern deployments of /sys/kernel/config/target/iscsi/
expect backends + fabric endpoints to be configured in parallel at
< 

Re: NVMe over Fabrics target implementation

2016-06-07 Thread Nicholas A. Bellinger
Hi HCH & Co,

On Mon, 2016-06-06 at 23:22 +0200, Christoph Hellwig wrote:
> This patch set adds a generic NVMe over Fabrics target. The
> implementation conforms to the NVMe 1.2b specification (which
> includes Fabrics) and provides the NVMe over Fabrics access
> to Linux block devices.
> 

Thanks for all of the development work by the fabric_linux_driver team
(HCH, Sagi, Ming, James F., James S., and Dave M.) over the last year. 

Very excited to see this code get a public release now that NVMf
specification is out.  Now that it's in the wild, it's a good
opportunity to discuss some of the more interesting implementation
details, beyond the new NVMf wire-protocol itself.

(Adding target-devel + linux-scsi CC')

> The target implementation consists of several elements:
> 
> - NVMe target core: defines and manages the NVMe entities (subsystems,
>   controllers, namespaces, ...) and their allocation, responsible
>   for initial commands processing and correct orchestration of
>   the stack setup and tear down.
> 
> - NVMe admin command implementation: responsible for parsing and
>   servicing admin commands such as controller identify, set features,
>   keep-alive, log page, ...).
> 
> - NVMe I/O command implementation: responsible for performing the actual
>   I/O (Read, Write, Flush, Deallocate (aka Discard).  It is a very thin
>   layer on top of the block layer and implements no logic of it's own.
>   To support exporting file systems please use the loopback block driver
>   in direct I/O mode, which gives very good performance.
> 
> - NVMe over Fabrics support: responsible for servicing Fabrics commands
>   (connect, property get/set).
> 
> - NVMe over Fabrics discovery service: responsible to serve the Discovery
>   log page through a special cut down Discovery controller.
> 
> The target is configured using configfs, and configurable entities are:
> 
>  - NVMe subsystems and namespaces
>  - NVMe over Fabrics ports and referrals
>  - Host ACLs for primitive access control - NVMe over Fabrics access
>control is still work in progress at the specification level and
>will be implemented once that work has finished.
> 
> To configure the target use the nvmetcli tool from
> http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed
> setup documentation.
> 
> In addition to the Fabrics target implementation we provide a loopback
> driver which also conforms the NVMe over Fabrics specification and allows
> evaluation of the target stack with local access without requiring a real
> fabric.
> 

So as-is, I have two main objections that been discussed off-list for
some time, that won't be a big surprise to anyone following
fabrics_linux_driver list.  ;P

First topic, I think nvme-target name-spaces should be utilizing
existing configfs logic, and sharing /sys/kernel/config/target/core/
backend driver symlinks as individual nvme-target subsystem namespaces.

That is, we've already got a configfs ABI in place for target mode
back-ends that today is able to operate independently from SCSI
architecture model dependencies.

To that end, the prerequisite series to allow target-core backends to
operate independent of se_cmd, and allow se_device backends to be
configfs symlinked directly into /sys/kernel/config/nvmet/, outside
of /sys/kernel/config/target/$FABRIC/ has been posted earlier here:

http://marc.info/?l=linux-scsi=146527281416606=2

Note the -v2 series has absorbed the nvmet/io-cmd execute_rw()
improvements from Sagi + Ming (inline bio/bvec and blk_poll) into
target_core_iblock.c driver code.

Second topic, and more important from a kernel ABI perspective are the
current scale limitations around the first pass of nvmet configfs.c
layout code in /sys/kernel/config/nvmet/.

Namely, the design of having three top level configfs groups in
/sys/kernel/config/nvmet/[subsystems,ports,hosts] that are configfs
symlinked between each other, with a single rw_mutex (nvmet_config_sem)
used for global list lookup and enforcing a globally synchronized
nvmet_fabrics_ops->add_port() creation across all subsystem NQN ports.

>From the shared experience in target_core_fabric_configfs.c over the
last 8 years, perhaps the greatest strength of configfs has been it's
ability to allow config_item_type parent/child relationships to exist
and operate independently of one another.

Specifically in the context of storage tenants, this means creation +
deletion of one backend + target fabric endpoint tenant, should not
block creation + deletion of another backend + target fabric endpoint
tenant.

As-is, a nvmet configfs layout holding a global mutex across
subsystem/port/host creation + deletion, and doing internal list lookup
within configfs ->allow_link + ->drop_link callbacks ends up being
severely limiting when scaling up the total number of nvmet subsystem
NQNs and ports.

Specifically, modern deployments of /sys/kernel/config/target/iscsi/
expect backends + fabric endpoints to be configured in parallel at
< 

NVMe over Fabrics target implementation

2016-06-06 Thread Christoph Hellwig
This patch set adds a generic NVMe over Fabrics target. The
implementation conforms to the NVMe 1.2b specification (which
includes Fabrics) and provides the NVMe over Fabrics access
to Linux block devices.

The target implementation consists of several elements:

- NVMe target core: defines and manages the NVMe entities (subsystems,
  controllers, namespaces, ...) and their allocation, responsible
  for initial commands processing and correct orchestration of
  the stack setup and tear down.

- NVMe admin command implementation: responsible for parsing and
  servicing admin commands such as controller identify, set features,
  keep-alive, log page, ...).

- NVMe I/O command implementation: responsible for performing the actual
  I/O (Read, Write, Flush, Deallocate (aka Discard).  It is a very thin
  layer on top of the block layer and implements no logic of it's own.
  To support exporting file systems please use the loopback block driver
  in direct I/O mode, which gives very good performance.

- NVMe over Fabrics support: responsible for servicing Fabrics commands
  (connect, property get/set).

- NVMe over Fabrics discovery service: responsible to serve the Discovery
  log page through a special cut down Discovery controller.

The target is configured using configfs, and configurable entities are:

 - NVMe subsystems and namespaces
 - NVMe over Fabrics ports and referrals
 - Host ACLs for primitive access control - NVMe over Fabrics access
   control is still work in progress at the specification level and
   will be implemented once that work has finished.

To configure the target use the nvmetcli tool from
http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed
setup documentation.

In addition to the Fabrics target implementation we provide a loopback
driver which also conforms the NVMe over Fabrics specification and allows
evaluation of the target stack with local access without requiring a real
fabric.

Various test cases are provided for this implementation: nvmetcli
contains a python testsuite that mostly stresses the configfs interface
of the target, and we have various integration tests prepared for the
kernel host and target which are available at:

git://git.infradead.org/nvme-fabrics.git nvmf-selftests

Gitweb:


http://git.infradead.org/nvme-fabrics.git/shortlog/refs/heads/nvmf-selftests

This repository also contains patches from all the series posted today
in case you prefer using a git repository over collecting patches.



NVMe over Fabrics target implementation

2016-06-06 Thread Christoph Hellwig
This patch set adds a generic NVMe over Fabrics target. The
implementation conforms to the NVMe 1.2b specification (which
includes Fabrics) and provides the NVMe over Fabrics access
to Linux block devices.

The target implementation consists of several elements:

- NVMe target core: defines and manages the NVMe entities (subsystems,
  controllers, namespaces, ...) and their allocation, responsible
  for initial commands processing and correct orchestration of
  the stack setup and tear down.

- NVMe admin command implementation: responsible for parsing and
  servicing admin commands such as controller identify, set features,
  keep-alive, log page, ...).

- NVMe I/O command implementation: responsible for performing the actual
  I/O (Read, Write, Flush, Deallocate (aka Discard).  It is a very thin
  layer on top of the block layer and implements no logic of it's own.
  To support exporting file systems please use the loopback block driver
  in direct I/O mode, which gives very good performance.

- NVMe over Fabrics support: responsible for servicing Fabrics commands
  (connect, property get/set).

- NVMe over Fabrics discovery service: responsible to serve the Discovery
  log page through a special cut down Discovery controller.

The target is configured using configfs, and configurable entities are:

 - NVMe subsystems and namespaces
 - NVMe over Fabrics ports and referrals
 - Host ACLs for primitive access control - NVMe over Fabrics access
   control is still work in progress at the specification level and
   will be implemented once that work has finished.

To configure the target use the nvmetcli tool from
http://git.infradead.org/users/hch/nvmetcli.git, which includes detailed
setup documentation.

In addition to the Fabrics target implementation we provide a loopback
driver which also conforms the NVMe over Fabrics specification and allows
evaluation of the target stack with local access without requiring a real
fabric.

Various test cases are provided for this implementation: nvmetcli
contains a python testsuite that mostly stresses the configfs interface
of the target, and we have various integration tests prepared for the
kernel host and target which are available at:

git://git.infradead.org/nvme-fabrics.git nvmf-selftests

Gitweb:


http://git.infradead.org/nvme-fabrics.git/shortlog/refs/heads/nvmf-selftests

This repository also contains patches from all the series posted today
in case you prefer using a git repository over collecting patches.