Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Christoph Hellwig
On Tue, Jun 05, 2018 at 03:57:05PM -0700, Roland Dreier wrote:
> That makes sense but I'm not sure it covers everything.  Probably the
> most common way to do NVMe/RDMA will be with a single HCA that has
> multiple ports, so there's no sensible CPU locality.  On the other
> hand we want to keep both ports to the fabric busy.  Setting different
> paths for different queues makes sense, but there may be
> single-threaded applications that want a  different policy.
> 
> I'm not saying anything very profound, but we have to find the right
> balance between too many and too few knobs.

Agreed.  And the philosophy here is to start with a as few knobs
as possible and work from there based on actual use cases.
Single threaded applications will run into issues with general
blk-mq philosophy, so to work around that we'll need to dig deeper
and allow borrowing of other cpu queues if we want to cater for that.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Christoph Hellwig
On Tue, Jun 05, 2018 at 03:57:05PM -0700, Roland Dreier wrote:
> That makes sense but I'm not sure it covers everything.  Probably the
> most common way to do NVMe/RDMA will be with a single HCA that has
> multiple ports, so there's no sensible CPU locality.  On the other
> hand we want to keep both ports to the fabric busy.  Setting different
> paths for different queues makes sense, but there may be
> single-threaded applications that want a  different policy.
> 
> I'm not saying anything very profound, but we have to find the right
> balance between too many and too few knobs.

Agreed.  And the philosophy here is to start with a as few knobs
as possible and work from there based on actual use cases.
Single threaded applications will run into issues with general
blk-mq philosophy, so to work around that we'll need to dig deeper
and allow borrowing of other cpu queues if we want to cater for that.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Christoph Hellwig
On Wed, Jun 06, 2018 at 12:32:21PM +0300, Sagi Grimberg wrote:
> Huh? different paths == different controllers so this sentence can't
> be right... you mean that a path selector will select a controller
> based on the home node of the local rdma device connecting to it and
> the running cpu right?

Think of a system with say 8 cpu cores.  Say we have two optimized
paths.

There is no point in going round robin or service time over the
two paths for each logic pre-cpu queue.  Instead we should always
got to path A for a given cpu queue or path B to reduce selection
overhead and cache footprint.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Christoph Hellwig
On Wed, Jun 06, 2018 at 12:32:21PM +0300, Sagi Grimberg wrote:
> Huh? different paths == different controllers so this sentence can't
> be right... you mean that a path selector will select a controller
> based on the home node of the local rdma device connecting to it and
> the running cpu right?

Think of a system with say 8 cpu cores.  Say we have two optimized
paths.

There is no point in going round robin or service time over the
two paths for each logic pre-cpu queue.  Instead we should always
got to path A for a given cpu queue or path B to reduce selection
overhead and cache footprint.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Sagi Grimberg




We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example).  Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed.  Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...


The sensible thing to do in nvme is to use different paths for
different queues.


Huh? different paths == different controllers so this sentence can't
be right... you mean that a path selector will select a controller
based on the home node of the local rdma device connecting to it and
the running cpu right?


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-06 Thread Sagi Grimberg




We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example).  Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed.  Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...


The sensible thing to do in nvme is to use different paths for
different queues.


Huh? different paths == different controllers so this sentence can't
be right... you mean that a path selector will select a controller
based on the home node of the local rdma device connecting to it and
the running cpu right?


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-05 Thread Roland Dreier
> The sensible thing to do in nvme is to use different paths for
> different queues.  That is e.g. in the RDMA case use the HCA closer
> to a given CPU by default.  We might allow to override this for
> cases where the is a good reason, but what I really don't want is
> configurability for configurabilities sake.

That makes sense but I'm not sure it covers everything.  Probably the
most common way to do NVMe/RDMA will be with a single HCA that has
multiple ports, so there's no sensible CPU locality.  On the other
hand we want to keep both ports to the fabric busy.  Setting different
paths for different queues makes sense, but there may be
single-threaded applications that want a  different policy.

I'm not saying anything very profound, but we have to find the right
balance between too many and too few knobs.

 - R.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-05 Thread Roland Dreier
> The sensible thing to do in nvme is to use different paths for
> different queues.  That is e.g. in the RDMA case use the HCA closer
> to a given CPU by default.  We might allow to override this for
> cases where the is a good reason, but what I really don't want is
> configurability for configurabilities sake.

That makes sense but I'm not sure it covers everything.  Probably the
most common way to do NVMe/RDMA will be with a single HCA that has
multiple ports, so there's no sensible CPU locality.  On the other
hand we want to keep both ports to the fabric busy.  Setting different
paths for different queues makes sense, but there may be
single-threaded applications that want a  different policy.

I'm not saying anything very profound, but we have to find the right
balance between too many and too few knobs.

 - R.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Christoph Hellwig
On Mon, Jun 04, 2018 at 02:58:49PM -0700, Roland Dreier wrote:
> We plan to implement all the fancy NVMe standards like ANA, but it
> seems that there is still a requirement to let the host side choose
> policies about how to use paths (round-robin vs least queue depth for
> example).  Even in the modern SCSI world with VPD pages and ALUA,
> there are still knobs that are needed.  Maybe NVMe will be different
> and we can find defaults that work in all cases but I have to admit
> I'm skeptical...

The sensible thing to do in nvme is to use different paths for
different queues.  That is e.g. in the RDMA case use the HCA closer
to a given CPU by default.  We might allow to override this for
cases where the is a good reason, but what I really don't want is
configurability for configurabilities sake.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Christoph Hellwig
On Mon, Jun 04, 2018 at 02:58:49PM -0700, Roland Dreier wrote:
> We plan to implement all the fancy NVMe standards like ANA, but it
> seems that there is still a requirement to let the host side choose
> policies about how to use paths (round-robin vs least queue depth for
> example).  Even in the modern SCSI world with VPD pages and ALUA,
> there are still knobs that are needed.  Maybe NVMe will be different
> and we can find defaults that work in all cases but I have to admit
> I'm skeptical...

The sensible thing to do in nvme is to use different paths for
different queues.  That is e.g. in the RDMA case use the HCA closer
to a given CPU by default.  We might allow to override this for
cases where the is a good reason, but what I really don't want is
configurability for configurabilities sake.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Roland Dreier
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

As a vendor who is building an NVMe-oF storage array, I can say that
clarity around how Linux wants to handle NVMe multipath would
definitely be appreciated.  It would be great if we could all converge
around the upstream native driver but right now it doesn't look
adequate - having only a single active path is not the best way to use
a multi-controller storage system.  Unfortunately it looks like we're
headed to a world where people have to write separate "best practices"
documents to cover RHEL, SLES and other vendors.

We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example).  Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed.  Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...

 - R.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Roland Dreier
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

As a vendor who is building an NVMe-oF storage array, I can say that
clarity around how Linux wants to handle NVMe multipath would
definitely be appreciated.  It would be great if we could all converge
around the upstream native driver but right now it doesn't look
adequate - having only a single active path is not the best way to use
a multi-controller storage system.  Unfortunately it looks like we're
headed to a world where people have to write separate "best practices"
documents to cover RHEL, SLES and other vendors.

We plan to implement all the fancy NVMe standards like ANA, but it
seems that there is still a requirement to let the host side choose
policies about how to use paths (round-robin vs least queue depth for
example).  Even in the modern SCSI world with VPD pages and ALUA,
there are still knobs that are needed.  Maybe NVMe will be different
and we can find defaults that work in all cases but I have to admit
I'm skeptical...

 - R.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Johannes Thumshirn
On Mon, Jun 04, 2018 at 02:46:47PM +0300, Sagi Grimberg wrote:
> I agree with Christoph that changing personality on the fly is going to
> be painful. This opt-in will need to be one-host at connect time. For
> that, we will probably need to also expose an argument in nvme-cli too.
> Changing the mpath personality will need to involve disconnecting the
> controller and connecting again with the argument toggled. I think this
> is the only sane way to do this.

If we still want to make it dynamically, yes. I've raised this concern
while working on the patch as well.

> Another path we can make progress in is user visibility. We have
> topology in place and you mentioned primary path (which we could
> probably add). What else do you need for multipath-tools to support
> nvme?

I think the first priority is getting nvme notion into multipath-tools
like I said elsewhere and then see. Martin Wilck was already working
on patches for this.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Johannes Thumshirn
On Mon, Jun 04, 2018 at 02:46:47PM +0300, Sagi Grimberg wrote:
> I agree with Christoph that changing personality on the fly is going to
> be painful. This opt-in will need to be one-host at connect time. For
> that, we will probably need to also expose an argument in nvme-cli too.
> Changing the mpath personality will need to involve disconnecting the
> controller and connecting again with the argument toggled. I think this
> is the only sane way to do this.

If we still want to make it dynamically, yes. I've raised this concern
while working on the patch as well.

> Another path we can make progress in is user visibility. We have
> topology in place and you mentioned primary path (which we could
> probably add). What else do you need for multipath-tools to support
> nvme?

I think the first priority is getting nvme notion into multipath-tools
like I said elsewhere and then see. Martin Wilck was already working
on patches for this.

-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Sagi Grimberg

[so much for putting out flames... :/]


This projecting onto me that I've not been keeping the conversation
technical is in itself hostile.  Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply)


You're right, I do feel this is lashing out. And I don't appreciate it.
Please stop it. We're not going to make progress otherwise.


Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.


Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise and
automation built around dm-multipath and multipath-tools.  BUT those
same users may need/want to simultaneously use native NVMe multipath on
the same host.  Dismissing this point or acting like I haven't
articulated it just illustrates to me continuing this conversation is
not going to be fruitful.


The vast majority of the points are about the fact that people still
need to be able to use multipath-tools, which they still can today.
Personally, I question the existence of this user base you are referring
to which would want to maintain both dm-multipath and nvme personalities 
at the same time on the same host. But I do want us to make progress, so

I will have take this need as a given.

I agree with Christoph that changing personality on the fly is going to
be painful. This opt-in will need to be one-host at connect time. For
that, we will probably need to also expose an argument in nvme-cli too.
Changing the mpath personality will need to involve disconnecting the
controller and connecting again with the argument toggled. I think this
is the only sane way to do this.

Another path we can make progress in is user visibility. We have
topology in place and you mentioned primary path (which we could
probably add). What else do you need for multipath-tools to support
nvme?


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-04 Thread Sagi Grimberg

[so much for putting out flames... :/]


This projecting onto me that I've not been keeping the conversation
technical is in itself hostile.  Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply)


You're right, I do feel this is lashing out. And I don't appreciate it.
Please stop it. We're not going to make progress otherwise.


Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.


Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise and
automation built around dm-multipath and multipath-tools.  BUT those
same users may need/want to simultaneously use native NVMe multipath on
the same host.  Dismissing this point or acting like I haven't
articulated it just illustrates to me continuing this conversation is
not going to be fruitful.


The vast majority of the points are about the fact that people still
need to be able to use multipath-tools, which they still can today.
Personally, I question the existence of this user base you are referring
to which would want to maintain both dm-multipath and nvme personalities 
at the same time on the same host. But I do want us to make progress, so

I will have take this need as a given.

I agree with Christoph that changing personality on the fly is going to
be painful. This opt-in will need to be one-host at connect time. For
that, we will probably need to also expose an argument in nvme-cli too.
Changing the mpath personality will need to involve disconnecting the
controller and connecting again with the argument toggled. I think this
is the only sane way to do this.

Another path we can make progress in is user visibility. We have
topology in place and you mentioned primary path (which we could
probably add). What else do you need for multipath-tools to support
nvme?


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-03 Thread Mike Snitzer
On Sun, Jun 03 2018 at  7:00P -0400,
Sagi Grimberg  wrote:

> 
> >I'm aware that most everything in multipath.conf is SCSI/FC specific.
> >That isn't the point.  dm-multipath and multipathd are an existing
> >framework for managing multipath storage.
> >
> >It could be made to work with NVMe.  But yes it would not be easy.
> >Especially not with the native NVMe multipath crew being so damn
> >hostile.
> 
> The resistance is not a hostile act. Please try and keep the
> discussion technical.

This projecting onto me that I've not been keeping the conversation
technical is in itself hostile.  Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply) but I've been beating my head
against the wall on the need for native NVMe multipath and dm-multipath
to coexist in a fine-grained manner for literally 2 years!

But for the time-being I was done dwelling on the need for a switch like
mpath_personality.  Yet you persist.  If you read the latest messages in
this thread [1] and still elected to send this message, then _that_ is a
hostile act.  Because I have been nothing but informative.  The fact you
choose not to care, appreciate or have concern for users' experience
isn't my fault.

And please don't pretend like the entire evolution of native NVMe
multipath was anything but one elaborate hostile act against
dm-multipath.  To deny that would simply discredit your entire
viewpoint on this topic.

Even smaller decisions that were communicated in person and then later
unilaterally reversed were hostile.  Examples:
1) ANA would serve as a scsi device handler like (multipath agnostic)
   feature to enhance namespaces -- now you can see in the v2
   implemation that certainly isn't the case
2) The dm-multipath path-selectors were going to be elevated for use by
   both native NVMe multipath and dm-multipath -- now people are
   implementing yet another round-robin path selector directly in NVMe.

I get it, Christoph (and others by association) are operating from a
"winning" position that was hostiley taken and now the winning position
is being leveraged to further ensure dm-multipath has no hope of being a
viable alternative to native NVMe multipath -- at least not without a
lot of work to refactor code to be unnecessarily homed in the
CONFIG_NVME_MULTIPATH=y sandbox.

> >>But I don't think the burden of allowing multipathd/DM to inject
> >>themselves into the path transition state machine has any benefit
> >>whatsoever to the user. It's only complicating things and therefore we'd
> >>be doing people a disservice rather than a favor.
> >
> >This notion that only native NVMe multipath can be successful is utter
> >bullshit.  And the mere fact that I've gotten such a reaction from a
> >select few speaks to some serious control issues.
> >
> >Imagine if XFS developers just one day imposed that it is the _only_
> >filesystem that can be used on persistent memory.
> >
> >Just please dial it back.. seriously tiresome.
> 
> Mike, you make a fair point on multipath tools being more mature
> compared to NVMe multipathing. But this is not the discussion at all (at
> least not from my perspective). There was not a single use-case that
> gave a clear-cut justification for a per-subsystem personality switch
> (other than some far fetched imaginary scenarios). This is not unusual
> for the kernel community not to accept things with little to no use,
> especially when it involves exposing a userspace ABI.

The interfaces dm-multipath and multipath-tools provide are exactly the
issue.  SO which is it, do I have a valid usecase, like you indicated
before [2] or am I just talking non-sense (with hypotehticals because I
was baited to do so)?  NOTE: even in your [2] reply you also go on to
say that "no one is forbidden to use [dm-]multipath." when the reality
is users will be as-is.

If you and others genuinely think that disallowing dm-multipath from
being able to manage NVMe devices if CONFIG_NVME_MULTIPATH is enabled
(and not shutoff via nvme_core.multipath=N) is a reasonable action then
you're actively complicit in limiting users from continuing to use the
long-established dm-multipath based infrastructure that Linux has had
for over 10 years.

There is literally no reason why they need to be mutually exclussive
(other than to grant otherwise would errode the "winning" position hch
et al have been operating from).
The implemetation of the switch to allow fine-grained control does need
proper care and review and buy-in.  But I'm sad to see there literally
is zero willingness to even acknowledge that it is "the right thing to
do".

> As for now, all I see is a disclaimer saying that it'd need to be
> nurtured over time as the NVMe spec evolves.
> 
> Can you (or others) please try and articulate why a "fine grained"
> multipathing is an absolute must? At the moment, I just don't
> understand.

Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-03 Thread Mike Snitzer
On Sun, Jun 03 2018 at  7:00P -0400,
Sagi Grimberg  wrote:

> 
> >I'm aware that most everything in multipath.conf is SCSI/FC specific.
> >That isn't the point.  dm-multipath and multipathd are an existing
> >framework for managing multipath storage.
> >
> >It could be made to work with NVMe.  But yes it would not be easy.
> >Especially not with the native NVMe multipath crew being so damn
> >hostile.
> 
> The resistance is not a hostile act. Please try and keep the
> discussion technical.

This projecting onto me that I've not been keeping the conversation
technical is in itself hostile.  Sure I get frustrated and lash out (as
I'm _sure_ you'll feel in this reply) but I've been beating my head
against the wall on the need for native NVMe multipath and dm-multipath
to coexist in a fine-grained manner for literally 2 years!

But for the time-being I was done dwelling on the need for a switch like
mpath_personality.  Yet you persist.  If you read the latest messages in
this thread [1] and still elected to send this message, then _that_ is a
hostile act.  Because I have been nothing but informative.  The fact you
choose not to care, appreciate or have concern for users' experience
isn't my fault.

And please don't pretend like the entire evolution of native NVMe
multipath was anything but one elaborate hostile act against
dm-multipath.  To deny that would simply discredit your entire
viewpoint on this topic.

Even smaller decisions that were communicated in person and then later
unilaterally reversed were hostile.  Examples:
1) ANA would serve as a scsi device handler like (multipath agnostic)
   feature to enhance namespaces -- now you can see in the v2
   implemation that certainly isn't the case
2) The dm-multipath path-selectors were going to be elevated for use by
   both native NVMe multipath and dm-multipath -- now people are
   implementing yet another round-robin path selector directly in NVMe.

I get it, Christoph (and others by association) are operating from a
"winning" position that was hostiley taken and now the winning position
is being leveraged to further ensure dm-multipath has no hope of being a
viable alternative to native NVMe multipath -- at least not without a
lot of work to refactor code to be unnecessarily homed in the
CONFIG_NVME_MULTIPATH=y sandbox.

> >>But I don't think the burden of allowing multipathd/DM to inject
> >>themselves into the path transition state machine has any benefit
> >>whatsoever to the user. It's only complicating things and therefore we'd
> >>be doing people a disservice rather than a favor.
> >
> >This notion that only native NVMe multipath can be successful is utter
> >bullshit.  And the mere fact that I've gotten such a reaction from a
> >select few speaks to some serious control issues.
> >
> >Imagine if XFS developers just one day imposed that it is the _only_
> >filesystem that can be used on persistent memory.
> >
> >Just please dial it back.. seriously tiresome.
> 
> Mike, you make a fair point on multipath tools being more mature
> compared to NVMe multipathing. But this is not the discussion at all (at
> least not from my perspective). There was not a single use-case that
> gave a clear-cut justification for a per-subsystem personality switch
> (other than some far fetched imaginary scenarios). This is not unusual
> for the kernel community not to accept things with little to no use,
> especially when it involves exposing a userspace ABI.

The interfaces dm-multipath and multipath-tools provide are exactly the
issue.  SO which is it, do I have a valid usecase, like you indicated
before [2] or am I just talking non-sense (with hypotehticals because I
was baited to do so)?  NOTE: even in your [2] reply you also go on to
say that "no one is forbidden to use [dm-]multipath." when the reality
is users will be as-is.

If you and others genuinely think that disallowing dm-multipath from
being able to manage NVMe devices if CONFIG_NVME_MULTIPATH is enabled
(and not shutoff via nvme_core.multipath=N) is a reasonable action then
you're actively complicit in limiting users from continuing to use the
long-established dm-multipath based infrastructure that Linux has had
for over 10 years.

There is literally no reason why they need to be mutually exclussive
(other than to grant otherwise would errode the "winning" position hch
et al have been operating from).
The implemetation of the switch to allow fine-grained control does need
proper care and review and buy-in.  But I'm sad to see there literally
is zero willingness to even acknowledge that it is "the right thing to
do".

> As for now, all I see is a disclaimer saying that it'd need to be
> nurtured over time as the NVMe spec evolves.
> 
> Can you (or others) please try and articulate why a "fine grained"
> multipathing is an absolute must? At the moment, I just don't
> understand.

Already made the point multiple times in this thread [3][4][5][1].
Hint: it is about the users who have long-standing expertise 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-03 Thread Sagi Grimberg




I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point.  dm-multipath and multipathd are an existing
framework for managing multipath storage.

It could be made to work with NVMe.  But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.


The resistance is not a hostile act. Please try and keep the
discussion technical.


But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.


This notion that only native NVMe multipath can be successful is utter
bullshit.  And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.

Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.

Just please dial it back.. seriously tiresome.


Mike, you make a fair point on multipath tools being more mature
compared to NVMe multipathing. But this is not the discussion at all (at
least not from my perspective). There was not a single use-case that
gave a clear-cut justification for a per-subsystem personality switch
(other than some far fetched imaginary scenarios). This is not unusual
for the kernel community not to accept things with little to no use,
especially when it involves exposing a userspace ABI.

As for now, all I see is a disclaimer saying that it'd need to be
nurtured over time as the NVMe spec evolves.

Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.

Also, I get your point that exposing state/stats information to
userspace is needed. That's a fair comment.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-03 Thread Sagi Grimberg




I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point.  dm-multipath and multipathd are an existing
framework for managing multipath storage.

It could be made to work with NVMe.  But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.


The resistance is not a hostile act. Please try and keep the
discussion technical.


But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.


This notion that only native NVMe multipath can be successful is utter
bullshit.  And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.

Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.

Just please dial it back.. seriously tiresome.


Mike, you make a fair point on multipath tools being more mature
compared to NVMe multipathing. But this is not the discussion at all (at
least not from my perspective). There was not a single use-case that
gave a clear-cut justification for a per-subsystem personality switch
(other than some far fetched imaginary scenarios). This is not unusual
for the kernel community not to accept things with little to no use,
especially when it involves exposing a userspace ABI.

As for now, all I see is a disclaimer saying that it'd need to be
nurtured over time as the NVMe spec evolves.

Can you (or others) please try and articulate why a "fine grained"
multipathing is an absolute must? At the moment, I just don't
understand.

Also, I get your point that exposing state/stats information to
userspace is needed. That's a fair comment.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-01 Thread Mike Snitzer
On Fri, Jun 01 2018 at 10:09am -0400,
Martin K. Petersen  wrote:

> 
> Good morning Mike,
> 
> > This notion that only native NVMe multipath can be successful is utter
> > bullshit.  And the mere fact that I've gotten such a reaction from a
> > select few speaks to some serious control issues.
> 
> Please stop making this personal.

It cuts both ways, but I agree.

> > Imagine if XFS developers just one day imposed that it is the _only_
> > filesystem that can be used on persistent memory.
> 
> It's not about project X vs. project Y at all. This is about how we got
> to where we are today. And whether we are making right decisions that
> will benefit our users in the long run.
> 
> 20 years ago there were several device-specific SCSI multipath drivers
> available for Linux. All of them out-of-tree because there was no good
> way to consolidate them. They all worked in very different ways because
> the devices themselves were implemented in very different ways. It was a
> nightmare.
> 
> At the time we were very proud of our block layer, an abstraction none
> of the other operating systems really had. And along came Ingo and
> Miguel and did a PoC MD multipath implementation for devices that didn't
> have special needs. It was small, beautiful, and fit well into our shiny
> block layer abstraction. And therefore everyone working on Linux storage
> at the time was convinced that the block layer multipath model was the
> right way to go. Including, I must emphasize, yours truly.
> 
> There were several reasons why the block + userland model was especially
> compelling:
> 
>  1. There were no device serial numbers, UUIDs, or VPD pages. So short
>  of disk labels, there was no way to automatically establish that block
>  device sda was in fact the same LUN as sdb. MD and DM were existing
>  vehicles for describing block device relationships. Either via on-disk
>  metadata or config files and device mapper tables. And system
>  configurations were simple and static enough then that manually
>  maintaining a config file wasn't much of a burden.
> 
>  2. There was lots of talk in the industry about devices supporting
>  heterogeneous multipathing. As in ATA on one port and SCSI on the
>  other. So we deliberately did not want to put multipathing in SCSI,
>  anticipating that these hybrid devices might show up (this was in the
>  IDE days, obviously, predating libata sitting under SCSI). We made
>  several design compromises wrt. SCSI devices to accommodate future
>  coexistence with ATA. Then iSCSI came along and provided a "cheaper
>  than FC" solution and everybody instantly lost interest in ATA
>  multipath.
> 
>  3. The devices at the time needed all sorts of custom knobs to
>  function. Path checkers, load balancing algorithms, explicit failover,
>  etc. We needed a way to run arbitrary, potentially proprietary,
>  commands from to initiate failover and failback. Absolute no-go for the
>  kernel so userland it was.
> 
> Those are some of the considerations that went into the original MD/DM
> multipath approach. Everything made lots of sense at the time. But
> obviously the industry constantly changes, things that were once
> important no longer matter. Some design decisions were made based on
> incorrect assumptions or lack of experience and we ended up with major
> ad-hoc workarounds to the originally envisioned approach. SCSI device
> handlers are the prime examples of how the original transport-agnostic
> model didn't quite cut it. Anyway. So here we are. Current DM multipath
> is a result of a whole string of design decisions, many of which are
> based on assumptions that were valid at the time but which are no longer
> relevant today.
> 
> ALUA came along in an attempt to standardize all the proprietary device
> interactions, thus obsoleting the userland plugin requirement. It also
> solved the ID/discovery aspect as well as provided a way to express
> fault domains. The main problem with ALUA was that it was too
> permissive, letting storage vendors get away with very suboptimal, yet
> compliant, implementations based on their older, proprietary multipath
> architectures. So we got the knobs standardized, but device behavior was
> still all over the place.
> 
> Now enter NVMe. The industry had a chance to clean things up. No legacy
> architectures to accommodate, no need for explicit failover, twiddling
> mode pages, reading sector 0, etc. The rationale behind ANA is for
> multipathing to work without any of the explicit configuration and
> management hassles which riddle SCSI devices for hysterical raisins.

Nice recap for those who aren't aware of the past (decision tree and
considerations that influenced the design of DM multipath).

> My objection to DM vs. NVMe enablement is that I think that the two
> models are a very poor fit (manually configured individual block device
> mapping vs. automatic grouping/failover above and below subsystem
> level). On top of that, no compelling technical reason 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-01 Thread Mike Snitzer
On Fri, Jun 01 2018 at 10:09am -0400,
Martin K. Petersen  wrote:

> 
> Good morning Mike,
> 
> > This notion that only native NVMe multipath can be successful is utter
> > bullshit.  And the mere fact that I've gotten such a reaction from a
> > select few speaks to some serious control issues.
> 
> Please stop making this personal.

It cuts both ways, but I agree.

> > Imagine if XFS developers just one day imposed that it is the _only_
> > filesystem that can be used on persistent memory.
> 
> It's not about project X vs. project Y at all. This is about how we got
> to where we are today. And whether we are making right decisions that
> will benefit our users in the long run.
> 
> 20 years ago there were several device-specific SCSI multipath drivers
> available for Linux. All of them out-of-tree because there was no good
> way to consolidate them. They all worked in very different ways because
> the devices themselves were implemented in very different ways. It was a
> nightmare.
> 
> At the time we were very proud of our block layer, an abstraction none
> of the other operating systems really had. And along came Ingo and
> Miguel and did a PoC MD multipath implementation for devices that didn't
> have special needs. It was small, beautiful, and fit well into our shiny
> block layer abstraction. And therefore everyone working on Linux storage
> at the time was convinced that the block layer multipath model was the
> right way to go. Including, I must emphasize, yours truly.
> 
> There were several reasons why the block + userland model was especially
> compelling:
> 
>  1. There were no device serial numbers, UUIDs, or VPD pages. So short
>  of disk labels, there was no way to automatically establish that block
>  device sda was in fact the same LUN as sdb. MD and DM were existing
>  vehicles for describing block device relationships. Either via on-disk
>  metadata or config files and device mapper tables. And system
>  configurations were simple and static enough then that manually
>  maintaining a config file wasn't much of a burden.
> 
>  2. There was lots of talk in the industry about devices supporting
>  heterogeneous multipathing. As in ATA on one port and SCSI on the
>  other. So we deliberately did not want to put multipathing in SCSI,
>  anticipating that these hybrid devices might show up (this was in the
>  IDE days, obviously, predating libata sitting under SCSI). We made
>  several design compromises wrt. SCSI devices to accommodate future
>  coexistence with ATA. Then iSCSI came along and provided a "cheaper
>  than FC" solution and everybody instantly lost interest in ATA
>  multipath.
> 
>  3. The devices at the time needed all sorts of custom knobs to
>  function. Path checkers, load balancing algorithms, explicit failover,
>  etc. We needed a way to run arbitrary, potentially proprietary,
>  commands from to initiate failover and failback. Absolute no-go for the
>  kernel so userland it was.
> 
> Those are some of the considerations that went into the original MD/DM
> multipath approach. Everything made lots of sense at the time. But
> obviously the industry constantly changes, things that were once
> important no longer matter. Some design decisions were made based on
> incorrect assumptions or lack of experience and we ended up with major
> ad-hoc workarounds to the originally envisioned approach. SCSI device
> handlers are the prime examples of how the original transport-agnostic
> model didn't quite cut it. Anyway. So here we are. Current DM multipath
> is a result of a whole string of design decisions, many of which are
> based on assumptions that were valid at the time but which are no longer
> relevant today.
> 
> ALUA came along in an attempt to standardize all the proprietary device
> interactions, thus obsoleting the userland plugin requirement. It also
> solved the ID/discovery aspect as well as provided a way to express
> fault domains. The main problem with ALUA was that it was too
> permissive, letting storage vendors get away with very suboptimal, yet
> compliant, implementations based on their older, proprietary multipath
> architectures. So we got the knobs standardized, but device behavior was
> still all over the place.
> 
> Now enter NVMe. The industry had a chance to clean things up. No legacy
> architectures to accommodate, no need for explicit failover, twiddling
> mode pages, reading sector 0, etc. The rationale behind ANA is for
> multipathing to work without any of the explicit configuration and
> management hassles which riddle SCSI devices for hysterical raisins.

Nice recap for those who aren't aware of the past (decision tree and
considerations that influenced the design of DM multipath).

> My objection to DM vs. NVMe enablement is that I think that the two
> models are a very poor fit (manually configured individual block device
> mapping vs. automatic grouping/failover above and below subsystem
> level). On top of that, no compelling technical reason 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-01 Thread Martin K. Petersen


Good morning Mike,

> This notion that only native NVMe multipath can be successful is utter
> bullshit.  And the mere fact that I've gotten such a reaction from a
> select few speaks to some serious control issues.

Please stop making this personal.

> Imagine if XFS developers just one day imposed that it is the _only_
> filesystem that can be used on persistent memory.

It's not about project X vs. project Y at all. This is about how we got
to where we are today. And whether we are making right decisions that
will benefit our users in the long run.

20 years ago there were several device-specific SCSI multipath drivers
available for Linux. All of them out-of-tree because there was no good
way to consolidate them. They all worked in very different ways because
the devices themselves were implemented in very different ways. It was a
nightmare.

At the time we were very proud of our block layer, an abstraction none
of the other operating systems really had. And along came Ingo and
Miguel and did a PoC MD multipath implementation for devices that didn't
have special needs. It was small, beautiful, and fit well into our shiny
block layer abstraction. And therefore everyone working on Linux storage
at the time was convinced that the block layer multipath model was the
right way to go. Including, I must emphasize, yours truly.

There were several reasons why the block + userland model was especially
compelling:

 1. There were no device serial numbers, UUIDs, or VPD pages. So short
 of disk labels, there was no way to automatically establish that block
 device sda was in fact the same LUN as sdb. MD and DM were existing
 vehicles for describing block device relationships. Either via on-disk
 metadata or config files and device mapper tables. And system
 configurations were simple and static enough then that manually
 maintaining a config file wasn't much of a burden.

 2. There was lots of talk in the industry about devices supporting
 heterogeneous multipathing. As in ATA on one port and SCSI on the
 other. So we deliberately did not want to put multipathing in SCSI,
 anticipating that these hybrid devices might show up (this was in the
 IDE days, obviously, predating libata sitting under SCSI). We made
 several design compromises wrt. SCSI devices to accommodate future
 coexistence with ATA. Then iSCSI came along and provided a "cheaper
 than FC" solution and everybody instantly lost interest in ATA
 multipath.

 3. The devices at the time needed all sorts of custom knobs to
 function. Path checkers, load balancing algorithms, explicit failover,
 etc. We needed a way to run arbitrary, potentially proprietary,
 commands from to initiate failover and failback. Absolute no-go for the
 kernel so userland it was.

Those are some of the considerations that went into the original MD/DM
multipath approach. Everything made lots of sense at the time. But
obviously the industry constantly changes, things that were once
important no longer matter. Some design decisions were made based on
incorrect assumptions or lack of experience and we ended up with major
ad-hoc workarounds to the originally envisioned approach. SCSI device
handlers are the prime examples of how the original transport-agnostic
model didn't quite cut it. Anyway. So here we are. Current DM multipath
is a result of a whole string of design decisions, many of which are
based on assumptions that were valid at the time but which are no longer
relevant today.

ALUA came along in an attempt to standardize all the proprietary device
interactions, thus obsoleting the userland plugin requirement. It also
solved the ID/discovery aspect as well as provided a way to express
fault domains. The main problem with ALUA was that it was too
permissive, letting storage vendors get away with very suboptimal, yet
compliant, implementations based on their older, proprietary multipath
architectures. So we got the knobs standardized, but device behavior was
still all over the place.

Now enter NVMe. The industry had a chance to clean things up. No legacy
architectures to accommodate, no need for explicit failover, twiddling
mode pages, reading sector 0, etc. The rationale behind ANA is for
multipathing to work without any of the explicit configuration and
management hassles which riddle SCSI devices for hysterical raisins.

My objection to DM vs. NVMe enablement is that I think that the two
models are a very poor fit (manually configured individual block device
mapping vs. automatic grouping/failover above and below subsystem
level). On top of that, no compelling technical reason has been offered
for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs
or IQNs into multipath.conf to get things working. And there is no flag
day/transition path requirement for devices that (with very few
exceptions) don't actually exist yet.

So I really don't understand why we must pound a square peg into a round
hole. NVMe is a different protocol. It is based on several 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-06-01 Thread Martin K. Petersen


Good morning Mike,

> This notion that only native NVMe multipath can be successful is utter
> bullshit.  And the mere fact that I've gotten such a reaction from a
> select few speaks to some serious control issues.

Please stop making this personal.

> Imagine if XFS developers just one day imposed that it is the _only_
> filesystem that can be used on persistent memory.

It's not about project X vs. project Y at all. This is about how we got
to where we are today. And whether we are making right decisions that
will benefit our users in the long run.

20 years ago there were several device-specific SCSI multipath drivers
available for Linux. All of them out-of-tree because there was no good
way to consolidate them. They all worked in very different ways because
the devices themselves were implemented in very different ways. It was a
nightmare.

At the time we were very proud of our block layer, an abstraction none
of the other operating systems really had. And along came Ingo and
Miguel and did a PoC MD multipath implementation for devices that didn't
have special needs. It was small, beautiful, and fit well into our shiny
block layer abstraction. And therefore everyone working on Linux storage
at the time was convinced that the block layer multipath model was the
right way to go. Including, I must emphasize, yours truly.

There were several reasons why the block + userland model was especially
compelling:

 1. There were no device serial numbers, UUIDs, or VPD pages. So short
 of disk labels, there was no way to automatically establish that block
 device sda was in fact the same LUN as sdb. MD and DM were existing
 vehicles for describing block device relationships. Either via on-disk
 metadata or config files and device mapper tables. And system
 configurations were simple and static enough then that manually
 maintaining a config file wasn't much of a burden.

 2. There was lots of talk in the industry about devices supporting
 heterogeneous multipathing. As in ATA on one port and SCSI on the
 other. So we deliberately did not want to put multipathing in SCSI,
 anticipating that these hybrid devices might show up (this was in the
 IDE days, obviously, predating libata sitting under SCSI). We made
 several design compromises wrt. SCSI devices to accommodate future
 coexistence with ATA. Then iSCSI came along and provided a "cheaper
 than FC" solution and everybody instantly lost interest in ATA
 multipath.

 3. The devices at the time needed all sorts of custom knobs to
 function. Path checkers, load balancing algorithms, explicit failover,
 etc. We needed a way to run arbitrary, potentially proprietary,
 commands from to initiate failover and failback. Absolute no-go for the
 kernel so userland it was.

Those are some of the considerations that went into the original MD/DM
multipath approach. Everything made lots of sense at the time. But
obviously the industry constantly changes, things that were once
important no longer matter. Some design decisions were made based on
incorrect assumptions or lack of experience and we ended up with major
ad-hoc workarounds to the originally envisioned approach. SCSI device
handlers are the prime examples of how the original transport-agnostic
model didn't quite cut it. Anyway. So here we are. Current DM multipath
is a result of a whole string of design decisions, many of which are
based on assumptions that were valid at the time but which are no longer
relevant today.

ALUA came along in an attempt to standardize all the proprietary device
interactions, thus obsoleting the userland plugin requirement. It also
solved the ID/discovery aspect as well as provided a way to express
fault domains. The main problem with ALUA was that it was too
permissive, letting storage vendors get away with very suboptimal, yet
compliant, implementations based on their older, proprietary multipath
architectures. So we got the knobs standardized, but device behavior was
still all over the place.

Now enter NVMe. The industry had a chance to clean things up. No legacy
architectures to accommodate, no need for explicit failover, twiddling
mode pages, reading sector 0, etc. The rationale behind ANA is for
multipathing to work without any of the explicit configuration and
management hassles which riddle SCSI devices for hysterical raisins.

My objection to DM vs. NVMe enablement is that I think that the two
models are a very poor fit (manually configured individual block device
mapping vs. automatic grouping/failover above and below subsystem
level). On top of that, no compelling technical reason has been offered
for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs
or IQNs into multipath.conf to get things working. And there is no flag
day/transition path requirement for devices that (with very few
exceptions) don't actually exist yet.

So I really don't understand why we must pound a square peg into a round
hole. NVMe is a different protocol. It is based on several 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 10:40pm -0400,
Martin K. Petersen  wrote:

> 
> Mike,
> 
> > 1) container A is tasked with managing some dedicated NVMe technology
> > that absolutely needs native NVMe multipath.
> 
> > 2) container B is tasked with offering some canned layered product
> > that was developed ontop of dm-multipath with its own multipath-tools
> > oriented APIs, etc. And it is to manage some other NVMe technology on
> > the same host as container A.
> 
> This assumes there is something to manage. And that the administrative
> model currently employed by DM multipath will be easily applicable to
> ANA devices. I don't believe that's the case. The configuration happens
> on the storage side, not on the host.

Fair point.

> With ALUA (and the proprietary implementations that predated the spec),
> it was very fuzzy whether it was the host or the target that owned
> responsibility for this or that. Part of the reason was that ALUA was
> deliberately vague to accommodate everybody's existing, non-standards
> compliant multipath storage implementations.
> 
> With ANA the heavy burden falls entirely on the storage. Most of the
> things you would currently configure in multipath.conf have no meaning
> in the context of ANA. Things that are currently the domain of
> dm-multipath or multipathd are inextricably living either in the storage
> device or in the NVMe ANA "device handler". And I think you are
> significantly underestimating the effort required to expose that
> information up the stack and to make use of it. That's not just a
> multipath personality toggle switch.

I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point.  dm-multipath and multipathd are an existing
framework for managing multipath storage.

It could be made to work with NVMe.  But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.

> If you want to make multipath -ll show something meaningful for ANA
> devices, then by all means go ahead. I don't have any problem with
> that.

Thanks so much for your permission ;)  But I'm actually not very
involved with multipathd development anyway.  It is likely a better use
of time in the near-term though.  Making the multipath tools and
libraries be able to understand native NVMe multipath in all its glory
might be a means to an end from a compatibility with existing monitoring
applications perspective.

Though NVMe just doesn't have per-device accounting at all.  Also not
yet aware how nvme cli conveys paths being down vs up, etc.

Glad that isn't my problem ;)

> But I don't think the burden of allowing multipathd/DM to inject
> themselves into the path transition state machine has any benefit
> whatsoever to the user. It's only complicating things and therefore we'd
> be doing people a disservice rather than a favor.

This notion that only native NVMe multipath can be successful is utter
bullshit.  And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.

Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.

Just please dial it back.. seriously tiresome.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 10:40pm -0400,
Martin K. Petersen  wrote:

> 
> Mike,
> 
> > 1) container A is tasked with managing some dedicated NVMe technology
> > that absolutely needs native NVMe multipath.
> 
> > 2) container B is tasked with offering some canned layered product
> > that was developed ontop of dm-multipath with its own multipath-tools
> > oriented APIs, etc. And it is to manage some other NVMe technology on
> > the same host as container A.
> 
> This assumes there is something to manage. And that the administrative
> model currently employed by DM multipath will be easily applicable to
> ANA devices. I don't believe that's the case. The configuration happens
> on the storage side, not on the host.

Fair point.

> With ALUA (and the proprietary implementations that predated the spec),
> it was very fuzzy whether it was the host or the target that owned
> responsibility for this or that. Part of the reason was that ALUA was
> deliberately vague to accommodate everybody's existing, non-standards
> compliant multipath storage implementations.
> 
> With ANA the heavy burden falls entirely on the storage. Most of the
> things you would currently configure in multipath.conf have no meaning
> in the context of ANA. Things that are currently the domain of
> dm-multipath or multipathd are inextricably living either in the storage
> device or in the NVMe ANA "device handler". And I think you are
> significantly underestimating the effort required to expose that
> information up the stack and to make use of it. That's not just a
> multipath personality toggle switch.

I'm aware that most everything in multipath.conf is SCSI/FC specific.
That isn't the point.  dm-multipath and multipathd are an existing
framework for managing multipath storage.

It could be made to work with NVMe.  But yes it would not be easy.
Especially not with the native NVMe multipath crew being so damn
hostile.

> If you want to make multipath -ll show something meaningful for ANA
> devices, then by all means go ahead. I don't have any problem with
> that.

Thanks so much for your permission ;)  But I'm actually not very
involved with multipathd development anyway.  It is likely a better use
of time in the near-term though.  Making the multipath tools and
libraries be able to understand native NVMe multipath in all its glory
might be a means to an end from a compatibility with existing monitoring
applications perspective.

Though NVMe just doesn't have per-device accounting at all.  Also not
yet aware how nvme cli conveys paths being down vs up, etc.

Glad that isn't my problem ;)

> But I don't think the burden of allowing multipathd/DM to inject
> themselves into the path transition state machine has any benefit
> whatsoever to the user. It's only complicating things and therefore we'd
> be doing people a disservice rather than a favor.

This notion that only native NVMe multipath can be successful is utter
bullshit.  And the mere fact that I've gotten such a reaction from a
select few speaks to some serious control issues.

Imagine if XFS developers just one day imposed that it is the _only_
filesystem that can be used on persistent memory.

Just please dial it back.. seriously tiresome.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 12:34pm -0400,
Christoph Hellwig  wrote:

> On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> > I saw your reply to the 1/3 patch.. I do agree it is broken for not
> > checking if any handles are active.  But that is easily fixed no?
> 
> Doing a switch at runtime simply is a really bad idea.  If for some
> reason we end up with a good per-controller switch it would have
> to be something set at probe time, and to get it on a controller
> you'd need to reset it first.

Yes, I see that now.  And the implementation would need to be something
yourself or other more seasoned NVMe developers pursued.  NVMe code is
pretty unforgiving.

I took a crack at aspects of this, my head hurts.  While testing I hit
some "interesting" lack of self-awareness about NVMe resources that are
in use.  So lots of associations are able to be torn down rather than
graceful failure.  Could be nvme_fcloop specific, but it is pretty easy
to do the following using mptest's lib/unittests/nvme_4port_create.sh
followed by: modprobe -r nvme_fcloop

Results in an infinite spew of:
[14245.345759] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.351851] nvme_fcloop: fcloop_exit: Failed deleting target port
[14245.357944] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.364038] nvme_fcloop: fcloop_exit: Failed deleting target port

Another fun one is to lib/unittests/nvme_4port_delete.sh while the
native NVMe multipath device (created from nvme_4port_create.sh) was
still in use by an xfs mount, so:
./nvme_4port_create.sh
mount /dev/nvme1n1 /mnt
./nvme_4port_delete.sh
umount /mnt

Those were clear screwups on my part but I wouldn't have expected them
to cause nvme to blow through so many stop signs.

Anyway, I put enough time to trying to make the previously thought
"simple" mpath_personality switch safe -- in the face of active handles
(issue Sagi pointed out) -- that it is clear NVMe just doesn't have
enough state to do it in a clean way.  Would require a deeper
understanding of the code that I don't have.  Most every NVMe function
returns void so there is basically no potential for error handling (in
the face of a resource being in use).

The following is my WIP patch (built ontop of the 3 patches from
this thread's series) that has cured me of wanting to continue pursuit
of a robust implementation of the runtime 'mpath_personality' switch:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e018d0..80103b3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2146,10 +2146,8 @@ static ssize_t 
__nvme_subsys_store_mpath_personality(struct nvme_subsystem *subs
goto out;
}
 
-   if (subsys->native_mpath != native_mpath) {
-   subsys->native_mpath = native_mpath;
-   ret = nvme_mpath_change_personality(subsys);
-   }
+   if (subsys->native_mpath != native_mpath)
+   ret = nvme_mpath_change_personality(subsys, native_mpath);
 out:
return ret ? ret : count;
 }
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 53d2610..017c924 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -247,26 +247,57 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
put_disk(head->disk);
 }
 
-int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
+static bool __nvme_subsys_in_use(struct nvme_subsystem *subsys)
 {
struct nvme_ctrl *ctrl;
-   int ret = 0;
+   struct nvme_ns *ns, *next;
 
-restart:
-   mutex_lock(>lock);
list_for_each_entry(ctrl, >ctrls, subsys_entry) {
-   if (!list_empty(>namespaces)) {
-   mutex_unlock(>lock);
-   nvme_remove_namespaces(ctrl);
-   goto restart;
+   down_write(>namespaces_rwsem);
+   list_for_each_entry_safe(ns, next, >namespaces, list) {
+   if ((kref_read(>kref) > 1) ||
+   // FIXME: need to compare with N paths
+   (ns->head && (kref_read(>head->ref) > 1))) {
+   printk("ns->kref = %d", kref_read(>kref));
+   printk("ns->head->ref = %d", 
kref_read(>head->ref));
+   up_write(>namespaces_rwsem);
+   mutex_unlock(>lock);
+   return true;
+   }
}
+   up_write(>namespaces_rwsem);
}
-   mutex_unlock(>lock);
+
+   return false;
+}
+
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native)
+{
+   struct nvme_ctrl *ctrl;
 
mutex_lock(>lock);
-   list_for_each_entry(ctrl, >ctrls, subsys_entry)
-   nvme_queue_scan(ctrl);
+
+   if (__nvme_subsys_in_use(subsys)) {
+   mutex_unlock(>lock);
+   return -EBUSY;
+   }

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 12:34pm -0400,
Christoph Hellwig  wrote:

> On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> > I saw your reply to the 1/3 patch.. I do agree it is broken for not
> > checking if any handles are active.  But that is easily fixed no?
> 
> Doing a switch at runtime simply is a really bad idea.  If for some
> reason we end up with a good per-controller switch it would have
> to be something set at probe time, and to get it on a controller
> you'd need to reset it first.

Yes, I see that now.  And the implementation would need to be something
yourself or other more seasoned NVMe developers pursued.  NVMe code is
pretty unforgiving.

I took a crack at aspects of this, my head hurts.  While testing I hit
some "interesting" lack of self-awareness about NVMe resources that are
in use.  So lots of associations are able to be torn down rather than
graceful failure.  Could be nvme_fcloop specific, but it is pretty easy
to do the following using mptest's lib/unittests/nvme_4port_create.sh
followed by: modprobe -r nvme_fcloop

Results in an infinite spew of:
[14245.345759] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.351851] nvme_fcloop: fcloop_exit: Failed deleting target port
[14245.357944] nvme_fcloop: fcloop_exit: Failed deleting remote port
[14245.364038] nvme_fcloop: fcloop_exit: Failed deleting target port

Another fun one is to lib/unittests/nvme_4port_delete.sh while the
native NVMe multipath device (created from nvme_4port_create.sh) was
still in use by an xfs mount, so:
./nvme_4port_create.sh
mount /dev/nvme1n1 /mnt
./nvme_4port_delete.sh
umount /mnt

Those were clear screwups on my part but I wouldn't have expected them
to cause nvme to blow through so many stop signs.

Anyway, I put enough time to trying to make the previously thought
"simple" mpath_personality switch safe -- in the face of active handles
(issue Sagi pointed out) -- that it is clear NVMe just doesn't have
enough state to do it in a clean way.  Would require a deeper
understanding of the code that I don't have.  Most every NVMe function
returns void so there is basically no potential for error handling (in
the face of a resource being in use).

The following is my WIP patch (built ontop of the 3 patches from
this thread's series) that has cured me of wanting to continue pursuit
of a robust implementation of the runtime 'mpath_personality' switch:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1e018d0..80103b3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2146,10 +2146,8 @@ static ssize_t 
__nvme_subsys_store_mpath_personality(struct nvme_subsystem *subs
goto out;
}
 
-   if (subsys->native_mpath != native_mpath) {
-   subsys->native_mpath = native_mpath;
-   ret = nvme_mpath_change_personality(subsys);
-   }
+   if (subsys->native_mpath != native_mpath)
+   ret = nvme_mpath_change_personality(subsys, native_mpath);
 out:
return ret ? ret : count;
 }
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 53d2610..017c924 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -247,26 +247,57 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
put_disk(head->disk);
 }
 
-int nvme_mpath_change_personality(struct nvme_subsystem *subsys)
+static bool __nvme_subsys_in_use(struct nvme_subsystem *subsys)
 {
struct nvme_ctrl *ctrl;
-   int ret = 0;
+   struct nvme_ns *ns, *next;
 
-restart:
-   mutex_lock(>lock);
list_for_each_entry(ctrl, >ctrls, subsys_entry) {
-   if (!list_empty(>namespaces)) {
-   mutex_unlock(>lock);
-   nvme_remove_namespaces(ctrl);
-   goto restart;
+   down_write(>namespaces_rwsem);
+   list_for_each_entry_safe(ns, next, >namespaces, list) {
+   if ((kref_read(>kref) > 1) ||
+   // FIXME: need to compare with N paths
+   (ns->head && (kref_read(>head->ref) > 1))) {
+   printk("ns->kref = %d", kref_read(>kref));
+   printk("ns->head->ref = %d", 
kref_read(>head->ref));
+   up_write(>namespaces_rwsem);
+   mutex_unlock(>lock);
+   return true;
+   }
}
+   up_write(>namespaces_rwsem);
}
-   mutex_unlock(>lock);
+
+   return false;
+}
+
+int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native)
+{
+   struct nvme_ctrl *ctrl;
 
mutex_lock(>lock);
-   list_for_each_entry(ctrl, >ctrls, subsys_entry)
-   nvme_queue_scan(ctrl);
+
+   if (__nvme_subsys_in_use(subsys)) {
+   mutex_unlock(>lock);
+   return -EBUSY;
+   }

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Martin K. Petersen


Mike,

> 1) container A is tasked with managing some dedicated NVMe technology
> that absolutely needs native NVMe multipath.

> 2) container B is tasked with offering some canned layered product
> that was developed ontop of dm-multipath with its own multipath-tools
> oriented APIs, etc. And it is to manage some other NVMe technology on
> the same host as container A.

This assumes there is something to manage. And that the administrative
model currently employed by DM multipath will be easily applicable to
ANA devices. I don't believe that's the case. The configuration happens
on the storage side, not on the host.

With ALUA (and the proprietary implementations that predated the spec),
it was very fuzzy whether it was the host or the target that owned
responsibility for this or that. Part of the reason was that ALUA was
deliberately vague to accommodate everybody's existing, non-standards
compliant multipath storage implementations.

With ANA the heavy burden falls entirely on the storage. Most of the
things you would currently configure in multipath.conf have no meaning
in the context of ANA. Things that are currently the domain of
dm-multipath or multipathd are inextricably living either in the storage
device or in the NVMe ANA "device handler". And I think you are
significantly underestimating the effort required to expose that
information up the stack and to make use of it. That's not just a
multipath personality toggle switch.

If you want to make multipath -ll show something meaningful for ANA
devices, then by all means go ahead. I don't have any problem with
that. But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Martin K. Petersen


Mike,

> 1) container A is tasked with managing some dedicated NVMe technology
> that absolutely needs native NVMe multipath.

> 2) container B is tasked with offering some canned layered product
> that was developed ontop of dm-multipath with its own multipath-tools
> oriented APIs, etc. And it is to manage some other NVMe technology on
> the same host as container A.

This assumes there is something to manage. And that the administrative
model currently employed by DM multipath will be easily applicable to
ANA devices. I don't believe that's the case. The configuration happens
on the storage side, not on the host.

With ALUA (and the proprietary implementations that predated the spec),
it was very fuzzy whether it was the host or the target that owned
responsibility for this or that. Part of the reason was that ALUA was
deliberately vague to accommodate everybody's existing, non-standards
compliant multipath storage implementations.

With ANA the heavy burden falls entirely on the storage. Most of the
things you would currently configure in multipath.conf have no meaning
in the context of ANA. Things that are currently the domain of
dm-multipath or multipathd are inextricably living either in the storage
device or in the NVMe ANA "device handler". And I think you are
significantly underestimating the effort required to expose that
information up the stack and to make use of it. That's not just a
multipath personality toggle switch.

If you want to make multipath -ll show something meaningful for ANA
devices, then by all means go ahead. I don't have any problem with
that. But I don't think the burden of allowing multipathd/DM to inject
themselves into the path transition state machine has any benefit
whatsoever to the user. It's only complicating things and therefore we'd
be doing people a disservice rather than a favor.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 12:33pm -0400,
Christoph Hellwig  wrote:

> On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> > Because once nvme_core.multipath=N is set: native NVMe multipath is then
> > not accessible from the same host.  The goal of this patchset is to give
> > users choice.  But not limit them to _only_ using dm-multipath if they
> > just have some legacy needs.
> 
> Choise by itself really isn't an argument.  We need a really good
> use case for all the complexity, and so far none has been presented.

OK, but its choice that is governed by higher level requirements that _I_
personally don't have.  They are large datacenter deployments like
Hannes eluded to [1] where there is heavy automation and/or layered
products that are developed around dm-multipath (via libraries to access
multipath-tools provided info, etc).

So trying to pin me down on _why_ users elect to make this choice (or
that there is such annoying inertia behind their choice) really isn't
fair TBH.  They exist.  Please just accept that.

Now another hypothetical usecase I thought of today, that really drives
home _why_ it useful to have this fine-grained 'mpath_personality'
flexibility is: admin containers.  (not saying people or companies
currently, or plan to, do this but they very easily could...):
1) container A is tasked with managing some dedicated NVMe technology
   that absolutely needs native NVMe multipath.
2) container B is tasked with offering some canned layered product that
   was developed ontop of dm-multipath with its own multipath-tools
   oriented APIs, etc. And it is to manage some other NVMe technology on
   the same host as container A.

So, containers with conflicting requirements running on the same host.

Now you can say: sorry don't do that.  But that really isn't a valid
counter.

Point is it really is meaningful to offer this 'mpath_personality'
switch.  I'm obviously hopeful for it to not be heavily used BUT not
providing the ability for native NVMe multipath and dm-multipath to
coexist on the same Linux host really isn't viable in the near-term.

Mike

[1] https://lkml.org/lkml/2018/5/29/95


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at 12:33pm -0400,
Christoph Hellwig  wrote:

> On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> > Because once nvme_core.multipath=N is set: native NVMe multipath is then
> > not accessible from the same host.  The goal of this patchset is to give
> > users choice.  But not limit them to _only_ using dm-multipath if they
> > just have some legacy needs.
> 
> Choise by itself really isn't an argument.  We need a really good
> use case for all the complexity, and so far none has been presented.

OK, but its choice that is governed by higher level requirements that _I_
personally don't have.  They are large datacenter deployments like
Hannes eluded to [1] where there is heavy automation and/or layered
products that are developed around dm-multipath (via libraries to access
multipath-tools provided info, etc).

So trying to pin me down on _why_ users elect to make this choice (or
that there is such annoying inertia behind their choice) really isn't
fair TBH.  They exist.  Please just accept that.

Now another hypothetical usecase I thought of today, that really drives
home _why_ it useful to have this fine-grained 'mpath_personality'
flexibility is: admin containers.  (not saying people or companies
currently, or plan to, do this but they very easily could...):
1) container A is tasked with managing some dedicated NVMe technology
   that absolutely needs native NVMe multipath.
2) container B is tasked with offering some canned layered product that
   was developed ontop of dm-multipath with its own multipath-tools
   oriented APIs, etc. And it is to manage some other NVMe technology on
   the same host as container A.

So, containers with conflicting requirements running on the same host.

Now you can say: sorry don't do that.  But that really isn't a valid
counter.

Point is it really is meaningful to offer this 'mpath_personality'
switch.  I'm obviously hopeful for it to not be heavily used BUT not
providing the ability for native NVMe multipath and dm-multipath to
coexist on the same Linux host really isn't viable in the near-term.

Mike

[1] https://lkml.org/lkml/2018/5/29/95


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Thu, May 31, 2018 at 11:37:20AM +0300, Sagi Grimberg wrote:
>> the same host with PCI NVMe could be connected to a FC network that has
>> historically always been managed via dm-multipath.. but say that
>> FC-based infrastructure gets updated to use NVMe (to leverage a wider
>> NVMe investment, whatever?) -- but maybe admins would still prefer to
>> use dm-multipath for the NVMe over FC.
>
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)

It seems like this isn't what Mike wanted, but I actually got some
requests for limited support for that to do a storage live migration
from a SCSI array to NVMe.  I think it is really sketchy, but if
doable if you are careful enough.  It would use dm-multipath, possibly
even on top of nvme multipathing if we have multiple nvme paths.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Thu, May 31, 2018 at 11:37:20AM +0300, Sagi Grimberg wrote:
>> the same host with PCI NVMe could be connected to a FC network that has
>> historically always been managed via dm-multipath.. but say that
>> FC-based infrastructure gets updated to use NVMe (to leverage a wider
>> NVMe investment, whatever?) -- but maybe admins would still prefer to
>> use dm-multipath for the NVMe over FC.
>
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)

It seems like this isn't what Mike wanted, but I actually got some
requests for limited support for that to do a storage live migration
from a SCSI array to NVMe.  I think it is really sketchy, but if
doable if you are careful enough.  It would use dm-multipath, possibly
even on top of nvme multipathing if we have multiple nvme paths.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> I saw your reply to the 1/3 patch.. I do agree it is broken for not
> checking if any handles are active.  But that is easily fixed no?

Doing a switch at runtime simply is a really bad idea.  If for some
reason we end up with a good per-controller switch it would have
to be something set at probe time, and to get it on a controller
you'd need to reset it first.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote:
> I saw your reply to the 1/3 patch.. I do agree it is broken for not
> checking if any handles are active.  But that is easily fixed no?

Doing a switch at runtime simply is a really bad idea.  If for some
reason we end up with a good per-controller switch it would have
to be something set at probe time, and to get it on a controller
you'd need to reset it first.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> Because once nvme_core.multipath=N is set: native NVMe multipath is then
> not accessible from the same host.  The goal of this patchset is to give
> users choice.  But not limit them to _only_ using dm-multipath if they
> just have some legacy needs.

Choise by itself really isn't an argument.  We need a really good
use case for all the complexity, and so far none has been presented.

> Tough to be convincing with hypotheticals but I could imagine a very
> obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> "fabrics" (especially if/when the numa-based path selector lands).  But
> the same host with PCI NVMe could be connected to a FC network that has
> historically always been managed via dm-multipath.. but say that
> FC-based infrastructure gets updated to use NVMe (to leverage a wider
> NVMe investment, whatever?) -- but maybe admins would still prefer to
> use dm-multipath for the NVMe over FC.

That is a lot of maybes.  If they prefer the good old way on FC then
can easily stay with SCSI, or for that matter use the global switch
off.

> > This might sound stupid to you, but can't users that desperately must
> > keep using dm-multipath (for its mature toolset or what-not) just
> > stack it on multipath nvme device? (I might be completely off on
> > this so feel free to correct my ignorance).
> 
> We could certainly pursue adding multipath-tools support for native NVMe
> multipathing.  Not opposed to it (even if just reporting topology and
> state).  But given the extensive lengths NVMe multipath goes to hide
> devices we'd need some way to piercing through the opaque nvme device
> that native NVMe multipath exposes.  But that really is a tangent
> relative to this patchset.  Since that kind of visibility would also
> benefit the nvme cli... otherwise how are users to even be able to trust
> but verify native NVMe multipathing did what it expected it to?

Just look at the nvme-cli output or sysfs.  It's all been there since
the code was merged to mainline.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Christoph Hellwig
On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote:
> Because once nvme_core.multipath=N is set: native NVMe multipath is then
> not accessible from the same host.  The goal of this patchset is to give
> users choice.  But not limit them to _only_ using dm-multipath if they
> just have some legacy needs.

Choise by itself really isn't an argument.  We need a really good
use case for all the complexity, and so far none has been presented.

> Tough to be convincing with hypotheticals but I could imagine a very
> obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> "fabrics" (especially if/when the numa-based path selector lands).  But
> the same host with PCI NVMe could be connected to a FC network that has
> historically always been managed via dm-multipath.. but say that
> FC-based infrastructure gets updated to use NVMe (to leverage a wider
> NVMe investment, whatever?) -- but maybe admins would still prefer to
> use dm-multipath for the NVMe over FC.

That is a lot of maybes.  If they prefer the good old way on FC then
can easily stay with SCSI, or for that matter use the global switch
off.

> > This might sound stupid to you, but can't users that desperately must
> > keep using dm-multipath (for its mature toolset or what-not) just
> > stack it on multipath nvme device? (I might be completely off on
> > this so feel free to correct my ignorance).
> 
> We could certainly pursue adding multipath-tools support for native NVMe
> multipathing.  Not opposed to it (even if just reporting topology and
> state).  But given the extensive lengths NVMe multipath goes to hide
> devices we'd need some way to piercing through the opaque nvme device
> that native NVMe multipath exposes.  But that really is a tangent
> relative to this patchset.  Since that kind of visibility would also
> benefit the nvme cli... otherwise how are users to even be able to trust
> but verify native NVMe multipathing did what it expected it to?

Just look at the nvme-cli output or sysfs.  It's all been there since
the code was merged to mainline.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at  4:51am -0400,
Sagi Grimberg  wrote:

> 
> >>Moreover, I also wanted to point out that fabrics array vendors are
> >>building products that rely on standard nvme multipathing (and probably
> >>multipathing over dispersed namespaces as well), and keeping a knob that
> >>will keep nvme users with dm-multipath will probably not help them
> >>educate their customers as well... So there is another angle to this.
> >
> >Noticed I didn't respond directly to this aspect.  As I explained in
> >various replies to this thread: The users/admins would be the ones who
> >would decide to use dm-multipath.  It wouldn't be something that'd be
> >imposed by default.  If anything, the all-or-nothing
> >nvme_core.multipath=N would pose a much more serious concern for these
> >array vendors that do have designs to specifically leverage native NVMe
> >multipath.  Because if users were to get into the habit of setting that
> >on the kernel commandline they'd literally _never_ be able to leverage
> >native NVMe multipathing.
> >
> >We can also add multipath.conf docs (man page, etc) that caution admins
> >to consult their array vendors about whether using dm-multipath is to be
> >avoided, etc.
> >
> >Again, this is opt-in, so on a upstream Linux kernel level the default
> >of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
> >is configured).  Not seeing why there is so much angst and concern about
> >offering this flexibility via opt-in but I'm also glad we're having this
> >discussion to have our eyes wide open.
> 
> I think that the concern is valid and should not be dismissed. And
> at times flexibility is a real source of pain, both to users and
> developers.
> 
> The choice is there, no one is forbidden to use multipath. I'm just
> still not sure exactly why the subsystem granularity is an absolute
> must other than a volume exposed as a nvmf namespace and scsi lun (how
> would dm-multipath detect this is the same device btw?)

Please see my other reply, I was talking about completely disjoint
arrays in my hypothetical config where having the ability to allow
simultaneous use of native NVMe multipath and dm-multipath is
meaningful.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at  4:51am -0400,
Sagi Grimberg  wrote:

> 
> >>Moreover, I also wanted to point out that fabrics array vendors are
> >>building products that rely on standard nvme multipathing (and probably
> >>multipathing over dispersed namespaces as well), and keeping a knob that
> >>will keep nvme users with dm-multipath will probably not help them
> >>educate their customers as well... So there is another angle to this.
> >
> >Noticed I didn't respond directly to this aspect.  As I explained in
> >various replies to this thread: The users/admins would be the ones who
> >would decide to use dm-multipath.  It wouldn't be something that'd be
> >imposed by default.  If anything, the all-or-nothing
> >nvme_core.multipath=N would pose a much more serious concern for these
> >array vendors that do have designs to specifically leverage native NVMe
> >multipath.  Because if users were to get into the habit of setting that
> >on the kernel commandline they'd literally _never_ be able to leverage
> >native NVMe multipathing.
> >
> >We can also add multipath.conf docs (man page, etc) that caution admins
> >to consult their array vendors about whether using dm-multipath is to be
> >avoided, etc.
> >
> >Again, this is opt-in, so on a upstream Linux kernel level the default
> >of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
> >is configured).  Not seeing why there is so much angst and concern about
> >offering this flexibility via opt-in but I'm also glad we're having this
> >discussion to have our eyes wide open.
> 
> I think that the concern is valid and should not be dismissed. And
> at times flexibility is a real source of pain, both to users and
> developers.
> 
> The choice is there, no one is forbidden to use multipath. I'm just
> still not sure exactly why the subsystem granularity is an absolute
> must other than a volume exposed as a nvmf namespace and scsi lun (how
> would dm-multipath detect this is the same device btw?)

Please see my other reply, I was talking about completely disjoint
arrays in my hypothetical config where having the ability to allow
simultaneous use of native NVMe multipath and dm-multipath is
meaningful.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at  4:37am -0400,
Sagi Grimberg  wrote:

> 
> >Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
> >when features like "dispersed namespaces" land a negative check would
> >need to be added in the code to prevent switching from "native".
> >
> >And once something like "dispersed namespaces" lands we'd then have to
> >see about a more sophisticated switch that operates at a different
> >granularity.  Could also be that switching one subsystem that is part of
> >"dispersed namespaces" would then cascade to all other associated
> >subsystems?  Not that dissimilar from the 3rd patch in this series that
> >allows a 'device' switch to be done in terms of the subsystem.
> 
> Which I think is broken by allowing to change this personality on the
> fly.

I saw your reply to the 1/3 patch.. I do agree it is broken for not
checking if any handles are active.  But that is easily fixed no?

Or are you suggesting some other aspect of "broken"?

> >Anyway, I don't know the end from the beginning on something you just
> >told me about ;)  But we're all in this together.  And can take it as it
> >comes.
> 
> I agree but this will be exposed to user-space and we will need to live
> with it for a long long time...

OK, well dm-multipath has been around for a long long time.  We cannot
simply wish it away.  Regardless of whatever architectural grievances
are levied against it.

There are far more customer and vendor products that have been developed
to understand and consume dm-multipath and multipath-tools interfaces
than native NVMe multipath.

> >>Don't get me wrong, I do support your cause, and I think nvme should try
> >>to help, I just think that subsystem granularity is not the correct
> >>approach going forward.
> >
> >I understand there will be limits to this 'mpath_personality' knob's
> >utility and it'll need to evolve over time.  But the burden of making
> >more advanced NVMe multipath features accessible outside of native NVMe
> >isn't intended to be on any of the NVMe maintainers (other than maybe
> >remembering to disallow the switch where it makes sense in the future).
> 
> I would expect that any "advanced multipath features" would be properly
> brought up with the NVMe TWG as a ratified standard and find its way
> to nvme. So I don't think this particularly is a valid argument.

You're misreading me again.  I'm also saying stop worrying.  I'm saying
any future native NVMe multipath features that come about don't necessarily
get immediate dm-multipath parity.  The native NVMe multipath would need
appropriate negative checks.
 
> >>As I said, I've been off the grid, can you remind me why global knob is
> >>not sufficient?
> >
> >Because once nvme_core.multipath=N is set: native NVMe multipath is then
> >not accessible from the same host.  The goal of this patchset is to give
> >users choice.  But not limit them to _only_ using dm-multipath if they
> >just have some legacy needs.
> >
> >Tough to be convincing with hypotheticals but I could imagine a very
> >obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> >"fabrics" (especially if/when the numa-based path selector lands).  But
> >the same host with PCI NVMe could be connected to a FC network that has
> >historically always been managed via dm-multipath.. but say that
> >FC-based infrastructure gets updated to use NVMe (to leverage a wider
> >NVMe investment, whatever?) -- but maybe admins would still prefer to
> >use dm-multipath for the NVMe over FC.
> 
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)

No I'm referring to completely disjoint arrays that are homed to the
same host.

> >>This might sound stupid to you, but can't users that desperately must
> >>keep using dm-multipath (for its mature toolset or what-not) just
> >>stack it on multipath nvme device? (I might be completely off on
> >>this so feel free to correct my ignorance).
> >
> >We could certainly pursue adding multipath-tools support for native NVMe
> >multipathing.  Not opposed to it (even if just reporting topology and
> >state).  But given the extensive lengths NVMe multipath goes to hide
> >devices we'd need some way to piercing through the opaque nvme device
> >that native NVMe multipath exposes.  But that really is a tangent
> >relative to this patchset.  Since that kind of visibility would also
> >benefit the nvme cli... otherwise how are users to even be able to trust
> >but verify native NVMe multipathing did what it expected it to?
> 
> Can you explain what is missing for multipath-tools to resolve topology?

I've not poured over these nvme interfaces (below I just learned
nvme-cli has since grown the capability).   SO I'm not informed enough
to know if nvme cli has grown other new capabilities.

In any case, training multipath-tools to understand native NVMe
multipath 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Mike Snitzer
On Thu, May 31 2018 at  4:37am -0400,
Sagi Grimberg  wrote:

> 
> >Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
> >when features like "dispersed namespaces" land a negative check would
> >need to be added in the code to prevent switching from "native".
> >
> >And once something like "dispersed namespaces" lands we'd then have to
> >see about a more sophisticated switch that operates at a different
> >granularity.  Could also be that switching one subsystem that is part of
> >"dispersed namespaces" would then cascade to all other associated
> >subsystems?  Not that dissimilar from the 3rd patch in this series that
> >allows a 'device' switch to be done in terms of the subsystem.
> 
> Which I think is broken by allowing to change this personality on the
> fly.

I saw your reply to the 1/3 patch.. I do agree it is broken for not
checking if any handles are active.  But that is easily fixed no?

Or are you suggesting some other aspect of "broken"?

> >Anyway, I don't know the end from the beginning on something you just
> >told me about ;)  But we're all in this together.  And can take it as it
> >comes.
> 
> I agree but this will be exposed to user-space and we will need to live
> with it for a long long time...

OK, well dm-multipath has been around for a long long time.  We cannot
simply wish it away.  Regardless of whatever architectural grievances
are levied against it.

There are far more customer and vendor products that have been developed
to understand and consume dm-multipath and multipath-tools interfaces
than native NVMe multipath.

> >>Don't get me wrong, I do support your cause, and I think nvme should try
> >>to help, I just think that subsystem granularity is not the correct
> >>approach going forward.
> >
> >I understand there will be limits to this 'mpath_personality' knob's
> >utility and it'll need to evolve over time.  But the burden of making
> >more advanced NVMe multipath features accessible outside of native NVMe
> >isn't intended to be on any of the NVMe maintainers (other than maybe
> >remembering to disallow the switch where it makes sense in the future).
> 
> I would expect that any "advanced multipath features" would be properly
> brought up with the NVMe TWG as a ratified standard and find its way
> to nvme. So I don't think this particularly is a valid argument.

You're misreading me again.  I'm also saying stop worrying.  I'm saying
any future native NVMe multipath features that come about don't necessarily
get immediate dm-multipath parity.  The native NVMe multipath would need
appropriate negative checks.
 
> >>As I said, I've been off the grid, can you remind me why global knob is
> >>not sufficient?
> >
> >Because once nvme_core.multipath=N is set: native NVMe multipath is then
> >not accessible from the same host.  The goal of this patchset is to give
> >users choice.  But not limit them to _only_ using dm-multipath if they
> >just have some legacy needs.
> >
> >Tough to be convincing with hypotheticals but I could imagine a very
> >obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
> >"fabrics" (especially if/when the numa-based path selector lands).  But
> >the same host with PCI NVMe could be connected to a FC network that has
> >historically always been managed via dm-multipath.. but say that
> >FC-based infrastructure gets updated to use NVMe (to leverage a wider
> >NVMe investment, whatever?) -- but maybe admins would still prefer to
> >use dm-multipath for the NVMe over FC.
> 
> You are referring to an array exposing media via nvmf and scsi
> simultaneously? I'm not sure that there is a clean definition of
> how that is supposed to work (ANA/ALUA, reservations, etc..)

No I'm referring to completely disjoint arrays that are homed to the
same host.

> >>This might sound stupid to you, but can't users that desperately must
> >>keep using dm-multipath (for its mature toolset or what-not) just
> >>stack it on multipath nvme device? (I might be completely off on
> >>this so feel free to correct my ignorance).
> >
> >We could certainly pursue adding multipath-tools support for native NVMe
> >multipathing.  Not opposed to it (even if just reporting topology and
> >state).  But given the extensive lengths NVMe multipath goes to hide
> >devices we'd need some way to piercing through the opaque nvme device
> >that native NVMe multipath exposes.  But that really is a tangent
> >relative to this patchset.  Since that kind of visibility would also
> >benefit the nvme cli... otherwise how are users to even be able to trust
> >but verify native NVMe multipathing did what it expected it to?
> 
> Can you explain what is missing for multipath-tools to resolve topology?

I've not poured over these nvme interfaces (below I just learned
nvme-cli has since grown the capability).   SO I'm not informed enough
to know if nvme cli has grown other new capabilities.

In any case, training multipath-tools to understand native NVMe
multipath 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.


Noticed I didn't respond directly to this aspect.  As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath.  It wouldn't be something that'd be
imposed by default.  If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath.  Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.

We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.

Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured).  Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.


I think that the concern is valid and should not be dismissed. And
at times flexibility is a real source of pain, both to users and
developers.

The choice is there, no one is forbidden to use multipath. I'm just
still not sure exactly why the subsystem granularity is an absolute
must other than a volume exposed as a nvmf namespace and scsi lun (how
would dm-multipath detect this is the same device btw?)


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.


Noticed I didn't respond directly to this aspect.  As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath.  It wouldn't be something that'd be
imposed by default.  If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath.  Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.

We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.

Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured).  Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.


I think that the concern is valid and should not be dismissed. And
at times flexibility is a real source of pain, both to users and
developers.

The choice is there, no one is forbidden to use multipath. I'm just
still not sure exactly why the subsystem granularity is an absolute
must other than a volume exposed as a nvmf namespace and scsi lun (how
would dm-multipath detect this is the same device btw?)


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".

And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity.  Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems?  Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.


Which I think is broken by allowing to change this personality on the
fly.



Anyway, I don't know the end from the beginning on something you just
told me about ;)  But we're all in this together.  And can take it as it
comes.


I agree but this will be exposed to user-space and we will need to live
with it for a long long time...


I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.

In time I really do have aspirations to contribute more to NVMe
multipathing.  I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).


That's great.


Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.


I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time.  But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).


I would expect that any "advanced multipath features" would be properly
brought up with the NVMe TWG as a ratified standard and find its way
to nvme. So I don't think this particularly is a valid argument.


As I said, I've been off the grid, can you remind me why global knob is
not sufficient?


Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host.  The goal of this patchset is to give
users choice.  But not limit them to _only_ using dm-multipath if they
just have some legacy needs.

Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands).  But
the same host with PCI NVMe could be connected to a FC network that has
historically always been managed via dm-multipath.. but say that
FC-based infrastructure gets updated to use NVMe (to leverage a wider
NVMe investment, whatever?) -- but maybe admins would still prefer to
use dm-multipath for the NVMe over FC.


You are referring to an array exposing media via nvmf and scsi
simultaneously? I'm not sure that there is a clean definition of
how that is supposed to work (ANA/ALUA, reservations, etc..)


This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just
stack it on multipath nvme device? (I might be completely off on
this so feel free to correct my ignorance).


We could certainly pursue adding multipath-tools support for native NVMe
multipathing.  Not opposed to it (even if just reporting topology and
state).  But given the extensive lengths NVMe multipath goes to hide
devices we'd need some way to piercing through the opaque nvme device
that native NVMe multipath exposes.  But that really is a tangent
relative to this patchset.  Since that kind of visibility would also
benefit the nvme cli... otherwise how are users to even be able to trust
but verify native NVMe multipathing did what it expected it to?


Can you explain what is missing for multipath-tools to resolve topology?

nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl
topology but that is sort of the important information as controllers
are the real paths.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-31 Thread Sagi Grimberg




Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".

And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity.  Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems?  Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.


Which I think is broken by allowing to change this personality on the
fly.



Anyway, I don't know the end from the beginning on something you just
told me about ;)  But we're all in this together.  And can take it as it
comes.


I agree but this will be exposed to user-space and we will need to live
with it for a long long time...


I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.

In time I really do have aspirations to contribute more to NVMe
multipathing.  I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).


That's great.


Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.


I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time.  But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).


I would expect that any "advanced multipath features" would be properly
brought up with the NVMe TWG as a ratified standard and find its way
to nvme. So I don't think this particularly is a valid argument.


As I said, I've been off the grid, can you remind me why global knob is
not sufficient?


Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host.  The goal of this patchset is to give
users choice.  But not limit them to _only_ using dm-multipath if they
just have some legacy needs.

Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands).  But
the same host with PCI NVMe could be connected to a FC network that has
historically always been managed via dm-multipath.. but say that
FC-based infrastructure gets updated to use NVMe (to leverage a wider
NVMe investment, whatever?) -- but maybe admins would still prefer to
use dm-multipath for the NVMe over FC.


You are referring to an array exposing media via nvmf and scsi
simultaneously? I'm not sure that there is a clean definition of
how that is supposed to work (ANA/ALUA, reservations, etc..)


This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just
stack it on multipath nvme device? (I might be completely off on
this so feel free to correct my ignorance).


We could certainly pursue adding multipath-tools support for native NVMe
multipathing.  Not opposed to it (even if just reporting topology and
state).  But given the extensive lengths NVMe multipath goes to hide
devices we'd need some way to piercing through the opaque nvme device
that native NVMe multipath exposes.  But that really is a tangent
relative to this patchset.  Since that kind of visibility would also
benefit the nvme cli... otherwise how are users to even be able to trust
but verify native NVMe multipathing did what it expected it to?


Can you explain what is missing for multipath-tools to resolve topology?

nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl
topology but that is sort of the important information as controllers
are the real paths.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Ming Lei
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> > No, what both Red Hat and SUSE are saying is: cool let's have a go at
> > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> > multipath) to be conditionally enabled to coexist with native NVMe
> > multipath?
> 
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).

BTW, for scsi-mq, we have made a little progress by commit 2f31115e940c
(scsi: core: introduce force_blk_mq), and virtio-scsi is working at
always scsi-mq mode now. Then driver can decide if .force_blk_mq needs
to be set.

Hope progress can be made in this nvme mpath issue too.

Thanks,
Ming


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Ming Lei
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> > No, what both Red Hat and SUSE are saying is: cool let's have a go at
> > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> > multipath) to be conditionally enabled to coexist with native NVMe
> > multipath?
> 
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).

BTW, for scsi-mq, we have made a little progress by commit 2f31115e940c
(scsi: core: introduce force_blk_mq), and virtio-scsi is working at
always scsi-mq mode now. Then driver can decide if .force_blk_mq needs
to be set.

Hope progress can be made in this nvme mpath issue too.

Thanks,
Ming


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Mike Snitzer
On Wed, May 30 2018 at  5:20pm -0400,
Sagi Grimberg  wrote:
 
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

Noticed I didn't respond directly to this aspect.  As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath.  It wouldn't be something that'd be
imposed by default.  If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath.  Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.

We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.

Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured).  Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Mike Snitzer
On Wed, May 30 2018 at  5:20pm -0400,
Sagi Grimberg  wrote:
 
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

Noticed I didn't respond directly to this aspect.  As I explained in
various replies to this thread: The users/admins would be the ones who
would decide to use dm-multipath.  It wouldn't be something that'd be
imposed by default.  If anything, the all-or-nothing
nvme_core.multipath=N would pose a much more serious concern for these
array vendors that do have designs to specifically leverage native NVMe
multipath.  Because if users were to get into the habit of setting that
on the kernel commandline they'd literally _never_ be able to leverage
native NVMe multipathing.

We can also add multipath.conf docs (man page, etc) that caution admins
to consult their array vendors about whether using dm-multipath is to be
avoided, etc.

Again, this is opt-in, so on a upstream Linux kernel level the default
of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH
is configured).  Not seeing why there is so much angst and concern about
offering this flexibility via opt-in but I'm also glad we're having this
discussion to have our eyes wide open.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Mike Snitzer
On Wed, May 30 2018 at  5:20pm -0400,
Sagi Grimberg  wrote:

> Hi Folks,
> 
> I'm sorry to chime in super late on this, but a lot has been
> going on for me lately which got me off the grid.
> 
> So I'll try to provide my input hopefully without starting any more
> flames..
> 
> >>>This patch series aims to provide a more fine grained control over
> >>>nvme's native multipathing, by allowing it to be switched on and off
> >>>on a per-subsystem basis instead of a big global switch.
> >>
> >>No.  The only reason we even allowed to turn multipathing off is
> >>because you complained about installer issues.  The path forward
> >>clearly is native multipathing and there will be no additional support
> >>for the use cases of not using it.
> >
> >We all basically knew this would be your position.  But at this year's
> >LSF we pretty quickly reached consensus that we do in fact need this.
> >Except for yourself, Sagi and afaik Martin George: all on the cc were in
> >attendance and agreed.
> 
> Correction, I wasn't able to attend LSF this year (unfortunately).

Yes, I was trying to say you weren't at LSF (but are on the cc).

> >And since then we've exchanged mails to refine and test Johannes'
> >implementation.
> >
> >You've isolated yourself on this issue.  Please just accept that we all
> >have a pretty solid command of what is needed to properly provide
> >commercial support for NVMe multipath.
> >
> >The ability to switch between "native" and "other" multipath absolutely
> >does _not_ imply anything about the winning disposition of native vs
> >other.  It is purely about providing commercial flexibility to use
> >whatever solution makes sense for a given environment.  The default _is_
> >native NVMe multipath.  It is on userspace solutions for "other"
> >multipath (e.g. multipathd) to allow user's to whitelist an NVMe
> >subsystem to be switched to "other".
> >
> >Hopefully this clarifies things, thanks.
> 
> Mike, I understand what you're saying, but I also agree with hch on
> the simple fact that this is a burden on linux nvme (although less
> passionate about it than hch).
> 
> Beyond that, this is going to get much worse when we support "dispersed
> namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
> namespaces" makes NVMe namespaces share-able over different subsystems
> so changing the personality on a per-subsystem basis is just asking for
> trouble.
> 
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".

And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity.  Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems?  Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.

Anyway, I don't know the end from the beginning on something you just
told me about ;)  But we're all in this together.  And can take it as it
comes.  I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.

In time I really do have aspirations to contribute more to NVMe
multipathing.  I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).

> Don't get me wrong, I do support your cause, and I think nvme should try
> to help, I just think that subsystem granularity is not the correct
> approach going forward.

I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time.  But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).
 
> As I said, I've been off the grid, can you remind me why global knob is
> not sufficient?

Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host.  The goal of this patchset is to give
users choice.  But not limit them to _only_ using dm-multipath if they
just have some legacy needs.

Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands).  But
the same host with 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Mike Snitzer
On Wed, May 30 2018 at  5:20pm -0400,
Sagi Grimberg  wrote:

> Hi Folks,
> 
> I'm sorry to chime in super late on this, but a lot has been
> going on for me lately which got me off the grid.
> 
> So I'll try to provide my input hopefully without starting any more
> flames..
> 
> >>>This patch series aims to provide a more fine grained control over
> >>>nvme's native multipathing, by allowing it to be switched on and off
> >>>on a per-subsystem basis instead of a big global switch.
> >>
> >>No.  The only reason we even allowed to turn multipathing off is
> >>because you complained about installer issues.  The path forward
> >>clearly is native multipathing and there will be no additional support
> >>for the use cases of not using it.
> >
> >We all basically knew this would be your position.  But at this year's
> >LSF we pretty quickly reached consensus that we do in fact need this.
> >Except for yourself, Sagi and afaik Martin George: all on the cc were in
> >attendance and agreed.
> 
> Correction, I wasn't able to attend LSF this year (unfortunately).

Yes, I was trying to say you weren't at LSF (but are on the cc).

> >And since then we've exchanged mails to refine and test Johannes'
> >implementation.
> >
> >You've isolated yourself on this issue.  Please just accept that we all
> >have a pretty solid command of what is needed to properly provide
> >commercial support for NVMe multipath.
> >
> >The ability to switch between "native" and "other" multipath absolutely
> >does _not_ imply anything about the winning disposition of native vs
> >other.  It is purely about providing commercial flexibility to use
> >whatever solution makes sense for a given environment.  The default _is_
> >native NVMe multipath.  It is on userspace solutions for "other"
> >multipath (e.g. multipathd) to allow user's to whitelist an NVMe
> >subsystem to be switched to "other".
> >
> >Hopefully this clarifies things, thanks.
> 
> Mike, I understand what you're saying, but I also agree with hch on
> the simple fact that this is a burden on linux nvme (although less
> passionate about it than hch).
> 
> Beyond that, this is going to get much worse when we support "dispersed
> namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
> namespaces" makes NVMe namespaces share-able over different subsystems
> so changing the personality on a per-subsystem basis is just asking for
> trouble.
> 
> Moreover, I also wanted to point out that fabrics array vendors are
> building products that rely on standard nvme multipathing (and probably
> multipathing over dispersed namespaces as well), and keeping a knob that
> will keep nvme users with dm-multipath will probably not help them
> educate their customers as well... So there is another angle to this.

Wouldn't expect you guys to nurture this 'mpath_personality' knob.  SO
when features like "dispersed namespaces" land a negative check would
need to be added in the code to prevent switching from "native".

And once something like "dispersed namespaces" lands we'd then have to
see about a more sophisticated switch that operates at a different
granularity.  Could also be that switching one subsystem that is part of
"dispersed namespaces" would then cascade to all other associated
subsystems?  Not that dissimilar from the 3rd patch in this series that
allows a 'device' switch to be done in terms of the subsystem.

Anyway, I don't know the end from the beginning on something you just
told me about ;)  But we're all in this together.  And can take it as it
comes.  I'm merely trying to bridge the gap from old dm-multipath while
native NVMe multipath gets its legs.

In time I really do have aspirations to contribute more to NVMe
multipathing.  I think Christoph's NVMe multipath implementation of
bio-based device ontop on NVMe core's blk-mq device(s) is very clever
and effective (blk_steal_bios() hack and all).

> Don't get me wrong, I do support your cause, and I think nvme should try
> to help, I just think that subsystem granularity is not the correct
> approach going forward.

I understand there will be limits to this 'mpath_personality' knob's
utility and it'll need to evolve over time.  But the burden of making
more advanced NVMe multipath features accessible outside of native NVMe
isn't intended to be on any of the NVMe maintainers (other than maybe
remembering to disallow the switch where it makes sense in the future).
 
> As I said, I've been off the grid, can you remind me why global knob is
> not sufficient?

Because once nvme_core.multipath=N is set: native NVMe multipath is then
not accessible from the same host.  The goal of this patchset is to give
users choice.  But not limit them to _only_ using dm-multipath if they
just have some legacy needs.

Tough to be convincing with hypotheticals but I could imagine a very
obvious usecase for native NVMe multipathing be PCI-based embedded NVMe
"fabrics" (especially if/when the numa-based path selector lands).  But
the same host with 

Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Sagi Grimberg

Hi Folks,

I'm sorry to chime in super late on this, but a lot has been
going on for me lately which got me off the grid.

So I'll try to provide my input hopefully without starting any more
flames..


This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.


No.  The only reason we even allowed to turn multipathing off is
because you complained about installer issues.  The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.


We all basically knew this would be your position.  But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.


Correction, I wasn't able to attend LSF this year (unfortunately).


And since then we've exchanged mails to refine and test Johannes'
implementation.

You've isolated yourself on this issue.  Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.

The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other.  It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment.  The default _is_
native NVMe multipath.  It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".

Hopefully this clarifies things, thanks.


Mike, I understand what you're saying, but I also agree with hch on
the simple fact that this is a burden on linux nvme (although less 
passionate about it than hch).


Beyond that, this is going to get much worse when we support "dispersed
namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
namespaces" makes NVMe namespaces share-able over different subsystems
so changing the personality on a per-subsystem basis is just asking for
trouble.

Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.

Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.

As I said, I've been off the grid, can you remind me why global knob is
not sufficient?

This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just stack 
it on multipath nvme device? (I might be completely off on this so

feel free to correct my ignorance).


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-30 Thread Sagi Grimberg

Hi Folks,

I'm sorry to chime in super late on this, but a lot has been
going on for me lately which got me off the grid.

So I'll try to provide my input hopefully without starting any more
flames..


This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.


No.  The only reason we even allowed to turn multipathing off is
because you complained about installer issues.  The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.


We all basically knew this would be your position.  But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.


Correction, I wasn't able to attend LSF this year (unfortunately).


And since then we've exchanged mails to refine and test Johannes'
implementation.

You've isolated yourself on this issue.  Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.

The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other.  It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment.  The default _is_
native NVMe multipath.  It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".

Hopefully this clarifies things, thanks.


Mike, I understand what you're saying, but I also agree with hch on
the simple fact that this is a burden on linux nvme (although less 
passionate about it than hch).


Beyond that, this is going to get much worse when we support "dispersed
namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed
namespaces" makes NVMe namespaces share-able over different subsystems
so changing the personality on a per-subsystem basis is just asking for
trouble.

Moreover, I also wanted to point out that fabrics array vendors are
building products that rely on standard nvme multipathing (and probably
multipathing over dispersed namespaces as well), and keeping a knob that
will keep nvme users with dm-multipath will probably not help them
educate their customers as well... So there is another angle to this.

Don't get me wrong, I do support your cause, and I think nvme should try
to help, I just think that subsystem granularity is not the correct
approach going forward.

As I said, I've been off the grid, can you remind me why global knob is
not sufficient?

This might sound stupid to you, but can't users that desperately must
keep using dm-multipath (for its mature toolset or what-not) just stack 
it on multipath nvme device? (I might be completely off on this so

feel free to correct my ignorance).


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Mike Snitzer
On Tue, May 29 2018 at  4:09am -0400,
Christoph Hellwig  wrote:

> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> > For a "Plan B" we can still use the global knob that's already in
> > place (even if this reminds me so much about scsi-mq which at least we
> > haven't turned on in fear of performance regressions).
> > 
> > Let's drop the discussion here, I don't think it leads to something
> > else than flamewars.

As the author of the original patch you're fine to want to step away from
this needlessly ugly aspect.  But it doesn't change the fact that we
need answers on _why_ it is a genuinely detrimental change. (hint: we
know it isn't).

The enterprise Linux people who directly need to support multipath want
the flexibility to allow dm-multipath while simultaneously allowing
native NVMe multipathing on the same host.

Hannes Reinecke and others, if you want the flexibility this patchset
offers please provide your review/acks.

> If our plan A doesn't work we can go back to these patches.  For now
> I'd rather have everyone spend their time on making Plan A work then
> preparing for contingencies.  Nothing prevents anyone from using these
> patches already out there if they really want to, but I'd recommend
> people are very careful about doing so as you'll lock yourself into
> a long-term maintainance burden.

This isn't about contingencies.  It is about continuing compatibility
with a sophisticated dm-multipath stack that is widely used by, and
familiar to, so many.

Christoph, you know you're being completely vague right?  You're
actively denying the validity of our position (at least Hannes and I)
with handwaving and effectively FUD, e.g. "maze of new setups" and
"hairy runtime ABIs" here: https://lkml.org/lkml/2018/5/25/461

To restate my question, from https://lkml.org/lkml/2018/5/28/2179:
hch had some non-specific concern about this patch forcing
support of some "ABI".  Which ABI is that _exactly_?

The incremental effort required to support NVMe in dm-multipath isn't so
grim.  And those who will do that work are signing up for it -- while
still motivated to help make native NVMe multipath a success.
Can you please give us time to responsibly ween users off dm-multipath?

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Mike Snitzer
On Tue, May 29 2018 at  4:09am -0400,
Christoph Hellwig  wrote:

> On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> > For a "Plan B" we can still use the global knob that's already in
> > place (even if this reminds me so much about scsi-mq which at least we
> > haven't turned on in fear of performance regressions).
> > 
> > Let's drop the discussion here, I don't think it leads to something
> > else than flamewars.

As the author of the original patch you're fine to want to step away from
this needlessly ugly aspect.  But it doesn't change the fact that we
need answers on _why_ it is a genuinely detrimental change. (hint: we
know it isn't).

The enterprise Linux people who directly need to support multipath want
the flexibility to allow dm-multipath while simultaneously allowing
native NVMe multipathing on the same host.

Hannes Reinecke and others, if you want the flexibility this patchset
offers please provide your review/acks.

> If our plan A doesn't work we can go back to these patches.  For now
> I'd rather have everyone spend their time on making Plan A work then
> preparing for contingencies.  Nothing prevents anyone from using these
> patches already out there if they really want to, but I'd recommend
> people are very careful about doing so as you'll lock yourself into
> a long-term maintainance burden.

This isn't about contingencies.  It is about continuing compatibility
with a sophisticated dm-multipath stack that is widely used by, and
familiar to, so many.

Christoph, you know you're being completely vague right?  You're
actively denying the validity of our position (at least Hannes and I)
with handwaving and effectively FUD, e.g. "maze of new setups" and
"hairy runtime ABIs" here: https://lkml.org/lkml/2018/5/25/461

To restate my question, from https://lkml.org/lkml/2018/5/28/2179:
hch had some non-specific concern about this patch forcing
support of some "ABI".  Which ABI is that _exactly_?

The incremental effort required to support NVMe in dm-multipath isn't so
grim.  And those who will do that work are signing up for it -- while
still motivated to help make native NVMe multipath a success.
Can you please give us time to responsibly ween users off dm-multipath?

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Christoph Hellwig
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).
> 
> Let's drop the discussion here, I don't think it leads to something
> else than flamewars.

If our plan A doesn't work we can go back to these patches.  For now
I'd rather have everyone spend their time on making Plan A work then
preparing for contingencies.  Nothing prevents anyone from using these
patches already out there if they really want to, but I'd recommend
people are very careful about doing so as you'll lock yourself into
a long-term maintainance burden.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Christoph Hellwig
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote:
> For a "Plan B" we can still use the global knob that's already in
> place (even if this reminds me so much about scsi-mq which at least we
> haven't turned on in fear of performance regressions).
> 
> Let's drop the discussion here, I don't think it leads to something
> else than flamewars.

If our plan A doesn't work we can go back to these patches.  For now
I'd rather have everyone spend their time on making Plan A work then
preparing for contingencies.  Nothing prevents anyone from using these
patches already out there if they really want to, but I'd recommend
people are very careful about doing so as you'll lock yourself into
a long-term maintainance burden.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Johannes Thumshirn
On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> No, what both Red Hat and SUSE are saying is: cool let's have a go at
> "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> multipath) to be conditionally enabled to coexist with native NVMe
> multipath?

For a "Plan B" we can still use the global knob that's already in
place (even if this reminds me so much about scsi-mq which at least we
haven't turned on in fear of performance regressions).

Let's drop the discussion here, I don't think it leads to something
else than flamewars.

 Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Johannes Thumshirn
On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote:
> No, what both Red Hat and SUSE are saying is: cool let's have a go at
> "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
> multipath) to be conditionally enabled to coexist with native NVMe
> multipath?

For a "Plan B" we can still use the global knob that's already in
place (even if this reminds me so much about scsi-mq which at least we
haven't turned on in fear of performance regressions).

Let's drop the discussion here, I don't think it leads to something
else than flamewars.

 Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Hannes Reinecke
On Mon, 28 May 2018 23:02:36 -0400
Mike Snitzer  wrote:

> On Mon, May 28 2018 at  9:19pm -0400,
> Martin K. Petersen  wrote:
> 
> > 
> > Mike,
> > 
> > I understand and appreciate your position but I still don't think
> > the arguments for enabling DM multipath are sufficiently
> > compelling. The whole point of ANA is for things to be plug and
> > play without any admin intervention whatsoever.
> > 
> > I also think we're getting ahead of ourselves a bit. The assumption
> > seems to be that NVMe ANA devices are going to be broken--or that
> > they will require the same amount of tweaking as SCSI devices--and
> > therefore DM multipath support is inevitable. However, I'm not sure
> > that will be the case.
> >   
> > > Thing is you really don't get to dictate that to the industry.
> > > Sorry.  
> > 
> > We are in the fortunate position of being able to influence how the
> > spec is written. It's a great opportunity to fix the mistakes of
> > the past in SCSI. And to encourage the industry to ship products
> > that don't need the current level of manual configuration and
> > complex management.
> > 
> > So I am in favor of Johannes' patches *if* we get to the point
> > where a Plan B is needed. But I am not entirely convinced that's
> > the case just yet. Let's see some more ANA devices first. And once
> > we do, we are also in a position where we can put some pressure on
> > the vendors to either amend the specification or fix their
> > implementations to work with ANA.  
> 
> ANA really isn't a motivating factor for whether or not to apply this
> patch.  So no, I don't have any interest in waiting to apply it.
> 
Correct. That patch is _not_ to work around any perceived incompability
on the OS side.
The patch is primarily to give _admins_ a choice.
Some installations like hosting providers etc are running quite complex
scenarios, most of which are highly automated.
So for those there is a real benefit to be able to use dm-multipathing
for NVMe; they are totally fine with having a performance impact if
they can avoid to rewrite their infrastructure.

Cheers,

Hannes


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-29 Thread Hannes Reinecke
On Mon, 28 May 2018 23:02:36 -0400
Mike Snitzer  wrote:

> On Mon, May 28 2018 at  9:19pm -0400,
> Martin K. Petersen  wrote:
> 
> > 
> > Mike,
> > 
> > I understand and appreciate your position but I still don't think
> > the arguments for enabling DM multipath are sufficiently
> > compelling. The whole point of ANA is for things to be plug and
> > play without any admin intervention whatsoever.
> > 
> > I also think we're getting ahead of ourselves a bit. The assumption
> > seems to be that NVMe ANA devices are going to be broken--or that
> > they will require the same amount of tweaking as SCSI devices--and
> > therefore DM multipath support is inevitable. However, I'm not sure
> > that will be the case.
> >   
> > > Thing is you really don't get to dictate that to the industry.
> > > Sorry.  
> > 
> > We are in the fortunate position of being able to influence how the
> > spec is written. It's a great opportunity to fix the mistakes of
> > the past in SCSI. And to encourage the industry to ship products
> > that don't need the current level of manual configuration and
> > complex management.
> > 
> > So I am in favor of Johannes' patches *if* we get to the point
> > where a Plan B is needed. But I am not entirely convinced that's
> > the case just yet. Let's see some more ANA devices first. And once
> > we do, we are also in a position where we can put some pressure on
> > the vendors to either amend the specification or fix their
> > implementations to work with ANA.  
> 
> ANA really isn't a motivating factor for whether or not to apply this
> patch.  So no, I don't have any interest in waiting to apply it.
> 
Correct. That patch is _not_ to work around any perceived incompability
on the OS side.
The patch is primarily to give _admins_ a choice.
Some installations like hosting providers etc are running quite complex
scenarios, most of which are highly automated.
So for those there is a real benefit to be able to use dm-multipathing
for NVMe; they are totally fine with having a performance impact if
they can avoid to rewrite their infrastructure.

Cheers,

Hannes


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-28 Thread Mike Snitzer
On Mon, May 28 2018 at  9:19pm -0400,
Martin K. Petersen  wrote:

> 
> Mike,
> 
> I understand and appreciate your position but I still don't think the
> arguments for enabling DM multipath are sufficiently compelling. The
> whole point of ANA is for things to be plug and play without any admin
> intervention whatsoever.
> 
> I also think we're getting ahead of ourselves a bit. The assumption
> seems to be that NVMe ANA devices are going to be broken--or that they
> will require the same amount of tweaking as SCSI devices--and therefore
> DM multipath support is inevitable. However, I'm not sure that will be
> the case.
> 
> > Thing is you really don't get to dictate that to the industry.  Sorry.
> 
> We are in the fortunate position of being able to influence how the spec
> is written. It's a great opportunity to fix the mistakes of the past in
> SCSI. And to encourage the industry to ship products that don't need the
> current level of manual configuration and complex management.
> 
> So I am in favor of Johannes' patches *if* we get to the point where a
> Plan B is needed. But I am not entirely convinced that's the case just
> yet. Let's see some more ANA devices first. And once we do, we are also
> in a position where we can put some pressure on the vendors to either
> amend the specification or fix their implementations to work with ANA.

ANA really isn't a motivating factor for whether or not to apply this
patch.  So no, I don't have any interest in waiting to apply it.

You're somehow missing that your implied "Plan A" (native NVMe
multipath) has been pushed as the only way forward for NVMe multipath
despite it being unproven.  Worse, literally no userspace infrastructure
exists to control native NVMe multipath (and this is supposed to be
comforting because the spec is tightly coupled to hch's implementation
that he controls with an iron fist).

We're supposed to be OK with completely _forced_ obsolescence of
dm-multipath infrastructure that has proven itself capable of managing a
wide range of complex multipath deployments for a tremendous amount of
enterprise Linux customers (of multiple vendors)!?  This is a tough sell
given the content of my previous paragraph (coupled with the fact the
next enterprise Linux versions are being hardened _now_).

No, what both Red Hat and SUSE are saying is: cool let's have a go at
"Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
multipath) to be conditionally enabled to coexist with native NVMe
multipath?

Nobody can explain why this patch is some sort of detriment.  It
literally is an amazingly simple switch that provides flexibility we
_need_.  hch had some non-specific concern about this patch forcing
support of some "ABI".  Which ABI is that _exactly_?

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-28 Thread Mike Snitzer
On Mon, May 28 2018 at  9:19pm -0400,
Martin K. Petersen  wrote:

> 
> Mike,
> 
> I understand and appreciate your position but I still don't think the
> arguments for enabling DM multipath are sufficiently compelling. The
> whole point of ANA is for things to be plug and play without any admin
> intervention whatsoever.
> 
> I also think we're getting ahead of ourselves a bit. The assumption
> seems to be that NVMe ANA devices are going to be broken--or that they
> will require the same amount of tweaking as SCSI devices--and therefore
> DM multipath support is inevitable. However, I'm not sure that will be
> the case.
> 
> > Thing is you really don't get to dictate that to the industry.  Sorry.
> 
> We are in the fortunate position of being able to influence how the spec
> is written. It's a great opportunity to fix the mistakes of the past in
> SCSI. And to encourage the industry to ship products that don't need the
> current level of manual configuration and complex management.
> 
> So I am in favor of Johannes' patches *if* we get to the point where a
> Plan B is needed. But I am not entirely convinced that's the case just
> yet. Let's see some more ANA devices first. And once we do, we are also
> in a position where we can put some pressure on the vendors to either
> amend the specification or fix their implementations to work with ANA.

ANA really isn't a motivating factor for whether or not to apply this
patch.  So no, I don't have any interest in waiting to apply it.

You're somehow missing that your implied "Plan A" (native NVMe
multipath) has been pushed as the only way forward for NVMe multipath
despite it being unproven.  Worse, literally no userspace infrastructure
exists to control native NVMe multipath (and this is supposed to be
comforting because the spec is tightly coupled to hch's implementation
that he controls with an iron fist).

We're supposed to be OK with completely _forced_ obsolescence of
dm-multipath infrastructure that has proven itself capable of managing a
wide range of complex multipath deployments for a tremendous amount of
enterprise Linux customers (of multiple vendors)!?  This is a tough sell
given the content of my previous paragraph (coupled with the fact the
next enterprise Linux versions are being hardened _now_).

No, what both Red Hat and SUSE are saying is: cool let's have a go at
"Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm
multipath) to be conditionally enabled to coexist with native NVMe
multipath?

Nobody can explain why this patch is some sort of detriment.  It
literally is an amazingly simple switch that provides flexibility we
_need_.  hch had some non-specific concern about this patch forcing
support of some "ABI".  Which ABI is that _exactly_?

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-28 Thread Martin K. Petersen


Mike,

I understand and appreciate your position but I still don't think the
arguments for enabling DM multipath are sufficiently compelling. The
whole point of ANA is for things to be plug and play without any admin
intervention whatsoever.

I also think we're getting ahead of ourselves a bit. The assumption
seems to be that NVMe ANA devices are going to be broken--or that they
will require the same amount of tweaking as SCSI devices--and therefore
DM multipath support is inevitable. However, I'm not sure that will be
the case.

> Thing is you really don't get to dictate that to the industry.  Sorry.

We are in the fortunate position of being able to influence how the spec
is written. It's a great opportunity to fix the mistakes of the past in
SCSI. And to encourage the industry to ship products that don't need the
current level of manual configuration and complex management.

So I am in favor of Johannes' patches *if* we get to the point where a
Plan B is needed. But I am not entirely convinced that's the case just
yet. Let's see some more ANA devices first. And once we do, we are also
in a position where we can put some pressure on the vendors to either
amend the specification or fix their implementations to work with ANA.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-28 Thread Martin K. Petersen


Mike,

I understand and appreciate your position but I still don't think the
arguments for enabling DM multipath are sufficiently compelling. The
whole point of ANA is for things to be plug and play without any admin
intervention whatsoever.

I also think we're getting ahead of ourselves a bit. The assumption
seems to be that NVMe ANA devices are going to be broken--or that they
will require the same amount of tweaking as SCSI devices--and therefore
DM multipath support is inevitable. However, I'm not sure that will be
the case.

> Thing is you really don't get to dictate that to the industry.  Sorry.

We are in the fortunate position of being able to influence how the spec
is written. It's a great opportunity to fix the mistakes of the past in
SCSI. And to encourage the industry to ship products that don't need the
current level of manual configuration and complex management.

So I am in favor of Johannes' patches *if* we get to the point where a
Plan B is needed. But I am not entirely convinced that's the case just
yet. Let's see some more ANA devices first. And once we do, we are also
in a position where we can put some pressure on the vendors to either
amend the specification or fix their implementations to work with ANA.

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Mike Snitzer
On Fri, May 25 2018 at 10:12am -0400,
Christoph Hellwig  wrote:

> On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> > We all basically knew this would be your position.  But at this year's
> > LSF we pretty quickly reached consensus that we do in fact need this.
> > Except for yourself, Sagi and afaik Martin George: all on the cc were in
> > attendance and agreed.
> 
> And I very mich disagree, and you'd bette come up with a good reason
> to overide me as the author and maintainer of this code.

I hope you don't truly think this is me vs you.

Some of the reasons are:
1) we need flexibility during the transition to native NVMe multipath
2) we need to support existing customers' dm-multipath storage networks
3) asking users to use an entirely new infrastructure that conflicts
   with their dm-multipath expertise and established norms is a hard
   sell.  Especially for environments that have a mix of traditional
   multipath (FC, iSCSI, whatever) and NVMe over fabrics.
4) Layered products (both vendor provided and user developed) have been
   trained to fully support and monitor dm-multipath; they have no
   understanding of native NVMe multipath

> > And since then we've exchanged mails to refine and test Johannes'
> > implementation.
> 
> Since when was acting behind the scenes a good argument for anything?

I mentioned our continued private collaboration to establish that this
wasn't a momentary weakness by anyone at LSF.  It has had a lot of soak
time in our heads.

We did it privately because we needed a concrete proposal that works for
our needs.  Rather than getting shot down over some shortcoming in an
RFC-style submission.
 
> > Hopefully this clarifies things, thanks.
> 
> It doesn't.
> 
> The whole point we have native multipath in nvme is because dm-multipath
> is the wrong architecture (and has been, long predating you, nothing
> personal).  And I don't want to be stuck additional decades with this
> in nvme.  We allowed a global opt-in to ease the three people in the
> world with existing setups to keep using that, but I also said I
> won't go any step further.  And I stand to that.

Thing is you really don't get to dictate that to the industry.  Sorry.

Reality is this ability to switch "native" vs "other" gives us the
options I've been talking about absolutely needing since the start of
this NVMe multipathing debate.

Your fighting against it for so long has prevented progress on NVMe
multipath in general.  Taking this change will increase native NVMe
multipath deployment.  Otherwise we're just going to have to disable
native multipath entirely for the time being.  That does users a
disservice because I completely agree that there _will_ be setups where
native NVMe multipath really does offer a huge win.  But those setups
could easily be deployed on the same hosts as another variant of NVMe
that really does want the use of the legacy DM multipath stack (possibly
even just for reason 4 above).

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Mike Snitzer
On Fri, May 25 2018 at 10:12am -0400,
Christoph Hellwig  wrote:

> On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> > We all basically knew this would be your position.  But at this year's
> > LSF we pretty quickly reached consensus that we do in fact need this.
> > Except for yourself, Sagi and afaik Martin George: all on the cc were in
> > attendance and agreed.
> 
> And I very mich disagree, and you'd bette come up with a good reason
> to overide me as the author and maintainer of this code.

I hope you don't truly think this is me vs you.

Some of the reasons are:
1) we need flexibility during the transition to native NVMe multipath
2) we need to support existing customers' dm-multipath storage networks
3) asking users to use an entirely new infrastructure that conflicts
   with their dm-multipath expertise and established norms is a hard
   sell.  Especially for environments that have a mix of traditional
   multipath (FC, iSCSI, whatever) and NVMe over fabrics.
4) Layered products (both vendor provided and user developed) have been
   trained to fully support and monitor dm-multipath; they have no
   understanding of native NVMe multipath

> > And since then we've exchanged mails to refine and test Johannes'
> > implementation.
> 
> Since when was acting behind the scenes a good argument for anything?

I mentioned our continued private collaboration to establish that this
wasn't a momentary weakness by anyone at LSF.  It has had a lot of soak
time in our heads.

We did it privately because we needed a concrete proposal that works for
our needs.  Rather than getting shot down over some shortcoming in an
RFC-style submission.
 
> > Hopefully this clarifies things, thanks.
> 
> It doesn't.
> 
> The whole point we have native multipath in nvme is because dm-multipath
> is the wrong architecture (and has been, long predating you, nothing
> personal).  And I don't want to be stuck additional decades with this
> in nvme.  We allowed a global opt-in to ease the three people in the
> world with existing setups to keep using that, but I also said I
> won't go any step further.  And I stand to that.

Thing is you really don't get to dictate that to the industry.  Sorry.

Reality is this ability to switch "native" vs "other" gives us the
options I've been talking about absolutely needing since the start of
this NVMe multipathing debate.

Your fighting against it for so long has prevented progress on NVMe
multipath in general.  Taking this change will increase native NVMe
multipath deployment.  Otherwise we're just going to have to disable
native multipath entirely for the time being.  That does users a
disservice because I completely agree that there _will_ be setups where
native NVMe multipath really does offer a huge win.  But those setups
could easily be deployed on the same hosts as another variant of NVMe
that really does want the use of the legacy DM multipath stack (possibly
even just for reason 4 above).

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 04:22:17PM +0200, Johannes Thumshirn wrote:
> But Mike's and Hannes' arguments where reasonable as well, we do not
> know if there are any existing setups we might break leading to
> support calls, which we have to deal with. Personally I don't believe
> there are lot's of existing nvme multipath setups out there, but who
> am I to judge.

I don't think existing setups are very likely, but they absolutely
are a valid reason to support the legacy mode.  That is why we support
the legacy mode using the multipath module option.  Once you move
to a per-subsystem switch you don't support legacy setups, you
create a maze of new setups that we need to keep compatibility
support for forever.

> So can we find a middle ground to this? Or we'll have the
> all-or-nothing situation we have in scsi-mq now again. How about
> tieing the switch to a config option which is off per default?

The middle ground is the module option.  It provides 100% backwards
compatibility if used, but more importantly doesn't create hairy
runtime ABIs that we will have to support forever.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 04:22:17PM +0200, Johannes Thumshirn wrote:
> But Mike's and Hannes' arguments where reasonable as well, we do not
> know if there are any existing setups we might break leading to
> support calls, which we have to deal with. Personally I don't believe
> there are lot's of existing nvme multipath setups out there, but who
> am I to judge.

I don't think existing setups are very likely, but they absolutely
are a valid reason to support the legacy mode.  That is why we support
the legacy mode using the multipath module option.  Once you move
to a per-subsystem switch you don't support legacy setups, you
create a maze of new setups that we need to keep compatibility
support for forever.

> So can we find a middle ground to this? Or we'll have the
> all-or-nothing situation we have in scsi-mq now again. How about
> tieing the switch to a config option which is off per default?

The middle ground is the module option.  It provides 100% backwards
compatibility if used, but more importantly doesn't create hairy
runtime ABIs that we will have to support forever.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Johannes Thumshirn
On Fri, May 25, 2018 at 03:05:35PM +0200, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> > 
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
> 
> No.  The only reason we even allowed to turn multipathing off is
> because you complained about installer issues.  The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.

First of all, it wasn't my idea and I'm just doing my job here, as I
got this task assigned at LSF and tried to do my best here.

Personally I _do_ agree with you and do not want to use dm-mpath in
nvme either (mainly because I don't really know the code and don't
want to learn yet another subsystem).

But Mike's and Hannes' arguments where reasonable as well, we do not
know if there are any existing setups we might break leading to
support calls, which we have to deal with. Personally I don't believe
there are lot's of existing nvme multipath setups out there, but who
am I to judge.

So can we find a middle ground to this? Or we'll have the
all-or-nothing situation we have in scsi-mq now again. How about
tieing the switch to a config option which is off per default?

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Johannes Thumshirn
On Fri, May 25, 2018 at 03:05:35PM +0200, Christoph Hellwig wrote:
> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> > 
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
> 
> No.  The only reason we even allowed to turn multipathing off is
> because you complained about installer issues.  The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.

First of all, it wasn't my idea and I'm just doing my job here, as I
got this task assigned at LSF and tried to do my best here.

Personally I _do_ agree with you and do not want to use dm-mpath in
nvme either (mainly because I don't really know the code and don't
want to learn yet another subsystem).

But Mike's and Hannes' arguments where reasonable as well, we do not
know if there are any existing setups we might break leading to
support calls, which we have to deal with. Personally I don't believe
there are lot's of existing nvme multipath setups out there, but who
am I to judge.

So can we find a middle ground to this? Or we'll have the
all-or-nothing situation we have in scsi-mq now again. How about
tieing the switch to a config option which is off per default?

Byte,
Johannes
-- 
Johannes Thumshirn  Storage
jthumsh...@suse.de+49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> We all basically knew this would be your position.  But at this year's
> LSF we pretty quickly reached consensus that we do in fact need this.
> Except for yourself, Sagi and afaik Martin George: all on the cc were in
> attendance and agreed.

And I very mich disagree, and you'd bette come up with a good reason
to overide me as the author and maintainer of this code.

> And since then we've exchanged mails to refine and test Johannes'
> implementation.

Since when was acting behind the scenes a good argument for anything?

> Hopefully this clarifies things, thanks.

It doesn't.

The whole point we have native multipath in nvme is because dm-multipath
is the wrong architecture (and has been, long predating you, nothing
personal).  And I don't want to be stuck additional decades with this
in nvme.  We allowed a global opt-in to ease the three people in the
world with existing setups to keep using that, but I also said I
won't go any step further.  And I stand to that.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote:
> We all basically knew this would be your position.  But at this year's
> LSF we pretty quickly reached consensus that we do in fact need this.
> Except for yourself, Sagi and afaik Martin George: all on the cc were in
> attendance and agreed.

And I very mich disagree, and you'd bette come up with a good reason
to overide me as the author and maintainer of this code.

> And since then we've exchanged mails to refine and test Johannes'
> implementation.

Since when was acting behind the scenes a good argument for anything?

> Hopefully this clarifies things, thanks.

It doesn't.

The whole point we have native multipath in nvme is because dm-multipath
is the wrong architecture (and has been, long predating you, nothing
personal).  And I don't want to be stuck additional decades with this
in nvme.  We allowed a global opt-in to ease the three people in the
world with existing setups to keep using that, but I also said I
won't go any step further.  And I stand to that.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Mike Snitzer
On Fri, May 25 2018 at  9:05am -0400,
Christoph Hellwig  wrote:

> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> > 
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
> 
> No.  The only reason we even allowed to turn multipathing off is
> because you complained about installer issues.  The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.

We all basically knew this would be your position.  But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.

And since then we've exchanged mails to refine and test Johannes'
implementation.

You've isolated yourself on this issue.  Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.

The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other.  It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment.  The default _is_
native NVMe multipath.  It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".

Hopefully this clarifies things, thanks.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Mike Snitzer
On Fri, May 25 2018 at  9:05am -0400,
Christoph Hellwig  wrote:

> On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> > Hi,
> > 
> > This patch series aims to provide a more fine grained control over
> > nvme's native multipathing, by allowing it to be switched on and off
> > on a per-subsystem basis instead of a big global switch.
> 
> No.  The only reason we even allowed to turn multipathing off is
> because you complained about installer issues.  The path forward
> clearly is native multipathing and there will be no additional support
> for the use cases of not using it.

We all basically knew this would be your position.  But at this year's
LSF we pretty quickly reached consensus that we do in fact need this.
Except for yourself, Sagi and afaik Martin George: all on the cc were in
attendance and agreed.

And since then we've exchanged mails to refine and test Johannes'
implementation.

You've isolated yourself on this issue.  Please just accept that we all
have a pretty solid command of what is needed to properly provide
commercial support for NVMe multipath.

The ability to switch between "native" and "other" multipath absolutely
does _not_ imply anything about the winning disposition of native vs
other.  It is purely about providing commercial flexibility to use
whatever solution makes sense for a given environment.  The default _is_
native NVMe multipath.  It is on userspace solutions for "other"
multipath (e.g. multipathd) to allow user's to whitelist an NVMe
subsystem to be switched to "other".

Hopefully this clarifies things, thanks.

Mike


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> Hi,
> 
> This patch series aims to provide a more fine grained control over
> nvme's native multipathing, by allowing it to be switched on and off
> on a per-subsystem basis instead of a big global switch.

No.  The only reason we even allowed to turn multipathing off is
because you complained about installer issues.  The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.


Re: [PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Christoph Hellwig
On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote:
> Hi,
> 
> This patch series aims to provide a more fine grained control over
> nvme's native multipathing, by allowing it to be switched on and off
> on a per-subsystem basis instead of a big global switch.

No.  The only reason we even allowed to turn multipathing off is
because you complained about installer issues.  The path forward
clearly is native multipathing and there will be no additional support
for the use cases of not using it.


[PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Johannes Thumshirn
Hi,

This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.

The prime use-case is for mixed scenarios where user might want to use
nvme's native multipathing on one subset of subsystems and
dm-multipath on another subset.

For example using native for internal the PCIe NVMe and dm-mpath for
the connection to an NVMe over Fabrics Array.

The initial discussion for this was held at this year's LSF/MM and the
architecture hasn't changed to what we've discussed there.

The first patch does the said switch and Mike added two follow up
patches to access the personality attribute from the block device's
sysfs directory as well.

I do have a blktests test for it as well but due to the fcloop but I
reported I'm reluctant to include it in the series (or I would need to
uncomment the rmmods).

Johannes Thumshirn (1):
  nvme: provide a way to disable nvme mpath per subsystem

Mike Snitzer (2):
  nvme multipath: added SUBSYS_ATTR_RW
  nvme multipath: add dev_attr_mpath_personality

 drivers/nvme/host/core.c  | 112 --
 drivers/nvme/host/multipath.c |  34 +++--
 drivers/nvme/host/nvme.h  |   8 +++
 3 files changed, 144 insertions(+), 10 deletions(-)

-- 
2.16.3



[PATCH 0/3] Provide more fine grained control over multipathing

2018-05-25 Thread Johannes Thumshirn
Hi,

This patch series aims to provide a more fine grained control over
nvme's native multipathing, by allowing it to be switched on and off
on a per-subsystem basis instead of a big global switch.

The prime use-case is for mixed scenarios where user might want to use
nvme's native multipathing on one subset of subsystems and
dm-multipath on another subset.

For example using native for internal the PCIe NVMe and dm-mpath for
the connection to an NVMe over Fabrics Array.

The initial discussion for this was held at this year's LSF/MM and the
architecture hasn't changed to what we've discussed there.

The first patch does the said switch and Mike added two follow up
patches to access the personality attribute from the block device's
sysfs directory as well.

I do have a blktests test for it as well but due to the fcloop but I
reported I'm reluctant to include it in the series (or I would need to
uncomment the rmmods).

Johannes Thumshirn (1):
  nvme: provide a way to disable nvme mpath per subsystem

Mike Snitzer (2):
  nvme multipath: added SUBSYS_ATTR_RW
  nvme multipath: add dev_attr_mpath_personality

 drivers/nvme/host/core.c  | 112 --
 drivers/nvme/host/multipath.c |  34 +++--
 drivers/nvme/host/nvme.h  |   8 +++
 3 files changed, 144 insertions(+), 10 deletions(-)

-- 
2.16.3