Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, Jun 05, 2018 at 03:57:05PM -0700, Roland Dreier wrote: > That makes sense but I'm not sure it covers everything. Probably the > most common way to do NVMe/RDMA will be with a single HCA that has > multiple ports, so there's no sensible CPU locality. On the other > hand we want to keep both ports to the fabric busy. Setting different > paths for different queues makes sense, but there may be > single-threaded applications that want a different policy. > > I'm not saying anything very profound, but we have to find the right > balance between too many and too few knobs. Agreed. And the philosophy here is to start with a as few knobs as possible and work from there based on actual use cases. Single threaded applications will run into issues with general blk-mq philosophy, so to work around that we'll need to dig deeper and allow borrowing of other cpu queues if we want to cater for that.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, Jun 05, 2018 at 03:57:05PM -0700, Roland Dreier wrote: > That makes sense but I'm not sure it covers everything. Probably the > most common way to do NVMe/RDMA will be with a single HCA that has > multiple ports, so there's no sensible CPU locality. On the other > hand we want to keep both ports to the fabric busy. Setting different > paths for different queues makes sense, but there may be > single-threaded applications that want a different policy. > > I'm not saying anything very profound, but we have to find the right > balance between too many and too few knobs. Agreed. And the philosophy here is to start with a as few knobs as possible and work from there based on actual use cases. Single threaded applications will run into issues with general blk-mq philosophy, so to work around that we'll need to dig deeper and allow borrowing of other cpu queues if we want to cater for that.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, Jun 06, 2018 at 12:32:21PM +0300, Sagi Grimberg wrote: > Huh? different paths == different controllers so this sentence can't > be right... you mean that a path selector will select a controller > based on the home node of the local rdma device connecting to it and > the running cpu right? Think of a system with say 8 cpu cores. Say we have two optimized paths. There is no point in going round robin or service time over the two paths for each logic pre-cpu queue. Instead we should always got to path A for a given cpu queue or path B to reduce selection overhead and cache footprint.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, Jun 06, 2018 at 12:32:21PM +0300, Sagi Grimberg wrote: > Huh? different paths == different controllers so this sentence can't > be right... you mean that a path selector will select a controller > based on the home node of the local rdma device connecting to it and > the running cpu right? Think of a system with say 8 cpu cores. Say we have two optimized paths. There is no point in going round robin or service time over the two paths for each logic pre-cpu queue. Instead we should always got to path A for a given cpu queue or path B to reduce selection overhead and cache footprint.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
We plan to implement all the fancy NVMe standards like ANA, but it seems that there is still a requirement to let the host side choose policies about how to use paths (round-robin vs least queue depth for example). Even in the modern SCSI world with VPD pages and ALUA, there are still knobs that are needed. Maybe NVMe will be different and we can find defaults that work in all cases but I have to admit I'm skeptical... The sensible thing to do in nvme is to use different paths for different queues. Huh? different paths == different controllers so this sentence can't be right... you mean that a path selector will select a controller based on the home node of the local rdma device connecting to it and the running cpu right?
Re: [PATCH 0/3] Provide more fine grained control over multipathing
We plan to implement all the fancy NVMe standards like ANA, but it seems that there is still a requirement to let the host side choose policies about how to use paths (round-robin vs least queue depth for example). Even in the modern SCSI world with VPD pages and ALUA, there are still knobs that are needed. Maybe NVMe will be different and we can find defaults that work in all cases but I have to admit I'm skeptical... The sensible thing to do in nvme is to use different paths for different queues. Huh? different paths == different controllers so this sentence can't be right... you mean that a path selector will select a controller based on the home node of the local rdma device connecting to it and the running cpu right?
Re: [PATCH 0/3] Provide more fine grained control over multipathing
> The sensible thing to do in nvme is to use different paths for > different queues. That is e.g. in the RDMA case use the HCA closer > to a given CPU by default. We might allow to override this for > cases where the is a good reason, but what I really don't want is > configurability for configurabilities sake. That makes sense but I'm not sure it covers everything. Probably the most common way to do NVMe/RDMA will be with a single HCA that has multiple ports, so there's no sensible CPU locality. On the other hand we want to keep both ports to the fabric busy. Setting different paths for different queues makes sense, but there may be single-threaded applications that want a different policy. I'm not saying anything very profound, but we have to find the right balance between too many and too few knobs. - R.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
> The sensible thing to do in nvme is to use different paths for > different queues. That is e.g. in the RDMA case use the HCA closer > to a given CPU by default. We might allow to override this for > cases where the is a good reason, but what I really don't want is > configurability for configurabilities sake. That makes sense but I'm not sure it covers everything. Probably the most common way to do NVMe/RDMA will be with a single HCA that has multiple ports, so there's no sensible CPU locality. On the other hand we want to keep both ports to the fabric busy. Setting different paths for different queues makes sense, but there may be single-threaded applications that want a different policy. I'm not saying anything very profound, but we have to find the right balance between too many and too few knobs. - R.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, Jun 04, 2018 at 02:58:49PM -0700, Roland Dreier wrote: > We plan to implement all the fancy NVMe standards like ANA, but it > seems that there is still a requirement to let the host side choose > policies about how to use paths (round-robin vs least queue depth for > example). Even in the modern SCSI world with VPD pages and ALUA, > there are still knobs that are needed. Maybe NVMe will be different > and we can find defaults that work in all cases but I have to admit > I'm skeptical... The sensible thing to do in nvme is to use different paths for different queues. That is e.g. in the RDMA case use the HCA closer to a given CPU by default. We might allow to override this for cases where the is a good reason, but what I really don't want is configurability for configurabilities sake.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, Jun 04, 2018 at 02:58:49PM -0700, Roland Dreier wrote: > We plan to implement all the fancy NVMe standards like ANA, but it > seems that there is still a requirement to let the host side choose > policies about how to use paths (round-robin vs least queue depth for > example). Even in the modern SCSI world with VPD pages and ALUA, > there are still knobs that are needed. Maybe NVMe will be different > and we can find defaults that work in all cases but I have to admit > I'm skeptical... The sensible thing to do in nvme is to use different paths for different queues. That is e.g. in the RDMA case use the HCA closer to a given CPU by default. We might allow to override this for cases where the is a good reason, but what I really don't want is configurability for configurabilities sake.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
> Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. As a vendor who is building an NVMe-oF storage array, I can say that clarity around how Linux wants to handle NVMe multipath would definitely be appreciated. It would be great if we could all converge around the upstream native driver but right now it doesn't look adequate - having only a single active path is not the best way to use a multi-controller storage system. Unfortunately it looks like we're headed to a world where people have to write separate "best practices" documents to cover RHEL, SLES and other vendors. We plan to implement all the fancy NVMe standards like ANA, but it seems that there is still a requirement to let the host side choose policies about how to use paths (round-robin vs least queue depth for example). Even in the modern SCSI world with VPD pages and ALUA, there are still knobs that are needed. Maybe NVMe will be different and we can find defaults that work in all cases but I have to admit I'm skeptical... - R.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
> Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. As a vendor who is building an NVMe-oF storage array, I can say that clarity around how Linux wants to handle NVMe multipath would definitely be appreciated. It would be great if we could all converge around the upstream native driver but right now it doesn't look adequate - having only a single active path is not the best way to use a multi-controller storage system. Unfortunately it looks like we're headed to a world where people have to write separate "best practices" documents to cover RHEL, SLES and other vendors. We plan to implement all the fancy NVMe standards like ANA, but it seems that there is still a requirement to let the host side choose policies about how to use paths (round-robin vs least queue depth for example). Even in the modern SCSI world with VPD pages and ALUA, there are still knobs that are needed. Maybe NVMe will be different and we can find defaults that work in all cases but I have to admit I'm skeptical... - R.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, Jun 04, 2018 at 02:46:47PM +0300, Sagi Grimberg wrote: > I agree with Christoph that changing personality on the fly is going to > be painful. This opt-in will need to be one-host at connect time. For > that, we will probably need to also expose an argument in nvme-cli too. > Changing the mpath personality will need to involve disconnecting the > controller and connecting again with the argument toggled. I think this > is the only sane way to do this. If we still want to make it dynamically, yes. I've raised this concern while working on the patch as well. > Another path we can make progress in is user visibility. We have > topology in place and you mentioned primary path (which we could > probably add). What else do you need for multipath-tools to support > nvme? I think the first priority is getting nvme notion into multipath-tools like I said elsewhere and then see. Martin Wilck was already working on patches for this. -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, Jun 04, 2018 at 02:46:47PM +0300, Sagi Grimberg wrote: > I agree with Christoph that changing personality on the fly is going to > be painful. This opt-in will need to be one-host at connect time. For > that, we will probably need to also expose an argument in nvme-cli too. > Changing the mpath personality will need to involve disconnecting the > controller and connecting again with the argument toggled. I think this > is the only sane way to do this. If we still want to make it dynamically, yes. I've raised this concern while working on the patch as well. > Another path we can make progress in is user visibility. We have > topology in place and you mentioned primary path (which we could > probably add). What else do you need for multipath-tools to support > nvme? I think the first priority is getting nvme notion into multipath-tools like I said elsewhere and then see. Martin Wilck was already working on patches for this. -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
[so much for putting out flames... :/] This projecting onto me that I've not been keeping the conversation technical is in itself hostile. Sure I get frustrated and lash out (as I'm _sure_ you'll feel in this reply) You're right, I do feel this is lashing out. And I don't appreciate it. Please stop it. We're not going to make progress otherwise. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Already made the point multiple times in this thread [3][4][5][1]. Hint: it is about the users who have long-standing expertise and automation built around dm-multipath and multipath-tools. BUT those same users may need/want to simultaneously use native NVMe multipath on the same host. Dismissing this point or acting like I haven't articulated it just illustrates to me continuing this conversation is not going to be fruitful. The vast majority of the points are about the fact that people still need to be able to use multipath-tools, which they still can today. Personally, I question the existence of this user base you are referring to which would want to maintain both dm-multipath and nvme personalities at the same time on the same host. But I do want us to make progress, so I will have take this need as a given. I agree with Christoph that changing personality on the fly is going to be painful. This opt-in will need to be one-host at connect time. For that, we will probably need to also expose an argument in nvme-cli too. Changing the mpath personality will need to involve disconnecting the controller and connecting again with the argument toggled. I think this is the only sane way to do this. Another path we can make progress in is user visibility. We have topology in place and you mentioned primary path (which we could probably add). What else do you need for multipath-tools to support nvme?
Re: [PATCH 0/3] Provide more fine grained control over multipathing
[so much for putting out flames... :/] This projecting onto me that I've not been keeping the conversation technical is in itself hostile. Sure I get frustrated and lash out (as I'm _sure_ you'll feel in this reply) You're right, I do feel this is lashing out. And I don't appreciate it. Please stop it. We're not going to make progress otherwise. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Already made the point multiple times in this thread [3][4][5][1]. Hint: it is about the users who have long-standing expertise and automation built around dm-multipath and multipath-tools. BUT those same users may need/want to simultaneously use native NVMe multipath on the same host. Dismissing this point or acting like I haven't articulated it just illustrates to me continuing this conversation is not going to be fruitful. The vast majority of the points are about the fact that people still need to be able to use multipath-tools, which they still can today. Personally, I question the existence of this user base you are referring to which would want to maintain both dm-multipath and nvme personalities at the same time on the same host. But I do want us to make progress, so I will have take this need as a given. I agree with Christoph that changing personality on the fly is going to be painful. This opt-in will need to be one-host at connect time. For that, we will probably need to also expose an argument in nvme-cli too. Changing the mpath personality will need to involve disconnecting the controller and connecting again with the argument toggled. I think this is the only sane way to do this. Another path we can make progress in is user visibility. We have topology in place and you mentioned primary path (which we could probably add). What else do you need for multipath-tools to support nvme?
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Sun, Jun 03 2018 at 7:00P -0400, Sagi Grimberg wrote: > > >I'm aware that most everything in multipath.conf is SCSI/FC specific. > >That isn't the point. dm-multipath and multipathd are an existing > >framework for managing multipath storage. > > > >It could be made to work with NVMe. But yes it would not be easy. > >Especially not with the native NVMe multipath crew being so damn > >hostile. > > The resistance is not a hostile act. Please try and keep the > discussion technical. This projecting onto me that I've not been keeping the conversation technical is in itself hostile. Sure I get frustrated and lash out (as I'm _sure_ you'll feel in this reply) but I've been beating my head against the wall on the need for native NVMe multipath and dm-multipath to coexist in a fine-grained manner for literally 2 years! But for the time-being I was done dwelling on the need for a switch like mpath_personality. Yet you persist. If you read the latest messages in this thread [1] and still elected to send this message, then _that_ is a hostile act. Because I have been nothing but informative. The fact you choose not to care, appreciate or have concern for users' experience isn't my fault. And please don't pretend like the entire evolution of native NVMe multipath was anything but one elaborate hostile act against dm-multipath. To deny that would simply discredit your entire viewpoint on this topic. Even smaller decisions that were communicated in person and then later unilaterally reversed were hostile. Examples: 1) ANA would serve as a scsi device handler like (multipath agnostic) feature to enhance namespaces -- now you can see in the v2 implemation that certainly isn't the case 2) The dm-multipath path-selectors were going to be elevated for use by both native NVMe multipath and dm-multipath -- now people are implementing yet another round-robin path selector directly in NVMe. I get it, Christoph (and others by association) are operating from a "winning" position that was hostiley taken and now the winning position is being leveraged to further ensure dm-multipath has no hope of being a viable alternative to native NVMe multipath -- at least not without a lot of work to refactor code to be unnecessarily homed in the CONFIG_NVME_MULTIPATH=y sandbox. > >>But I don't think the burden of allowing multipathd/DM to inject > >>themselves into the path transition state machine has any benefit > >>whatsoever to the user. It's only complicating things and therefore we'd > >>be doing people a disservice rather than a favor. > > > >This notion that only native NVMe multipath can be successful is utter > >bullshit. And the mere fact that I've gotten such a reaction from a > >select few speaks to some serious control issues. > > > >Imagine if XFS developers just one day imposed that it is the _only_ > >filesystem that can be used on persistent memory. > > > >Just please dial it back.. seriously tiresome. > > Mike, you make a fair point on multipath tools being more mature > compared to NVMe multipathing. But this is not the discussion at all (at > least not from my perspective). There was not a single use-case that > gave a clear-cut justification for a per-subsystem personality switch > (other than some far fetched imaginary scenarios). This is not unusual > for the kernel community not to accept things with little to no use, > especially when it involves exposing a userspace ABI. The interfaces dm-multipath and multipath-tools provide are exactly the issue. SO which is it, do I have a valid usecase, like you indicated before [2] or am I just talking non-sense (with hypotehticals because I was baited to do so)? NOTE: even in your [2] reply you also go on to say that "no one is forbidden to use [dm-]multipath." when the reality is users will be as-is. If you and others genuinely think that disallowing dm-multipath from being able to manage NVMe devices if CONFIG_NVME_MULTIPATH is enabled (and not shutoff via nvme_core.multipath=N) is a reasonable action then you're actively complicit in limiting users from continuing to use the long-established dm-multipath based infrastructure that Linux has had for over 10 years. There is literally no reason why they need to be mutually exclussive (other than to grant otherwise would errode the "winning" position hch et al have been operating from). The implemetation of the switch to allow fine-grained control does need proper care and review and buy-in. But I'm sad to see there literally is zero willingness to even acknowledge that it is "the right thing to do". > As for now, all I see is a disclaimer saying that it'd need to be > nurtured over time as the NVMe spec evolves. > > Can you (or others) please try and articulate why a "fine grained" > multipathing is an absolute must? At the moment, I just don't > understand. Already made the point multiple times in this thread [3][4][5][1]. Hint: it is about the users who have long-standing expertise
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Sun, Jun 03 2018 at 7:00P -0400, Sagi Grimberg wrote: > > >I'm aware that most everything in multipath.conf is SCSI/FC specific. > >That isn't the point. dm-multipath and multipathd are an existing > >framework for managing multipath storage. > > > >It could be made to work with NVMe. But yes it would not be easy. > >Especially not with the native NVMe multipath crew being so damn > >hostile. > > The resistance is not a hostile act. Please try and keep the > discussion technical. This projecting onto me that I've not been keeping the conversation technical is in itself hostile. Sure I get frustrated and lash out (as I'm _sure_ you'll feel in this reply) but I've been beating my head against the wall on the need for native NVMe multipath and dm-multipath to coexist in a fine-grained manner for literally 2 years! But for the time-being I was done dwelling on the need for a switch like mpath_personality. Yet you persist. If you read the latest messages in this thread [1] and still elected to send this message, then _that_ is a hostile act. Because I have been nothing but informative. The fact you choose not to care, appreciate or have concern for users' experience isn't my fault. And please don't pretend like the entire evolution of native NVMe multipath was anything but one elaborate hostile act against dm-multipath. To deny that would simply discredit your entire viewpoint on this topic. Even smaller decisions that were communicated in person and then later unilaterally reversed were hostile. Examples: 1) ANA would serve as a scsi device handler like (multipath agnostic) feature to enhance namespaces -- now you can see in the v2 implemation that certainly isn't the case 2) The dm-multipath path-selectors were going to be elevated for use by both native NVMe multipath and dm-multipath -- now people are implementing yet another round-robin path selector directly in NVMe. I get it, Christoph (and others by association) are operating from a "winning" position that was hostiley taken and now the winning position is being leveraged to further ensure dm-multipath has no hope of being a viable alternative to native NVMe multipath -- at least not without a lot of work to refactor code to be unnecessarily homed in the CONFIG_NVME_MULTIPATH=y sandbox. > >>But I don't think the burden of allowing multipathd/DM to inject > >>themselves into the path transition state machine has any benefit > >>whatsoever to the user. It's only complicating things and therefore we'd > >>be doing people a disservice rather than a favor. > > > >This notion that only native NVMe multipath can be successful is utter > >bullshit. And the mere fact that I've gotten such a reaction from a > >select few speaks to some serious control issues. > > > >Imagine if XFS developers just one day imposed that it is the _only_ > >filesystem that can be used on persistent memory. > > > >Just please dial it back.. seriously tiresome. > > Mike, you make a fair point on multipath tools being more mature > compared to NVMe multipathing. But this is not the discussion at all (at > least not from my perspective). There was not a single use-case that > gave a clear-cut justification for a per-subsystem personality switch > (other than some far fetched imaginary scenarios). This is not unusual > for the kernel community not to accept things with little to no use, > especially when it involves exposing a userspace ABI. The interfaces dm-multipath and multipath-tools provide are exactly the issue. SO which is it, do I have a valid usecase, like you indicated before [2] or am I just talking non-sense (with hypotehticals because I was baited to do so)? NOTE: even in your [2] reply you also go on to say that "no one is forbidden to use [dm-]multipath." when the reality is users will be as-is. If you and others genuinely think that disallowing dm-multipath from being able to manage NVMe devices if CONFIG_NVME_MULTIPATH is enabled (and not shutoff via nvme_core.multipath=N) is a reasonable action then you're actively complicit in limiting users from continuing to use the long-established dm-multipath based infrastructure that Linux has had for over 10 years. There is literally no reason why they need to be mutually exclussive (other than to grant otherwise would errode the "winning" position hch et al have been operating from). The implemetation of the switch to allow fine-grained control does need proper care and review and buy-in. But I'm sad to see there literally is zero willingness to even acknowledge that it is "the right thing to do". > As for now, all I see is a disclaimer saying that it'd need to be > nurtured over time as the NVMe spec evolves. > > Can you (or others) please try and articulate why a "fine grained" > multipathing is an absolute must? At the moment, I just don't > understand. Already made the point multiple times in this thread [3][4][5][1]. Hint: it is about the users who have long-standing expertise
Re: [PATCH 0/3] Provide more fine grained control over multipathing
I'm aware that most everything in multipath.conf is SCSI/FC specific. That isn't the point. dm-multipath and multipathd are an existing framework for managing multipath storage. It could be made to work with NVMe. But yes it would not be easy. Especially not with the native NVMe multipath crew being so damn hostile. The resistance is not a hostile act. Please try and keep the discussion technical. But I don't think the burden of allowing multipathd/DM to inject themselves into the path transition state machine has any benefit whatsoever to the user. It's only complicating things and therefore we'd be doing people a disservice rather than a favor. This notion that only native NVMe multipath can be successful is utter bullshit. And the mere fact that I've gotten such a reaction from a select few speaks to some serious control issues. Imagine if XFS developers just one day imposed that it is the _only_ filesystem that can be used on persistent memory. Just please dial it back.. seriously tiresome. Mike, you make a fair point on multipath tools being more mature compared to NVMe multipathing. But this is not the discussion at all (at least not from my perspective). There was not a single use-case that gave a clear-cut justification for a per-subsystem personality switch (other than some far fetched imaginary scenarios). This is not unusual for the kernel community not to accept things with little to no use, especially when it involves exposing a userspace ABI. As for now, all I see is a disclaimer saying that it'd need to be nurtured over time as the NVMe spec evolves. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Also, I get your point that exposing state/stats information to userspace is needed. That's a fair comment.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
I'm aware that most everything in multipath.conf is SCSI/FC specific. That isn't the point. dm-multipath and multipathd are an existing framework for managing multipath storage. It could be made to work with NVMe. But yes it would not be easy. Especially not with the native NVMe multipath crew being so damn hostile. The resistance is not a hostile act. Please try and keep the discussion technical. But I don't think the burden of allowing multipathd/DM to inject themselves into the path transition state machine has any benefit whatsoever to the user. It's only complicating things and therefore we'd be doing people a disservice rather than a favor. This notion that only native NVMe multipath can be successful is utter bullshit. And the mere fact that I've gotten such a reaction from a select few speaks to some serious control issues. Imagine if XFS developers just one day imposed that it is the _only_ filesystem that can be used on persistent memory. Just please dial it back.. seriously tiresome. Mike, you make a fair point on multipath tools being more mature compared to NVMe multipathing. But this is not the discussion at all (at least not from my perspective). There was not a single use-case that gave a clear-cut justification for a per-subsystem personality switch (other than some far fetched imaginary scenarios). This is not unusual for the kernel community not to accept things with little to no use, especially when it involves exposing a userspace ABI. As for now, all I see is a disclaimer saying that it'd need to be nurtured over time as the NVMe spec evolves. Can you (or others) please try and articulate why a "fine grained" multipathing is an absolute must? At the moment, I just don't understand. Also, I get your point that exposing state/stats information to userspace is needed. That's a fair comment.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, Jun 01 2018 at 10:09am -0400, Martin K. Petersen wrote: > > Good morning Mike, > > > This notion that only native NVMe multipath can be successful is utter > > bullshit. And the mere fact that I've gotten such a reaction from a > > select few speaks to some serious control issues. > > Please stop making this personal. It cuts both ways, but I agree. > > Imagine if XFS developers just one day imposed that it is the _only_ > > filesystem that can be used on persistent memory. > > It's not about project X vs. project Y at all. This is about how we got > to where we are today. And whether we are making right decisions that > will benefit our users in the long run. > > 20 years ago there were several device-specific SCSI multipath drivers > available for Linux. All of them out-of-tree because there was no good > way to consolidate them. They all worked in very different ways because > the devices themselves were implemented in very different ways. It was a > nightmare. > > At the time we were very proud of our block layer, an abstraction none > of the other operating systems really had. And along came Ingo and > Miguel and did a PoC MD multipath implementation for devices that didn't > have special needs. It was small, beautiful, and fit well into our shiny > block layer abstraction. And therefore everyone working on Linux storage > at the time was convinced that the block layer multipath model was the > right way to go. Including, I must emphasize, yours truly. > > There were several reasons why the block + userland model was especially > compelling: > > 1. There were no device serial numbers, UUIDs, or VPD pages. So short > of disk labels, there was no way to automatically establish that block > device sda was in fact the same LUN as sdb. MD and DM were existing > vehicles for describing block device relationships. Either via on-disk > metadata or config files and device mapper tables. And system > configurations were simple and static enough then that manually > maintaining a config file wasn't much of a burden. > > 2. There was lots of talk in the industry about devices supporting > heterogeneous multipathing. As in ATA on one port and SCSI on the > other. So we deliberately did not want to put multipathing in SCSI, > anticipating that these hybrid devices might show up (this was in the > IDE days, obviously, predating libata sitting under SCSI). We made > several design compromises wrt. SCSI devices to accommodate future > coexistence with ATA. Then iSCSI came along and provided a "cheaper > than FC" solution and everybody instantly lost interest in ATA > multipath. > > 3. The devices at the time needed all sorts of custom knobs to > function. Path checkers, load balancing algorithms, explicit failover, > etc. We needed a way to run arbitrary, potentially proprietary, > commands from to initiate failover and failback. Absolute no-go for the > kernel so userland it was. > > Those are some of the considerations that went into the original MD/DM > multipath approach. Everything made lots of sense at the time. But > obviously the industry constantly changes, things that were once > important no longer matter. Some design decisions were made based on > incorrect assumptions or lack of experience and we ended up with major > ad-hoc workarounds to the originally envisioned approach. SCSI device > handlers are the prime examples of how the original transport-agnostic > model didn't quite cut it. Anyway. So here we are. Current DM multipath > is a result of a whole string of design decisions, many of which are > based on assumptions that were valid at the time but which are no longer > relevant today. > > ALUA came along in an attempt to standardize all the proprietary device > interactions, thus obsoleting the userland plugin requirement. It also > solved the ID/discovery aspect as well as provided a way to express > fault domains. The main problem with ALUA was that it was too > permissive, letting storage vendors get away with very suboptimal, yet > compliant, implementations based on their older, proprietary multipath > architectures. So we got the knobs standardized, but device behavior was > still all over the place. > > Now enter NVMe. The industry had a chance to clean things up. No legacy > architectures to accommodate, no need for explicit failover, twiddling > mode pages, reading sector 0, etc. The rationale behind ANA is for > multipathing to work without any of the explicit configuration and > management hassles which riddle SCSI devices for hysterical raisins. Nice recap for those who aren't aware of the past (decision tree and considerations that influenced the design of DM multipath). > My objection to DM vs. NVMe enablement is that I think that the two > models are a very poor fit (manually configured individual block device > mapping vs. automatic grouping/failover above and below subsystem > level). On top of that, no compelling technical reason
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, Jun 01 2018 at 10:09am -0400, Martin K. Petersen wrote: > > Good morning Mike, > > > This notion that only native NVMe multipath can be successful is utter > > bullshit. And the mere fact that I've gotten such a reaction from a > > select few speaks to some serious control issues. > > Please stop making this personal. It cuts both ways, but I agree. > > Imagine if XFS developers just one day imposed that it is the _only_ > > filesystem that can be used on persistent memory. > > It's not about project X vs. project Y at all. This is about how we got > to where we are today. And whether we are making right decisions that > will benefit our users in the long run. > > 20 years ago there were several device-specific SCSI multipath drivers > available for Linux. All of them out-of-tree because there was no good > way to consolidate them. They all worked in very different ways because > the devices themselves were implemented in very different ways. It was a > nightmare. > > At the time we were very proud of our block layer, an abstraction none > of the other operating systems really had. And along came Ingo and > Miguel and did a PoC MD multipath implementation for devices that didn't > have special needs. It was small, beautiful, and fit well into our shiny > block layer abstraction. And therefore everyone working on Linux storage > at the time was convinced that the block layer multipath model was the > right way to go. Including, I must emphasize, yours truly. > > There were several reasons why the block + userland model was especially > compelling: > > 1. There were no device serial numbers, UUIDs, or VPD pages. So short > of disk labels, there was no way to automatically establish that block > device sda was in fact the same LUN as sdb. MD and DM were existing > vehicles for describing block device relationships. Either via on-disk > metadata or config files and device mapper tables. And system > configurations were simple and static enough then that manually > maintaining a config file wasn't much of a burden. > > 2. There was lots of talk in the industry about devices supporting > heterogeneous multipathing. As in ATA on one port and SCSI on the > other. So we deliberately did not want to put multipathing in SCSI, > anticipating that these hybrid devices might show up (this was in the > IDE days, obviously, predating libata sitting under SCSI). We made > several design compromises wrt. SCSI devices to accommodate future > coexistence with ATA. Then iSCSI came along and provided a "cheaper > than FC" solution and everybody instantly lost interest in ATA > multipath. > > 3. The devices at the time needed all sorts of custom knobs to > function. Path checkers, load balancing algorithms, explicit failover, > etc. We needed a way to run arbitrary, potentially proprietary, > commands from to initiate failover and failback. Absolute no-go for the > kernel so userland it was. > > Those are some of the considerations that went into the original MD/DM > multipath approach. Everything made lots of sense at the time. But > obviously the industry constantly changes, things that were once > important no longer matter. Some design decisions were made based on > incorrect assumptions or lack of experience and we ended up with major > ad-hoc workarounds to the originally envisioned approach. SCSI device > handlers are the prime examples of how the original transport-agnostic > model didn't quite cut it. Anyway. So here we are. Current DM multipath > is a result of a whole string of design decisions, many of which are > based on assumptions that were valid at the time but which are no longer > relevant today. > > ALUA came along in an attempt to standardize all the proprietary device > interactions, thus obsoleting the userland plugin requirement. It also > solved the ID/discovery aspect as well as provided a way to express > fault domains. The main problem with ALUA was that it was too > permissive, letting storage vendors get away with very suboptimal, yet > compliant, implementations based on their older, proprietary multipath > architectures. So we got the knobs standardized, but device behavior was > still all over the place. > > Now enter NVMe. The industry had a chance to clean things up. No legacy > architectures to accommodate, no need for explicit failover, twiddling > mode pages, reading sector 0, etc. The rationale behind ANA is for > multipathing to work without any of the explicit configuration and > management hassles which riddle SCSI devices for hysterical raisins. Nice recap for those who aren't aware of the past (decision tree and considerations that influenced the design of DM multipath). > My objection to DM vs. NVMe enablement is that I think that the two > models are a very poor fit (manually configured individual block device > mapping vs. automatic grouping/failover above and below subsystem > level). On top of that, no compelling technical reason
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Good morning Mike, > This notion that only native NVMe multipath can be successful is utter > bullshit. And the mere fact that I've gotten such a reaction from a > select few speaks to some serious control issues. Please stop making this personal. > Imagine if XFS developers just one day imposed that it is the _only_ > filesystem that can be used on persistent memory. It's not about project X vs. project Y at all. This is about how we got to where we are today. And whether we are making right decisions that will benefit our users in the long run. 20 years ago there were several device-specific SCSI multipath drivers available for Linux. All of them out-of-tree because there was no good way to consolidate them. They all worked in very different ways because the devices themselves were implemented in very different ways. It was a nightmare. At the time we were very proud of our block layer, an abstraction none of the other operating systems really had. And along came Ingo and Miguel and did a PoC MD multipath implementation for devices that didn't have special needs. It was small, beautiful, and fit well into our shiny block layer abstraction. And therefore everyone working on Linux storage at the time was convinced that the block layer multipath model was the right way to go. Including, I must emphasize, yours truly. There were several reasons why the block + userland model was especially compelling: 1. There were no device serial numbers, UUIDs, or VPD pages. So short of disk labels, there was no way to automatically establish that block device sda was in fact the same LUN as sdb. MD and DM were existing vehicles for describing block device relationships. Either via on-disk metadata or config files and device mapper tables. And system configurations were simple and static enough then that manually maintaining a config file wasn't much of a burden. 2. There was lots of talk in the industry about devices supporting heterogeneous multipathing. As in ATA on one port and SCSI on the other. So we deliberately did not want to put multipathing in SCSI, anticipating that these hybrid devices might show up (this was in the IDE days, obviously, predating libata sitting under SCSI). We made several design compromises wrt. SCSI devices to accommodate future coexistence with ATA. Then iSCSI came along and provided a "cheaper than FC" solution and everybody instantly lost interest in ATA multipath. 3. The devices at the time needed all sorts of custom knobs to function. Path checkers, load balancing algorithms, explicit failover, etc. We needed a way to run arbitrary, potentially proprietary, commands from to initiate failover and failback. Absolute no-go for the kernel so userland it was. Those are some of the considerations that went into the original MD/DM multipath approach. Everything made lots of sense at the time. But obviously the industry constantly changes, things that were once important no longer matter. Some design decisions were made based on incorrect assumptions or lack of experience and we ended up with major ad-hoc workarounds to the originally envisioned approach. SCSI device handlers are the prime examples of how the original transport-agnostic model didn't quite cut it. Anyway. So here we are. Current DM multipath is a result of a whole string of design decisions, many of which are based on assumptions that were valid at the time but which are no longer relevant today. ALUA came along in an attempt to standardize all the proprietary device interactions, thus obsoleting the userland plugin requirement. It also solved the ID/discovery aspect as well as provided a way to express fault domains. The main problem with ALUA was that it was too permissive, letting storage vendors get away with very suboptimal, yet compliant, implementations based on their older, proprietary multipath architectures. So we got the knobs standardized, but device behavior was still all over the place. Now enter NVMe. The industry had a chance to clean things up. No legacy architectures to accommodate, no need for explicit failover, twiddling mode pages, reading sector 0, etc. The rationale behind ANA is for multipathing to work without any of the explicit configuration and management hassles which riddle SCSI devices for hysterical raisins. My objection to DM vs. NVMe enablement is that I think that the two models are a very poor fit (manually configured individual block device mapping vs. automatic grouping/failover above and below subsystem level). On top of that, no compelling technical reason has been offered for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs or IQNs into multipath.conf to get things working. And there is no flag day/transition path requirement for devices that (with very few exceptions) don't actually exist yet. So I really don't understand why we must pound a square peg into a round hole. NVMe is a different protocol. It is based on several
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Good morning Mike, > This notion that only native NVMe multipath can be successful is utter > bullshit. And the mere fact that I've gotten such a reaction from a > select few speaks to some serious control issues. Please stop making this personal. > Imagine if XFS developers just one day imposed that it is the _only_ > filesystem that can be used on persistent memory. It's not about project X vs. project Y at all. This is about how we got to where we are today. And whether we are making right decisions that will benefit our users in the long run. 20 years ago there were several device-specific SCSI multipath drivers available for Linux. All of them out-of-tree because there was no good way to consolidate them. They all worked in very different ways because the devices themselves were implemented in very different ways. It was a nightmare. At the time we were very proud of our block layer, an abstraction none of the other operating systems really had. And along came Ingo and Miguel and did a PoC MD multipath implementation for devices that didn't have special needs. It was small, beautiful, and fit well into our shiny block layer abstraction. And therefore everyone working on Linux storage at the time was convinced that the block layer multipath model was the right way to go. Including, I must emphasize, yours truly. There were several reasons why the block + userland model was especially compelling: 1. There were no device serial numbers, UUIDs, or VPD pages. So short of disk labels, there was no way to automatically establish that block device sda was in fact the same LUN as sdb. MD and DM were existing vehicles for describing block device relationships. Either via on-disk metadata or config files and device mapper tables. And system configurations were simple and static enough then that manually maintaining a config file wasn't much of a burden. 2. There was lots of talk in the industry about devices supporting heterogeneous multipathing. As in ATA on one port and SCSI on the other. So we deliberately did not want to put multipathing in SCSI, anticipating that these hybrid devices might show up (this was in the IDE days, obviously, predating libata sitting under SCSI). We made several design compromises wrt. SCSI devices to accommodate future coexistence with ATA. Then iSCSI came along and provided a "cheaper than FC" solution and everybody instantly lost interest in ATA multipath. 3. The devices at the time needed all sorts of custom knobs to function. Path checkers, load balancing algorithms, explicit failover, etc. We needed a way to run arbitrary, potentially proprietary, commands from to initiate failover and failback. Absolute no-go for the kernel so userland it was. Those are some of the considerations that went into the original MD/DM multipath approach. Everything made lots of sense at the time. But obviously the industry constantly changes, things that were once important no longer matter. Some design decisions were made based on incorrect assumptions or lack of experience and we ended up with major ad-hoc workarounds to the originally envisioned approach. SCSI device handlers are the prime examples of how the original transport-agnostic model didn't quite cut it. Anyway. So here we are. Current DM multipath is a result of a whole string of design decisions, many of which are based on assumptions that were valid at the time but which are no longer relevant today. ALUA came along in an attempt to standardize all the proprietary device interactions, thus obsoleting the userland plugin requirement. It also solved the ID/discovery aspect as well as provided a way to express fault domains. The main problem with ALUA was that it was too permissive, letting storage vendors get away with very suboptimal, yet compliant, implementations based on their older, proprietary multipath architectures. So we got the knobs standardized, but device behavior was still all over the place. Now enter NVMe. The industry had a chance to clean things up. No legacy architectures to accommodate, no need for explicit failover, twiddling mode pages, reading sector 0, etc. The rationale behind ANA is for multipathing to work without any of the explicit configuration and management hassles which riddle SCSI devices for hysterical raisins. My objection to DM vs. NVMe enablement is that I think that the two models are a very poor fit (manually configured individual block device mapping vs. automatic grouping/failover above and below subsystem level). On top of that, no compelling technical reason has been offered for why DM multipath is actually a benefit. Nobody enjoys pasting WWNs or IQNs into multipath.conf to get things working. And there is no flag day/transition path requirement for devices that (with very few exceptions) don't actually exist yet. So I really don't understand why we must pound a square peg into a round hole. NVMe is a different protocol. It is based on several
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 10:40pm -0400, Martin K. Petersen wrote: > > Mike, > > > 1) container A is tasked with managing some dedicated NVMe technology > > that absolutely needs native NVMe multipath. > > > 2) container B is tasked with offering some canned layered product > > that was developed ontop of dm-multipath with its own multipath-tools > > oriented APIs, etc. And it is to manage some other NVMe technology on > > the same host as container A. > > This assumes there is something to manage. And that the administrative > model currently employed by DM multipath will be easily applicable to > ANA devices. I don't believe that's the case. The configuration happens > on the storage side, not on the host. Fair point. > With ALUA (and the proprietary implementations that predated the spec), > it was very fuzzy whether it was the host or the target that owned > responsibility for this or that. Part of the reason was that ALUA was > deliberately vague to accommodate everybody's existing, non-standards > compliant multipath storage implementations. > > With ANA the heavy burden falls entirely on the storage. Most of the > things you would currently configure in multipath.conf have no meaning > in the context of ANA. Things that are currently the domain of > dm-multipath or multipathd are inextricably living either in the storage > device or in the NVMe ANA "device handler". And I think you are > significantly underestimating the effort required to expose that > information up the stack and to make use of it. That's not just a > multipath personality toggle switch. I'm aware that most everything in multipath.conf is SCSI/FC specific. That isn't the point. dm-multipath and multipathd are an existing framework for managing multipath storage. It could be made to work with NVMe. But yes it would not be easy. Especially not with the native NVMe multipath crew being so damn hostile. > If you want to make multipath -ll show something meaningful for ANA > devices, then by all means go ahead. I don't have any problem with > that. Thanks so much for your permission ;) But I'm actually not very involved with multipathd development anyway. It is likely a better use of time in the near-term though. Making the multipath tools and libraries be able to understand native NVMe multipath in all its glory might be a means to an end from a compatibility with existing monitoring applications perspective. Though NVMe just doesn't have per-device accounting at all. Also not yet aware how nvme cli conveys paths being down vs up, etc. Glad that isn't my problem ;) > But I don't think the burden of allowing multipathd/DM to inject > themselves into the path transition state machine has any benefit > whatsoever to the user. It's only complicating things and therefore we'd > be doing people a disservice rather than a favor. This notion that only native NVMe multipath can be successful is utter bullshit. And the mere fact that I've gotten such a reaction from a select few speaks to some serious control issues. Imagine if XFS developers just one day imposed that it is the _only_ filesystem that can be used on persistent memory. Just please dial it back.. seriously tiresome.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 10:40pm -0400, Martin K. Petersen wrote: > > Mike, > > > 1) container A is tasked with managing some dedicated NVMe technology > > that absolutely needs native NVMe multipath. > > > 2) container B is tasked with offering some canned layered product > > that was developed ontop of dm-multipath with its own multipath-tools > > oriented APIs, etc. And it is to manage some other NVMe technology on > > the same host as container A. > > This assumes there is something to manage. And that the administrative > model currently employed by DM multipath will be easily applicable to > ANA devices. I don't believe that's the case. The configuration happens > on the storage side, not on the host. Fair point. > With ALUA (and the proprietary implementations that predated the spec), > it was very fuzzy whether it was the host or the target that owned > responsibility for this or that. Part of the reason was that ALUA was > deliberately vague to accommodate everybody's existing, non-standards > compliant multipath storage implementations. > > With ANA the heavy burden falls entirely on the storage. Most of the > things you would currently configure in multipath.conf have no meaning > in the context of ANA. Things that are currently the domain of > dm-multipath or multipathd are inextricably living either in the storage > device or in the NVMe ANA "device handler". And I think you are > significantly underestimating the effort required to expose that > information up the stack and to make use of it. That's not just a > multipath personality toggle switch. I'm aware that most everything in multipath.conf is SCSI/FC specific. That isn't the point. dm-multipath and multipathd are an existing framework for managing multipath storage. It could be made to work with NVMe. But yes it would not be easy. Especially not with the native NVMe multipath crew being so damn hostile. > If you want to make multipath -ll show something meaningful for ANA > devices, then by all means go ahead. I don't have any problem with > that. Thanks so much for your permission ;) But I'm actually not very involved with multipathd development anyway. It is likely a better use of time in the near-term though. Making the multipath tools and libraries be able to understand native NVMe multipath in all its glory might be a means to an end from a compatibility with existing monitoring applications perspective. Though NVMe just doesn't have per-device accounting at all. Also not yet aware how nvme cli conveys paths being down vs up, etc. Glad that isn't my problem ;) > But I don't think the burden of allowing multipathd/DM to inject > themselves into the path transition state machine has any benefit > whatsoever to the user. It's only complicating things and therefore we'd > be doing people a disservice rather than a favor. This notion that only native NVMe multipath can be successful is utter bullshit. And the mere fact that I've gotten such a reaction from a select few speaks to some serious control issues. Imagine if XFS developers just one day imposed that it is the _only_ filesystem that can be used on persistent memory. Just please dial it back.. seriously tiresome.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 12:34pm -0400, Christoph Hellwig wrote: > On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote: > > I saw your reply to the 1/3 patch.. I do agree it is broken for not > > checking if any handles are active. But that is easily fixed no? > > Doing a switch at runtime simply is a really bad idea. If for some > reason we end up with a good per-controller switch it would have > to be something set at probe time, and to get it on a controller > you'd need to reset it first. Yes, I see that now. And the implementation would need to be something yourself or other more seasoned NVMe developers pursued. NVMe code is pretty unforgiving. I took a crack at aspects of this, my head hurts. While testing I hit some "interesting" lack of self-awareness about NVMe resources that are in use. So lots of associations are able to be torn down rather than graceful failure. Could be nvme_fcloop specific, but it is pretty easy to do the following using mptest's lib/unittests/nvme_4port_create.sh followed by: modprobe -r nvme_fcloop Results in an infinite spew of: [14245.345759] nvme_fcloop: fcloop_exit: Failed deleting remote port [14245.351851] nvme_fcloop: fcloop_exit: Failed deleting target port [14245.357944] nvme_fcloop: fcloop_exit: Failed deleting remote port [14245.364038] nvme_fcloop: fcloop_exit: Failed deleting target port Another fun one is to lib/unittests/nvme_4port_delete.sh while the native NVMe multipath device (created from nvme_4port_create.sh) was still in use by an xfs mount, so: ./nvme_4port_create.sh mount /dev/nvme1n1 /mnt ./nvme_4port_delete.sh umount /mnt Those were clear screwups on my part but I wouldn't have expected them to cause nvme to blow through so many stop signs. Anyway, I put enough time to trying to make the previously thought "simple" mpath_personality switch safe -- in the face of active handles (issue Sagi pointed out) -- that it is clear NVMe just doesn't have enough state to do it in a clean way. Would require a deeper understanding of the code that I don't have. Most every NVMe function returns void so there is basically no potential for error handling (in the face of a resource being in use). The following is my WIP patch (built ontop of the 3 patches from this thread's series) that has cured me of wanting to continue pursuit of a robust implementation of the runtime 'mpath_personality' switch: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e018d0..80103b3 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2146,10 +2146,8 @@ static ssize_t __nvme_subsys_store_mpath_personality(struct nvme_subsystem *subs goto out; } - if (subsys->native_mpath != native_mpath) { - subsys->native_mpath = native_mpath; - ret = nvme_mpath_change_personality(subsys); - } + if (subsys->native_mpath != native_mpath) + ret = nvme_mpath_change_personality(subsys, native_mpath); out: return ret ? ret : count; } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 53d2610..017c924 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -247,26 +247,57 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) put_disk(head->disk); } -int nvme_mpath_change_personality(struct nvme_subsystem *subsys) +static bool __nvme_subsys_in_use(struct nvme_subsystem *subsys) { struct nvme_ctrl *ctrl; - int ret = 0; + struct nvme_ns *ns, *next; -restart: - mutex_lock(>lock); list_for_each_entry(ctrl, >ctrls, subsys_entry) { - if (!list_empty(>namespaces)) { - mutex_unlock(>lock); - nvme_remove_namespaces(ctrl); - goto restart; + down_write(>namespaces_rwsem); + list_for_each_entry_safe(ns, next, >namespaces, list) { + if ((kref_read(>kref) > 1) || + // FIXME: need to compare with N paths + (ns->head && (kref_read(>head->ref) > 1))) { + printk("ns->kref = %d", kref_read(>kref)); + printk("ns->head->ref = %d", kref_read(>head->ref)); + up_write(>namespaces_rwsem); + mutex_unlock(>lock); + return true; + } } + up_write(>namespaces_rwsem); } - mutex_unlock(>lock); + + return false; +} + +int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native) +{ + struct nvme_ctrl *ctrl; mutex_lock(>lock); - list_for_each_entry(ctrl, >ctrls, subsys_entry) - nvme_queue_scan(ctrl); + + if (__nvme_subsys_in_use(subsys)) { + mutex_unlock(>lock); + return -EBUSY; + }
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 12:34pm -0400, Christoph Hellwig wrote: > On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote: > > I saw your reply to the 1/3 patch.. I do agree it is broken for not > > checking if any handles are active. But that is easily fixed no? > > Doing a switch at runtime simply is a really bad idea. If for some > reason we end up with a good per-controller switch it would have > to be something set at probe time, and to get it on a controller > you'd need to reset it first. Yes, I see that now. And the implementation would need to be something yourself or other more seasoned NVMe developers pursued. NVMe code is pretty unforgiving. I took a crack at aspects of this, my head hurts. While testing I hit some "interesting" lack of self-awareness about NVMe resources that are in use. So lots of associations are able to be torn down rather than graceful failure. Could be nvme_fcloop specific, but it is pretty easy to do the following using mptest's lib/unittests/nvme_4port_create.sh followed by: modprobe -r nvme_fcloop Results in an infinite spew of: [14245.345759] nvme_fcloop: fcloop_exit: Failed deleting remote port [14245.351851] nvme_fcloop: fcloop_exit: Failed deleting target port [14245.357944] nvme_fcloop: fcloop_exit: Failed deleting remote port [14245.364038] nvme_fcloop: fcloop_exit: Failed deleting target port Another fun one is to lib/unittests/nvme_4port_delete.sh while the native NVMe multipath device (created from nvme_4port_create.sh) was still in use by an xfs mount, so: ./nvme_4port_create.sh mount /dev/nvme1n1 /mnt ./nvme_4port_delete.sh umount /mnt Those were clear screwups on my part but I wouldn't have expected them to cause nvme to blow through so many stop signs. Anyway, I put enough time to trying to make the previously thought "simple" mpath_personality switch safe -- in the face of active handles (issue Sagi pointed out) -- that it is clear NVMe just doesn't have enough state to do it in a clean way. Would require a deeper understanding of the code that I don't have. Most every NVMe function returns void so there is basically no potential for error handling (in the face of a resource being in use). The following is my WIP patch (built ontop of the 3 patches from this thread's series) that has cured me of wanting to continue pursuit of a robust implementation of the runtime 'mpath_personality' switch: diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 1e018d0..80103b3 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -2146,10 +2146,8 @@ static ssize_t __nvme_subsys_store_mpath_personality(struct nvme_subsystem *subs goto out; } - if (subsys->native_mpath != native_mpath) { - subsys->native_mpath = native_mpath; - ret = nvme_mpath_change_personality(subsys); - } + if (subsys->native_mpath != native_mpath) + ret = nvme_mpath_change_personality(subsys, native_mpath); out: return ret ? ret : count; } diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c index 53d2610..017c924 100644 --- a/drivers/nvme/host/multipath.c +++ b/drivers/nvme/host/multipath.c @@ -247,26 +247,57 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head) put_disk(head->disk); } -int nvme_mpath_change_personality(struct nvme_subsystem *subsys) +static bool __nvme_subsys_in_use(struct nvme_subsystem *subsys) { struct nvme_ctrl *ctrl; - int ret = 0; + struct nvme_ns *ns, *next; -restart: - mutex_lock(>lock); list_for_each_entry(ctrl, >ctrls, subsys_entry) { - if (!list_empty(>namespaces)) { - mutex_unlock(>lock); - nvme_remove_namespaces(ctrl); - goto restart; + down_write(>namespaces_rwsem); + list_for_each_entry_safe(ns, next, >namespaces, list) { + if ((kref_read(>kref) > 1) || + // FIXME: need to compare with N paths + (ns->head && (kref_read(>head->ref) > 1))) { + printk("ns->kref = %d", kref_read(>kref)); + printk("ns->head->ref = %d", kref_read(>head->ref)); + up_write(>namespaces_rwsem); + mutex_unlock(>lock); + return true; + } } + up_write(>namespaces_rwsem); } - mutex_unlock(>lock); + + return false; +} + +int nvme_mpath_change_personality(struct nvme_subsystem *subsys, bool native) +{ + struct nvme_ctrl *ctrl; mutex_lock(>lock); - list_for_each_entry(ctrl, >ctrls, subsys_entry) - nvme_queue_scan(ctrl); + + if (__nvme_subsys_in_use(subsys)) { + mutex_unlock(>lock); + return -EBUSY; + }
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Mike, > 1) container A is tasked with managing some dedicated NVMe technology > that absolutely needs native NVMe multipath. > 2) container B is tasked with offering some canned layered product > that was developed ontop of dm-multipath with its own multipath-tools > oriented APIs, etc. And it is to manage some other NVMe technology on > the same host as container A. This assumes there is something to manage. And that the administrative model currently employed by DM multipath will be easily applicable to ANA devices. I don't believe that's the case. The configuration happens on the storage side, not on the host. With ALUA (and the proprietary implementations that predated the spec), it was very fuzzy whether it was the host or the target that owned responsibility for this or that. Part of the reason was that ALUA was deliberately vague to accommodate everybody's existing, non-standards compliant multipath storage implementations. With ANA the heavy burden falls entirely on the storage. Most of the things you would currently configure in multipath.conf have no meaning in the context of ANA. Things that are currently the domain of dm-multipath or multipathd are inextricably living either in the storage device or in the NVMe ANA "device handler". And I think you are significantly underestimating the effort required to expose that information up the stack and to make use of it. That's not just a multipath personality toggle switch. If you want to make multipath -ll show something meaningful for ANA devices, then by all means go ahead. I don't have any problem with that. But I don't think the burden of allowing multipathd/DM to inject themselves into the path transition state machine has any benefit whatsoever to the user. It's only complicating things and therefore we'd be doing people a disservice rather than a favor. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Mike, > 1) container A is tasked with managing some dedicated NVMe technology > that absolutely needs native NVMe multipath. > 2) container B is tasked with offering some canned layered product > that was developed ontop of dm-multipath with its own multipath-tools > oriented APIs, etc. And it is to manage some other NVMe technology on > the same host as container A. This assumes there is something to manage. And that the administrative model currently employed by DM multipath will be easily applicable to ANA devices. I don't believe that's the case. The configuration happens on the storage side, not on the host. With ALUA (and the proprietary implementations that predated the spec), it was very fuzzy whether it was the host or the target that owned responsibility for this or that. Part of the reason was that ALUA was deliberately vague to accommodate everybody's existing, non-standards compliant multipath storage implementations. With ANA the heavy burden falls entirely on the storage. Most of the things you would currently configure in multipath.conf have no meaning in the context of ANA. Things that are currently the domain of dm-multipath or multipathd are inextricably living either in the storage device or in the NVMe ANA "device handler". And I think you are significantly underestimating the effort required to expose that information up the stack and to make use of it. That's not just a multipath personality toggle switch. If you want to make multipath -ll show something meaningful for ANA devices, then by all means go ahead. I don't have any problem with that. But I don't think the burden of allowing multipathd/DM to inject themselves into the path transition state machine has any benefit whatsoever to the user. It's only complicating things and therefore we'd be doing people a disservice rather than a favor. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 12:33pm -0400, Christoph Hellwig wrote: > On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote: > > Because once nvme_core.multipath=N is set: native NVMe multipath is then > > not accessible from the same host. The goal of this patchset is to give > > users choice. But not limit them to _only_ using dm-multipath if they > > just have some legacy needs. > > Choise by itself really isn't an argument. We need a really good > use case for all the complexity, and so far none has been presented. OK, but its choice that is governed by higher level requirements that _I_ personally don't have. They are large datacenter deployments like Hannes eluded to [1] where there is heavy automation and/or layered products that are developed around dm-multipath (via libraries to access multipath-tools provided info, etc). So trying to pin me down on _why_ users elect to make this choice (or that there is such annoying inertia behind their choice) really isn't fair TBH. They exist. Please just accept that. Now another hypothetical usecase I thought of today, that really drives home _why_ it useful to have this fine-grained 'mpath_personality' flexibility is: admin containers. (not saying people or companies currently, or plan to, do this but they very easily could...): 1) container A is tasked with managing some dedicated NVMe technology that absolutely needs native NVMe multipath. 2) container B is tasked with offering some canned layered product that was developed ontop of dm-multipath with its own multipath-tools oriented APIs, etc. And it is to manage some other NVMe technology on the same host as container A. So, containers with conflicting requirements running on the same host. Now you can say: sorry don't do that. But that really isn't a valid counter. Point is it really is meaningful to offer this 'mpath_personality' switch. I'm obviously hopeful for it to not be heavily used BUT not providing the ability for native NVMe multipath and dm-multipath to coexist on the same Linux host really isn't viable in the near-term. Mike [1] https://lkml.org/lkml/2018/5/29/95
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 12:33pm -0400, Christoph Hellwig wrote: > On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote: > > Because once nvme_core.multipath=N is set: native NVMe multipath is then > > not accessible from the same host. The goal of this patchset is to give > > users choice. But not limit them to _only_ using dm-multipath if they > > just have some legacy needs. > > Choise by itself really isn't an argument. We need a really good > use case for all the complexity, and so far none has been presented. OK, but its choice that is governed by higher level requirements that _I_ personally don't have. They are large datacenter deployments like Hannes eluded to [1] where there is heavy automation and/or layered products that are developed around dm-multipath (via libraries to access multipath-tools provided info, etc). So trying to pin me down on _why_ users elect to make this choice (or that there is such annoying inertia behind their choice) really isn't fair TBH. They exist. Please just accept that. Now another hypothetical usecase I thought of today, that really drives home _why_ it useful to have this fine-grained 'mpath_personality' flexibility is: admin containers. (not saying people or companies currently, or plan to, do this but they very easily could...): 1) container A is tasked with managing some dedicated NVMe technology that absolutely needs native NVMe multipath. 2) container B is tasked with offering some canned layered product that was developed ontop of dm-multipath with its own multipath-tools oriented APIs, etc. And it is to manage some other NVMe technology on the same host as container A. So, containers with conflicting requirements running on the same host. Now you can say: sorry don't do that. But that really isn't a valid counter. Point is it really is meaningful to offer this 'mpath_personality' switch. I'm obviously hopeful for it to not be heavily used BUT not providing the ability for native NVMe multipath and dm-multipath to coexist on the same Linux host really isn't viable in the near-term. Mike [1] https://lkml.org/lkml/2018/5/29/95
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31, 2018 at 11:37:20AM +0300, Sagi Grimberg wrote: >> the same host with PCI NVMe could be connected to a FC network that has >> historically always been managed via dm-multipath.. but say that >> FC-based infrastructure gets updated to use NVMe (to leverage a wider >> NVMe investment, whatever?) -- but maybe admins would still prefer to >> use dm-multipath for the NVMe over FC. > > You are referring to an array exposing media via nvmf and scsi > simultaneously? I'm not sure that there is a clean definition of > how that is supposed to work (ANA/ALUA, reservations, etc..) It seems like this isn't what Mike wanted, but I actually got some requests for limited support for that to do a storage live migration from a SCSI array to NVMe. I think it is really sketchy, but if doable if you are careful enough. It would use dm-multipath, possibly even on top of nvme multipathing if we have multiple nvme paths.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31, 2018 at 11:37:20AM +0300, Sagi Grimberg wrote: >> the same host with PCI NVMe could be connected to a FC network that has >> historically always been managed via dm-multipath.. but say that >> FC-based infrastructure gets updated to use NVMe (to leverage a wider >> NVMe investment, whatever?) -- but maybe admins would still prefer to >> use dm-multipath for the NVMe over FC. > > You are referring to an array exposing media via nvmf and scsi > simultaneously? I'm not sure that there is a clean definition of > how that is supposed to work (ANA/ALUA, reservations, etc..) It seems like this isn't what Mike wanted, but I actually got some requests for limited support for that to do a storage live migration from a SCSI array to NVMe. I think it is really sketchy, but if doable if you are careful enough. It would use dm-multipath, possibly even on top of nvme multipathing if we have multiple nvme paths.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote: > I saw your reply to the 1/3 patch.. I do agree it is broken for not > checking if any handles are active. But that is easily fixed no? Doing a switch at runtime simply is a really bad idea. If for some reason we end up with a good per-controller switch it would have to be something set at probe time, and to get it on a controller you'd need to reset it first.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31, 2018 at 08:37:39AM -0400, Mike Snitzer wrote: > I saw your reply to the 1/3 patch.. I do agree it is broken for not > checking if any handles are active. But that is easily fixed no? Doing a switch at runtime simply is a really bad idea. If for some reason we end up with a good per-controller switch it would have to be something set at probe time, and to get it on a controller you'd need to reset it first.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote: > Because once nvme_core.multipath=N is set: native NVMe multipath is then > not accessible from the same host. The goal of this patchset is to give > users choice. But not limit them to _only_ using dm-multipath if they > just have some legacy needs. Choise by itself really isn't an argument. We need a really good use case for all the complexity, and so far none has been presented. > Tough to be convincing with hypotheticals but I could imagine a very > obvious usecase for native NVMe multipathing be PCI-based embedded NVMe > "fabrics" (especially if/when the numa-based path selector lands). But > the same host with PCI NVMe could be connected to a FC network that has > historically always been managed via dm-multipath.. but say that > FC-based infrastructure gets updated to use NVMe (to leverage a wider > NVMe investment, whatever?) -- but maybe admins would still prefer to > use dm-multipath for the NVMe over FC. That is a lot of maybes. If they prefer the good old way on FC then can easily stay with SCSI, or for that matter use the global switch off. > > This might sound stupid to you, but can't users that desperately must > > keep using dm-multipath (for its mature toolset or what-not) just > > stack it on multipath nvme device? (I might be completely off on > > this so feel free to correct my ignorance). > > We could certainly pursue adding multipath-tools support for native NVMe > multipathing. Not opposed to it (even if just reporting topology and > state). But given the extensive lengths NVMe multipath goes to hide > devices we'd need some way to piercing through the opaque nvme device > that native NVMe multipath exposes. But that really is a tangent > relative to this patchset. Since that kind of visibility would also > benefit the nvme cli... otherwise how are users to even be able to trust > but verify native NVMe multipathing did what it expected it to? Just look at the nvme-cli output or sysfs. It's all been there since the code was merged to mainline.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30, 2018 at 06:02:06PM -0400, Mike Snitzer wrote: > Because once nvme_core.multipath=N is set: native NVMe multipath is then > not accessible from the same host. The goal of this patchset is to give > users choice. But not limit them to _only_ using dm-multipath if they > just have some legacy needs. Choise by itself really isn't an argument. We need a really good use case for all the complexity, and so far none has been presented. > Tough to be convincing with hypotheticals but I could imagine a very > obvious usecase for native NVMe multipathing be PCI-based embedded NVMe > "fabrics" (especially if/when the numa-based path selector lands). But > the same host with PCI NVMe could be connected to a FC network that has > historically always been managed via dm-multipath.. but say that > FC-based infrastructure gets updated to use NVMe (to leverage a wider > NVMe investment, whatever?) -- but maybe admins would still prefer to > use dm-multipath for the NVMe over FC. That is a lot of maybes. If they prefer the good old way on FC then can easily stay with SCSI, or for that matter use the global switch off. > > This might sound stupid to you, but can't users that desperately must > > keep using dm-multipath (for its mature toolset or what-not) just > > stack it on multipath nvme device? (I might be completely off on > > this so feel free to correct my ignorance). > > We could certainly pursue adding multipath-tools support for native NVMe > multipathing. Not opposed to it (even if just reporting topology and > state). But given the extensive lengths NVMe multipath goes to hide > devices we'd need some way to piercing through the opaque nvme device > that native NVMe multipath exposes. But that really is a tangent > relative to this patchset. Since that kind of visibility would also > benefit the nvme cli... otherwise how are users to even be able to trust > but verify native NVMe multipathing did what it expected it to? Just look at the nvme-cli output or sysfs. It's all been there since the code was merged to mainline.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 4:51am -0400, Sagi Grimberg wrote: > > >>Moreover, I also wanted to point out that fabrics array vendors are > >>building products that rely on standard nvme multipathing (and probably > >>multipathing over dispersed namespaces as well), and keeping a knob that > >>will keep nvme users with dm-multipath will probably not help them > >>educate their customers as well... So there is another angle to this. > > > >Noticed I didn't respond directly to this aspect. As I explained in > >various replies to this thread: The users/admins would be the ones who > >would decide to use dm-multipath. It wouldn't be something that'd be > >imposed by default. If anything, the all-or-nothing > >nvme_core.multipath=N would pose a much more serious concern for these > >array vendors that do have designs to specifically leverage native NVMe > >multipath. Because if users were to get into the habit of setting that > >on the kernel commandline they'd literally _never_ be able to leverage > >native NVMe multipathing. > > > >We can also add multipath.conf docs (man page, etc) that caution admins > >to consult their array vendors about whether using dm-multipath is to be > >avoided, etc. > > > >Again, this is opt-in, so on a upstream Linux kernel level the default > >of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH > >is configured). Not seeing why there is so much angst and concern about > >offering this flexibility via opt-in but I'm also glad we're having this > >discussion to have our eyes wide open. > > I think that the concern is valid and should not be dismissed. And > at times flexibility is a real source of pain, both to users and > developers. > > The choice is there, no one is forbidden to use multipath. I'm just > still not sure exactly why the subsystem granularity is an absolute > must other than a volume exposed as a nvmf namespace and scsi lun (how > would dm-multipath detect this is the same device btw?) Please see my other reply, I was talking about completely disjoint arrays in my hypothetical config where having the ability to allow simultaneous use of native NVMe multipath and dm-multipath is meaningful. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 4:51am -0400, Sagi Grimberg wrote: > > >>Moreover, I also wanted to point out that fabrics array vendors are > >>building products that rely on standard nvme multipathing (and probably > >>multipathing over dispersed namespaces as well), and keeping a knob that > >>will keep nvme users with dm-multipath will probably not help them > >>educate their customers as well... So there is another angle to this. > > > >Noticed I didn't respond directly to this aspect. As I explained in > >various replies to this thread: The users/admins would be the ones who > >would decide to use dm-multipath. It wouldn't be something that'd be > >imposed by default. If anything, the all-or-nothing > >nvme_core.multipath=N would pose a much more serious concern for these > >array vendors that do have designs to specifically leverage native NVMe > >multipath. Because if users were to get into the habit of setting that > >on the kernel commandline they'd literally _never_ be able to leverage > >native NVMe multipathing. > > > >We can also add multipath.conf docs (man page, etc) that caution admins > >to consult their array vendors about whether using dm-multipath is to be > >avoided, etc. > > > >Again, this is opt-in, so on a upstream Linux kernel level the default > >of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH > >is configured). Not seeing why there is so much angst and concern about > >offering this flexibility via opt-in but I'm also glad we're having this > >discussion to have our eyes wide open. > > I think that the concern is valid and should not be dismissed. And > at times flexibility is a real source of pain, both to users and > developers. > > The choice is there, no one is forbidden to use multipath. I'm just > still not sure exactly why the subsystem granularity is an absolute > must other than a volume exposed as a nvmf namespace and scsi lun (how > would dm-multipath detect this is the same device btw?) Please see my other reply, I was talking about completely disjoint arrays in my hypothetical config where having the ability to allow simultaneous use of native NVMe multipath and dm-multipath is meaningful. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 4:37am -0400, Sagi Grimberg wrote: > > >Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO > >when features like "dispersed namespaces" land a negative check would > >need to be added in the code to prevent switching from "native". > > > >And once something like "dispersed namespaces" lands we'd then have to > >see about a more sophisticated switch that operates at a different > >granularity. Could also be that switching one subsystem that is part of > >"dispersed namespaces" would then cascade to all other associated > >subsystems? Not that dissimilar from the 3rd patch in this series that > >allows a 'device' switch to be done in terms of the subsystem. > > Which I think is broken by allowing to change this personality on the > fly. I saw your reply to the 1/3 patch.. I do agree it is broken for not checking if any handles are active. But that is easily fixed no? Or are you suggesting some other aspect of "broken"? > >Anyway, I don't know the end from the beginning on something you just > >told me about ;) But we're all in this together. And can take it as it > >comes. > > I agree but this will be exposed to user-space and we will need to live > with it for a long long time... OK, well dm-multipath has been around for a long long time. We cannot simply wish it away. Regardless of whatever architectural grievances are levied against it. There are far more customer and vendor products that have been developed to understand and consume dm-multipath and multipath-tools interfaces than native NVMe multipath. > >>Don't get me wrong, I do support your cause, and I think nvme should try > >>to help, I just think that subsystem granularity is not the correct > >>approach going forward. > > > >I understand there will be limits to this 'mpath_personality' knob's > >utility and it'll need to evolve over time. But the burden of making > >more advanced NVMe multipath features accessible outside of native NVMe > >isn't intended to be on any of the NVMe maintainers (other than maybe > >remembering to disallow the switch where it makes sense in the future). > > I would expect that any "advanced multipath features" would be properly > brought up with the NVMe TWG as a ratified standard and find its way > to nvme. So I don't think this particularly is a valid argument. You're misreading me again. I'm also saying stop worrying. I'm saying any future native NVMe multipath features that come about don't necessarily get immediate dm-multipath parity. The native NVMe multipath would need appropriate negative checks. > >>As I said, I've been off the grid, can you remind me why global knob is > >>not sufficient? > > > >Because once nvme_core.multipath=N is set: native NVMe multipath is then > >not accessible from the same host. The goal of this patchset is to give > >users choice. But not limit them to _only_ using dm-multipath if they > >just have some legacy needs. > > > >Tough to be convincing with hypotheticals but I could imagine a very > >obvious usecase for native NVMe multipathing be PCI-based embedded NVMe > >"fabrics" (especially if/when the numa-based path selector lands). But > >the same host with PCI NVMe could be connected to a FC network that has > >historically always been managed via dm-multipath.. but say that > >FC-based infrastructure gets updated to use NVMe (to leverage a wider > >NVMe investment, whatever?) -- but maybe admins would still prefer to > >use dm-multipath for the NVMe over FC. > > You are referring to an array exposing media via nvmf and scsi > simultaneously? I'm not sure that there is a clean definition of > how that is supposed to work (ANA/ALUA, reservations, etc..) No I'm referring to completely disjoint arrays that are homed to the same host. > >>This might sound stupid to you, but can't users that desperately must > >>keep using dm-multipath (for its mature toolset or what-not) just > >>stack it on multipath nvme device? (I might be completely off on > >>this so feel free to correct my ignorance). > > > >We could certainly pursue adding multipath-tools support for native NVMe > >multipathing. Not opposed to it (even if just reporting topology and > >state). But given the extensive lengths NVMe multipath goes to hide > >devices we'd need some way to piercing through the opaque nvme device > >that native NVMe multipath exposes. But that really is a tangent > >relative to this patchset. Since that kind of visibility would also > >benefit the nvme cli... otherwise how are users to even be able to trust > >but verify native NVMe multipathing did what it expected it to? > > Can you explain what is missing for multipath-tools to resolve topology? I've not poured over these nvme interfaces (below I just learned nvme-cli has since grown the capability). SO I'm not informed enough to know if nvme cli has grown other new capabilities. In any case, training multipath-tools to understand native NVMe multipath
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Thu, May 31 2018 at 4:37am -0400, Sagi Grimberg wrote: > > >Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO > >when features like "dispersed namespaces" land a negative check would > >need to be added in the code to prevent switching from "native". > > > >And once something like "dispersed namespaces" lands we'd then have to > >see about a more sophisticated switch that operates at a different > >granularity. Could also be that switching one subsystem that is part of > >"dispersed namespaces" would then cascade to all other associated > >subsystems? Not that dissimilar from the 3rd patch in this series that > >allows a 'device' switch to be done in terms of the subsystem. > > Which I think is broken by allowing to change this personality on the > fly. I saw your reply to the 1/3 patch.. I do agree it is broken for not checking if any handles are active. But that is easily fixed no? Or are you suggesting some other aspect of "broken"? > >Anyway, I don't know the end from the beginning on something you just > >told me about ;) But we're all in this together. And can take it as it > >comes. > > I agree but this will be exposed to user-space and we will need to live > with it for a long long time... OK, well dm-multipath has been around for a long long time. We cannot simply wish it away. Regardless of whatever architectural grievances are levied against it. There are far more customer and vendor products that have been developed to understand and consume dm-multipath and multipath-tools interfaces than native NVMe multipath. > >>Don't get me wrong, I do support your cause, and I think nvme should try > >>to help, I just think that subsystem granularity is not the correct > >>approach going forward. > > > >I understand there will be limits to this 'mpath_personality' knob's > >utility and it'll need to evolve over time. But the burden of making > >more advanced NVMe multipath features accessible outside of native NVMe > >isn't intended to be on any of the NVMe maintainers (other than maybe > >remembering to disallow the switch where it makes sense in the future). > > I would expect that any "advanced multipath features" would be properly > brought up with the NVMe TWG as a ratified standard and find its way > to nvme. So I don't think this particularly is a valid argument. You're misreading me again. I'm also saying stop worrying. I'm saying any future native NVMe multipath features that come about don't necessarily get immediate dm-multipath parity. The native NVMe multipath would need appropriate negative checks. > >>As I said, I've been off the grid, can you remind me why global knob is > >>not sufficient? > > > >Because once nvme_core.multipath=N is set: native NVMe multipath is then > >not accessible from the same host. The goal of this patchset is to give > >users choice. But not limit them to _only_ using dm-multipath if they > >just have some legacy needs. > > > >Tough to be convincing with hypotheticals but I could imagine a very > >obvious usecase for native NVMe multipathing be PCI-based embedded NVMe > >"fabrics" (especially if/when the numa-based path selector lands). But > >the same host with PCI NVMe could be connected to a FC network that has > >historically always been managed via dm-multipath.. but say that > >FC-based infrastructure gets updated to use NVMe (to leverage a wider > >NVMe investment, whatever?) -- but maybe admins would still prefer to > >use dm-multipath for the NVMe over FC. > > You are referring to an array exposing media via nvmf and scsi > simultaneously? I'm not sure that there is a clean definition of > how that is supposed to work (ANA/ALUA, reservations, etc..) No I'm referring to completely disjoint arrays that are homed to the same host. > >>This might sound stupid to you, but can't users that desperately must > >>keep using dm-multipath (for its mature toolset or what-not) just > >>stack it on multipath nvme device? (I might be completely off on > >>this so feel free to correct my ignorance). > > > >We could certainly pursue adding multipath-tools support for native NVMe > >multipathing. Not opposed to it (even if just reporting topology and > >state). But given the extensive lengths NVMe multipath goes to hide > >devices we'd need some way to piercing through the opaque nvme device > >that native NVMe multipath exposes. But that really is a tangent > >relative to this patchset. Since that kind of visibility would also > >benefit the nvme cli... otherwise how are users to even be able to trust > >but verify native NVMe multipathing did what it expected it to? > > Can you explain what is missing for multipath-tools to resolve topology? I've not poured over these nvme interfaces (below I just learned nvme-cli has since grown the capability). SO I'm not informed enough to know if nvme cli has grown other new capabilities. In any case, training multipath-tools to understand native NVMe multipath
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Noticed I didn't respond directly to this aspect. As I explained in various replies to this thread: The users/admins would be the ones who would decide to use dm-multipath. It wouldn't be something that'd be imposed by default. If anything, the all-or-nothing nvme_core.multipath=N would pose a much more serious concern for these array vendors that do have designs to specifically leverage native NVMe multipath. Because if users were to get into the habit of setting that on the kernel commandline they'd literally _never_ be able to leverage native NVMe multipathing. We can also add multipath.conf docs (man page, etc) that caution admins to consult their array vendors about whether using dm-multipath is to be avoided, etc. Again, this is opt-in, so on a upstream Linux kernel level the default of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH is configured). Not seeing why there is so much angst and concern about offering this flexibility via opt-in but I'm also glad we're having this discussion to have our eyes wide open. I think that the concern is valid and should not be dismissed. And at times flexibility is a real source of pain, both to users and developers. The choice is there, no one is forbidden to use multipath. I'm just still not sure exactly why the subsystem granularity is an absolute must other than a volume exposed as a nvmf namespace and scsi lun (how would dm-multipath detect this is the same device btw?)
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Noticed I didn't respond directly to this aspect. As I explained in various replies to this thread: The users/admins would be the ones who would decide to use dm-multipath. It wouldn't be something that'd be imposed by default. If anything, the all-or-nothing nvme_core.multipath=N would pose a much more serious concern for these array vendors that do have designs to specifically leverage native NVMe multipath. Because if users were to get into the habit of setting that on the kernel commandline they'd literally _never_ be able to leverage native NVMe multipathing. We can also add multipath.conf docs (man page, etc) that caution admins to consult their array vendors about whether using dm-multipath is to be avoided, etc. Again, this is opt-in, so on a upstream Linux kernel level the default of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH is configured). Not seeing why there is so much angst and concern about offering this flexibility via opt-in but I'm also glad we're having this discussion to have our eyes wide open. I think that the concern is valid and should not be dismissed. And at times flexibility is a real source of pain, both to users and developers. The choice is there, no one is forbidden to use multipath. I'm just still not sure exactly why the subsystem granularity is an absolute must other than a volume exposed as a nvmf namespace and scsi lun (how would dm-multipath detect this is the same device btw?)
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Which I think is broken by allowing to change this personality on the fly. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I agree but this will be exposed to user-space and we will need to live with it for a long long time... I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). That's great. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). I would expect that any "advanced multipath features" would be properly brought up with the NVMe TWG as a ratified standard and find its way to nvme. So I don't think this particularly is a valid argument. As I said, I've been off the grid, can you remind me why global knob is not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with PCI NVMe could be connected to a FC network that has historically always been managed via dm-multipath.. but say that FC-based infrastructure gets updated to use NVMe (to leverage a wider NVMe investment, whatever?) -- but maybe admins would still prefer to use dm-multipath for the NVMe over FC. You are referring to an array exposing media via nvmf and scsi simultaneously? I'm not sure that there is a clean definition of how that is supposed to work (ANA/ALUA, reservations, etc..) This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance). We could certainly pursue adding multipath-tools support for native NVMe multipathing. Not opposed to it (even if just reporting topology and state). But given the extensive lengths NVMe multipath goes to hide devices we'd need some way to piercing through the opaque nvme device that native NVMe multipath exposes. But that really is a tangent relative to this patchset. Since that kind of visibility would also benefit the nvme cli... otherwise how are users to even be able to trust but verify native NVMe multipathing did what it expected it to? Can you explain what is missing for multipath-tools to resolve topology? nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl topology but that is sort of the important information as controllers are the real paths.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Which I think is broken by allowing to change this personality on the fly. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I agree but this will be exposed to user-space and we will need to live with it for a long long time... I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). That's great. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). I would expect that any "advanced multipath features" would be properly brought up with the NVMe TWG as a ratified standard and find its way to nvme. So I don't think this particularly is a valid argument. As I said, I've been off the grid, can you remind me why global knob is not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with PCI NVMe could be connected to a FC network that has historically always been managed via dm-multipath.. but say that FC-based infrastructure gets updated to use NVMe (to leverage a wider NVMe investment, whatever?) -- but maybe admins would still prefer to use dm-multipath for the NVMe over FC. You are referring to an array exposing media via nvmf and scsi simultaneously? I'm not sure that there is a clean definition of how that is supposed to work (ANA/ALUA, reservations, etc..) This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance). We could certainly pursue adding multipath-tools support for native NVMe multipathing. Not opposed to it (even if just reporting topology and state). But given the extensive lengths NVMe multipath goes to hide devices we'd need some way to piercing through the opaque nvme device that native NVMe multipath exposes. But that really is a tangent relative to this patchset. Since that kind of visibility would also benefit the nvme cli... otherwise how are users to even be able to trust but verify native NVMe multipathing did what it expected it to? Can you explain what is missing for multipath-tools to resolve topology? nvme list-subsys is doing just that, doesn't it? It lists subsys-ctrl topology but that is sort of the important information as controllers are the real paths.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote: > > No, what both Red Hat and SUSE are saying is: cool let's have a go at > > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm > > multipath) to be conditionally enabled to coexist with native NVMe > > multipath? > > For a "Plan B" we can still use the global knob that's already in > place (even if this reminds me so much about scsi-mq which at least we > haven't turned on in fear of performance regressions). BTW, for scsi-mq, we have made a little progress by commit 2f31115e940c (scsi: core: introduce force_blk_mq), and virtio-scsi is working at always scsi-mq mode now. Then driver can decide if .force_blk_mq needs to be set. Hope progress can be made in this nvme mpath issue too. Thanks, Ming
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote: > > No, what both Red Hat and SUSE are saying is: cool let's have a go at > > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm > > multipath) to be conditionally enabled to coexist with native NVMe > > multipath? > > For a "Plan B" we can still use the global knob that's already in > place (even if this reminds me so much about scsi-mq which at least we > haven't turned on in fear of performance regressions). BTW, for scsi-mq, we have made a little progress by commit 2f31115e940c (scsi: core: introduce force_blk_mq), and virtio-scsi is working at always scsi-mq mode now. Then driver can decide if .force_blk_mq needs to be set. Hope progress can be made in this nvme mpath issue too. Thanks, Ming
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30 2018 at 5:20pm -0400, Sagi Grimberg wrote: > Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. Noticed I didn't respond directly to this aspect. As I explained in various replies to this thread: The users/admins would be the ones who would decide to use dm-multipath. It wouldn't be something that'd be imposed by default. If anything, the all-or-nothing nvme_core.multipath=N would pose a much more serious concern for these array vendors that do have designs to specifically leverage native NVMe multipath. Because if users were to get into the habit of setting that on the kernel commandline they'd literally _never_ be able to leverage native NVMe multipathing. We can also add multipath.conf docs (man page, etc) that caution admins to consult their array vendors about whether using dm-multipath is to be avoided, etc. Again, this is opt-in, so on a upstream Linux kernel level the default of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH is configured). Not seeing why there is so much angst and concern about offering this flexibility via opt-in but I'm also glad we're having this discussion to have our eyes wide open. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30 2018 at 5:20pm -0400, Sagi Grimberg wrote: > Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. Noticed I didn't respond directly to this aspect. As I explained in various replies to this thread: The users/admins would be the ones who would decide to use dm-multipath. It wouldn't be something that'd be imposed by default. If anything, the all-or-nothing nvme_core.multipath=N would pose a much more serious concern for these array vendors that do have designs to specifically leverage native NVMe multipath. Because if users were to get into the habit of setting that on the kernel commandline they'd literally _never_ be able to leverage native NVMe multipathing. We can also add multipath.conf docs (man page, etc) that caution admins to consult their array vendors about whether using dm-multipath is to be avoided, etc. Again, this is opt-in, so on a upstream Linux kernel level the default of enabling native NVMe multipath stands (provided CONFIG_NVME_MULTIPATH is configured). Not seeing why there is so much angst and concern about offering this flexibility via opt-in but I'm also glad we're having this discussion to have our eyes wide open. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30 2018 at 5:20pm -0400, Sagi Grimberg wrote: > Hi Folks, > > I'm sorry to chime in super late on this, but a lot has been > going on for me lately which got me off the grid. > > So I'll try to provide my input hopefully without starting any more > flames.. > > >>>This patch series aims to provide a more fine grained control over > >>>nvme's native multipathing, by allowing it to be switched on and off > >>>on a per-subsystem basis instead of a big global switch. > >> > >>No. The only reason we even allowed to turn multipathing off is > >>because you complained about installer issues. The path forward > >>clearly is native multipathing and there will be no additional support > >>for the use cases of not using it. > > > >We all basically knew this would be your position. But at this year's > >LSF we pretty quickly reached consensus that we do in fact need this. > >Except for yourself, Sagi and afaik Martin George: all on the cc were in > >attendance and agreed. > > Correction, I wasn't able to attend LSF this year (unfortunately). Yes, I was trying to say you weren't at LSF (but are on the cc). > >And since then we've exchanged mails to refine and test Johannes' > >implementation. > > > >You've isolated yourself on this issue. Please just accept that we all > >have a pretty solid command of what is needed to properly provide > >commercial support for NVMe multipath. > > > >The ability to switch between "native" and "other" multipath absolutely > >does _not_ imply anything about the winning disposition of native vs > >other. It is purely about providing commercial flexibility to use > >whatever solution makes sense for a given environment. The default _is_ > >native NVMe multipath. It is on userspace solutions for "other" > >multipath (e.g. multipathd) to allow user's to whitelist an NVMe > >subsystem to be switched to "other". > > > >Hopefully this clarifies things, thanks. > > Mike, I understand what you're saying, but I also agree with hch on > the simple fact that this is a burden on linux nvme (although less > passionate about it than hch). > > Beyond that, this is going to get much worse when we support "dispersed > namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed > namespaces" makes NVMe namespaces share-able over different subsystems > so changing the personality on a per-subsystem basis is just asking for > trouble. > > Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). > Don't get me wrong, I do support your cause, and I think nvme should try > to help, I just think that subsystem granularity is not the correct > approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). > As I said, I've been off the grid, can you remind me why global knob is > not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Wed, May 30 2018 at 5:20pm -0400, Sagi Grimberg wrote: > Hi Folks, > > I'm sorry to chime in super late on this, but a lot has been > going on for me lately which got me off the grid. > > So I'll try to provide my input hopefully without starting any more > flames.. > > >>>This patch series aims to provide a more fine grained control over > >>>nvme's native multipathing, by allowing it to be switched on and off > >>>on a per-subsystem basis instead of a big global switch. > >> > >>No. The only reason we even allowed to turn multipathing off is > >>because you complained about installer issues. The path forward > >>clearly is native multipathing and there will be no additional support > >>for the use cases of not using it. > > > >We all basically knew this would be your position. But at this year's > >LSF we pretty quickly reached consensus that we do in fact need this. > >Except for yourself, Sagi and afaik Martin George: all on the cc were in > >attendance and agreed. > > Correction, I wasn't able to attend LSF this year (unfortunately). Yes, I was trying to say you weren't at LSF (but are on the cc). > >And since then we've exchanged mails to refine and test Johannes' > >implementation. > > > >You've isolated yourself on this issue. Please just accept that we all > >have a pretty solid command of what is needed to properly provide > >commercial support for NVMe multipath. > > > >The ability to switch between "native" and "other" multipath absolutely > >does _not_ imply anything about the winning disposition of native vs > >other. It is purely about providing commercial flexibility to use > >whatever solution makes sense for a given environment. The default _is_ > >native NVMe multipath. It is on userspace solutions for "other" > >multipath (e.g. multipathd) to allow user's to whitelist an NVMe > >subsystem to be switched to "other". > > > >Hopefully this clarifies things, thanks. > > Mike, I understand what you're saying, but I also agree with hch on > the simple fact that this is a burden on linux nvme (although less > passionate about it than hch). > > Beyond that, this is going to get much worse when we support "dispersed > namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed > namespaces" makes NVMe namespaces share-able over different subsystems > so changing the personality on a per-subsystem basis is just asking for > trouble. > > Moreover, I also wanted to point out that fabrics array vendors are > building products that rely on standard nvme multipathing (and probably > multipathing over dispersed namespaces as well), and keeping a knob that > will keep nvme users with dm-multipath will probably not help them > educate their customers as well... So there is another angle to this. Wouldn't expect you guys to nurture this 'mpath_personality' knob. SO when features like "dispersed namespaces" land a negative check would need to be added in the code to prevent switching from "native". And once something like "dispersed namespaces" lands we'd then have to see about a more sophisticated switch that operates at a different granularity. Could also be that switching one subsystem that is part of "dispersed namespaces" would then cascade to all other associated subsystems? Not that dissimilar from the 3rd patch in this series that allows a 'device' switch to be done in terms of the subsystem. Anyway, I don't know the end from the beginning on something you just told me about ;) But we're all in this together. And can take it as it comes. I'm merely trying to bridge the gap from old dm-multipath while native NVMe multipath gets its legs. In time I really do have aspirations to contribute more to NVMe multipathing. I think Christoph's NVMe multipath implementation of bio-based device ontop on NVMe core's blk-mq device(s) is very clever and effective (blk_steal_bios() hack and all). > Don't get me wrong, I do support your cause, and I think nvme should try > to help, I just think that subsystem granularity is not the correct > approach going forward. I understand there will be limits to this 'mpath_personality' knob's utility and it'll need to evolve over time. But the burden of making more advanced NVMe multipath features accessible outside of native NVMe isn't intended to be on any of the NVMe maintainers (other than maybe remembering to disallow the switch where it makes sense in the future). > As I said, I've been off the grid, can you remind me why global knob is > not sufficient? Because once nvme_core.multipath=N is set: native NVMe multipath is then not accessible from the same host. The goal of this patchset is to give users choice. But not limit them to _only_ using dm-multipath if they just have some legacy needs. Tough to be convincing with hypotheticals but I could imagine a very obvious usecase for native NVMe multipathing be PCI-based embedded NVMe "fabrics" (especially if/when the numa-based path selector lands). But the same host with
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Hi Folks, I'm sorry to chime in super late on this, but a lot has been going on for me lately which got me off the grid. So I'll try to provide my input hopefully without starting any more flames.. This patch series aims to provide a more fine grained control over nvme's native multipathing, by allowing it to be switched on and off on a per-subsystem basis instead of a big global switch. No. The only reason we even allowed to turn multipathing off is because you complained about installer issues. The path forward clearly is native multipathing and there will be no additional support for the use cases of not using it. We all basically knew this would be your position. But at this year's LSF we pretty quickly reached consensus that we do in fact need this. Except for yourself, Sagi and afaik Martin George: all on the cc were in attendance and agreed. Correction, I wasn't able to attend LSF this year (unfortunately). And since then we've exchanged mails to refine and test Johannes' implementation. You've isolated yourself on this issue. Please just accept that we all have a pretty solid command of what is needed to properly provide commercial support for NVMe multipath. The ability to switch between "native" and "other" multipath absolutely does _not_ imply anything about the winning disposition of native vs other. It is purely about providing commercial flexibility to use whatever solution makes sense for a given environment. The default _is_ native NVMe multipath. It is on userspace solutions for "other" multipath (e.g. multipathd) to allow user's to whitelist an NVMe subsystem to be switched to "other". Hopefully this clarifies things, thanks. Mike, I understand what you're saying, but I also agree with hch on the simple fact that this is a burden on linux nvme (although less passionate about it than hch). Beyond that, this is going to get much worse when we support "dispersed namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed namespaces" makes NVMe namespaces share-able over different subsystems so changing the personality on a per-subsystem basis is just asking for trouble. Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. As I said, I've been off the grid, can you remind me why global knob is not sufficient? This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance).
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Hi Folks, I'm sorry to chime in super late on this, but a lot has been going on for me lately which got me off the grid. So I'll try to provide my input hopefully without starting any more flames.. This patch series aims to provide a more fine grained control over nvme's native multipathing, by allowing it to be switched on and off on a per-subsystem basis instead of a big global switch. No. The only reason we even allowed to turn multipathing off is because you complained about installer issues. The path forward clearly is native multipathing and there will be no additional support for the use cases of not using it. We all basically knew this would be your position. But at this year's LSF we pretty quickly reached consensus that we do in fact need this. Except for yourself, Sagi and afaik Martin George: all on the cc were in attendance and agreed. Correction, I wasn't able to attend LSF this year (unfortunately). And since then we've exchanged mails to refine and test Johannes' implementation. You've isolated yourself on this issue. Please just accept that we all have a pretty solid command of what is needed to properly provide commercial support for NVMe multipath. The ability to switch between "native" and "other" multipath absolutely does _not_ imply anything about the winning disposition of native vs other. It is purely about providing commercial flexibility to use whatever solution makes sense for a given environment. The default _is_ native NVMe multipath. It is on userspace solutions for "other" multipath (e.g. multipathd) to allow user's to whitelist an NVMe subsystem to be switched to "other". Hopefully this clarifies things, thanks. Mike, I understand what you're saying, but I also agree with hch on the simple fact that this is a burden on linux nvme (although less passionate about it than hch). Beyond that, this is going to get much worse when we support "dispersed namespaces" which is a submitted TPAR in the NVMe TWG. "dispersed namespaces" makes NVMe namespaces share-able over different subsystems so changing the personality on a per-subsystem basis is just asking for trouble. Moreover, I also wanted to point out that fabrics array vendors are building products that rely on standard nvme multipathing (and probably multipathing over dispersed namespaces as well), and keeping a knob that will keep nvme users with dm-multipath will probably not help them educate their customers as well... So there is another angle to this. Don't get me wrong, I do support your cause, and I think nvme should try to help, I just think that subsystem granularity is not the correct approach going forward. As I said, I've been off the grid, can you remind me why global knob is not sufficient? This might sound stupid to you, but can't users that desperately must keep using dm-multipath (for its mature toolset or what-not) just stack it on multipath nvme device? (I might be completely off on this so feel free to correct my ignorance).
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29 2018 at 4:09am -0400, Christoph Hellwig wrote: > On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > > For a "Plan B" we can still use the global knob that's already in > > place (even if this reminds me so much about scsi-mq which at least we > > haven't turned on in fear of performance regressions). > > > > Let's drop the discussion here, I don't think it leads to something > > else than flamewars. As the author of the original patch you're fine to want to step away from this needlessly ugly aspect. But it doesn't change the fact that we need answers on _why_ it is a genuinely detrimental change. (hint: we know it isn't). The enterprise Linux people who directly need to support multipath want the flexibility to allow dm-multipath while simultaneously allowing native NVMe multipathing on the same host. Hannes Reinecke and others, if you want the flexibility this patchset offers please provide your review/acks. > If our plan A doesn't work we can go back to these patches. For now > I'd rather have everyone spend their time on making Plan A work then > preparing for contingencies. Nothing prevents anyone from using these > patches already out there if they really want to, but I'd recommend > people are very careful about doing so as you'll lock yourself into > a long-term maintainance burden. This isn't about contingencies. It is about continuing compatibility with a sophisticated dm-multipath stack that is widely used by, and familiar to, so many. Christoph, you know you're being completely vague right? You're actively denying the validity of our position (at least Hannes and I) with handwaving and effectively FUD, e.g. "maze of new setups" and "hairy runtime ABIs" here: https://lkml.org/lkml/2018/5/25/461 To restate my question, from https://lkml.org/lkml/2018/5/28/2179: hch had some non-specific concern about this patch forcing support of some "ABI". Which ABI is that _exactly_? The incremental effort required to support NVMe in dm-multipath isn't so grim. And those who will do that work are signing up for it -- while still motivated to help make native NVMe multipath a success. Can you please give us time to responsibly ween users off dm-multipath? Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29 2018 at 4:09am -0400, Christoph Hellwig wrote: > On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > > For a "Plan B" we can still use the global knob that's already in > > place (even if this reminds me so much about scsi-mq which at least we > > haven't turned on in fear of performance regressions). > > > > Let's drop the discussion here, I don't think it leads to something > > else than flamewars. As the author of the original patch you're fine to want to step away from this needlessly ugly aspect. But it doesn't change the fact that we need answers on _why_ it is a genuinely detrimental change. (hint: we know it isn't). The enterprise Linux people who directly need to support multipath want the flexibility to allow dm-multipath while simultaneously allowing native NVMe multipathing on the same host. Hannes Reinecke and others, if you want the flexibility this patchset offers please provide your review/acks. > If our plan A doesn't work we can go back to these patches. For now > I'd rather have everyone spend their time on making Plan A work then > preparing for contingencies. Nothing prevents anyone from using these > patches already out there if they really want to, but I'd recommend > people are very careful about doing so as you'll lock yourself into > a long-term maintainance burden. This isn't about contingencies. It is about continuing compatibility with a sophisticated dm-multipath stack that is widely used by, and familiar to, so many. Christoph, you know you're being completely vague right? You're actively denying the validity of our position (at least Hannes and I) with handwaving and effectively FUD, e.g. "maze of new setups" and "hairy runtime ABIs" here: https://lkml.org/lkml/2018/5/25/461 To restate my question, from https://lkml.org/lkml/2018/5/28/2179: hch had some non-specific concern about this patch forcing support of some "ABI". Which ABI is that _exactly_? The incremental effort required to support NVMe in dm-multipath isn't so grim. And those who will do that work are signing up for it -- while still motivated to help make native NVMe multipath a success. Can you please give us time to responsibly ween users off dm-multipath? Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > For a "Plan B" we can still use the global knob that's already in > place (even if this reminds me so much about scsi-mq which at least we > haven't turned on in fear of performance regressions). > > Let's drop the discussion here, I don't think it leads to something > else than flamewars. If our plan A doesn't work we can go back to these patches. For now I'd rather have everyone spend their time on making Plan A work then preparing for contingencies. Nothing prevents anyone from using these patches already out there if they really want to, but I'd recommend people are very careful about doing so as you'll lock yourself into a long-term maintainance burden.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Tue, May 29, 2018 at 09:22:40AM +0200, Johannes Thumshirn wrote: > For a "Plan B" we can still use the global knob that's already in > place (even if this reminds me so much about scsi-mq which at least we > haven't turned on in fear of performance regressions). > > Let's drop the discussion here, I don't think it leads to something > else than flamewars. If our plan A doesn't work we can go back to these patches. For now I'd rather have everyone spend their time on making Plan A work then preparing for contingencies. Nothing prevents anyone from using these patches already out there if they really want to, but I'd recommend people are very careful about doing so as you'll lock yourself into a long-term maintainance burden.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote: > No, what both Red Hat and SUSE are saying is: cool let's have a go at > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm > multipath) to be conditionally enabled to coexist with native NVMe > multipath? For a "Plan B" we can still use the global knob that's already in place (even if this reminds me so much about scsi-mq which at least we haven't turned on in fear of performance regressions). Let's drop the discussion here, I don't think it leads to something else than flamewars. Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, May 28, 2018 at 11:02:36PM -0400, Mike Snitzer wrote: > No, what both Red Hat and SUSE are saying is: cool let's have a go at > "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm > multipath) to be conditionally enabled to coexist with native NVMe > multipath? For a "Plan B" we can still use the global knob that's already in place (even if this reminds me so much about scsi-mq which at least we haven't turned on in fear of performance regressions). Let's drop the discussion here, I don't think it leads to something else than flamewars. Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, 28 May 2018 23:02:36 -0400 Mike Snitzer wrote: > On Mon, May 28 2018 at 9:19pm -0400, > Martin K. Petersen wrote: > > > > > Mike, > > > > I understand and appreciate your position but I still don't think > > the arguments for enabling DM multipath are sufficiently > > compelling. The whole point of ANA is for things to be plug and > > play without any admin intervention whatsoever. > > > > I also think we're getting ahead of ourselves a bit. The assumption > > seems to be that NVMe ANA devices are going to be broken--or that > > they will require the same amount of tweaking as SCSI devices--and > > therefore DM multipath support is inevitable. However, I'm not sure > > that will be the case. > > > > > Thing is you really don't get to dictate that to the industry. > > > Sorry. > > > > We are in the fortunate position of being able to influence how the > > spec is written. It's a great opportunity to fix the mistakes of > > the past in SCSI. And to encourage the industry to ship products > > that don't need the current level of manual configuration and > > complex management. > > > > So I am in favor of Johannes' patches *if* we get to the point > > where a Plan B is needed. But I am not entirely convinced that's > > the case just yet. Let's see some more ANA devices first. And once > > we do, we are also in a position where we can put some pressure on > > the vendors to either amend the specification or fix their > > implementations to work with ANA. > > ANA really isn't a motivating factor for whether or not to apply this > patch. So no, I don't have any interest in waiting to apply it. > Correct. That patch is _not_ to work around any perceived incompability on the OS side. The patch is primarily to give _admins_ a choice. Some installations like hosting providers etc are running quite complex scenarios, most of which are highly automated. So for those there is a real benefit to be able to use dm-multipathing for NVMe; they are totally fine with having a performance impact if they can avoid to rewrite their infrastructure. Cheers, Hannes
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, 28 May 2018 23:02:36 -0400 Mike Snitzer wrote: > On Mon, May 28 2018 at 9:19pm -0400, > Martin K. Petersen wrote: > > > > > Mike, > > > > I understand and appreciate your position but I still don't think > > the arguments for enabling DM multipath are sufficiently > > compelling. The whole point of ANA is for things to be plug and > > play without any admin intervention whatsoever. > > > > I also think we're getting ahead of ourselves a bit. The assumption > > seems to be that NVMe ANA devices are going to be broken--or that > > they will require the same amount of tweaking as SCSI devices--and > > therefore DM multipath support is inevitable. However, I'm not sure > > that will be the case. > > > > > Thing is you really don't get to dictate that to the industry. > > > Sorry. > > > > We are in the fortunate position of being able to influence how the > > spec is written. It's a great opportunity to fix the mistakes of > > the past in SCSI. And to encourage the industry to ship products > > that don't need the current level of manual configuration and > > complex management. > > > > So I am in favor of Johannes' patches *if* we get to the point > > where a Plan B is needed. But I am not entirely convinced that's > > the case just yet. Let's see some more ANA devices first. And once > > we do, we are also in a position where we can put some pressure on > > the vendors to either amend the specification or fix their > > implementations to work with ANA. > > ANA really isn't a motivating factor for whether or not to apply this > patch. So no, I don't have any interest in waiting to apply it. > Correct. That patch is _not_ to work around any perceived incompability on the OS side. The patch is primarily to give _admins_ a choice. Some installations like hosting providers etc are running quite complex scenarios, most of which are highly automated. So for those there is a real benefit to be able to use dm-multipathing for NVMe; they are totally fine with having a performance impact if they can avoid to rewrite their infrastructure. Cheers, Hannes
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, May 28 2018 at 9:19pm -0400, Martin K. Petersen wrote: > > Mike, > > I understand and appreciate your position but I still don't think the > arguments for enabling DM multipath are sufficiently compelling. The > whole point of ANA is for things to be plug and play without any admin > intervention whatsoever. > > I also think we're getting ahead of ourselves a bit. The assumption > seems to be that NVMe ANA devices are going to be broken--or that they > will require the same amount of tweaking as SCSI devices--and therefore > DM multipath support is inevitable. However, I'm not sure that will be > the case. > > > Thing is you really don't get to dictate that to the industry. Sorry. > > We are in the fortunate position of being able to influence how the spec > is written. It's a great opportunity to fix the mistakes of the past in > SCSI. And to encourage the industry to ship products that don't need the > current level of manual configuration and complex management. > > So I am in favor of Johannes' patches *if* we get to the point where a > Plan B is needed. But I am not entirely convinced that's the case just > yet. Let's see some more ANA devices first. And once we do, we are also > in a position where we can put some pressure on the vendors to either > amend the specification or fix their implementations to work with ANA. ANA really isn't a motivating factor for whether or not to apply this patch. So no, I don't have any interest in waiting to apply it. You're somehow missing that your implied "Plan A" (native NVMe multipath) has been pushed as the only way forward for NVMe multipath despite it being unproven. Worse, literally no userspace infrastructure exists to control native NVMe multipath (and this is supposed to be comforting because the spec is tightly coupled to hch's implementation that he controls with an iron fist). We're supposed to be OK with completely _forced_ obsolescence of dm-multipath infrastructure that has proven itself capable of managing a wide range of complex multipath deployments for a tremendous amount of enterprise Linux customers (of multiple vendors)!? This is a tough sell given the content of my previous paragraph (coupled with the fact the next enterprise Linux versions are being hardened _now_). No, what both Red Hat and SUSE are saying is: cool let's have a go at "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm multipath) to be conditionally enabled to coexist with native NVMe multipath? Nobody can explain why this patch is some sort of detriment. It literally is an amazingly simple switch that provides flexibility we _need_. hch had some non-specific concern about this patch forcing support of some "ABI". Which ABI is that _exactly_? Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Mon, May 28 2018 at 9:19pm -0400, Martin K. Petersen wrote: > > Mike, > > I understand and appreciate your position but I still don't think the > arguments for enabling DM multipath are sufficiently compelling. The > whole point of ANA is for things to be plug and play without any admin > intervention whatsoever. > > I also think we're getting ahead of ourselves a bit. The assumption > seems to be that NVMe ANA devices are going to be broken--or that they > will require the same amount of tweaking as SCSI devices--and therefore > DM multipath support is inevitable. However, I'm not sure that will be > the case. > > > Thing is you really don't get to dictate that to the industry. Sorry. > > We are in the fortunate position of being able to influence how the spec > is written. It's a great opportunity to fix the mistakes of the past in > SCSI. And to encourage the industry to ship products that don't need the > current level of manual configuration and complex management. > > So I am in favor of Johannes' patches *if* we get to the point where a > Plan B is needed. But I am not entirely convinced that's the case just > yet. Let's see some more ANA devices first. And once we do, we are also > in a position where we can put some pressure on the vendors to either > amend the specification or fix their implementations to work with ANA. ANA really isn't a motivating factor for whether or not to apply this patch. So no, I don't have any interest in waiting to apply it. You're somehow missing that your implied "Plan A" (native NVMe multipath) has been pushed as the only way forward for NVMe multipath despite it being unproven. Worse, literally no userspace infrastructure exists to control native NVMe multipath (and this is supposed to be comforting because the spec is tightly coupled to hch's implementation that he controls with an iron fist). We're supposed to be OK with completely _forced_ obsolescence of dm-multipath infrastructure that has proven itself capable of managing a wide range of complex multipath deployments for a tremendous amount of enterprise Linux customers (of multiple vendors)!? This is a tough sell given the content of my previous paragraph (coupled with the fact the next enterprise Linux versions are being hardened _now_). No, what both Red Hat and SUSE are saying is: cool let's have a go at "Plan A" but, in parallel, what harm is there in allowing "Plan B" (dm multipath) to be conditionally enabled to coexist with native NVMe multipath? Nobody can explain why this patch is some sort of detriment. It literally is an amazingly simple switch that provides flexibility we _need_. hch had some non-specific concern about this patch forcing support of some "ABI". Which ABI is that _exactly_? Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Mike, I understand and appreciate your position but I still don't think the arguments for enabling DM multipath are sufficiently compelling. The whole point of ANA is for things to be plug and play without any admin intervention whatsoever. I also think we're getting ahead of ourselves a bit. The assumption seems to be that NVMe ANA devices are going to be broken--or that they will require the same amount of tweaking as SCSI devices--and therefore DM multipath support is inevitable. However, I'm not sure that will be the case. > Thing is you really don't get to dictate that to the industry. Sorry. We are in the fortunate position of being able to influence how the spec is written. It's a great opportunity to fix the mistakes of the past in SCSI. And to encourage the industry to ship products that don't need the current level of manual configuration and complex management. So I am in favor of Johannes' patches *if* we get to the point where a Plan B is needed. But I am not entirely convinced that's the case just yet. Let's see some more ANA devices first. And once we do, we are also in a position where we can put some pressure on the vendors to either amend the specification or fix their implementations to work with ANA. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 0/3] Provide more fine grained control over multipathing
Mike, I understand and appreciate your position but I still don't think the arguments for enabling DM multipath are sufficiently compelling. The whole point of ANA is for things to be plug and play without any admin intervention whatsoever. I also think we're getting ahead of ourselves a bit. The assumption seems to be that NVMe ANA devices are going to be broken--or that they will require the same amount of tweaking as SCSI devices--and therefore DM multipath support is inevitable. However, I'm not sure that will be the case. > Thing is you really don't get to dictate that to the industry. Sorry. We are in the fortunate position of being able to influence how the spec is written. It's a great opportunity to fix the mistakes of the past in SCSI. And to encourage the industry to ship products that don't need the current level of manual configuration and complex management. So I am in favor of Johannes' patches *if* we get to the point where a Plan B is needed. But I am not entirely convinced that's the case just yet. Let's see some more ANA devices first. And once we do, we are also in a position where we can put some pressure on the vendors to either amend the specification or fix their implementations to work with ANA. -- Martin K. Petersen Oracle Linux Engineering
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25 2018 at 10:12am -0400, Christoph Hellwigwrote: > On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote: > > We all basically knew this would be your position. But at this year's > > LSF we pretty quickly reached consensus that we do in fact need this. > > Except for yourself, Sagi and afaik Martin George: all on the cc were in > > attendance and agreed. > > And I very mich disagree, and you'd bette come up with a good reason > to overide me as the author and maintainer of this code. I hope you don't truly think this is me vs you. Some of the reasons are: 1) we need flexibility during the transition to native NVMe multipath 2) we need to support existing customers' dm-multipath storage networks 3) asking users to use an entirely new infrastructure that conflicts with their dm-multipath expertise and established norms is a hard sell. Especially for environments that have a mix of traditional multipath (FC, iSCSI, whatever) and NVMe over fabrics. 4) Layered products (both vendor provided and user developed) have been trained to fully support and monitor dm-multipath; they have no understanding of native NVMe multipath > > And since then we've exchanged mails to refine and test Johannes' > > implementation. > > Since when was acting behind the scenes a good argument for anything? I mentioned our continued private collaboration to establish that this wasn't a momentary weakness by anyone at LSF. It has had a lot of soak time in our heads. We did it privately because we needed a concrete proposal that works for our needs. Rather than getting shot down over some shortcoming in an RFC-style submission. > > Hopefully this clarifies things, thanks. > > It doesn't. > > The whole point we have native multipath in nvme is because dm-multipath > is the wrong architecture (and has been, long predating you, nothing > personal). And I don't want to be stuck additional decades with this > in nvme. We allowed a global opt-in to ease the three people in the > world with existing setups to keep using that, but I also said I > won't go any step further. And I stand to that. Thing is you really don't get to dictate that to the industry. Sorry. Reality is this ability to switch "native" vs "other" gives us the options I've been talking about absolutely needing since the start of this NVMe multipathing debate. Your fighting against it for so long has prevented progress on NVMe multipath in general. Taking this change will increase native NVMe multipath deployment. Otherwise we're just going to have to disable native multipath entirely for the time being. That does users a disservice because I completely agree that there _will_ be setups where native NVMe multipath really does offer a huge win. But those setups could easily be deployed on the same hosts as another variant of NVMe that really does want the use of the legacy DM multipath stack (possibly even just for reason 4 above). Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25 2018 at 10:12am -0400, Christoph Hellwig wrote: > On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote: > > We all basically knew this would be your position. But at this year's > > LSF we pretty quickly reached consensus that we do in fact need this. > > Except for yourself, Sagi and afaik Martin George: all on the cc were in > > attendance and agreed. > > And I very mich disagree, and you'd bette come up with a good reason > to overide me as the author and maintainer of this code. I hope you don't truly think this is me vs you. Some of the reasons are: 1) we need flexibility during the transition to native NVMe multipath 2) we need to support existing customers' dm-multipath storage networks 3) asking users to use an entirely new infrastructure that conflicts with their dm-multipath expertise and established norms is a hard sell. Especially for environments that have a mix of traditional multipath (FC, iSCSI, whatever) and NVMe over fabrics. 4) Layered products (both vendor provided and user developed) have been trained to fully support and monitor dm-multipath; they have no understanding of native NVMe multipath > > And since then we've exchanged mails to refine and test Johannes' > > implementation. > > Since when was acting behind the scenes a good argument for anything? I mentioned our continued private collaboration to establish that this wasn't a momentary weakness by anyone at LSF. It has had a lot of soak time in our heads. We did it privately because we needed a concrete proposal that works for our needs. Rather than getting shot down over some shortcoming in an RFC-style submission. > > Hopefully this clarifies things, thanks. > > It doesn't. > > The whole point we have native multipath in nvme is because dm-multipath > is the wrong architecture (and has been, long predating you, nothing > personal). And I don't want to be stuck additional decades with this > in nvme. We allowed a global opt-in to ease the three people in the > world with existing setups to keep using that, but I also said I > won't go any step further. And I stand to that. Thing is you really don't get to dictate that to the industry. Sorry. Reality is this ability to switch "native" vs "other" gives us the options I've been talking about absolutely needing since the start of this NVMe multipathing debate. Your fighting against it for so long has prevented progress on NVMe multipath in general. Taking this change will increase native NVMe multipath deployment. Otherwise we're just going to have to disable native multipath entirely for the time being. That does users a disservice because I completely agree that there _will_ be setups where native NVMe multipath really does offer a huge win. But those setups could easily be deployed on the same hosts as another variant of NVMe that really does want the use of the legacy DM multipath stack (possibly even just for reason 4 above). Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 04:22:17PM +0200, Johannes Thumshirn wrote: > But Mike's and Hannes' arguments where reasonable as well, we do not > know if there are any existing setups we might break leading to > support calls, which we have to deal with. Personally I don't believe > there are lot's of existing nvme multipath setups out there, but who > am I to judge. I don't think existing setups are very likely, but they absolutely are a valid reason to support the legacy mode. That is why we support the legacy mode using the multipath module option. Once you move to a per-subsystem switch you don't support legacy setups, you create a maze of new setups that we need to keep compatibility support for forever. > So can we find a middle ground to this? Or we'll have the > all-or-nothing situation we have in scsi-mq now again. How about > tieing the switch to a config option which is off per default? The middle ground is the module option. It provides 100% backwards compatibility if used, but more importantly doesn't create hairy runtime ABIs that we will have to support forever.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 04:22:17PM +0200, Johannes Thumshirn wrote: > But Mike's and Hannes' arguments where reasonable as well, we do not > know if there are any existing setups we might break leading to > support calls, which we have to deal with. Personally I don't believe > there are lot's of existing nvme multipath setups out there, but who > am I to judge. I don't think existing setups are very likely, but they absolutely are a valid reason to support the legacy mode. That is why we support the legacy mode using the multipath module option. Once you move to a per-subsystem switch you don't support legacy setups, you create a maze of new setups that we need to keep compatibility support for forever. > So can we find a middle ground to this? Or we'll have the > all-or-nothing situation we have in scsi-mq now again. How about > tieing the switch to a config option which is off per default? The middle ground is the module option. It provides 100% backwards compatibility if used, but more importantly doesn't create hairy runtime ABIs that we will have to support forever.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 03:05:35PM +0200, Christoph Hellwig wrote: > On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > > Hi, > > > > This patch series aims to provide a more fine grained control over > > nvme's native multipathing, by allowing it to be switched on and off > > on a per-subsystem basis instead of a big global switch. > > No. The only reason we even allowed to turn multipathing off is > because you complained about installer issues. The path forward > clearly is native multipathing and there will be no additional support > for the use cases of not using it. First of all, it wasn't my idea and I'm just doing my job here, as I got this task assigned at LSF and tried to do my best here. Personally I _do_ agree with you and do not want to use dm-mpath in nvme either (mainly because I don't really know the code and don't want to learn yet another subsystem). But Mike's and Hannes' arguments where reasonable as well, we do not know if there are any existing setups we might break leading to support calls, which we have to deal with. Personally I don't believe there are lot's of existing nvme multipath setups out there, but who am I to judge. So can we find a middle ground to this? Or we'll have the all-or-nothing situation we have in scsi-mq now again. How about tieing the switch to a config option which is off per default? Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 03:05:35PM +0200, Christoph Hellwig wrote: > On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > > Hi, > > > > This patch series aims to provide a more fine grained control over > > nvme's native multipathing, by allowing it to be switched on and off > > on a per-subsystem basis instead of a big global switch. > > No. The only reason we even allowed to turn multipathing off is > because you complained about installer issues. The path forward > clearly is native multipathing and there will be no additional support > for the use cases of not using it. First of all, it wasn't my idea and I'm just doing my job here, as I got this task assigned at LSF and tried to do my best here. Personally I _do_ agree with you and do not want to use dm-mpath in nvme either (mainly because I don't really know the code and don't want to learn yet another subsystem). But Mike's and Hannes' arguments where reasonable as well, we do not know if there are any existing setups we might break leading to support calls, which we have to deal with. Personally I don't believe there are lot's of existing nvme multipath setups out there, but who am I to judge. So can we find a middle ground to this? Or we'll have the all-or-nothing situation we have in scsi-mq now again. How about tieing the switch to a config option which is off per default? Byte, Johannes -- Johannes Thumshirn Storage jthumsh...@suse.de+49 911 74053 689 SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Felix Imendörffer, Jane Smithard, Graham Norton HRB 21284 (AG Nürnberg) Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote: > We all basically knew this would be your position. But at this year's > LSF we pretty quickly reached consensus that we do in fact need this. > Except for yourself, Sagi and afaik Martin George: all on the cc were in > attendance and agreed. And I very mich disagree, and you'd bette come up with a good reason to overide me as the author and maintainer of this code. > And since then we've exchanged mails to refine and test Johannes' > implementation. Since when was acting behind the scenes a good argument for anything? > Hopefully this clarifies things, thanks. It doesn't. The whole point we have native multipath in nvme is because dm-multipath is the wrong architecture (and has been, long predating you, nothing personal). And I don't want to be stuck additional decades with this in nvme. We allowed a global opt-in to ease the three people in the world with existing setups to keep using that, but I also said I won't go any step further. And I stand to that.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 09:58:13AM -0400, Mike Snitzer wrote: > We all basically knew this would be your position. But at this year's > LSF we pretty quickly reached consensus that we do in fact need this. > Except for yourself, Sagi and afaik Martin George: all on the cc were in > attendance and agreed. And I very mich disagree, and you'd bette come up with a good reason to overide me as the author and maintainer of this code. > And since then we've exchanged mails to refine and test Johannes' > implementation. Since when was acting behind the scenes a good argument for anything? > Hopefully this clarifies things, thanks. It doesn't. The whole point we have native multipath in nvme is because dm-multipath is the wrong architecture (and has been, long predating you, nothing personal). And I don't want to be stuck additional decades with this in nvme. We allowed a global opt-in to ease the three people in the world with existing setups to keep using that, but I also said I won't go any step further. And I stand to that.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25 2018 at 9:05am -0400, Christoph Hellwigwrote: > On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > > Hi, > > > > This patch series aims to provide a more fine grained control over > > nvme's native multipathing, by allowing it to be switched on and off > > on a per-subsystem basis instead of a big global switch. > > No. The only reason we even allowed to turn multipathing off is > because you complained about installer issues. The path forward > clearly is native multipathing and there will be no additional support > for the use cases of not using it. We all basically knew this would be your position. But at this year's LSF we pretty quickly reached consensus that we do in fact need this. Except for yourself, Sagi and afaik Martin George: all on the cc were in attendance and agreed. And since then we've exchanged mails to refine and test Johannes' implementation. You've isolated yourself on this issue. Please just accept that we all have a pretty solid command of what is needed to properly provide commercial support for NVMe multipath. The ability to switch between "native" and "other" multipath absolutely does _not_ imply anything about the winning disposition of native vs other. It is purely about providing commercial flexibility to use whatever solution makes sense for a given environment. The default _is_ native NVMe multipath. It is on userspace solutions for "other" multipath (e.g. multipathd) to allow user's to whitelist an NVMe subsystem to be switched to "other". Hopefully this clarifies things, thanks. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25 2018 at 9:05am -0400, Christoph Hellwig wrote: > On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > > Hi, > > > > This patch series aims to provide a more fine grained control over > > nvme's native multipathing, by allowing it to be switched on and off > > on a per-subsystem basis instead of a big global switch. > > No. The only reason we even allowed to turn multipathing off is > because you complained about installer issues. The path forward > clearly is native multipathing and there will be no additional support > for the use cases of not using it. We all basically knew this would be your position. But at this year's LSF we pretty quickly reached consensus that we do in fact need this. Except for yourself, Sagi and afaik Martin George: all on the cc were in attendance and agreed. And since then we've exchanged mails to refine and test Johannes' implementation. You've isolated yourself on this issue. Please just accept that we all have a pretty solid command of what is needed to properly provide commercial support for NVMe multipath. The ability to switch between "native" and "other" multipath absolutely does _not_ imply anything about the winning disposition of native vs other. It is purely about providing commercial flexibility to use whatever solution makes sense for a given environment. The default _is_ native NVMe multipath. It is on userspace solutions for "other" multipath (e.g. multipathd) to allow user's to whitelist an NVMe subsystem to be switched to "other". Hopefully this clarifies things, thanks. Mike
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > Hi, > > This patch series aims to provide a more fine grained control over > nvme's native multipathing, by allowing it to be switched on and off > on a per-subsystem basis instead of a big global switch. No. The only reason we even allowed to turn multipathing off is because you complained about installer issues. The path forward clearly is native multipathing and there will be no additional support for the use cases of not using it.
Re: [PATCH 0/3] Provide more fine grained control over multipathing
On Fri, May 25, 2018 at 02:53:19PM +0200, Johannes Thumshirn wrote: > Hi, > > This patch series aims to provide a more fine grained control over > nvme's native multipathing, by allowing it to be switched on and off > on a per-subsystem basis instead of a big global switch. No. The only reason we even allowed to turn multipathing off is because you complained about installer issues. The path forward clearly is native multipathing and there will be no additional support for the use cases of not using it.
[PATCH 0/3] Provide more fine grained control over multipathing
Hi, This patch series aims to provide a more fine grained control over nvme's native multipathing, by allowing it to be switched on and off on a per-subsystem basis instead of a big global switch. The prime use-case is for mixed scenarios where user might want to use nvme's native multipathing on one subset of subsystems and dm-multipath on another subset. For example using native for internal the PCIe NVMe and dm-mpath for the connection to an NVMe over Fabrics Array. The initial discussion for this was held at this year's LSF/MM and the architecture hasn't changed to what we've discussed there. The first patch does the said switch and Mike added two follow up patches to access the personality attribute from the block device's sysfs directory as well. I do have a blktests test for it as well but due to the fcloop but I reported I'm reluctant to include it in the series (or I would need to uncomment the rmmods). Johannes Thumshirn (1): nvme: provide a way to disable nvme mpath per subsystem Mike Snitzer (2): nvme multipath: added SUBSYS_ATTR_RW nvme multipath: add dev_attr_mpath_personality drivers/nvme/host/core.c | 112 -- drivers/nvme/host/multipath.c | 34 +++-- drivers/nvme/host/nvme.h | 8 +++ 3 files changed, 144 insertions(+), 10 deletions(-) -- 2.16.3
[PATCH 0/3] Provide more fine grained control over multipathing
Hi, This patch series aims to provide a more fine grained control over nvme's native multipathing, by allowing it to be switched on and off on a per-subsystem basis instead of a big global switch. The prime use-case is for mixed scenarios where user might want to use nvme's native multipathing on one subset of subsystems and dm-multipath on another subset. For example using native for internal the PCIe NVMe and dm-mpath for the connection to an NVMe over Fabrics Array. The initial discussion for this was held at this year's LSF/MM and the architecture hasn't changed to what we've discussed there. The first patch does the said switch and Mike added two follow up patches to access the personality attribute from the block device's sysfs directory as well. I do have a blktests test for it as well but due to the fcloop but I reported I'm reluctant to include it in the series (or I would need to uncomment the rmmods). Johannes Thumshirn (1): nvme: provide a way to disable nvme mpath per subsystem Mike Snitzer (2): nvme multipath: added SUBSYS_ATTR_RW nvme multipath: add dev_attr_mpath_personality drivers/nvme/host/core.c | 112 -- drivers/nvme/host/multipath.c | 34 +++-- drivers/nvme/host/nvme.h | 8 +++ 3 files changed, 144 insertions(+), 10 deletions(-) -- 2.16.3