Re: [PATCH 2/2] nvme: add emulation for zone-append
On Thu, 2020-08-20 at 12:45 +0900, Keith Busch wrote: > For the record, the suggestion provided, which you agreed to look > into, > most broadly enables your hardware on Linux and was entirely to your > benefit. Not quite as dramatic as a political conspiracy. > > You later responded with a technical argument against that > suggestion; > however, your reason didn't add up, and that's where you left the > thread. The suggestion to the rejected patch was passed onto the related FW team, and the "technical argument" was our FW team's response to the suggestion which I relayed to the list. At this point, there's no closure on whether the device will get NOIOB. My point in bringing up this example was a one-line, highly- maintainable patch which improves the performance of Linux should not have been immediatedly NAK'ed as it was. If you believe it should have, we'll agree to disagree.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 20.08.2020 13:07, Kanchan Joshi wrote: On Thu, Aug 20, 2020 at 3:22 AM Keith Busch wrote: On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > Intel does not support making *optional* NVMe spec features *required* > by the NVMe driver. This is inaccurate. As another example, the spec optionally allows a zone size to be a power of 2, but linux requires a power of 2 if you want to use it with this driver. > Provided there's no glaring technical issues There are many. Some may be improved through a serious review process, but the mess it makes out of the fast path is measurably harmful to devices that don't subscribe to this. That issue is not so easily remedied. Since this patch is a copy of the scsi implementation, the reasonable thing is take this fight to the generic block layer for a common solution. We're not taking this in the nvme driver. I sincerely want to minimize any adverse impact to the fast-path of non-zoned devices. My understanding of that aspect is (I feel it is good to confirm, irrespective of the future of this patch): 1. Submission path: There is no extra code for non-zoned device IO. For append, there is this "ns->append_emulate = 1" check. Snippet - case REQ_OP_ZONE_APPEND: - ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + if (!nvme_is_append_emulated(ns)) + ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + else { + /* prepare append like write, and adjust lba afterwards */ 2. Completion: Not as clean as submission for sure. The extra check is "ns && ns->append_emulate == 1" for completions if CONFIG_ZONED is enabled. A slightly better completion-handling version (than the submitted patch) is - - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + struct nvme_ns *ns = req->q->queuedata; + /* append-emulation requires wp update for some cmds*/ + if (ns && nvme_is_append_emulated(ns)) { + if (nvme_need_zone_wp_update(req)) + nvme_zone_wp_update(ns, req, status); + } + else if (req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(ns, + le64_to_cpu(nvme_req(req)->result.u64)); Am I missing any other fast-path pain-points? A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows : before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec) after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec) It is good to use the QEMU "simulation" path that we implemented to test performance with different delays, etc., but for these numbers to make sense we need to put them in contrast to the simulated NAND speed, etc. This maynot be the best test to see the cost, and I am happy to conduct more and improvise. As for the volume of the code - it is same as SCSI emulation. And I can make efforts to reduce that by moving common code to blk-zone, and reuse in SCSI/NVMe emulation. In the patch I tried to isolate append-emulation by keeping everything into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns, which can be turned into "driver_data". +#ifdef CONFIG_BLK_DEV_ZONED +struct nvme_za_emul { + unsigned int nr_zones; + spinlock_t zones_wp_offset_lock; + u32 *zones_wp_offset; + u32 *rev_wp_offset; + struct work_struct zone_wp_offset_work; + char *zone_wp_update_buf; + struct mutex rev_mutex; + struct nvme_ns *ns; +}; +#endif Will that be an acceptable line of further work/discussions? I know we spent time enabling this path, but I don't think that moving the discussion to the block layer will have much more benefit. Let's keep the support for these non-append devices in xNVMe and focus on the support for the append-enabled ones in Linux. We have a lot of good stuff in the backlog that we can start pushing.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 20.08.2020 08:52, Christoph Hellwig wrote: On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote: We will stop pushing for this emulation. We have a couple of SSDs where we disabled Append, we implemented support for them, and we wanted to push the changes upstream. That's it. This is no politics not a conspiracy against the current ZNS spec. We spent a lot of time working on this spec and are actually doing a fair amount of work to support Append other places in the stack. In any case, the fuzz stops here. FYI, from knowing your personally I'm pretty confident you are not part of a conspiracy and you are just doing your best given the context, and I appreciate all your work! Thanks Christoph. I'm a lot less happy about thinks that happen in other groups not involving you directly, and I'm still pretty mad at how the games were played there, and especially other actors the seem to be reading the list here, and instead of taking part in the discussion are causing fuzz in completely unrelated venues. Yes. Hopefully, we will start keeping things separated and using this list for Linux, technical conversations only.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Thu, Aug 20, 2020 at 3:22 AM Keith Busch wrote: > > On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > > Intel does not support making *optional* NVMe spec features *required* > > by the NVMe driver. > > This is inaccurate. As another example, the spec optionally allows a > zone size to be a power of 2, but linux requires a power of 2 if you > want to use it with this driver. > > > Provided there's no glaring technical issues > > There are many. Some may be improved through a serious review process, > but the mess it makes out of the fast path is measurably harmful to > devices that don't subscribe to this. That issue is not so easily > remedied. > > Since this patch is a copy of the scsi implementation, the reasonable > thing is take this fight to the generic block layer for a common > solution. We're not taking this in the nvme driver. I sincerely want to minimize any adverse impact to the fast-path of non-zoned devices. My understanding of that aspect is (I feel it is good to confirm, irrespective of the future of this patch): 1. Submission path: There is no extra code for non-zoned device IO. For append, there is this "ns->append_emulate = 1" check. Snippet - case REQ_OP_ZONE_APPEND: - ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + if (!nvme_is_append_emulated(ns)) + ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + else { + /* prepare append like write, and adjust lba afterwards */ 2. Completion: Not as clean as submission for sure. The extra check is "ns && ns->append_emulate == 1" for completions if CONFIG_ZONED is enabled. A slightly better completion-handling version (than the submitted patch) is - - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + struct nvme_ns *ns = req->q->queuedata; + /* append-emulation requires wp update for some cmds*/ + if (ns && nvme_is_append_emulated(ns)) { + if (nvme_need_zone_wp_update(req)) + nvme_zone_wp_update(ns, req, status); + } + else if (req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(ns, + le64_to_cpu(nvme_req(req)->result.u64)); Am I missing any other fast-path pain-points? A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows : before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec) after: IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec) This maynot be the best test to see the cost, and I am happy to conduct more and improvise. As for the volume of the code - it is same as SCSI emulation. And I can make efforts to reduce that by moving common code to blk-zone, and reuse in SCSI/NVMe emulation. In the patch I tried to isolate append-emulation by keeping everything into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns, which can be turned into "driver_data". +#ifdef CONFIG_BLK_DEV_ZONED +struct nvme_za_emul { + unsigned int nr_zones; + spinlock_t zones_wp_offset_lock; + u32 *zones_wp_offset; + u32 *rev_wp_offset; + struct work_struct zone_wp_offset_work; + char *zone_wp_update_buf; + struct mutex rev_mutex; + struct nvme_ns *ns; +}; +#endif Will that be an acceptable line of further work/discussions? -- Kanchan
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote: > We will stop pushing for this emulation. We have a couple of SSDs where > we disabled Append, we implemented support for them, and we wanted to > push the changes upstream. That's it. This is no politics not a > conspiracy against the current ZNS spec. We spent a lot of time working > on this spec and are actually doing a fair amount of work to support > Append other places in the stack. In any case, the fuzz stops here. FYI, from knowing your personally I'm pretty confident you are not part of a conspiracy and you are just doing your best given the context, and I appreciate all your work! I'm a lot less happy about thinks that happen in other groups not involving you directly, and I'm still pretty mad at how the games were played there, and especially other actors the seem to be reading the list here, and instead of taking part in the discussion are causing fuzz in completely unrelated venues.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 19.08.2020 12:43, Christoph Hellwig wrote: On Wed, Aug 19, 2020 at 09:14:13AM +, Damien Le Moal wrote: While defining a zone append command for SCSI/ZBC is possible (using sense data for returning the written offset), there is no way to define zone append for SATA/ZAC without entirely breaking the ATA command model. This is why we went after an emulation implementation instead of trying to standardized native commands. That implementation does not have any performance impact over regular writes *and* zone write locking does not in general degrade HDD write performance (only a few corner cases suffer from it). Comparing things equally, the same could be said of NVMe drives that do not have zone append native support: performance will be essentially the same using regular writes and emulated zone append. But mq-deadline and zone write locking will significantly lower performance for emulated zone append compared to a native zone append support by the drive. And to summarize the most important point - Zone Append doesn't exist in ZAC/ABC. For people that spent the last years trying to make zoned storage work, the lack of such a primite has been the major pain point. That's why I came up with the Zone Append design in response to a request for such an operation from another company that is now heavily involved in both Linux development and hosting Linux VMs. For ZAC and ZBC the best we can do is to emulate the approach in the driver, but for NVMe we can do it. ZNS until just before the release had Zone Append mandatory, and it did so for a very good reason. While making it optional allows OEMs to request drives without it, I fundamentally think we should not support that in Linux and request vendors do implement writes to zones the right way. Ok. We will just pursue Linux support for the ZNS following the append model. And just as some OEMs can request certain TPs or optional features to be implemented, so can Linux. Just to give an example from the zone world - Linux requires uniform and power of two zone sizes, which in ZAC and ZBC are not required.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 19.08.2020 13:25, Jens Axboe wrote: On 8/19/20 12:11 PM, David Fugate wrote: On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path. Intel does not support making *optional* NVMe spec features *required* by the NVMe driver. It's not required, the driver will function quite fine without it. If you want to use ZNS it's required. The Linux driver thankfully doesn't need any vendor to sign off on what it can or cannot do, or what features are acceptable. It's forgivable WDC's accepted contribution didn't work with other vendors' devices choosing not to implement the optional Zone Append, but it's not OK to reject contributions remedying this. Provided there's no glaring technical issues, Samsung's contribution should be accepted to maintain both spec compliance as well as vendor neutrality. It's *always* ok to reject contributions, if those contributions cause maintainability issues, unacceptable slowdowns, or whatever other issue that the maintainers of said driver don't want to deal with. Any contribution should be judged on merit, not based on political decisions or opinions. Obviously this thread reeks of it. I'll reply here, where the discussion diverges from this particular patch. We will stop pushing for this emulation. We have a couple of SSDs where we disabled Append, we implemented support for them, and we wanted to push the changes upstream. That's it. This is no politics not a conspiracy against the current ZNS spec. We spent a lot of time working on this spec and are actually doing a fair amount of work to support Append other places in the stack. In any case, the fuzz stops here. Javier
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote: > There were queries? My key takeaways were a maintainer NAK followed by > instructions to make the Intel drive align with the driver by > implementing NOIOB. While I disagree with the rejection as it appeared > to be based entirely on politics, I can accept it as the quirk wasn't > in the spec. Stop the crap now please. I'm really offended. I'm offended by you implying that I pay corporate poitics. I'm primary a Linux developer, and I stand for what I think is good for Linux. I fight inside companies that pay me what is good for Linux, I work hard with companies that do not pay to convinve them what is right. I waste a lot of precious time to sit in standards meetings to do what is right for Linux. And now you come here with the politics arguments. Stop, it please. The Intel NOIOB quirk was in the driver since day 1, added by Matthew probably with no bad intent, but we kept growing the lists and finally got Intel to standardized because it was good for Linux, and to be honest good for Intel as updating quirks in half a dozen drivers simply does not scale. We got a really nice trivial TP out of it. But we really should not see any new drivers that do not implement one damn trivial field a long time after the TP has been publushed. That is not politics. Making a fuzz and playing the corporate card on a Linux mailing list is politics, really bad politics and you need to stop it. > It's not fair to make this same "your drive should align with the > driver" demand of Samsung because we *are* talking about a spec'ed > feature here. Huh. Next time Dell or NetApp or Facebook are going to require an optional NVMe feature, and OCP feature or even a vendor specific feature you are going to go to them and say "you don't play fair"? > Technical critques of their patches and real performance > degrades observed are fair game and objective; "your company did > the nastiest possible move violating the normal NVMe procedures to make > it optional" is not. It is pretty fair game if you know the background, that is that I spent countless hours to make this feature fit the Linux I/O stack requirement perfecty and worked that it is there. We might as well rely on it.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote: > > On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: > > > If drive does not support zone-append natively, enable emulation > > > using > > > regular write. > > > Make emulated zone-append cmd write-lock the zone, preventing > > > concurrent append/write on the same zone. > > > > I really don't think we should add this. ZNS and the Linux support > > were all designed with Zone Append in mind, and then your company did > > the nastiest possible move violating the normal NVMe procedures to > > make > > it optional. But that doesn't change the fact the Linux should keep > > requiring it, especially with the amount of code added here and how > > it > > hooks in the fast path. > > Intel does not support making *optional* NVMe spec features *required* > by the NVMe driver. That is a great demand, but please stop talking of companies here, because companies simply don't matter for Linux development. People and use cases do, companies don't and your mail talking about companies really can't be taken serious. And I'm not sure why you think Linux is different from any other NVMe host OS. If random NVMe host A decided they need feature X they are going to require it, because why not. Especially if the developers of host A helped to drive the development of feature X. I'm pretty sure with a little background in storage standards you've seen that play out a lot before. And it is no different here.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote: > There were queries? My key takeaways were a maintainer NAK followed by > instructions to make the Intel drive align with the driver by > implementing NOIOB. While I disagree with the rejection as it appeared > to be based entirely on politics, I can accept it as the quirk wasn't > in the spec. For the record, the suggestion provided, which you agreed to look into, most broadly enables your hardware on Linux and was entirely to your benefit. Not quite as dramatic as a political conspiracy. You later responded with a technical argument against that suggestion; however, your reason didn't add up, and that's where you left the thread. > It's not fair to make this same "your drive should align with the > driver" demand of Samsung because we *are* talking about a spec'ed > feature here. Technical critques of their patches and real performance > degrades observed are fair game and objective; "your company did > the nastiest possible move violating the normal NVMe procedures to make > it optional" is not. Sure, but you're cherry picking comments from the discussion. The performance impact exists, and it's generally not acceptable from a maintenance point to duplicate significant code without at least trying to provide a common solution.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, 2020-08-19 at 15:10 -0700, Keith Busch wrote: > You're the one who left that thread dangling. You offered to have > your > firmware accommodate the Intel sponsored feature that makes your > patch > unnecessary in the first place. Your follow up made no sense and you > have not responded to the queries about it. There were queries? My key takeaways were a maintainer NAK followed by instructions to make the Intel drive align with the driver by implementing NOIOB. While I disagree with the rejection as it appeared to be based entirely on politics, I can accept it as the quirk wasn't in the spec. It's not fair to make this same "your drive should align with the driver" demand of Samsung because we *are* talking about a spec'ed feature here. Technical critques of their patches and real performance degrades observed are fair game and objective; "your company did the nastiest possible move violating the normal NVMe procedures to make it optional" is not.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 03:54:20PM -0600, David Fugate wrote: > On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote: > > It's not required, the driver will function quite fine without it. If > > you > > want to use ZNS it's required. > > The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral* > Linux driver should not either. The spec was developed over the course of years with your employer's involvement, and the software enabling efforts occurred in parallel. The "optional" part was made that way at the final hour, so please align your expectations accordingly. > Agreed, but this standard needs to be applied equally to everyone. > E.g., harmless contributions such as > https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get > rejected yet clear spec violations from maintainers are accepted? > type of behavior encourages forking, vendor-specific drivers, etc. > which is somewhere I hope none of us want to go. You're the one who left that thread dangling. You offered to have your firmware accommodate the Intel sponsored feature that makes your patch unnecessary in the first place. Your follow up made no sense and you have not responded to the queries about it.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote: > It's not required, the driver will function quite fine without it. If > you > want to use ZNS it's required. The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral* Linux driver should not either. > The Linux driver thankfully doesn't need > any vendor to sign off on what it can or cannot do, or what features > are acceptable. The problem is the driver needs one *particular* vendor to sign off. Existing driver behavior aligns with WDC drives instead of the spec, giving WDC an unfair advantage. Couple this with NVMe maintainer(s?) working for WDC, and there's a conflict of interest. > It's *always* ok to reject contributions, if those contributions > cause > maintainability issues, unacceptable slowdowns, or whatever other > issue > that the maintainers of said driver don't want to deal with. Any > contribution should be judged on merit, not based on political > decisions > or opinions. Agreed, but this standard needs to be applied equally to everyone. E.g., harmless contributions such as https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get rejected yet clear spec violations from maintainers are accepted? This type of behavior encourages forking, vendor-specific drivers, etc. which is somewhere I hope none of us want to go.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote: > Intel does not support making *optional* NVMe spec features *required* > by the NVMe driver. This is inaccurate. As another example, the spec optionally allows a zone size to be a power of 2, but linux requires a power of 2 if you want to use it with this driver. > Provided there's no glaring technical issues There are many. Some may be improved through a serious review process, but the mess it makes out of the fast path is measurably harmful to devices that don't subscribe to this. That issue is not so easily remedied. Since this patch is a copy of the scsi implementation, the reasonable thing is take this fight to the generic block layer for a common solution. We're not taking this in the nvme driver.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 8/19/20 12:11 PM, David Fugate wrote: > On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote: >> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: >>> If drive does not support zone-append natively, enable emulation >>> using >>> regular write. >>> Make emulated zone-append cmd write-lock the zone, preventing >>> concurrent append/write on the same zone. >> >> I really don't think we should add this. ZNS and the Linux support >> were all designed with Zone Append in mind, and then your company did >> the nastiest possible move violating the normal NVMe procedures to >> make >> it optional. But that doesn't change the fact the Linux should keep >> requiring it, especially with the amount of code added here and how >> it >> hooks in the fast path. > > Intel does not support making *optional* NVMe spec features *required* > by the NVMe driver. It's not required, the driver will function quite fine without it. If you want to use ZNS it's required. The Linux driver thankfully doesn't need any vendor to sign off on what it can or cannot do, or what features are acceptable. > It's forgivable WDC's accepted contribution didn't work with other > vendors' devices choosing not to implement the optional Zone Append, > but it's not OK to reject contributions remedying this. Provided > there's no glaring technical issues, Samsung's contribution should be > accepted to maintain both spec compliance as well as vendor neutrality. It's *always* ok to reject contributions, if those contributions cause maintainability issues, unacceptable slowdowns, or whatever other issue that the maintainers of said driver don't want to deal with. Any contribution should be judged on merit, not based on political decisions or opinions. Obviously this thread reeks of it. -- Jens Axboe
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote: > On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: > > If drive does not support zone-append natively, enable emulation > > using > > regular write. > > Make emulated zone-append cmd write-lock the zone, preventing > > concurrent append/write on the same zone. > > I really don't think we should add this. ZNS and the Linux support > were all designed with Zone Append in mind, and then your company did > the nastiest possible move violating the normal NVMe procedures to > make > it optional. But that doesn't change the fact the Linux should keep > requiring it, especially with the amount of code added here and how > it > hooks in the fast path. Intel does not support making *optional* NVMe spec features *required* by the NVMe driver. It's forgivable WDC's accepted contribution didn't work with other vendors' devices choosing not to implement the optional Zone Append, but it's not OK to reject contributions remedying this. Provided there's no glaring technical issues, Samsung's contribution should be accepted to maintain both spec compliance as well as vendor neutrality.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 10:33:53AM +0200, Javier Gonzalez wrote: > I would ask you to reconsider this position. I have a hard time > understanding how zone append emulation is a good idea in SCSI and not > in NVMe, when there is no performance penalty. Per the numbers on btrfs and zonefs numbers zone append emulation is faster than using serialized writes, but also way slower than native zone append. Zone append emulation for SCSI is the way to bring support for pre-existing standards into the brave new world. Not requiring Zone Append in a standard designed around it because of bowing down to last minute political interventions in the standards committee is shooting ourselves in the foot.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Wed, Aug 19, 2020 at 09:14:13AM +, Damien Le Moal wrote: > While defining a zone append command for SCSI/ZBC is possible (using sense > data > for returning the written offset), there is no way to define zone append for > SATA/ZAC without entirely breaking the ATA command model. This is why we went > after an emulation implementation instead of trying to standardized native > commands. That implementation does not have any performance impact over > regular > writes *and* zone write locking does not in general degrade HDD write > performance (only a few corner cases suffer from it). Comparing things > equally, > the same could be said of NVMe drives that do not have zone append native > support: performance will be essentially the same using regular writes and > emulated zone append. But mq-deadline and zone write locking will > significantly > lower performance for emulated zone append compared to a native zone append > support by the drive. And to summarize the most important point - Zone Append doesn't exist in ZAC/ABC. For people that spent the last years trying to make zoned storage work, the lack of such a primite has been the major pain point. That's why I came up with the Zone Append design in response to a request for such an operation from another company that is now heavily involved in both Linux development and hosting Linux VMs. For ZAC and ZBC the best we can do is to emulate the approach in the driver, but for NVMe we can do it. ZNS until just before the release had Zone Append mandatory, and it did so for a very good reason. While making it optional allows OEMs to request drives without it, I fundamentally think we should not support that in Linux and request vendors do implement writes to zones the right way. And just as some OEMs can request certain TPs or optional features to be implemented, so can Linux. Just to give an example from the zone world - Linux requires uniform and power of two zone sizes, which in ZAC and ZBC are not required.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 2020/08/19 17:34, Javier Gonzalez wrote: > On 19.08.2020 09:40, Christoph Hellwig wrote: >> On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote: >>> I understand that you want vendor alignment in the NVMe driver and I >>> agree. We are not pushing for a non-append model - you can see that we >>> are investing effort in implementing the append path in thee block layer >>> and io_uring and we will continue doing so as patches get merged. >>> >>> This said, we do have some OEM models that do not implement append and I >>> would like them to be supported in Linux. As you know, new TPs are being >>> standardized now and the append emulation is the based for adding >>> support for this. I do not believe it is unreasonable to find a way to >>> add support for this SSDs. >> >> I do not think we should support anything but Zone Append, especially not >> the new TP, which is going to add even more horrible code for absolutely >> no good reason. > > I must admit that this is a bit frustrating. The new TP adds > functionality beyond operating as an Append alternative that I would > very much like to see upstream (do want to discuss details here). > > I understand the concerns about deviating from the Append model, but I > believe we should find a way to add these new features. We are hiding > all the logic in the NVMe driver and not touching the interface with the > block layer, so the overall model is really not changed. > >> >>> If you completely close the door this approach, the alternative is >>> carrying off-tree patches to the several OEMs that use these devices. >>> This is not good for the zoned ecosystem nor for the future of Zone >>> Append. >> >> I really don't have a problem with that. If these OEMs want to use >> an inferior access model only, they have to pay the price for it. >> I also don't think that proxy arguments are very useful. If you OEMs >> are troubled by carrying patches becomes they decided to buy inferior >> drivers they are perfectly happy to argue their cause here on the list. > > I am not arguing as a proxy, I am stating the trouble we see from our > perspective in having to diverge from mainline when our approach is > being upstream first. > > Whether the I/O mode is inferior or superior, they can answer that > themselves if they read this list. >> >>> Are you open to us doing some characterization and if the impact >>> to the fast path is not significant, moving ahead to a Zone Append >>> emulation like in SCSI? I will promise that we will remove this path if >>> requests for these devices terminate. >> >> As said I do not think implementing zone append emulation or the TP that >> shall not be named are a good idea for Linux. > > I would ask you to reconsider this position. I have a hard time > understanding how zone append emulation is a good idea in SCSI and not > in NVMe, when there is no performance penalty. While defining a zone append command for SCSI/ZBC is possible (using sense data for returning the written offset), there is no way to define zone append for SATA/ZAC without entirely breaking the ATA command model. This is why we went after an emulation implementation instead of trying to standardized native commands. That implementation does not have any performance impact over regular writes *and* zone write locking does not in general degrade HDD write performance (only a few corner cases suffer from it). Comparing things equally, the same could be said of NVMe drives that do not have zone append native support: performance will be essentially the same using regular writes and emulated zone append. But mq-deadline and zone write locking will significantly lower performance for emulated zone append compared to a native zone append support by the drive. -- Damien Le Moal Western Digital Research
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 19.08.2020 09:40, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote: I understand that you want vendor alignment in the NVMe driver and I agree. We are not pushing for a non-append model - you can see that we are investing effort in implementing the append path in thee block layer and io_uring and we will continue doing so as patches get merged. This said, we do have some OEM models that do not implement append and I would like them to be supported in Linux. As you know, new TPs are being standardized now and the append emulation is the based for adding support for this. I do not believe it is unreasonable to find a way to add support for this SSDs. I do not think we should support anything but Zone Append, especially not the new TP, which is going to add even more horrible code for absolutely no good reason. I must admit that this is a bit frustrating. The new TP adds functionality beyond operating as an Append alternative that I would very much like to see upstream (do want to discuss details here). I understand the concerns about deviating from the Append model, but I believe we should find a way to add these new features. We are hiding all the logic in the NVMe driver and not touching the interface with the block layer, so the overall model is really not changed. If you completely close the door this approach, the alternative is carrying off-tree patches to the several OEMs that use these devices. This is not good for the zoned ecosystem nor for the future of Zone Append. I really don't have a problem with that. If these OEMs want to use an inferior access model only, they have to pay the price for it. I also don't think that proxy arguments are very useful. If you OEMs are troubled by carrying patches becomes they decided to buy inferior drivers they are perfectly happy to argue their cause here on the list. I am not arguing as a proxy, I am stating the trouble we see from our perspective in having to diverge from mainline when our approach is being upstream first. Whether the I/O mode is inferior or superior, they can answer that themselves if they read this list. Are you open to us doing some characterization and if the impact to the fast path is not significant, moving ahead to a Zone Append emulation like in SCSI? I will promise that we will remove this path if requests for these devices terminate. As said I do not think implementing zone append emulation or the TP that shall not be named are a good idea for Linux. I would ask you to reconsider this position. I have a hard time understanding how zone append emulation is a good idea in SCSI and not in NVMe, when there is no performance penalty.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote: > I understand that you want vendor alignment in the NVMe driver and I > agree. We are not pushing for a non-append model - you can see that we > are investing effort in implementing the append path in thee block layer > and io_uring and we will continue doing so as patches get merged. > > This said, we do have some OEM models that do not implement append and I > would like them to be supported in Linux. As you know, new TPs are being > standardized now and the append emulation is the based for adding > support for this. I do not believe it is unreasonable to find a way to > add support for this SSDs. I do not think we should support anything but Zone Append, especially not the new TP, which is going to add even more horrible code for absolutely no good reason. > If you completely close the door this approach, the alternative is > carrying off-tree patches to the several OEMs that use these devices. > This is not good for the zoned ecosystem nor for the future of Zone > Append. I really don't have a problem with that. If these OEMs want to use an inferior access model only, they have to pay the price for it. I also don't think that proxy arguments are very useful. If you OEMs are troubled by carrying patches becomes they decided to buy inferior drivers they are perfectly happy to argue their cause here on the list. > Are you open to us doing some characterization and if the impact > to the fast path is not significant, moving ahead to a Zone Append > emulation like in SCSI? I will promise that we will remove this path if > requests for these devices terminate. As said I do not think implementing zone append emulation or the TP that shall not be named are a good idea for Linux.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18.08.2020 10:39, Keith Busch wrote: On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote: On 18.08.2020 09:58, Keith Busch wrote: > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > > a number of customers are requiring the use of normal writes, which we > > want to support. > > A device that supports append is completely usable for those customers, > too. There's no need to create divergence in this driver. Not really. You know as well as I do that some features are disabled for a particular SSD model on customer requirements. Generic models implementing append can submit both I/Os, but those that remove append are left out. You are only supporting my point: if your device supports append, you get to work in every ZNS use case, otherwise you only get to work in a subset. There are several devices. Some of them enable append and some do not. I would like to support the later too.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18.08.2020 12:51, Matias Bjørling wrote: On 18/08/2020 11.50, Javier Gonzalez wrote: On 18.08.2020 09:12, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path. I understand that the NVMe process was agitated and that the current ZNS implementation in Linux relies in append support from the device perspective. However, the current TP does allow for not implementing append, and a number of customers are requiring the use of normal writes, which we want to support. There is a lot of things that is specified in NVMe, but not implemented in the Linux kernel. That your company is not able to efficiently implement the Zone Append command (this is the only reason I can think of that make you and your company cause such a fuss), This comment is out of place and I will choose to ignore it. shouldn't mean that everyone else has to suffer. This is not a quirk, nor a software work-around, This is a design decision affecting the storage stack of several OEMs. As such, I would like to find a way to implement this functionality. Last time we discussed this in the mailing list, you among others pointed to append emulation as the best way to enable this path and here we are. Can you explained what changed? In any case, SPDK offers adequate support and can be used today. We take the SPDK discussion in the appropriate mailing lists and slack channels.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18.08.2020 17:50, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: I understand that the NVMe process was agitated and that the current ZNS implementation in Linux relies in append support from the device perspective. However, the current TP does allow for not implementing append, and a number of customers are requiring the use of normal writes, which we want to support. The NVMe TPs allow for lots of things, but that doesn't mean we have to support it. Agree. As I replied to Keith, I am just interested in enabling the ZNS models that come with append disabled. Do you have any early suggestion on how you this patch should look like to be upstreamable? My position is that at this point in time we should not consider it. Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC. I want to see broad industry support for it instead of having to add more code just for zone append emulation than actual current ZNS support. If in a few years the market place has decided and has lots of drives available in the consuer market or OEM channels we'll have to reconsider and potentially merge Zone Append emulation. But my deep hope is that this does not happen, as it sets us back 10 years in the standards of zoned storage support again. I understand that you want vendor alignment in the NVMe driver and I agree. We are not pushing for a non-append model - you can see that we are investing effort in implementing the append path in thee block layer and io_uring and we will continue doing so as patches get merged. This said, we do have some OEM models that do not implement append and I would like them to be supported in Linux. As you know, new TPs are being standardized now and the append emulation is the based for adding support for this. I do not believe it is unreasonable to find a way to add support for this SSDs. If you completely close the door this approach, the alternative is carrying off-tree patches to the several OEMs that use these devices. This is not good for the zoned ecosystem nor for the future of Zone Append. Are you open to us doing some characterization and if the impact to the fast path is not significant, moving ahead to a Zone Append emulation like in SCSI? I will promise that we will remove this path if requests for these devices terminate.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote: > On 18.08.2020 09:58, Keith Busch wrote: > > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > > > a number of customers are requiring the use of normal writes, which we > > > want to support. > > > > A device that supports append is completely usable for those customers, > > too. There's no need to create divergence in this driver. > > Not really. You know as well as I do that some features are disabled for > a particular SSD model on customer requirements. Generic models > implementing append can submit both I/Os, but those that remove append > are left out. You are only supporting my point: if your device supports append, you get to work in every ZNS use case, otherwise you only get to work in a subset.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18.08.2020 09:58, Keith Busch wrote: On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: a number of customers are requiring the use of normal writes, which we want to support. A device that supports append is completely usable for those customers, too. There's no need to create divergence in this driver. Not really. You know as well as I do that some features are disabled for a particular SSD model on customer requirements. Generic models implementing append can submit both I/Os, but those that remove append are left out. I would like to understand how we can enable these NVMe-compatible models in Linux. If it is a performance concern, we will address it.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > a number of customers are requiring the use of normal writes, which we > want to support. A device that supports append is completely usable for those customers, too. There's no need to create divergence in this driver.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote: > I understand that the NVMe process was agitated and that the current ZNS > implementation in Linux relies in append support from the device > perspective. However, the current TP does allow for not implementing > append, and a number of customers are requiring the use of normal > writes, which we want to support. The NVMe TPs allow for lots of things, but that doesn't mean we have to support it. > Do you have any early suggestion on how you this patch should look like > to be upstreamable? My position is that at this point in time we should not consider it. Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC. I want to see broad industry support for it instead of having to add more code just for zone append emulation than actual current ZNS support. If in a few years the market place has decided and has lots of drives available in the consuer market or OEM channels we'll have to reconsider and potentially merge Zone Append emulation. But my deep hope is that this does not happen, as it sets us back 10 years in the standards of zoned storage support again.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18/08/2020 11.50, Javier Gonzalez wrote: On 18.08.2020 09:12, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path. I understand that the NVMe process was agitated and that the current ZNS implementation in Linux relies in append support from the device perspective. However, the current TP does allow for not implementing append, and a number of customers are requiring the use of normal writes, which we want to support. There is a lot of things that is specified in NVMe, but not implemented in the Linux kernel. That your company is not able to efficiently implement the Zone Append command (this is the only reason I can think of that make you and your company cause such a fuss), shouldn't mean that everyone else has to suffer. In any case, SPDK offers adequate support and can be used today.
Re: [PATCH 2/2] nvme: add emulation for zone-append
On 18.08.2020 09:12, Christoph Hellwig wrote: On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path. I understand that the NVMe process was agitated and that the current ZNS implementation in Linux relies in append support from the device perspective. However, the current TP does allow for not implementing append, and a number of customers are requiring the use of normal writes, which we want to support. During the initial patch review we discussed this and we agreed that the block layer is designed for append on zone devices, and that for the foreseeable future this was not going to change. We therefore took the feedback and followed a similar approach as in the SCSI driver for implementing append emulation. We are happy to do more characterization on the impact of these hooks in the non-zoned hast path and eventually changing the approach if this proves to be a problem. Our thought is to isolate any potential performance degradation to the zoned path using the emulation (we do not see any ATM). Do you have any early suggestion on how you this patch should look like to be upstreamable? Javier
Re: [PATCH 2/2] nvme: add emulation for zone-append
On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote: > If drive does not support zone-append natively, enable emulation using > regular write. > Make emulated zone-append cmd write-lock the zone, preventing > concurrent append/write on the same zone. I really don't think we should add this. ZNS and the Linux support were all designed with Zone Append in mind, and then your company did the nastiest possible move violating the normal NVMe procedures to make it optional. But that doesn't change the fact the Linux should keep requiring it, especially with the amount of code added here and how it hooks in the fast path.
[PATCH 2/2] nvme: add emulation for zone-append
If drive does not support zone-append natively, enable emulation using regular write. Make emulated zone-append cmd write-lock the zone, preventing concurrent append/write on the same zone. To determine the start-lba for such writes, an array of 32 bit zone-relative write-pointer (WP) positions is attached with namespace. This cached WP-position is updated on successful completion as follows: - APPEND/WRITE/WRITE_ZEROS/WRITE_SAME update it by number of sectors (512b) copied - ZONE_RESET updates it to 0 for target zone. ZONE_RESET_ALL does the same for all zones. - ZONE_FINISH sets it to zone-size. On failed-completion for above requests, cached WP-position of target zone is marked invalid. On subsequent zone-append to that zone, WP position is refreshed by querying it from device (i.e. zone-report). If emulated-append cannot immediately proceed due to zone write-lock or invalid WP position, block-layer is asked to retry it. Signed-off-by: Kanchan Joshi Signed-off-by: Nitesh Shetty Signed-off-by: SelvaKumar S Signed-off-by: Javier Gonzalez --- drivers/nvme/host/core.c | 41 +- drivers/nvme/host/nvme.h | 60 drivers/nvme/host/zns.c | 306 ++- 3 files changed, 398 insertions(+), 9 deletions(-) diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c index 88cff309d8e4..78faddf444c3 100644 --- a/drivers/nvme/host/core.c +++ b/drivers/nvme/host/core.c @@ -287,10 +287,17 @@ void nvme_complete_rq(struct request *req) nvme_retry_req(req); return; } - } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) && - req_op(req) == REQ_OP_ZONE_APPEND) { - req->__sector = nvme_lba_to_sect(req->q->queuedata, - le64_to_cpu(nvme_req(req)->result.u64)); + } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) { + bool need_wp_offset_update = false; + struct nvme_ns *ns = req->q->queuedata; + /* append-emulation requires wp update for some cmds*/ + if (ns && nvme_is_append_emulated(ns)) + need_wp_offset_update = nvme_need_zone_wp_update(req); + if (need_wp_offset_update) + nvme_zone_wp_update(ns, req, status); + else if (req_op(req) == REQ_OP_ZONE_APPEND) + req->__sector = nvme_lba_to_sect(ns, + le64_to_cpu(nvme_req(req)->result.u64)); } nvme_trace_bio_complete(req, status); @@ -456,6 +463,8 @@ static void nvme_free_ns(struct kref *kref) { struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref); + if (nvme_is_append_emulated(ns)) + nvme_teardown_append_emulate(ns); if (ns->ndev) nvme_nvm_unregister(ns); @@ -809,7 +818,15 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct request *req, ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write); break; case REQ_OP_ZONE_APPEND: - ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + if (!nvme_is_append_emulated(ns)) + ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append); + else { + /* prepare append like write, and adjust lba afterwards */ + ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write); + if (ret) + break; + ret = nvme_append_to_write(ns, req, cmd); + } break; default: WARN_ON_ONCE(1); @@ -2150,7 +2167,7 @@ static int nvme_revalidate_disk(struct gendisk *disk) struct nvme_ns *ns = disk->private_data; struct nvme_ctrl *ctrl = ns->ctrl; - ret = blk_revalidate_disk_zones(disk, NULL); + ret = nvme_revalidate_disk_zones(disk); if (!ret) blk_queue_max_zone_append_sectors(disk->queue, ctrl->max_zone_append); @@ -3900,6 +3917,18 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, unsigned nsid) if (__nvme_revalidate_disk(disk, id)) goto out_put_disk; + /* setup append-emulation if required */ + if (nvme_is_append_emulated(ns)) { + ret = nvme_setup_append_emulate(ns); + if (ret) { + dev_warn(ns->ctrl->device, + "append-emulation failed, zoned namespace:%d\n", + ns->head->ns_id); + nvme_clear_append_emulated(ns); + goto out_put_disk; + } + } + if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) { ret = nvme_nvm_register(ns, disk_name, node); if (ret)