Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread David Fugate
On Thu, 2020-08-20 at 12:45 +0900, Keith Busch wrote:
> For the record, the suggestion provided, which you agreed to look
> into,
> most broadly enables your hardware on Linux and was entirely to your
> benefit. Not quite as dramatic as a political conspiracy.
> 
> You later responded with a technical argument against that
> suggestion;
> however, your reason didn't add up, and that's where you left the
> thread.

The suggestion to the rejected patch was passed onto the related FW
team, and the "technical argument" was our FW team's response to the
suggestion which I relayed to the list. At this point, there's no
closure on whether the device will get NOIOB.

My point in bringing up this example was a one-line, highly-
maintainable patch which improves the performance of Linux should not
have been immediatedly NAK'ed as it was. If you believe it should have,
we'll agree to disagree.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Javier Gonzalez

On 20.08.2020 13:07, Kanchan Joshi wrote:

On Thu, Aug 20, 2020 at 3:22 AM Keith Busch  wrote:


On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.

> Provided there's no glaring technical issues

There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.

Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.


I sincerely want to minimize any adverse impact to the fast-path of
non-zoned devices.
My understanding of that aspect is (I feel it is good to confirm,
irrespective of the future of this patch):

1. Submission path:
There is no extra code for non-zoned device IO. For append, there is
this "ns->append_emulate = 1" check.
Snippet -
   case REQ_OP_ZONE_APPEND:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   if (!nvme_is_append_emulated(ns))
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   else {
+   /* prepare append like write, and adjust lba
afterwards */

2. Completion:
Not as clean as submission for sure.
The extra check is "ns && ns->append_emulate == 1" for completions if
CONFIG_ZONED is enabled.
A slightly better completion-handling version (than the submitted patch) is -

-   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-  req_op(req) == REQ_OP_ZONE_APPEND) {
-   req->__sector = nvme_lba_to_sect(req->q->queuedata,
-   le64_to_cpu(nvme_req(req)->result.u64));
+   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+   struct nvme_ns *ns = req->q->queuedata;
+   /* append-emulation requires wp update for some cmds*/
+   if (ns && nvme_is_append_emulated(ns)) {
+   if (nvme_need_zone_wp_update(req))
+   nvme_zone_wp_update(ns, req, status);
+   }
+   else if (req_op(req) == REQ_OP_ZONE_APPEND)
+   req->__sector = nvme_lba_to_sect(ns,
+   le64_to_cpu(nvme_req(req)->result.u64));

Am I missing any other fast-path pain-points?

A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
after:  IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)


It is good to use the QEMU "simulation" path that we implemented to test
performance with different delays, etc., but for these numbers to make
sense we need to put them in contrast to the simulated NAND speed, etc.



This maynot be the best test to see the cost, and I am happy to
conduct more and improvise.

As for the volume of the code - it is same as SCSI emulation. And I
can make efforts to reduce that by moving common code to blk-zone, and
reuse in SCSI/NVMe emulation.
In the patch I tried to isolate append-emulation by keeping everything
into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
which can be turned into "driver_data".

+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+   unsigned int nr_zones;
+   spinlock_t zones_wp_offset_lock;
+   u32 *zones_wp_offset;
+   u32 *rev_wp_offset;
+   struct work_struct zone_wp_offset_work;
+   char *zone_wp_update_buf;
+   struct mutex rev_mutex;
+   struct nvme_ns *ns;
+};
+#endif

Will that be an acceptable line of further work/discussions?


I know we spent time enabling this path, but I don't think that moving
the discussion to the block layer will have much more benefit.

Let's keep the support for these non-append devices in xNVMe and focus
on the support for the append-enabled ones in Linux. We have a lot of
good stuff in the backlog that we can start pushing.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Javier Gonzalez

On 20.08.2020 08:52, Christoph Hellwig wrote:

On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote:

We will stop pushing for this emulation. We have a couple of SSDs where
we disabled Append, we implemented support for them, and we wanted to
push the changes upstream. That's it. This is no politics not a
conspiracy against the current ZNS spec. We spent a lot of time working
on this spec and are actually doing a fair amount of work to support
Append other places in the stack. In any case, the fuzz stops here.


FYI, from knowing your personally I'm pretty confident you are not
part of a conspiracy and you are just doing your best given the context,
and I appreciate all your work!


Thanks Christoph.



I'm a lot less happy about thinks that happen in other groups not
involving you directly, and I'm still pretty mad at how the games were
played there, and especially other actors the seem to be reading the
list here, and instead of taking part in the discussion are causing fuzz
in completely unrelated venues.


Yes. Hopefully, we will start keeping things separated and using this
list for Linux, technical conversations only.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Kanchan Joshi
On Thu, Aug 20, 2020 at 3:22 AM Keith Busch  wrote:
>
> On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> > Intel does not support making *optional* NVMe spec features *required*
> > by the NVMe driver.
>
> This is inaccurate. As another example, the spec optionally allows a
> zone size to be a power of 2, but linux requires a power of 2 if you
> want to use it with this driver.
>
> > Provided there's no glaring technical issues
>
> There are many. Some may be improved through a serious review process,
> but the mess it makes out of the fast path is measurably harmful to
> devices that don't subscribe to this. That issue is not so easily
> remedied.
>
> Since this patch is a copy of the scsi implementation, the reasonable
> thing is take this fight to the generic block layer for a common
> solution. We're not taking this in the nvme driver.

I sincerely want to minimize any adverse impact to the fast-path of
non-zoned devices.
My understanding of that aspect is (I feel it is good to confirm,
irrespective of the future of this patch):

1. Submission path:
There is no extra code for non-zoned device IO. For append, there is
this "ns->append_emulate = 1" check.
Snippet -
case REQ_OP_ZONE_APPEND:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   if (!nvme_is_append_emulated(ns))
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   else {
+   /* prepare append like write, and adjust lba
afterwards */

2. Completion:
 Not as clean as submission for sure.
The extra check is "ns && ns->append_emulate == 1" for completions if
CONFIG_ZONED is enabled.
A slightly better completion-handling version (than the submitted patch) is -

-   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-  req_op(req) == REQ_OP_ZONE_APPEND) {
-   req->__sector = nvme_lba_to_sect(req->q->queuedata,
-   le64_to_cpu(nvme_req(req)->result.u64));
+   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+   struct nvme_ns *ns = req->q->queuedata;
+   /* append-emulation requires wp update for some cmds*/
+   if (ns && nvme_is_append_emulated(ns)) {
+   if (nvme_need_zone_wp_update(req))
+   nvme_zone_wp_update(ns, req, status);
+   }
+   else if (req_op(req) == REQ_OP_ZONE_APPEND)
+   req->__sector = nvme_lba_to_sect(ns,
+   le64_to_cpu(nvme_req(req)->result.u64));

Am I missing any other fast-path pain-points?

A quick 1 minute 4K randwrite run (QD 64, 4 jobs,libaio) shows :
before: IOPS=270k, BW=1056MiB/s (1107MB/s)(61.9GiB/60002msec)
after:  IOPS=270k, BW=1055MiB/s (1106MB/s)(61.8GiB/60005msec)

This maynot be the best test to see the cost, and I am happy to
conduct more and improvise.

As for the volume of the code - it is same as SCSI emulation. And I
can make efforts to reduce that by moving common code to blk-zone, and
reuse in SCSI/NVMe emulation.
In the patch I tried to isolate append-emulation by keeping everything
into "nvme_za_emul". It contains nothing nvme'ish except nvme_ns,
which can be turned into "driver_data".

+#ifdef CONFIG_BLK_DEV_ZONED
+struct nvme_za_emul {
+   unsigned int nr_zones;
+   spinlock_t zones_wp_offset_lock;
+   u32 *zones_wp_offset;
+   u32 *rev_wp_offset;
+   struct work_struct zone_wp_offset_work;
+   char *zone_wp_update_buf;
+   struct mutex rev_mutex;
+   struct nvme_ns *ns;
+};
+#endif

Will that be an acceptable line of further work/discussions?

-- 
Kanchan


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Christoph Hellwig
On Thu, Aug 20, 2020 at 08:37:19AM +0200, Javier Gonzalez wrote:
> We will stop pushing for this emulation. We have a couple of SSDs where
> we disabled Append, we implemented support for them, and we wanted to
> push the changes upstream. That's it. This is no politics not a
> conspiracy against the current ZNS spec. We spent a lot of time working
> on this spec and are actually doing a fair amount of work to support
> Append other places in the stack. In any case, the fuzz stops here.

FYI, from knowing your personally I'm pretty confident you are not
part of a conspiracy and you are just doing your best given the context,
and I appreciate all your work!

I'm a lot less happy about thinks that happen in other groups not
involving you directly, and I'm still pretty mad at how the games were
played there, and especially other actors the seem to be reading the
list here, and instead of taking part in the discussion are causing fuzz
in completely unrelated venues.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Javier Gonzalez

On 19.08.2020 12:43, Christoph Hellwig wrote:

On Wed, Aug 19, 2020 at 09:14:13AM +, Damien Le Moal wrote:

While defining a zone append command for SCSI/ZBC is possible (using sense data
for returning the written offset), there is no way to define zone append for
SATA/ZAC without entirely breaking the ATA command model. This is why we went
after an emulation implementation instead of trying to standardized native
commands. That implementation does not have any performance impact over regular
writes *and* zone write locking does not in general degrade HDD write
performance (only a few corner cases suffer from it). Comparing things equally,
the same could be said of NVMe drives that do not have zone append native
support: performance will be essentially the same using regular writes and
emulated zone append. But mq-deadline and zone write locking will significantly
lower performance for emulated zone append compared to a native zone append
support by the drive.


And to summarize the most important point - Zone Append doesn't exist
in ZAC/ABC.  For people that spent the last years trying to make zoned
storage work, the lack of such a primite has been the major pain point.
That's why I came up with the Zone Append design in response to a
request for such an operation from another company that is now heavily
involved in both Linux development and hosting Linux VMs.  For ZAC and
ZBC the best we can do is to emulate the approach in the driver, but
for NVMe we can do it.  ZNS until just before the release had Zone
Append mandatory, and it did so for a very good reason.  While making
it optional allows OEMs to request drives without it, I fundamentally
think we should not support that in Linux and request vendors do
implement writes to zones the right way.


Ok. We will just pursue Linux support for the ZNS following the append
model.



And just as some OEMs can request certain TPs or optional features to
be implemented, so can Linux.  Just to give an example from the zone
world - Linux requires uniform and power of two zone sizes, which in
ZAC and ZBC are not required.




Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-20 Thread Javier Gonzalez

On 19.08.2020 13:25, Jens Axboe wrote:

On 8/19/20 12:11 PM, David Fugate wrote:

On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:

If drive does not support zone-append natively, enable emulation
using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.


I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to
make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how
it
hooks in the fast path.


Intel does not support making *optional* NVMe spec features *required*
by the NVMe driver.


It's not required, the driver will function quite fine without it. If you
want to use ZNS it's required. The Linux driver thankfully doesn't need
any vendor to sign off on what it can or cannot do, or what features
are acceptable.


It's forgivable WDC's accepted contribution didn't work with other
vendors' devices choosing not to implement the optional Zone Append,
but it's not OK to reject contributions remedying this.  Provided
there's no glaring technical issues, Samsung's contribution should be
accepted to maintain both spec compliance as well as vendor neutrality.


It's *always* ok to reject contributions, if those contributions cause
maintainability issues, unacceptable slowdowns, or whatever other issue
that the maintainers of said driver don't want to deal with. Any
contribution should be judged on merit, not based on political decisions
or opinions. Obviously this thread reeks of it.



I'll reply here, where the discussion diverges from this particular
patch.

We will stop pushing for this emulation. We have a couple of SSDs where
we disabled Append, we implemented support for them, and we wanted to
push the changes upstream. That's it. This is no politics not a
conspiracy against the current ZNS spec. We spent a lot of time working
on this spec and are actually doing a fair amount of work to support
Append other places in the stack. In any case, the fuzz stops here.

Javier


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Christoph Hellwig
On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote:
> There were queries? My key takeaways were a maintainer NAK followed by
> instructions to make the Intel drive align with the driver by
> implementing NOIOB. While I disagree with the rejection as it appeared
> to be based entirely on politics, I can accept it as the quirk wasn't
> in the spec.

Stop the crap now please.  I'm really offended.  I'm offended by you
implying that I pay corporate poitics.  I'm primary a Linux developer,
and I stand for what I think is good for Linux.  I fight inside companies
that pay me what is good for Linux, I work hard with companies that do not
pay to convinve them what is right.  I waste a lot of precious time to sit
in standards meetings to do what is right for Linux.  And now you come
here with the politics arguments.  Stop, it please.

The Intel NOIOB quirk was in the driver since day 1, added by Matthew
probably with no bad intent, but we kept growing the lists and finally
got Intel to standardized because it was good for Linux, and to be
honest good for Intel as updating quirks in half a dozen drivers
simply does not scale.  We got a really nice trivial TP out of it.  But
we really should not see any new drivers that do not implement one damn
trivial field a long time after the TP has been publushed.  That is not
politics.

Making a fuzz and playing the corporate card on a Linux mailing list
is politics, really bad politics and you need to stop it.

> It's not fair to make this same "your drive should align with the
> driver" demand of Samsung because we *are* talking about a spec'ed
> feature here.

Huh.  Next time Dell or NetApp or Facebook are going to require an
optional NVMe feature, and OCP feature or even a vendor specific
feature you are going to go to them and say "you don't play fair"?

> Technical critques of their patches and real performance
> degrades observed are fair game and objective; "your company did
> the nastiest possible move violating the normal NVMe procedures to make
> it optional" is not.

It is pretty fair game if you know the background, that is that I spent
countless hours to make this feature fit the Linux I/O stack requirement
perfecty and worked that it is there.  We might as well rely on it.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Christoph Hellwig
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote:
> > On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> > > If drive does not support zone-append natively, enable emulation
> > > using
> > > regular write.
> > > Make emulated zone-append cmd write-lock the zone, preventing
> > > concurrent append/write on the same zone.
> > 
> > I really don't think we should add this.  ZNS and the Linux support
> > were all designed with Zone Append in mind, and then your company did
> > the nastiest possible move violating the normal NVMe procedures to
> > make
> > it optional.  But that doesn't change the fact the Linux should keep
> > requiring it, especially with the amount of code added here and how
> > it
> > hooks in the fast path.
> 
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver. 

That is a great demand, but please stop talking of companies here,
because companies simply don't matter for Linux development.  People
and use cases do, companies don't and your mail talking about companies
really can't be taken serious.

And I'm not sure why you think Linux is different from any other NVMe
host OS.  If random NVMe host A decided they need feature X they are
going to require it, because why not.  Especially if the developers of
host A helped to drive the development of feature X.  I'm pretty sure
with a little background in storage standards you've seen that play out
a lot before.  And it is no different here.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch
On Wed, Aug 19, 2020 at 05:43:29PM -0600, David Fugate wrote:
> There were queries? My key takeaways were a maintainer NAK followed by
> instructions to make the Intel drive align with the driver by
> implementing NOIOB. While I disagree with the rejection as it appeared
> to be based entirely on politics, I can accept it as the quirk wasn't
> in the spec.

For the record, the suggestion provided, which you agreed to look into,
most broadly enables your hardware on Linux and was entirely to your
benefit. Not quite as dramatic as a political conspiracy.

You later responded with a technical argument against that suggestion;
however, your reason didn't add up, and that's where you left the
thread.
 
> It's not fair to make this same "your drive should align with the
> driver" demand of Samsung because we *are* talking about a spec'ed
> feature here. Technical critques of their patches and real performance
> degrades observed are fair game and objective; "your company did
> the nastiest possible move violating the normal NVMe procedures to make
> it optional" is not.

Sure, but you're cherry picking comments from the discussion. The
performance impact exists, and it's generally not acceptable from a
maintenance point to duplicate significant code without at least trying
to provide a common solution. 


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread David Fugate
On Wed, 2020-08-19 at 15:10 -0700, Keith Busch wrote:
> You're the one who left that thread dangling. You offered to have
> your
> firmware accommodate the Intel sponsored feature that makes your
> patch
> unnecessary in the first place. Your follow up made no sense and you
> have not responded to the queries about it.

There were queries? My key takeaways were a maintainer NAK followed by
instructions to make the Intel drive align with the driver by
implementing NOIOB. While I disagree with the rejection as it appeared
to be based entirely on politics, I can accept it as the quirk wasn't
in the spec.

It's not fair to make this same "your drive should align with the
driver" demand of Samsung because we *are* talking about a spec'ed
feature here. Technical critques of their patches and real performance
degrades observed are fair game and objective; "your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional" is not.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch
On Wed, Aug 19, 2020 at 03:54:20PM -0600, David Fugate wrote:
> On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote:
> > It's not required, the driver will function quite fine without it. If
> > you
> > want to use ZNS it's required. 
> 
> The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral*
> Linux driver should not either. 

The spec was developed over the course of years with your employer's
involvement, and the software enabling efforts occurred in parallel. The
"optional" part was made that way at the final hour, so please align
your expectations accordingly.
 
> Agreed, but this standard needs to be applied equally to everyone.
> E.g., harmless contributions such as 
> https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get
> rejected yet clear spec violations from maintainers are accepted?
> type of behavior encourages forking, vendor-specific drivers, etc.
> which is somewhere I hope none of us want to go.

You're the one who left that thread dangling. You offered to have your
firmware accommodate the Intel sponsored feature that makes your patch
unnecessary in the first place. Your follow up made no sense and you
have not responded to the queries about it.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread David Fugate
On Wed, 2020-08-19 at 13:25 -0600, Jens Axboe wrote:
> It's not required, the driver will function quite fine without it. If
> you
> want to use ZNS it's required. 

The NVMe spec does not require Zone Append for ZNS; a *vendor-neutral*
Linux driver should not either. 

> The Linux driver thankfully doesn't need
> any vendor to sign off on what it can or cannot do, or what features
> are acceptable.

The problem is the driver needs one *particular* vendor to sign off.
Existing driver behavior aligns with WDC drives instead of the spec,
giving WDC an unfair advantage. Couple this with NVMe maintainer(s?)
working for WDC, and there's a conflict of interest.

> It's *always* ok to reject contributions, if those contributions
> cause
> maintainability issues, unacceptable slowdowns, or whatever other
> issue
> that the maintainers of said driver don't want to deal with. Any
> contribution should be judged on merit, not based on political
> decisions
> or opinions.

Agreed, but this standard needs to be applied equally to everyone.
E.g., harmless contributions such as 
https://lore.kernel.org/linux-nvme/20200611054156.gb3...@lst.de/ get
rejected yet clear spec violations from maintainers are accepted? This
type of behavior encourages forking, vendor-specific drivers, etc.
which is somewhere I hope none of us want to go.




Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Keith Busch
On Wed, Aug 19, 2020 at 01:11:58PM -0600, David Fugate wrote:
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver. 

This is inaccurate. As another example, the spec optionally allows a
zone size to be a power of 2, but linux requires a power of 2 if you
want to use it with this driver.

> Provided there's no glaring technical issues

There are many. Some may be improved through a serious review process,
but the mess it makes out of the fast path is measurably harmful to
devices that don't subscribe to this. That issue is not so easily
remedied.

Since this patch is a copy of the scsi implementation, the reasonable
thing is take this fight to the generic block layer for a common
solution. We're not taking this in the nvme driver.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Jens Axboe
On 8/19/20 12:11 PM, David Fugate wrote:
> On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
>>> If drive does not support zone-append natively, enable emulation
>>> using
>>> regular write.
>>> Make emulated zone-append cmd write-lock the zone, preventing
>>> concurrent append/write on the same zone.
>>
>> I really don't think we should add this.  ZNS and the Linux support
>> were all designed with Zone Append in mind, and then your company did
>> the nastiest possible move violating the normal NVMe procedures to
>> make
>> it optional.  But that doesn't change the fact the Linux should keep
>> requiring it, especially with the amount of code added here and how
>> it
>> hooks in the fast path.
> 
> Intel does not support making *optional* NVMe spec features *required*
> by the NVMe driver.

It's not required, the driver will function quite fine without it. If you
want to use ZNS it's required. The Linux driver thankfully doesn't need
any vendor to sign off on what it can or cannot do, or what features
are acceptable.

> It's forgivable WDC's accepted contribution didn't work with other
> vendors' devices choosing not to implement the optional Zone Append,
> but it's not OK to reject contributions remedying this.  Provided
> there's no glaring technical issues, Samsung's contribution should be
> accepted to maintain both spec compliance as well as vendor neutrality.

It's *always* ok to reject contributions, if those contributions cause
maintainability issues, unacceptable slowdowns, or whatever other issue
that the maintainers of said driver don't want to deal with. Any
contribution should be judged on merit, not based on political decisions
or opinions. Obviously this thread reeks of it.

-- 
Jens Axboe



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread David Fugate
On Tue, 2020-08-18 at 07:12 +, Christoph Hellwig wrote:
> On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> > If drive does not support zone-append natively, enable emulation
> > using
> > regular write.
> > Make emulated zone-append cmd write-lock the zone, preventing
> > concurrent append/write on the same zone.
> 
> I really don't think we should add this.  ZNS and the Linux support
> were all designed with Zone Append in mind, and then your company did
> the nastiest possible move violating the normal NVMe procedures to
> make
> it optional.  But that doesn't change the fact the Linux should keep
> requiring it, especially with the amount of code added here and how
> it
> hooks in the fast path.

Intel does not support making *optional* NVMe spec features *required*
by the NVMe driver. 

It's forgivable WDC's accepted contribution didn't work with other
vendors' devices choosing not to implement the optional Zone Append,
but it's not OK to reject contributions remedying this.  Provided
there's no glaring technical issues, Samsung's contribution should be
accepted to maintain both spec compliance as well as vendor neutrality.




Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Christoph Hellwig
On Wed, Aug 19, 2020 at 10:33:53AM +0200, Javier Gonzalez wrote:
> I would ask you to reconsider this position. I have a hard time
> understanding how zone append emulation is a good idea in SCSI and not
> in NVMe, when there is no performance penalty.

Per the numbers on btrfs and zonefs numbers zone append emulation is
faster than using serialized writes, but also way slower than native
zone append.  Zone append emulation for SCSI is the way to bring
support for pre-existing standards into the brave new world.  Not
requiring Zone Append in a standard designed around it because of
bowing down to last minute political interventions in the standards
committee is shooting ourselves in the foot.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Christoph Hellwig
On Wed, Aug 19, 2020 at 09:14:13AM +, Damien Le Moal wrote:
> While defining a zone append command for SCSI/ZBC is possible (using sense 
> data
> for returning the written offset), there is no way to define zone append for
> SATA/ZAC without entirely breaking the ATA command model. This is why we went
> after an emulation implementation instead of trying to standardized native
> commands. That implementation does not have any performance impact over 
> regular
> writes *and* zone write locking does not in general degrade HDD write
> performance (only a few corner cases suffer from it). Comparing things 
> equally,
> the same could be said of NVMe drives that do not have zone append native
> support: performance will be essentially the same using regular writes and
> emulated zone append. But mq-deadline and zone write locking will 
> significantly
> lower performance for emulated zone append compared to a native zone append
> support by the drive.

And to summarize the most important point - Zone Append doesn't exist
in ZAC/ABC.  For people that spent the last years trying to make zoned
storage work, the lack of such a primite has been the major pain point.
That's why I came up with the Zone Append design in response to a
request for such an operation from another company that is now heavily
involved in both Linux development and hosting Linux VMs.  For ZAC and
ZBC the best we can do is to emulate the approach in the driver, but
for NVMe we can do it.  ZNS until just before the release had Zone
Append mandatory, and it did so for a very good reason.  While making
it optional allows OEMs to request drives without it, I fundamentally
think we should not support that in Linux and request vendors do
implement writes to zones the right way.

And just as some OEMs can request certain TPs or optional features to
be implemented, so can Linux.  Just to give an example from the zone
world - Linux requires uniform and power of two zone sizes, which in
ZAC and ZBC are not required.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Damien Le Moal
On 2020/08/19 17:34, Javier Gonzalez wrote:
> On 19.08.2020 09:40, Christoph Hellwig wrote:
>> On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
>>> I understand that you want vendor alignment in the NVMe driver and I
>>> agree. We are not pushing for a non-append model - you can see that we
>>> are investing effort in implementing the append path in thee block layer
>>> and io_uring and we will continue doing so as patches get merged.
>>>
>>> This said, we do have some OEM models that do not implement append and I
>>> would like them to be supported in Linux. As you know, new TPs are being
>>> standardized now and the append emulation is the based for adding
>>> support for this. I do not believe it is unreasonable to find a way to
>>> add support for this SSDs.
>>
>> I do not think we should support anything but Zone Append, especially not
>> the new TP, which is going to add even more horrible code for absolutely
>> no good reason.
> 
> I must admit that this is a bit frustrating. The new TP adds
> functionality beyond operating as an Append alternative that I would
> very much like to see upstream (do want to discuss details here).
> 
> I understand the concerns about deviating from the Append model, but I
> believe we should find a way to add these new features. We are hiding
> all the logic in the NVMe driver and not touching the interface with the
> block layer, so the overall model is really not changed.
> 
>>
>>> If you completely close the door this approach, the alternative is
>>> carrying off-tree patches to the several OEMs that use these devices.
>>> This is not good for the zoned ecosystem nor for the future of Zone
>>> Append.
>>
>> I really don't have a problem with that.  If these OEMs want to use
>> an inferior access model only, they have to pay the price for it.
>> I also don't think that proxy arguments are very useful.  If you OEMs
>> are troubled by carrying patches becomes they decided to buy inferior
>> drivers they are perfectly happy to argue their cause here on the list.
> 
> I am not arguing as a proxy, I am stating the trouble we see from our
> perspective in having to diverge from mainline when our approach is
> being upstream first.
> 
> Whether the I/O mode is inferior or superior, they can answer that
> themselves if they read this list.
>>
>>> Are you open to us doing some characterization and if the impact
>>> to the fast path is not significant, moving ahead to a Zone Append
>>> emulation like in SCSI? I will promise that we will remove this path if
>>> requests for these devices terminate.
>>
>> As said I do not think implementing zone append emulation or the TP that
>> shall not be named are a good idea for Linux.
> 
> I would ask you to reconsider this position. I have a hard time
> understanding how zone append emulation is a good idea in SCSI and not
> in NVMe, when there is no performance penalty.

While defining a zone append command for SCSI/ZBC is possible (using sense data
for returning the written offset), there is no way to define zone append for
SATA/ZAC without entirely breaking the ATA command model. This is why we went
after an emulation implementation instead of trying to standardized native
commands. That implementation does not have any performance impact over regular
writes *and* zone write locking does not in general degrade HDD write
performance (only a few corner cases suffer from it). Comparing things equally,
the same could be said of NVMe drives that do not have zone append native
support: performance will be essentially the same using regular writes and
emulated zone append. But mq-deadline and zone write locking will significantly
lower performance for emulated zone append compared to a native zone append
support by the drive.


-- 
Damien Le Moal
Western Digital Research


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Javier Gonzalez

On 19.08.2020 09:40, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:

I understand that you want vendor alignment in the NVMe driver and I
agree. We are not pushing for a non-append model - you can see that we
are investing effort in implementing the append path in thee block layer
and io_uring and we will continue doing so as patches get merged.

This said, we do have some OEM models that do not implement append and I
would like them to be supported in Linux. As you know, new TPs are being
standardized now and the append emulation is the based for adding
support for this. I do not believe it is unreasonable to find a way to
add support for this SSDs.


I do not think we should support anything but Zone Append, especially not
the new TP, which is going to add even more horrible code for absolutely
no good reason.


I must admit that this is a bit frustrating. The new TP adds
functionality beyond operating as an Append alternative that I would
very much like to see upstream (do want to discuss details here).

I understand the concerns about deviating from the Append model, but I
believe we should find a way to add these new features. We are hiding
all the logic in the NVMe driver and not touching the interface with the
block layer, so the overall model is really not changed.




If you completely close the door this approach, the alternative is
carrying off-tree patches to the several OEMs that use these devices.
This is not good for the zoned ecosystem nor for the future of Zone
Append.


I really don't have a problem with that.  If these OEMs want to use
an inferior access model only, they have to pay the price for it.
I also don't think that proxy arguments are very useful.  If you OEMs
are troubled by carrying patches becomes they decided to buy inferior
drivers they are perfectly happy to argue their cause here on the list.


I am not arguing as a proxy, I am stating the trouble we see from our
perspective in having to diverge from mainline when our approach is
being upstream first.

Whether the I/O mode is inferior or superior, they can answer that
themselves if they read this list.



Are you open to us doing some characterization and if the impact
to the fast path is not significant, moving ahead to a Zone Append
emulation like in SCSI? I will promise that we will remove this path if
requests for these devices terminate.


As said I do not think implementing zone append emulation or the TP that
shall not be named are a good idea for Linux.


I would ask you to reconsider this position. I have a hard time
understanding how zone append emulation is a good idea in SCSI and not
in NVMe, when there is no performance penalty.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-19 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 08:04:28PM +0200, Javier Gonzalez wrote:
> I understand that you want vendor alignment in the NVMe driver and I
> agree. We are not pushing for a non-append model - you can see that we
> are investing effort in implementing the append path in thee block layer
> and io_uring and we will continue doing so as patches get merged.
>
> This said, we do have some OEM models that do not implement append and I
> would like them to be supported in Linux. As you know, new TPs are being
> standardized now and the append emulation is the based for adding
> support for this. I do not believe it is unreasonable to find a way to
> add support for this SSDs.

I do not think we should support anything but Zone Append, especially not
the new TP, which is going to add even more horrible code for absolutely
no good reason.

> If you completely close the door this approach, the alternative is
> carrying off-tree patches to the several OEMs that use these devices.
> This is not good for the zoned ecosystem nor for the future of Zone
> Append.

I really don't have a problem with that.  If these OEMs want to use
an inferior access model only, they have to pay the price for it.
I also don't think that proxy arguments are very useful.  If you OEMs
are troubled by carrying patches becomes they decided to buy inferior
drivers they are perfectly happy to argue their cause here on the list.

> Are you open to us doing some characterization and if the impact
> to the fast path is not significant, moving ahead to a Zone Append
> emulation like in SCSI? I will promise that we will remove this path if
> requests for these devices terminate.

As said I do not think implementing zone append emulation or the TP that
shall not be named are a good idea for Linux.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Javier Gonzalez

On 18.08.2020 10:39, Keith Busch wrote:

On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote:

On 18.08.2020 09:58, Keith Busch wrote:
> On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> > a number of customers are requiring the use of normal writes, which we
> > want to support.
>
> A device that supports append is completely usable for those customers,
> too. There's no need to create divergence in this driver.

Not really. You know as well as I do that some features are disabled for
a particular SSD model on customer requirements. Generic models
implementing append can submit both I/Os, but those that remove append
are left out.


You are only supporting my point: if your device supports append, you
get to work in every ZNS use case, otherwise you only get to work in a
subset.


There are several devices. Some of them enable append and some do not. I
would like to support the later too.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Javier Gonzalez

On 18.08.2020 12:51, Matias Bjørling wrote:

On 18/08/2020 11.50, Javier Gonzalez wrote:

On 18.08.2020 09:12, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:

If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.


I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.


I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.


There is a lot of things that is specified in NVMe, but not 
implemented in the Linux kernel. That your company is not able to 
efficiently implement the Zone Append command (this is the only reason 
I can think of that make you and your company cause such a fuss), 


This comment is out of place and I will choose to ignore it.


shouldn't mean that everyone else has to suffer.


This is not a quirk, nor a software work-around, This is a design
decision affecting the storage stack of several OEMs. As such, I would
like to find a way to implement this functionality.

Last time we discussed this in the mailing list, you among others
pointed to append emulation as the best way to enable this path and here
we are. Can you explained what changed?



In any case, SPDK offers adequate support and can be used today.


We take the SPDK discussion in the appropriate mailing lists and slack
channels.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Javier Gonzalez

On 18.08.2020 17:50, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:

I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.


The NVMe TPs allow for lots of things, but that doesn't mean we have
to support it.


Agree. As I replied to Keith, I am just interested in enabling the ZNS
models that come with append disabled.




Do you have any early suggestion on how you this patch should look like
to be upstreamable?


My position is that at this point in time we should not consider it.
Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC.
I want to see broad industry support for it instead of having to add more
code just for zone append emulation than actual current ZNS support.  If
in a few years the market place has decided and has lots of drives
available in the consuer market or OEM channels we'll have to reconsider
and potentially merge Zone Append emulation.  But my deep hope is that
this does not happen, as it sets us back 10 years in the standards of
zoned storage support again.


I understand that you want vendor alignment in the NVMe driver and I
agree. We are not pushing for a non-append model - you can see that we
are investing effort in implementing the append path in thee block layer
and io_uring and we will continue doing so as patches get merged.

This said, we do have some OEM models that do not implement append and I
would like them to be supported in Linux. As you know, new TPs are being
standardized now and the append emulation is the based for adding
support for this. I do not believe it is unreasonable to find a way to
add support for this SSDs.

If you completely close the door this approach, the alternative is
carrying off-tree patches to the several OEMs that use these devices.
This is not good for the zoned ecosystem nor for the future of Zone
Append.

Are you open to us doing some characterization and if the impact
to the fast path is not significant, moving ahead to a Zone Append
emulation like in SCSI? I will promise that we will remove this path if
requests for these devices terminate.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Keith Busch
On Tue, Aug 18, 2020 at 07:29:12PM +0200, Javier Gonzalez wrote:
> On 18.08.2020 09:58, Keith Busch wrote:
> > On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> > > a number of customers are requiring the use of normal writes, which we
> > > want to support.
> > 
> > A device that supports append is completely usable for those customers,
> > too. There's no need to create divergence in this driver.
> 
> Not really. You know as well as I do that some features are disabled for
> a particular SSD model on customer requirements. Generic models
> implementing append can submit both I/Os, but those that remove append
> are left out.

You are only supporting my point: if your device supports append, you
get to work in every ZNS use case, otherwise you only get to work in a
subset.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Javier Gonzalez

On 18.08.2020 09:58, Keith Busch wrote:

On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:

a number of customers are requiring the use of normal writes, which we
want to support.


A device that supports append is completely usable for those customers,
too. There's no need to create divergence in this driver.


Not really. You know as well as I do that some features are disabled for
a particular SSD model on customer requirements. Generic models
implementing append can submit both I/Os, but those that remove append
are left out.

I would like to understand how we can enable these NVMe-compatible
models in Linux. If it is a performance concern, we will address it.



Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Keith Busch
On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> a number of customers are requiring the use of normal writes, which we
> want to support.

A device that supports append is completely usable for those customers,
too. There's no need to create divergence in this driver.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 11:50:33AM +0200, Javier Gonzalez wrote:
> I understand that the NVMe process was agitated and that the current ZNS
> implementation in Linux relies in append support from the device
> perspective. However, the current TP does allow for not implementing
> append, and a number of customers are requiring the use of normal
> writes, which we want to support.

The NVMe TPs allow for lots of things, but that doesn't mean we have
to support it.

> Do you have any early suggestion on how you this patch should look like
> to be upstreamable?

My position is that at this point in time we should not consider it.
Zone Append is the major feature in ZNS that solves the issue in ZAC/ZBC.
I want to see broad industry support for it instead of having to add more
code just for zone append emulation than actual current ZNS support.  If
in a few years the market place has decided and has lots of drives
available in the consuer market or OEM channels we'll have to reconsider
and potentially merge Zone Append emulation.  But my deep hope is that
this does not happen, as it sets us back 10 years in the standards of
zoned storage support again.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Matias Bjørling

On 18/08/2020 11.50, Javier Gonzalez wrote:

On 18.08.2020 09:12, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:

If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.


I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.


I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.


There is a lot of things that is specified in NVMe, but not implemented 
in the Linux kernel. That your company is not able to efficiently 
implement the Zone Append command (this is the only reason I can think 
of that make you and your company cause such a fuss), shouldn't mean 
that everyone else has to suffer.


In any case, SPDK offers adequate support and can be used today.


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Javier Gonzalez

On 18.08.2020 09:12, Christoph Hellwig wrote:

On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:

If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.


I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.


I understand that the NVMe process was agitated and that the current ZNS
implementation in Linux relies in append support from the device
perspective. However, the current TP does allow for not implementing
append, and a number of customers are requiring the use of normal
writes, which we want to support.

During the initial patch review we discussed this and we agreed that the
block layer is designed for append on zone devices, and that for the
foreseeable future this was not going to change. We therefore took the
feedback and followed a similar approach as in the SCSI driver for
implementing append emulation.

We are happy to do more characterization on the impact of these hooks in
the non-zoned hast path and eventually changing the approach if this
proves to be a problem. Our thought is to isolate any potential
performance degradation to the zoned path using the emulation (we do not
see any ATM).

Do you have any early suggestion on how you this patch should look like
to be upstreamable?

Javier


Re: [PATCH 2/2] nvme: add emulation for zone-append

2020-08-18 Thread Christoph Hellwig
On Tue, Aug 18, 2020 at 10:59:36AM +0530, Kanchan Joshi wrote:
> If drive does not support zone-append natively, enable emulation using
> regular write.
> Make emulated zone-append cmd write-lock the zone, preventing
> concurrent append/write on the same zone.

I really don't think we should add this.  ZNS and the Linux support
were all designed with Zone Append in mind, and then your company did
the nastiest possible move violating the normal NVMe procedures to make
it optional.  But that doesn't change the fact the Linux should keep
requiring it, especially with the amount of code added here and how it
hooks in the fast path.


[PATCH 2/2] nvme: add emulation for zone-append

2020-08-17 Thread Kanchan Joshi
If drive does not support zone-append natively, enable emulation using
regular write.
Make emulated zone-append cmd write-lock the zone, preventing
concurrent append/write on the same zone.

To determine the start-lba for such writes, an array of 32 bit
zone-relative write-pointer (WP) positions is attached with namespace.
This cached WP-position is updated on successful completion as follows:
- APPEND/WRITE/WRITE_ZEROS/WRITE_SAME update it by number of sectors
(512b) copied
- ZONE_RESET updates it to 0 for target zone. ZONE_RESET_ALL does the
same for all zones.
- ZONE_FINISH sets it to zone-size.

On failed-completion for above requests, cached WP-position of target zone
is marked invalid. On subsequent zone-append to that zone, WP position is
refreshed by querying it from device (i.e. zone-report).

If emulated-append cannot immediately proceed due to zone write-lock
or invalid WP position, block-layer is asked to retry it.

Signed-off-by: Kanchan Joshi 
Signed-off-by: Nitesh Shetty 
Signed-off-by: SelvaKumar S 
Signed-off-by: Javier Gonzalez 
---
 drivers/nvme/host/core.c |  41 +-
 drivers/nvme/host/nvme.h |  60 
 drivers/nvme/host/zns.c  | 306 ++-
 3 files changed, 398 insertions(+), 9 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 88cff309d8e4..78faddf444c3 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -287,10 +287,17 @@ void nvme_complete_rq(struct request *req)
nvme_retry_req(req);
return;
}
-   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED) &&
-  req_op(req) == REQ_OP_ZONE_APPEND) {
-   req->__sector = nvme_lba_to_sect(req->q->queuedata,
-   le64_to_cpu(nvme_req(req)->result.u64));
+   } else if (IS_ENABLED(CONFIG_BLK_DEV_ZONED)) {
+   bool need_wp_offset_update = false;
+   struct nvme_ns *ns = req->q->queuedata;
+   /* append-emulation requires wp update for some cmds*/
+   if (ns && nvme_is_append_emulated(ns))
+   need_wp_offset_update = nvme_need_zone_wp_update(req);
+   if (need_wp_offset_update)
+   nvme_zone_wp_update(ns, req, status);
+   else if (req_op(req) == REQ_OP_ZONE_APPEND)
+   req->__sector = nvme_lba_to_sect(ns,
+   le64_to_cpu(nvme_req(req)->result.u64));
}
 
nvme_trace_bio_complete(req, status);
@@ -456,6 +463,8 @@ static void nvme_free_ns(struct kref *kref)
 {
struct nvme_ns *ns = container_of(kref, struct nvme_ns, kref);
 
+   if (nvme_is_append_emulated(ns))
+   nvme_teardown_append_emulate(ns);
if (ns->ndev)
nvme_nvm_unregister(ns);
 
@@ -809,7 +818,15 @@ blk_status_t nvme_setup_cmd(struct nvme_ns *ns, struct 
request *req,
ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
break;
case REQ_OP_ZONE_APPEND:
-   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   if (!nvme_is_append_emulated(ns))
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_zone_append);
+   else {
+   /* prepare append like write, and adjust lba afterwards 
*/
+   ret = nvme_setup_rw(ns, req, cmd, nvme_cmd_write);
+   if (ret)
+   break;
+   ret = nvme_append_to_write(ns, req, cmd);
+   }
break;
default:
WARN_ON_ONCE(1);
@@ -2150,7 +2167,7 @@ static int nvme_revalidate_disk(struct gendisk *disk)
struct nvme_ns *ns = disk->private_data;
struct nvme_ctrl *ctrl = ns->ctrl;
 
-   ret = blk_revalidate_disk_zones(disk, NULL);
+   ret = nvme_revalidate_disk_zones(disk);
if (!ret)
blk_queue_max_zone_append_sectors(disk->queue,
  
ctrl->max_zone_append);
@@ -3900,6 +3917,18 @@ static void nvme_alloc_ns(struct nvme_ctrl *ctrl, 
unsigned nsid)
if (__nvme_revalidate_disk(disk, id))
goto out_put_disk;
 
+   /* setup append-emulation if required */
+   if (nvme_is_append_emulated(ns)) {
+   ret = nvme_setup_append_emulate(ns);
+   if (ret) {
+   dev_warn(ns->ctrl->device,
+   "append-emulation failed, zoned namespace:%d\n",
+   ns->head->ns_id);
+   nvme_clear_append_emulated(ns);
+   goto out_put_disk;
+   }
+   }
+
if ((ctrl->quirks & NVME_QUIRK_LIGHTNVM) && id->vs[0] == 0x1) {
ret = nvme_nvm_register(ns, disk_name, node);
if (ret)