[PATCH v3] hw/ide/ahci: fix legacy software reset

2023-11-08 Thread Niklas Cassel
From: Niklas Cassel 

Legacy software contains a standard mechanism for generating a reset to a
Serial ATA device - setting the SRST (software reset) bit in the Device
Control register.

Serial ATA has a more robust mechanism called COMRESET, also referred to
as port reset. A port reset is the preferred mechanism for error
recovery and should be used in place of software reset.

Commit e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
improved the handling of PxCI, such that PxCI gets cleared after handling
a non-NCQ, or NCQ command (instead of incorrectly clearing PxCI after
receiving anything - even a FIS that failed to parse, which should NOT
clear PxCI, so that you can see which command slot that caused an error).

However, simply clearing PxCI after a non-NCQ, or NCQ command, is not
enough, we also need to clear PxCI when receiving a SRST in the Device
Control register.

A legacy software reset is performed by the host sending two H2D FISes,
the first H2D FIS asserts SRST, and the second H2D FIS deasserts SRST.

The first H2D FIS will not get a D2H reply, and requires the FIS to have
the C bit set to one, such that the HBA itself will clear the bit in PxCI.

The second H2D FIS will get a D2H reply once the diagnostic is completed.
The clearing of the bit in PxCI for this command should ideally be done
in ahci_init_d2h() (if it was a legacy software reset that caused the
reset (a COMRESET does not use a command slot)). However, since the reset
value for PxCI is 0, modify ahci_reset_port() to actually clear PxCI to 0,
that way we can avoid complex logic in ahci_init_d2h().

This fixes an issue for FreeBSD where the device would fail to reset.
The problem was not noticed in Linux, because Linux uses a COMRESET
instead of a legacy software reset by default.

Fixes: e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
Reported-by: Marcin Juszkiewicz 
Signed-off-by: Niklas Cassel 
---
Changes since v3: remove superfluous ahci_write_fis_d2h(), and add
comments describing how the (quite confusing) spec actually works.

 hw/ide/ahci.c | 27 ++-
 1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index fcc5476e9e..56c7672eb9 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -623,9 +623,13 @@ static void ahci_init_d2h(AHCIDevice *ad)
 return;
 }
 
+/*
+ * For simplicity, do not call ahci_clear_cmd_issue() for this
+ * ahci_write_fis_d2h(). (The reset value for PxCI is 0.)
+ */
 if (ahci_write_fis_d2h(ad, true)) {
 ad->init_d2h_sent = true;
-/* We're emulating receiving the first Reg H2D Fis from the device;
+/* We're emulating receiving the first Reg D2H FIS from the device;
  * Update the SIG register, but otherwise proceed as normal. */
 pr->sig = ((uint32_t)ide_state->hcyl << 24) |
 (ide_state->lcyl << 16) |
@@ -663,6 +667,7 @@ static void ahci_reset_port(AHCIState *s, int port)
 pr->scr_act = 0;
 pr->tfdata = 0x7F;
 pr->sig = 0x;
+pr->cmd_issue = 0;
 d->busy_slot = -1;
 d->init_d2h_sent = false;
 
@@ -1243,10 +1248,30 @@ static void handle_reg_h2d_fis(AHCIState *s, int port,
 case STATE_RUN:
 if (cmd_fis[15] & ATA_SRST) {
 s->dev[port].port_state = STATE_RESET;
+/*
+ * When setting SRST in the first H2D FIS in the reset 
sequence,
+ * the device does not send a D2H FIS. Host software thus has 
to
+ * set the "Clear Busy upon R_OK" bit such that PxCI (and BUSY)
+ * gets cleared. See AHCI 1.3.1, section 10.4.1 Software Reset.
+ */
+if (opts & AHCI_CMD_CLR_BUSY) {
+ahci_clear_cmd_issue(ad, slot);
+}
 }
 break;
 case STATE_RESET:
 if (!(cmd_fis[15] & ATA_SRST)) {
+/*
+ * When clearing SRST in the second H2D FIS in the reset
+ * sequence, the device will execute diagnostics. When this is
+ * done, the device will send a D2H FIS with the good status.
+ * See SATA 3.5a Gold, section 11.4 Software reset protocol.
+ *
+ * This D2H FIS is the first D2H FIS received from the device,
+ * and is received regardless if the reset was performed by a
+ * COMRESET or by setting and clearing the SRST bit. Therefore,
+ * the logic for this is found in ahci_init_d2h() and not here.
+ */
 ahci_reset_port(s, port);
 }
 break;
-- 
2.41.0




Re: [PATCH v2] hw/ide/ahci: fix legacy software reset

2023-11-07 Thread Niklas Cassel
On Tue, Nov 07, 2023 at 07:14:07PM +0100, Kevin Wolf wrote:
> Am 05.10.2023 um 12:04 hat Niklas Cassel geschrieben:
> > From: Niklas Cassel 
> > 
> > Legacy software contains a standard mechanism for generating a reset to a
> > Serial ATA device - setting the SRST (software reset) bit in the Device
> > Control register.
> > 
> > Serial ATA has a more robust mechanism called COMRESET, also referred to
> > as port reset. A port reset is the preferred mechanism for error
> > recovery and should be used in place of software reset.
> > 
> > Commit e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
> > improved the handling of PxCI, such that PxCI gets cleared after handling
> > a non-NCQ, or NCQ command (instead of incorrectly clearing PxCI after
> > receiving an arbitrary FIS).
> > 
> > However, simply clearing PxCI after a non-NCQ, or NCQ command, is not
> > enough, we also need to clear PxCI when receiving a SRST in the Device
> > Control register.
> > 
> > This fixes an issue for FreeBSD where the device would fail to reset.
> > The problem was not noticed in Linux, because Linux uses a COMRESET
> > instead of a legacy software reset by default.
> > 
> > Fixes: e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
> > Reported-by: Marcin Juszkiewicz 
> > Signed-off-by: Niklas Cassel 
> > ---
> > Changes since v1: write the D2H FIS before clearing PxCI.
> > 
> >  hw/ide/ahci.c | 16 
> >  1 file changed, 16 insertions(+)
> > 
> > diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
> > index babdd7b458..7269dabbdb 100644
> > --- a/hw/ide/ahci.c
> > +++ b/hw/ide/ahci.c
> > @@ -1254,10 +1254,26 @@ static void handle_reg_h2d_fis(AHCIState *s, int 
> > port,
> >  case STATE_RUN:
> >  if (cmd_fis[15] & ATA_SRST) {
> >  s->dev[port].port_state = STATE_RESET;
> > +/*
> > + * When setting SRST in the first H2D FIS in the reset 
> > sequence,
> > + * the device does not send a D2H FIS. Host software thus 
> > has to
> > + * set the "Clear Busy upon R_OK" bit such that PxCI (and 
> > BUSY)
> > + * gets cleared. See AHCI 1.3.1, section 10.4.1 Software 
> > Reset.
> > + */
> > +if (opts & AHCI_CMD_CLR_BUSY) {
> > +ahci_clear_cmd_issue(ad, slot);
> > +}
> 
> I suspect that AHCI_CMD_CLR_BUSY really should be checked in a more
> generic place, but this will do for fixing software reset.

This weirdo option "Clear Busy upon R_OK" is for the HBA itself to clear
the bit for the command in PxCI, when it gets the R_OK that the Host to Device
(H2D) FIS was transmitted successfully to the device.
The normal way is that the device sends back a Device to Host (D2H) FIS
which then causes the bit in PxCI to be cleared in the HBA.

Yes, theoretically you could probably build a FIS that has the
"Clear Busy upon R_OK" even for e.g. a regular non-NCQ I/O command
(where PxCI/BUSY is first supposed to be cleared once the command is done...),
however this would surely cause problems as PxCI would then no longer shadow
the state of the device.

If you search for "Clear Busy upon R_OK", the only usage seems
to be for legacy software reset.

I'm quite sure that the option was specifically created for legacy
software reset, since to do a legacy software reset, you need to
create and issue two FISes.
The first FIS has SRST set (assert reset), the second FIS has SRST
cleared (deassert reset).

For regular I/O non-NCQ commands, PxCI will be cleared when the
I/O has completed (by the device sending back a D2H FIS).
For NCQ commands, PxCI will be cleared when the drive has
successfully queued the command (by the device sending back a D2H FIS).

The spec states that there should NOT be a D2H FIS sent back to
the HBA when getting the first H2D FIS (which asserts reset).
(The device will still send back a R_OK that the FIS was transmitted
successfully.)

For the second H2D FIS (that deasserts reset), the spec says that the
device SHOULD send a D2H FIS back (after the device diagnostic is done).


> 
> >  }
> >  break;
> >  case STATE_RESET:
> >  if (!(cmd_fis[15] & ATA_SRST)) {
> > +/*
> > + * When clearing SRST in the second H2D FIS in the reset
> > + * sequence, the device will send a D2H FIS. See SATA 3.5a 
> > Gold,
> > + * section 11.4 Software reset protocol.
> &g

IDE pending patches

2023-11-02 Thread Niklas Cassel
Hello Philippe, Kevin,

The QEMU 8.2 freeze is next week,
and the IDE maintainer (John) hasn't been replying to emails lately.

Kevin, considering that you picked up Fiona's series:
https://lore.kernel.org/qemu-devel/d6286ef8-6cf0-4e72-90e9-e91cef9da...@proxmox.com/
which was sent 2023-09-06, via your tree, do you think that you could
queue up some additional pending IDE patches?

If you don't want to take them, perhaps Kevin can take them?



I have these two patches:
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg00172.html
which was sent 2023-10-05
and
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg00382.html
which was sent 2023-10-11

Looking at the list, Mark's series:
https://lists.nongnu.org/archive/html/qemu-block/2023-10/msg01289.html
v2 was sent quite recently, 2023-10-24, but seems to have sufficient
tags to be ready to go in this cycle as well.


Kind regards,
Niklas


Re: [PATCH] hw/ide/ahci: trigger either error IRQ or regular IRQ, not both

2023-10-26 Thread Niklas Cassel
On Wed, Oct 11, 2023 at 03:12:20PM +0200, Niklas Cassel wrote:
> From: Niklas Cassel 
> 
> According to AHCI 1.3.1, 5.3.8.1 RegFIS:Entry, if ERR_STAT is set,
> we jump to state ERR:FatalTaskfile, which will raise a TFES IRQ
> unconditionally, regardless if the I bit is set in the FIS or not.
> 
> Thus, we should never raise a normal IRQ after having sent an error
> IRQ.
> 
> NOTE: for QEMU platforms that use SeaBIOS, this patch depends on QEMU
> commit 784155cdcb02 ("seabios: update submodule to git snapshot"), and
> QEMU commit 14f5a7bae4cb ("seabios: update binaries to git snapshot").
> 
> Signed-off-by: Niklas Cassel 
> ---
>  hw/ide/ahci.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
> index fcc5476e9e..7676e2d871 100644
> --- a/hw/ide/ahci.c
> +++ b/hw/ide/ahci.c
> @@ -897,11 +897,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad, bool 
> d2h_fis_i)
>  pr->tfdata = (ad->port.ifs[0].error << 8) |
>  ad->port.ifs[0].status;
>  
> +/* TFES IRQ is always raised if ERR_STAT is set, regardless of I bit. */
>  if (d2h_fis[2] & ERR_STAT) {
>  ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
> -}
> -
> -if (d2h_fis_i) {
> +} else if (d2h_fis_i) {
>  ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
>  }
>  
> -- 
> 2.41.0
> 

Gentle ping :)


Kind regards,
Niklas


Re: [PATCH] hw/ide/ahci: trigger either error IRQ or regular IRQ, not both

2023-10-11 Thread Niklas Cassel
On Wed, Oct 11, 2023 at 05:44:28PM +0300, Michael Tokarev wrote:
> 11.10.2023 16:12, Niklas Cassel wrote:
> > From: Niklas Cassel 
> > 
> > According to AHCI 1.3.1, 5.3.8.1 RegFIS:Entry, if ERR_STAT is set,
> > we jump to state ERR:FatalTaskfile, which will raise a TFES IRQ
> > unconditionally, regardless if the I bit is set in the FIS or not.
> > 
> > Thus, we should never raise a normal IRQ after having sent an error
> > IRQ.
> > 
> > NOTE: for QEMU platforms that use SeaBIOS, this patch depends on QEMU
> > commit 784155cdcb02 ("seabios: update submodule to git snapshot"), and
> > QEMU commit 14f5a7bae4cb ("seabios: update binaries to git snapshot").
> 
> Am I right that without commit 1281e340a "ahci: handle TFES irq correctly"
> in seabios (which is included in this 784155cd update above), seabios
> will hang at boot for 32s when qemu is booted with ahci drives?
> 
> I mean, is it the only implication of not updating seabios after
> this patch?

That is correct.

If you don't have the seabios patch, the seabios ATAPI command will timeout
after 30 seconds, after which QEMU will boot and behave like normal.


Kind regards,
Niklas


[PATCH] hw/ide/ahci: trigger either error IRQ or regular IRQ, not both

2023-10-11 Thread Niklas Cassel
From: Niklas Cassel 

According to AHCI 1.3.1, 5.3.8.1 RegFIS:Entry, if ERR_STAT is set,
we jump to state ERR:FatalTaskfile, which will raise a TFES IRQ
unconditionally, regardless if the I bit is set in the FIS or not.

Thus, we should never raise a normal IRQ after having sent an error
IRQ.

NOTE: for QEMU platforms that use SeaBIOS, this patch depends on QEMU
commit 784155cdcb02 ("seabios: update submodule to git snapshot"), and
QEMU commit 14f5a7bae4cb ("seabios: update binaries to git snapshot").

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index fcc5476e9e..7676e2d871 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -897,11 +897,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad, bool 
d2h_fis_i)
 pr->tfdata = (ad->port.ifs[0].error << 8) |
 ad->port.ifs[0].status;
 
+/* TFES IRQ is always raised if ERR_STAT is set, regardless of I bit. */
 if (d2h_fis[2] & ERR_STAT) {
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
-}
-
-if (d2h_fis_i) {
+} else if (d2h_fis_i) {
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
 }
 
-- 
2.41.0




[PATCH v2] hw/ide/ahci: fix legacy software reset

2023-10-05 Thread Niklas Cassel
From: Niklas Cassel 

Legacy software contains a standard mechanism for generating a reset to a
Serial ATA device - setting the SRST (software reset) bit in the Device
Control register.

Serial ATA has a more robust mechanism called COMRESET, also referred to
as port reset. A port reset is the preferred mechanism for error
recovery and should be used in place of software reset.

Commit e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
improved the handling of PxCI, such that PxCI gets cleared after handling
a non-NCQ, or NCQ command (instead of incorrectly clearing PxCI after
receiving an arbitrary FIS).

However, simply clearing PxCI after a non-NCQ, or NCQ command, is not
enough, we also need to clear PxCI when receiving a SRST in the Device
Control register.

This fixes an issue for FreeBSD where the device would fail to reset.
The problem was not noticed in Linux, because Linux uses a COMRESET
instead of a legacy software reset by default.

Fixes: e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
Reported-by: Marcin Juszkiewicz 
Signed-off-by: Niklas Cassel 
---
Changes since v1: write the D2H FIS before clearing PxCI.

 hw/ide/ahci.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index babdd7b458..7269dabbdb 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1254,10 +1254,26 @@ static void handle_reg_h2d_fis(AHCIState *s, int port,
 case STATE_RUN:
 if (cmd_fis[15] & ATA_SRST) {
 s->dev[port].port_state = STATE_RESET;
+/*
+ * When setting SRST in the first H2D FIS in the reset 
sequence,
+ * the device does not send a D2H FIS. Host software thus has 
to
+ * set the "Clear Busy upon R_OK" bit such that PxCI (and BUSY)
+ * gets cleared. See AHCI 1.3.1, section 10.4.1 Software Reset.
+ */
+if (opts & AHCI_CMD_CLR_BUSY) {
+ahci_clear_cmd_issue(ad, slot);
+}
 }
 break;
 case STATE_RESET:
 if (!(cmd_fis[15] & ATA_SRST)) {
+/*
+ * When clearing SRST in the second H2D FIS in the reset
+ * sequence, the device will send a D2H FIS. See SATA 3.5a 
Gold,
+ * section 11.4 Software reset protocol.
+ */
+ahci_clear_cmd_issue(ad, slot);
+ahci_write_fis_d2h(ad, false);
 ahci_reset_port(s, port);
 }
 break;
-- 
2.41.0




[PATCH] hw/ide/ahci: fix legacy software reset

2023-10-05 Thread Niklas Cassel
From: Niklas Cassel 

Legacy software contains a standard mechanism for generating a reset to a
Serial ATA device - setting the SRST (software reset) bit in the Device
Control register.

Serial ATA has a more robust mechanism called COMRESET, also referred to
as port reset. A port reset is the preferred mechanism for error
recovery and should be used in place of software reset.

Commit e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
improved the handling of PxCI, such that PxCI gets cleared after handling
a non-NCQ, or NCQ command (instead of incorrectly clearing PxCI after
receiving an arbitrary FIS).

However, simply clearing PxCI after a non-NCQ, or NCQ command, is not
enough, we also need to clear PxCI when receiving a SRST in the Device
Control register.

This fixes an issue for FreeBSD where the device would fail to reset.
The problem was not noticed in Linux, because Linux uses a COMRESET
instead of a legacy software reset by default.

Fixes: e2a5d9b3d9c3 ("hw/ide/ahci: simplify and document PxCI handling")
Reported-by: Marcin Juszkiewicz 
Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 16 
 1 file changed, 16 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index babdd7b458..3a8b97c325 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1254,10 +1254,26 @@ static void handle_reg_h2d_fis(AHCIState *s, int port,
 case STATE_RUN:
 if (cmd_fis[15] & ATA_SRST) {
 s->dev[port].port_state = STATE_RESET;
+/*
+ * When setting SRST in the first H2D FIS in the reset 
sequence,
+ * the device does not send a D2H FIS. Host software thus has 
to
+ * set the "Clear Busy upon R_OK" bit such that PxCI (and BUSY)
+ * gets cleared. See AHCI 1.3.1, section 10.4.1 Software Reset.
+ */
+if (opts & AHCI_CMD_CLR_BUSY) {
+ahci_clear_cmd_issue(ad, slot);
+}
 }
 break;
 case STATE_RESET:
 if (!(cmd_fis[15] & ATA_SRST)) {
+/*
+ * When clearing SRST in the second H2D FIS in the reset
+ * sequence, the device will send a D2H FIS. See SATA 3.5a 
Gold,
+ * section 11.4 Software reset protocol.
+ */
+ahci_write_fis_d2h(ad, false);
+ahci_clear_cmd_issue(ad, slot);
 ahci_reset_port(s, port);
 }
 break;
-- 
2.41.0




Re: FreeBSD 13.2 installer does not see AHCI devices on aarch64/sbsa-ref and x86-64/q35

2023-10-03 Thread Niklas Cassel
On Tue, Oct 03, 2023 at 08:11:39PM +0300, Michael Tokarev wrote:
> 26.09.2023 15:05, Niklas Cassel:
> > Hello Marcin,
> > 
> > I will have a look at this.
> 
> Hi Marcin, Hi Niklas!
> 
> Niklas, I remember asking you if the whole thing is okay for the -stable,
> and you was a bit unsure about it :)  Regardless, I picked the changes
> up for -stable. I don't think it was anyone's fault though, - after all,
> I guess, without the change being in -stable, we'd know about this issue
> in some distant future instead of now :)
> 
> I'm planning to release 8.1.2 soon, with freeze being at Oct-14.  It'd
> be really great if we can include a fix for this both in master and in
> 8.1.2 (8.1.2 should have a long-awaited fix for a quite serious long-
> standing issue in 8.1).
> 
> Were you able to take a look at what's going on here?  I wish I were
> able to help here but I know right to nothing about ahci emulation..

I was away on a conference all last week, so I didn't have much time to
look at this yet. I will debug the problem this week.

>From at quick look at Marcin logs:
ahcich0: Poll timeout on slot 1 port 0
ahcich0: is  cs 0002 ss  rs 0002 tfd 170 serr  
cmd c017

This log seems to come from:
http://fxr.watson.org/fxr/source/dev/ahci/ahci.c?v=FREEBSD-13-STABLE#L1795

Looking at the print "cs 0002" means PxCI: 0x2.
So PxCI is never cleared.

I will need to run FreeBSD so I can see which command it is that never gets
PxCI cleared.

NCQ commands will always clear PxCI in process_ncq_command().
Non-NCQ commands will always clear PxCI in ahci_clear_cmd_issue().

>From a quick glance, the only time we do not clear PxCI is if we receive
a FIS that we do not handle (trace_handle_cmd_unhandled_fis()).
Will try to see exactly which FIS FreeBSD is sending tomorrow.

(Perhaps we should simply clear PxCI for these FISes that we do not handle...
Or at least make sure that we set ERR_STAT in PxTFD.)


Kind regards,
Niklas




Re: [PATCH 1/1] hw/ide/core: terminate in-flight DMA on IDE bus reset

2023-10-03 Thread Niklas Cassel
On Mon, Sep 25, 2023 at 03:53:23PM -0400, John Snow wrote:
> Niklas, I'm sorry to lean on you here a little bit - You've been
> working on the SATA side of this a bit more often, can you let me know
> if you think this patch is safe?

FWIW, I prefer Fiona's patch series which calls blk_aio_cancel() before
calling ide_reset():
https://lore.kernel.org/qemu-devel/20230906130922.142845-1-f.eb...@proxmox.com/T/#u

Patch 2/2 also adds a qtest which fails before patch 1/2, and passes after
patch 1/2.


> 
> I'm not immediately sure what the impact of applying it is, but I have
> some questions about it:
> 
> (1) When does ide_dma_cb get invoked when DRQ_STAT is set but the
> return code we were passed is not < 0, and
> 
> (2) what's the impact of just *not* executing the end-of-transfer
> block here; what happens to the state machine?
> 
> On Thu, Sep 21, 2023 at 12:07 PM Simon Rowe  wrote:
> >
> > When an IDE controller is reset, its internal state is being cleared
> > before any outstanding I/O is cancelled. If a response to DMA is
> > received in this window, the aio callback will incorrectly continue
> > with the next part of the transfer (now using sector 0 from
> > the cleared controller state).
> 
> Eugh, yikes. It feels like we should fix the cancellation ... I'm
> worried that if we've reset the state machine and we need to bail on a
> DMA callback that the heuristics we use for that will eventually be
> wrong, unless I am mistaken and this is totally safe and reliable for
> a reason I don't intuitively see right away.
> 
> Thoughts?

Since Simon has a reliable reproducer, and has offered to test Fiona's patch,
perhaps we should simply take him up on that offer :)


Kind regards,
Niklas

Re: FreeBSD 13.2 installer does not see AHCI devices on aarch64/sbsa-ref and x86-64/q35

2023-09-26 Thread Niklas Cassel
Hello Marcin,

I will have a look at this.


Kind regards,
Niklas



On 26 September 2023 13:23:46 CEST, Marcin Juszkiewicz 
 wrote:
>I work on SBSA Reference Platform (sbsa-ref) at Linaro. And yesterday I
>wanted to check how non-Linux operating systems work on sbsa-ref machine.
>
>One of them was FreeBSD 13.2 - the latest one. Fetched bootonly ISO
>image [1] and booted system.
>
>1. 
>https://download.freebsd.org/releases/arm64/aarch64/ISO-IMAGES/13.2/FreeBSD-13.2-RELEASE-arm64-aarch64-bootonly.iso
>
>QEMU command line arguments:
>
>-drive 
>if=ide,file=disks/FreeBSD-13.2-RELEASE-arm64-aarch64-bootonly.iso,media=cdrom
>-machine sbsa-ref
>-m 4096
>-smp 2
>-cpu neoverse-n1
>-drive 
>file=fat:rw:/home/marcin/devel/linaro/sbsa-qemu/sbsa-ref-status/disks/virtual/,format=raw
>-drive 
>format=raw,file=/home/marcin/devel/linaro/sbsa-qemu/sbsa-ref-status/disks/full-debian.hddimg
>-watchdog-action none
>-no-reboot
>-monitor telnet::45454,server,nowait
>-serial stdio
>-device igb
>-nographic
>-drive if=pflash,file=SBSA_FLASH0.fd,format=raw
>-drive if=pflash,file=SBSA_FLASH1.fd,format=raw
>
>
>Firmware loaded FreeBSD loader, kernel booted but it does not see
>any AHCI devices:
>
>ahci0:  iomem 0x6010-0x6010 irq 1 on acpi0
>ahci0: AHCI v1.00 with 6 1.5Gbps ports, Port Multiplier not supported
>ahci0: Caps: 64bit NCQ 1.5Gbps 32cmd 6ports
>ahcich0:  at channel 0 on ahci0
>ahcich0: Caps:
>[..]
>ahcich0: AHCI reset...
>ahcich0: SATA connect time=0us status=0113
>ahcich0: AHCI reset: device found
>ahcich0: AHCI reset: device ready after 0ms
>ahcich1: AHCI reset...
>ahcich1: SATA connect time=0us status=0113
>ahcich1: AHCI reset: device found
>ahcich1: AHCI reset: device ready after 0ms
>ahcich2: AHCI reset...
>ahcich2: SATA connect time=0us status=0113
>ahcich2: AHCI reset: device found
>ahcich2: AHCI reset: device ready after 0ms
>[..]
>Trying to mount root from cd9660:/dev/iso9660/13_2_RELEASE_AARCH64_BO [ro]...
>Root mount waiting for: CAM
>[..]
>Root mount waiting for: CAM
>ahcich0: Poll timeout on slot 1 port 0
>ahcich0: is  cs 0002 ss  rs 0002 tfd 170 serr  
>cmd c017
>
>And finally it gives up.
>
>
>v8.1.1 was bad, v8.0.5 was bad so I did git bisecting.
>Which gave me this commit:
>
>commit 7bcd32128b227cee1fb39ff242d486ed9fff7648
>Author: Niklas Cassel 
>Date:   Fri Jun 9 16:08:40 2023 +0200
>
>hw/ide/ahci: simplify and document PxCI handling
>
>The AHCI spec states that:
>For NCQ, PxCI is cleared on command queued successfully.
>
>
>
>I built x86_64-softmmu target and checked both "pc" and "q35"
>machines.
>
>./build/x86_64-softmmu/qemu-system-x86_64
>-cdrom FreeBSD-13.2-RELEASE-amd64-bootonly.iso
>-m 2048 -serial stdio  -monitor telnet::45454,server,nowait
>
>PC target ("-M pc") booted fine. But Q35 ("-M q35") failed
>similar way as aarch64/sbsa-ref did:
>
>ahci0:  port 0xc060-0xc07f mem 
>0xfebd5000-0xfebd5fff irq 16 at device 31.2 on pci0
>ahci0: attempting to allocate 1 MSI vectors (1 supported)
>msi: routing MSI IRQ 26 to local APIC 0 vector 52
>ahci0: using IRQ 26 for MSI
>ahci0: AHCI v1.00 with 6 1.5Gbps ports, Port Multiplier not supported
>ahci0: Caps: 64bit NCQ 1.5Gbps 32cmd 6ports
>ahcich0:  at channel 0 on ahci0
>ahcich0: Caps:
>ahcich1:  at channel 1 on ahci0
>ahcich1: Caps:
>ahcich2:  at channel 2 on ahci0
>ahcich2: Caps:
>[..]
>ahcich2: AHCI reset...
>ahcich2: SATA connect time=0us status=0113
>ahcich2: AHCI reset: device found
>ahcich2: AHCI reset: device ready after 0ms
>[..]
>Trying to mount root from cd9660:/dev/iso9660/13_2_RELEASE_AMD64_BO [ro]...
>ahcich2: Poll timeout on slot 1 port 0
>ahcich2: is  cs 0002 ss  rs 0002 tfd 170 serr  
>cmd c017
>(aprobe2:ahcich2:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
>(aprobe2:ahcich2:0:0:0): CAM status: Command timeout
>(aprobe2:ahcich2:0:0:0): Error 5, Retries exhausted
>ahcich2: Poll timeout on slot 2 port 0
>ahcich2: is  cs 0006 ss  rs 0004 tfd 170 serr  
>cmd c017
>(aprobe2:ahcich2:0:0:0): SOFT_RESET. ACB: 00 00 00 00 00 00 00 00 00 00 00 00
>(aprobe2:ahcich2:0:0:0): CAM status: Command timeout
>(aprobe2:ahcich2:0:0:0): Error 5, Retries exhausted
>mountroot: waiting for device /dev/iso9660/13_2_RELEASE_AMD64_BO...
>Mounting from cd9660:/dev/iso9660/13_2_RELEASE_AMD64_BO failed with error 19.
>
>Same thing happens with current qemu HEAD:
>
>commit 494a6a2cf7f775d2c20fd6df9601e30606cc2014
>Merge: 29578f5757 b821109583
>Author: Stefan Hajnoczi 
>Date:   Mon Sep 25 10:10:30 2023 -0400
>
>Merge tag 'pull-request-2023-09-25' of https://gitlab.com/thuth/qemu into 
> staging
>
>
>Any ideas?


Re: [PATCH v3 0/8] misc AHCI cleanups

2023-08-07 Thread Niklas Cassel
On Tue, Jul 25, 2023 at 03:00:56PM -0400, John Snow wrote:
> On Tue, Jul 25, 2023 at 9:04 AM Philippe Mathieu-Daudé
>  wrote:
> >
> > Hi Niklas, John, Paolo, Kevin,
> >
> > On 19/7/23 12:47, Niklas Cassel wrote:
> >
> > >> Niklas Cassel (8):
> > >>hw/ide/ahci: remove stray backslash
> > >>hw/ide/core: set ERR_STAT in unsupported command completion
> > >>hw/ide/ahci: write D2H FIS when processing NCQ command
> > >>hw/ide/ahci: simplify and document PxCI handling
> > >>hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
> > >>hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
> > >>hw/ide/ahci: fix ahci_write_fis_sdb()
> > >>hw/ide/ahci: fix broken SError handling
> > >>
> > >>   hw/ide/ahci.c | 112 +++---
> > >>   hw/ide/core.c |   2 +-
> > >>   tests/qtest/libqos/ahci.c | 106 +++-
> > >>   tests/qtest/libqos/ahci.h |   8 +--
> > >>   4 files changed, 164 insertions(+), 64 deletions(-)
> > >>
> > >> --
> > >> 2.40.1
> > >>
> > >>
> > >
> > > Hello Philippe,
> > >
> > > Considering that you picked up my patch,
> > > "hw/ide/ahci: remove stray backslash" (patch 1/8 in this series),
> > > and since John seems to have gone silent for 40+ days,
> > > could you please consider taking this series through your misc tree?
> >
> 
> 40 days, ouch. I kept thinking it had been a week. Don't trust me with time.

Well, it is summer vacation times, so the days tend to fly by quite
quickly :)


> 
> > (First patch was a cleanup)
> >
> > Niklas, I don't feel confident enough :/
> >
> > John, Paolo, Kevin, do you Ack?
> >
> > Regards,
> >
> > Phil.
> 
> I'm staging it, but it's for next release. We'll get it in early and
> it gives us a chance to fix anything that's amiss before the next RC
> window.
> 

Thank you John!

I don't expect any (further) issues, but I will of course be available
in case something unexpectedly pops up.


Kind regards,
Niklas

Re: [PATCH v3 0/8] misc AHCI cleanups

2023-07-19 Thread Niklas Cassel
On Fri, Jun 09, 2023 at 04:08:36PM +0200, Niklas Cassel wrote:
> From: Niklas Cassel 
> 
> Hello John,
> 
> Here comes some misc AHCI cleanups.
> 
> Most are related to error handling.
> 
> Please review.
> 
> Changes since v2:
> -Squashed in the test commits that were sent out as a separate series into
>  the patch "hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set",
>  and reordered some of the patches, such that each and every commit passes
>  the ahci test suite as a separate unit. This way it will be possible to
>  perform a git bisect without seeing any failures in the ahci test suite.
> 
> 
> Kind regards,
> Niklas
> 
> Niklas Cassel (8):
>   hw/ide/ahci: remove stray backslash
>   hw/ide/core: set ERR_STAT in unsupported command completion
>   hw/ide/ahci: write D2H FIS when processing NCQ command
>   hw/ide/ahci: simplify and document PxCI handling
>   hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
>   hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
>   hw/ide/ahci: fix ahci_write_fis_sdb()
>   hw/ide/ahci: fix broken SError handling
> 
>  hw/ide/ahci.c | 112 +++---
>  hw/ide/core.c |   2 +-
>  tests/qtest/libqos/ahci.c | 106 +++-
>  tests/qtest/libqos/ahci.h |   8 +--
>  4 files changed, 164 insertions(+), 64 deletions(-)
> 
> -- 
> 2.40.1
> 
> 

Hello Philippe,

Considering that you picked up my patch,
"hw/ide/ahci: remove stray backslash" (patch 1/8 in this series),
and since John seems to have gone silent for 40+ days,
could you please consider taking this series through your misc tree?

The series still applies, you just need to skip patch 1/8
(which has already landed in qemu master).


Kind regards,
Niklas


[PATCH v3 4/8] hw/ide/ahci: simplify and document PxCI handling

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

The AHCI spec states that:
For NCQ, PxCI is cleared on command queued successfully.

For non-NCQ, PxCI is cleared on command completed successfully.
(A non-NCQ command that completes with error does not clear PxCI.)

The current QEMU implementation either clears PxCI in check_cmd(),
or in ahci_cmd_done().

check_cmd() will clear PxCI for a command if handle_cmd() returns 0.
handle_cmd() will return -1 if BUSY or DRQ is set.

The QEMU implementation for NCQ commands will currently not set BUSY
or DRQ, so they will always have PxCI cleared by handle_cmd().
ahci_cmd_done() will never even get called for NCQ commands.

Non-NCQ commands are executed by ide_bus_exec_cmd().
Non-NCQ commands in QEMU are implemented either in a sync or in an async
way.

For non-NCQ commands implemented in a sync way, the command handler will
return true, and when ide_bus_exec_cmd() sees that a command handler
returns true, it will call ide_cmd_done() (which will call
ahci_cmd_done()). For a command implemented in a sync way,
ahci_cmd_done() will do nothing (since busy_slot is not set). Instead,
after ide_bus_exec_cmd() has finished, check_cmd() will clear PxCI for
these commands.

For non-NCQ commands implemented in an async way (using either aiocb or
pio_aiocb), the command handler will return false, ide_bus_exec_cmd()
will not call ide_cmd_done(), instead it is expected that the async
callback function will call ide_cmd_done() once the async command is
done. handle_cmd() will set busy_slot, if and only if BUSY or DRQ is
set, and this is checked _after_ ide_bus_exec_cmd() has returned.
handle_cmd() will return -1, so check_cmd() will not clear PxCI.
When the async callback calls ide_cmd_done() (which will call
ahci_cmd_done()), it will see that busy_slot is set, and
ahci_cmd_done() will clear PxCI.

This seems racy, since busy_slot is set _after_ ide_bus_exec_cmd() has
returned. The callback might come before busy_slot gets set. And it is
quite confusing that ahci_cmd_done() will be called for all non-NCQ
commands when the command is done, but will only clear PxCI in certain
cases, even though it will always write a D2H FIS and raise an IRQ.

Even worse, in the case where ahci_cmd_done() does not clear PxCI, it
still raises an IRQ. Host software might thus read an old PxCI value,
since PxCI is cleared (by check_cmd()) after the IRQ has been raised.

Try to simplify this by always setting busy_slot for non-NCQ commands,
such that ahci_cmd_done() will always be responsible for clearing PxCI
for non-NCQ commands.

For NCQ commands, clear PxCI when we receive the D2H FIS, but before
raising the IRQ, see AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and
RegFIS:ClearCI.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 70 ---
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4b272397fd..3deaf01add 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -41,9 +41,10 @@
 #include "trace.h"
 
 static void check_cmd(AHCIState *s, int port);
-static int handle_cmd(AHCIState *s, int port, uint8_t slot);
+static void handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
 static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
+static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t slot);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -591,9 +592,8 @@ static void check_cmd(AHCIState *s, int port)
 
 if ((pr->cmd & PORT_CMD_START) && pr->cmd_issue) {
 for (slot = 0; (slot < 32) && pr->cmd_issue; slot++) {
-if ((pr->cmd_issue & (1U << slot)) &&
-!handle_cmd(s, port, slot)) {
-pr->cmd_issue &= ~(1U << slot);
+if (pr->cmd_issue & (1U << slot)) {
+handle_cmd(s, port, slot);
 }
 }
 }
@@ -1123,6 +1123,22 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+/*
+ * A NCQ command clears the bit in PxCI after the command has been QUEUED
+ * successfully (ERROR not set, BUSY and DRQ cleared).
+ *
+ * For NCQ commands, PxCI will always be cleared here.
+ *
+ * (Once the NCQ command is COMPLETED, the device will send a SDB FIS with
+ * the interrupt bit set, which will clear PxSACT and raise an interrupt.)
+ */
+ahci_clear_cmd_issue(ad, slot);
+
+/*
+ * In reality, for NCQ commands, PxCI is cleared after receiving a D2H FIS
+ * without the interrupt bit set, but since ahci_write_fis_d2h() can raise
+ * an IRQ on error, we need to call them in reverse order.
+ */
 ahci_write_fis_d2h(ad, false);
 
 ncq_tfs->used = 1;
@@ -1197,6 +1213,7 @@ st

[PATCH v3 3/8] hw/ide/ahci: write D2H FIS when processing NCQ command

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

The way that BUSY + PxCI is cleared for NCQ (FPDMA QUEUED) commands is
described in SATA 3.5a Gold:

11.15 FPDMA QUEUED command protocol
DFPDMAQ2: ClearInterfaceBsy
"Transmit Register Device to Host FIS with the BSY bit cleared to zero
and the DRQ bit cleared to zero and Interrupt bit cleared to zero to
mark interface ready for the next command."

PxCI is currently cleared by handle_cmd(), but we don't write the D2H
FIS to the FIS Receive Area that actually caused PxCI to be cleared.

Similar to how ahci_pio_transfer() calls ahci_write_fis_pio() with an
additional parameter to write a PIO Setup FIS without raising an IRQ,
add a parameter to ahci_write_fis_d2h() so that ahci_write_fis_d2h()
also can write the FIS to the FIS Receive Area without raising an IRQ.

Change process_ncq_command() to call ahci_write_fis_d2h() without
raising an IRQ (similar to ahci_pio_transfer()), such that the FIS
Receive Area is in sync with the PxTFD shadow register.

E.g. Linux reads status and error fields from the FIS Receive Area
directly, so it is wise to keep the FIS Receive Area and the PxTFD
shadow register in sync.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 48d550f633..4b272397fd 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -43,7 +43,7 @@
 static void check_cmd(AHCIState *s, int port);
 static int handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
-static bool ahci_write_fis_d2h(AHCIDevice *ad);
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -618,7 +618,7 @@ static void ahci_init_d2h(AHCIDevice *ad)
 return;
 }
 
-if (ahci_write_fis_d2h(ad)) {
+if (ahci_write_fis_d2h(ad, true)) {
 ad->init_d2h_sent = true;
 /* We're emulating receiving the first Reg H2D Fis from the device;
  * Update the SIG register, but otherwise proceed as normal. */
@@ -850,7 +850,7 @@ static void ahci_write_fis_pio(AHCIDevice *ad, uint16_t 
len, bool pio_fis_i)
 }
 }
 
-static bool ahci_write_fis_d2h(AHCIDevice *ad)
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i)
 {
 AHCIPortRegs *pr = >port_regs;
 uint8_t *d2h_fis;
@@ -864,7 +864,7 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 d2h_fis = >res_fis[RES_FIS_RFIS];
 
 d2h_fis[0] = SATA_FIS_TYPE_REGISTER_D2H;
-d2h_fis[1] = (1 << 6); /* interrupt bit */
+d2h_fis[1] = d2h_fis_i ? (1 << 6) : 0; /* interrupt bit */
 d2h_fis[2] = s->status;
 d2h_fis[3] = s->error;
 
@@ -890,7 +890,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
 }
 
-ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+if (d2h_fis_i) {
+ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+}
+
 return true;
 }
 
@@ -1120,6 +1123,8 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+ahci_write_fis_d2h(ad, false);
+
 ncq_tfs->used = 1;
 ncq_tfs->drive = ad;
 ncq_tfs->slot = slot;
@@ -1506,7 +1511,7 @@ static void ahci_cmd_done(const IDEDMA *dma)
 }
 
 /* update d2h status */
-ahci_write_fis_d2h(ad);
+ahci_write_fis_d2h(ad, true);
 
 if (ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
-- 
2.40.1




[PATCH v3 7/8] hw/ide/ahci: fix ahci_write_fis_sdb()

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

When there is an error, we need to raise a TFES error irq, see AHCI 1.3.1,
5.3.13.1 SDB:Entry.

If ERR_STAT is set, we jump to state ERR:FatalTaskfile, which will raise
a TFES IRQ unconditionally, regardless if the I bit is set in the FIS or
not.

Thus, we should never raise a normal IRQ after having sent an error IRQ.

It is valid to signal successfully completed commands as finished in the
same SDB FIS that generates the error IRQ. The important thing is that
commands that did not complete successfully (e.g. commands that were
aborted, do not get the finished bit set).

Before this commit, there was never a TFES IRQ raised on NCQ error.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 12aaadc554..ef6c9fc378 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -806,8 +806,14 @@ static void ahci_write_fis_sdb(AHCIState *s, 
NCQTransferState *ncq_tfs)
 pr->scr_act &= ~ad->finished;
 ad->finished = 0;
 
-/* Trigger IRQ if interrupt bit is set (which currently, it always is) */
-if (sdb_fis->flags & 0x40) {
+/*
+ * TFES IRQ is always raised if ERR_STAT is set, regardless of I bit.
+ * If ERR_STAT is not set, trigger SDBS IRQ if interrupt bit is set
+ * (which currently, it always is).
+ */
+if (sdb_fis->status & ERR_STAT) {
+ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_TFES);
+} else if (sdb_fis->flags & 0x40) {
 ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_SDBS);
 }
 }
-- 
2.40.1




[PATCH v3 2/8] hw/ide/core: set ERR_STAT in unsupported command completion

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

Currently, the first time sending an unsupported command
(e.g. READ LOG DMA EXT) will not have ERR_STAT set in the completion.
Sending the unsupported command again, will correctly have ERR_STAT set.

When ide_cmd_permitted() returns false, it calls ide_abort_command().
ide_abort_command() first calls ide_transfer_stop(), which will call
ide_transfer_halt() and ide_cmd_done(), after that ide_abort_command()
sets ERR_STAT in status.

ide_cmd_done() for AHCI will call ahci_write_fis_d2h() which writes the
current status in the FIS, and raises an IRQ. (The status here will not
have ERR_STAT set!).

Thus, we cannot call ide_transfer_stop() before setting ERR_STAT, as
ide_transfer_stop() will result in the FIS being written and an IRQ
being raised.

The reason why it works the second time, is that ERR_STAT will still
be set from the previous command, so when writing the FIS, the
completion will correctly have ERR_STAT set.

Set ERR_STAT before writing the FIS (calling cmd_done), so that we will
raise an error IRQ correctly when receiving an unsupported command.

Signed-off-by: Niklas Cassel 
---
 hw/ide/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index de48ff9f86..07971c0218 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -533,9 +533,9 @@ BlockAIOCB *ide_issue_trim(
 
 void ide_abort_command(IDEState *s)
 {
-ide_transfer_stop(s);
 s->status = READY_STAT | ERR_STAT;
 s->error = ABRT_ERR;
+ide_transfer_stop(s);
 }
 
 static void ide_set_retry(IDEState *s)
-- 
2.40.1




[PATCH v3 5/8] hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

According to AHCI 1.3.1 definition of PxSACT:
This field is cleared when PxCMD.ST is written from a '1' to a '0' by
software. This field is not cleared by a COMRESET or a software reset.

According to AHCI 1.3.1 definition of PxCI:
This field is also cleared when PxCMD.ST is written from a '1' to a '0'
by software.

Clearing PxCMD.ST is part of the error recovery procedure, see
AHCI 1.3.1, section "6.2 Error Recovery".

If we don't clear PxCI on error recovery, the previous command will
incorrectly still be marked as pending after error recovery.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 3deaf01add..a31e6fa65e 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -329,6 +329,11 @@ static void ahci_port_write(AHCIState *s, int port, int 
offset, uint32_t val)
 ahci_check_irq(s);
 break;
 case AHCI_PORT_REG_CMD:
+if ((pr->cmd & PORT_CMD_START) && !(val & PORT_CMD_START)) {
+pr->scr_act = 0;
+pr->cmd_issue = 0;
+}
+
 /* Block any Read-only fields from being set;
  * including LIST_ON and FIS_ON.
  * The spec requires to set ICC bits to zero after the ICC change
-- 
2.40.1




[PATCH v3 8/8] hw/ide/ahci: fix broken SError handling

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

When encountering an NCQ error, you should not write the NCQ tag to the
SError register. This is completely wrong.

The SError register has a clear definition, where each bit represents a
different error, see PxSERR definition in AHCI 1.3.1.

If we write a random value (like the NCQ tag) in SError, e.g. Linux will
read SError, and will trigger arbitrary error handling depending on the
NCQ tag that happened to be executing.

In case of success, ncq_cb() will call ncq_finish().
In case of error, ncq_cb() will call ncq_err() (which will clear
ncq_tfs->used), and then call ncq_finish(), thus using ncq_tfs->used is
sufficient to tell if finished should get set or not.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index ef6c9fc378..d0a774bc17 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1012,7 +1012,6 @@ static void ncq_err(NCQTransferState *ncq_tfs)
 
 ide_state->error = ABRT_ERR;
 ide_state->status = READY_STAT | ERR_STAT;
-ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
 qemu_sglist_destroy(_tfs->sglist);
 ncq_tfs->used = 0;
 }
@@ -1022,7 +1021,7 @@ static void ncq_finish(NCQTransferState *ncq_tfs)
 /* If we didn't error out, set our finished bit. Errored commands
  * do not get a bit set for the SDB FIS ACT register, nor do they
  * clear the outstanding bit in scr_act (PxSACT). */
-if (!(ncq_tfs->drive->port_regs.scr_err & (1 << ncq_tfs->tag))) {
+if (ncq_tfs->used) {
 ncq_tfs->drive->finished |= (1 << ncq_tfs->tag);
 }
 
-- 
2.40.1




[PATCH v3 0/8] misc AHCI cleanups

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

Hello John,

Here comes some misc AHCI cleanups.

Most are related to error handling.

Please review.

Changes since v2:
-Squashed in the test commits that were sent out as a separate series into
 the patch "hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set",
 and reordered some of the patches, such that each and every commit passes
 the ahci test suite as a separate unit. This way it will be possible to
 perform a git bisect without seeing any failures in the ahci test suite.


Kind regards,
Niklas

Niklas Cassel (8):
  hw/ide/ahci: remove stray backslash
  hw/ide/core: set ERR_STAT in unsupported command completion
  hw/ide/ahci: write D2H FIS when processing NCQ command
  hw/ide/ahci: simplify and document PxCI handling
  hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
  hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
  hw/ide/ahci: fix ahci_write_fis_sdb()
  hw/ide/ahci: fix broken SError handling

 hw/ide/ahci.c | 112 +++---
 hw/ide/core.c |   2 +-
 tests/qtest/libqos/ahci.c | 106 +++-
 tests/qtest/libqos/ahci.h |   8 +--
 4 files changed, 164 insertions(+), 64 deletions(-)

-- 
2.40.1




[PATCH v3 6/8] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

For NCQ, PxCI is cleared on command queued successfully.
For non-NCQ, PxCI is cleared on command completed successfully.
Successfully means ERR_STAT, BUSY and DRQ are all cleared.

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

In the case of non-NCQ commands, not clearing PxCI is needed in order
for host software to be able to see which command slot that failed.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c |   7 ++-
 tests/qtest/libqos/ahci.c | 106 --
 tests/qtest/libqos/ahci.h |   8 ++-
 3 files changed, 88 insertions(+), 33 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index a31e6fa65e..12aaadc554 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1523,7 +1523,8 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 {
 IDEState *ide_state = >port.ifs[0];
 
-if (!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
+if (!(ide_state->status & ERR_STAT) &&
+!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
 ad->port_regs.cmd_issue &= ~(1 << slot);
 }
 }
@@ -1532,6 +1533,7 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 static void ahci_cmd_done(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
+IDEState *ide_state = >port.ifs[0];
 
 trace_ahci_cmd_done(ad->hba, ad->port_no);
 
@@ -1548,7 +1550,8 @@ static void ahci_cmd_done(const IDEDMA *dma)
  */
 ahci_write_fis_d2h(ad, true);
 
-if (ad->port_regs.cmd_issue && !ad->check_bh) {
+if (!(ide_state->status & ERR_STAT) &&
+ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
>mem_reentrancy_guard);
 qemu_bh_schedule(ad->check_bh);
diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index f53f12aa99..a2c94c6e06 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -404,57 +404,110 @@ void ahci_port_clear(AHCIQState *ahci, uint8_t port)
 /**
  * Check a port for errors.
  */
-void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
-   uint32_t imask, uint8_t emask)
+void ahci_port_check_error(AHCIQState *ahci, AHCICommand *cmd)
 {
+uint8_t port = cmd->port;
 uint32_t reg;
 
-/* The upper 9 bits of the IS register all indicate errors. */
-reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
-reg &= ~imask;
-reg >>= 23;
-g_assert_cmphex(reg, ==, 0);
+/* If expecting TF error, ensure that TFES is set. */
+if (cmd->errors) {
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ASSERT_BIT_SET(reg, AHCI_PX_IS_TFES);
+} else {
+/* The upper 9 bits of the IS register all indicate errors. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+reg &= ~cmd->interrupts;
+reg >>= 23;
+g_assert_cmphex(reg, ==, 0);
+}
 
-/* The Sata Error Register should be empty. */
+/* The Sata Error Register should be empty, even when expecting TF error. 
*/
 reg = ahci_px_rreg(ahci, port, AHCI_PX_SERR);
 g_assert_cmphex(reg, ==, 0);
 
+/* If expecting TF error, and TFES was set, perform error recovery
+ * (see AHCI 1.3 section 6.2.2.1) such that we can send new commands. */
+if (cmd->errors) {
+/* This will clear PxCI. */
+ahci_px_clr(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+
+/* The port has 500ms to disengage. */
+usleep(50);
+reg = ahci_px_rreg(ahci, port, AHCI_PX_CMD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_CMD_CR);
+
+/* Clear PxIS. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ahci_px_wreg(ahci, port, AHCI_PX_IS, reg);
+
+/* Check if we need to perform a COMRESET.
+ * Not implemented right now, as there is no reason why our QEMU model
+ * should need a COMRESET when expecting TF error. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_BSY | AHCI_PX_TFD_STS_DRQ);
+
+/* Enable issuing new commands. */
+ahci_px_set(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+}
+
 /* The TFD also has two error sections. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
-if (!emask) {
+if (!cmd->errors) {
 ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_ERR);
 } else {
 ASSERT_BIT_SET(reg, AHCI_PX_TFD_STS_ERR);
 }
-ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~emask << 8));
-ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (emask << 8));
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~cmd->errors << 8));
+ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR &

[PATCH v3 1/8] hw/ide/ahci: remove stray backslash

2023-06-09 Thread Niklas Cassel
From: Niklas Cassel 

This backslash obviously does not belong here, so remove it.

Signed-off-by: Niklas Cassel 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: John Snow 
---
 hw/ide/ahci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4e76d6b191..48d550f633 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -690,7 +690,7 @@ static void ahci_reset_port(AHCIState *s, int port)
 
 s->dev[port].port_state = STATE_RUN;
 if (ide_state->drive_kind == IDE_CD) {
-ahci_set_signature(d, SATA_SIGNATURE_CDROM);\
+ahci_set_signature(d, SATA_SIGNATURE_CDROM);
 ide_state->status = SEEK_STAT | WRERR_STAT | READY_STAT;
 } else {
 ahci_set_signature(d, SATA_SIGNATURE_DISK);
-- 
2.40.1




[PATCH 4/5] libqos/ahci: improve ahci_port_check_error()

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

Improve ahci_port_check_error() to also assert that PxIS.TFES is set when
expecting errors.

Signed-off-by: Niklas Cassel 
---
 tests/qtest/libqos/ahci.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index 644ed7e79f..b216f61f14 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -409,13 +409,19 @@ void ahci_port_check_error(AHCIQState *ahci, AHCICommand 
*cmd)
 uint8_t port = cmd->port;
 uint32_t reg;
 
-/* The upper 9 bits of the IS register all indicate errors. */
-reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
-reg &= ~cmd->interrupts;
-reg >>= 23;
-g_assert_cmphex(reg, ==, 0);
+/* If expecting TF error, ensure that TFES is set. */
+if (cmd->errors) {
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ASSERT_BIT_SET(reg, AHCI_PX_IS_TFES);
+} else {
+/* The upper 9 bits of the IS register all indicate errors. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+reg &= ~cmd->interrupts;
+reg >>= 23;
+g_assert_cmphex(reg, ==, 0);
+}
 
-/* The Sata Error Register should be empty. */
+/* The Sata Error Register should be empty, even when expecting TF error. 
*/
 reg = ahci_px_rreg(ahci, port, AHCI_PX_SERR);
 g_assert_cmphex(reg, ==, 0);
 
-- 
2.40.1




[PATCH 5/5] libqos/ahci: perform mandatory error recovery on error

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

When the HBA encouters an error, the host has to perform error recovery,
see AHCI 1.3.1 section 6.2.2.1, in order to be able issue new commands.
If we don't do this, all the commands that we queue will get aborted.

Some tests, e.g. test_atapi_tray() call ahci_atapi_test_ready() with
ready == false, intentionally sending a command that will cause an error.

After test_atapi_tray() has called ahci_atapi_test_ready(), it directly
calls ahci_atapi_get_sense() to send a REQUEST SENSE command.

If we don't perform error recovery, the REQUEST SENSE command will get
aborted, and will not return the sense data that the test case expects.

Since the error recovery will clear PxCI, we also need to move the
ahci_port_check_nonbusy() call to before the error handling takes place.

Signed-off-by: Niklas Cassel 
---
 tests/qtest/libqos/ahci.c | 44 +--
 tests/qtest/libqos/ahci.h |  3 +--
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index b216f61f14..a2c94c6e06 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -425,6 +425,31 @@ void ahci_port_check_error(AHCIQState *ahci, AHCICommand 
*cmd)
 reg = ahci_px_rreg(ahci, port, AHCI_PX_SERR);
 g_assert_cmphex(reg, ==, 0);
 
+/* If expecting TF error, and TFES was set, perform error recovery
+ * (see AHCI 1.3 section 6.2.2.1) such that we can send new commands. */
+if (cmd->errors) {
+/* This will clear PxCI. */
+ahci_px_clr(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+
+/* The port has 500ms to disengage. */
+usleep(50);
+reg = ahci_px_rreg(ahci, port, AHCI_PX_CMD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_CMD_CR);
+
+/* Clear PxIS. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
+ahci_px_wreg(ahci, port, AHCI_PX_IS, reg);
+
+/* Check if we need to perform a COMRESET.
+ * Not implemented right now, as there is no reason why our QEMU model
+ * should need a COMRESET when expecting TF error. */
+reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_BSY | AHCI_PX_TFD_STS_DRQ);
+
+/* Enable issuing new commands. */
+ahci_px_set(ahci, port, AHCI_PX_CMD, AHCI_PX_CMD_ST);
+}
+
 /* The TFD also has two error sections. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
 if (!cmd->errors) {
@@ -436,17 +461,24 @@ void ahci_port_check_error(AHCIQState *ahci, AHCICommand 
*cmd)
 ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (cmd->errors << 8));
 }
 
-void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
-uint32_t intr_mask)
+void ahci_port_check_interrupts(AHCIQState *ahci, AHCICommand *cmd)
 {
+uint8_t port = cmd->port;
 uint32_t reg;
 
+/* If we expect errors, error handling in ahci_port_check_error() will
+ * already have cleared PxIS, so in that case this function cannot verify
+ * and clear expected interrupts. */
+if (cmd->errors) {
+return;
+}
+
 /* Check for expected interrupts */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
-ASSERT_BIT_SET(reg, intr_mask);
+ASSERT_BIT_SET(reg, cmd->interrupts);
 
 /* Clear expected interrupts and assert all interrupts now cleared. */
-ahci_px_wreg(ahci, port, AHCI_PX_IS, intr_mask);
+ahci_px_wreg(ahci, port, AHCI_PX_IS, cmd->interrupts);
 g_assert_cmphex(ahci_px_rreg(ahci, port, AHCI_PX_IS), ==, 0);
 }
 
@@ -1248,9 +1280,9 @@ void ahci_command_verify(AHCIQState *ahci, AHCICommand 
*cmd)
 uint8_t slot = cmd->slot;
 uint8_t port = cmd->port;
 
-ahci_port_check_error(ahci, cmd);
-ahci_port_check_interrupts(ahci, port, cmd->interrupts);
 ahci_port_check_nonbusy(ahci, cmd);
+ahci_port_check_error(ahci, cmd);
+ahci_port_check_interrupts(ahci, cmd);
 ahci_port_check_cmd_sanity(ahci, cmd);
 if (cmd->interrupts & AHCI_PX_IS_DHRS) {
 ahci_port_check_d2h_sanity(ahci, port, slot);
diff --git a/tests/qtest/libqos/ahci.h b/tests/qtest/libqos/ahci.h
index 137858d79c..48017864bf 100644
--- a/tests/qtest/libqos/ahci.h
+++ b/tests/qtest/libqos/ahci.h
@@ -591,8 +591,7 @@ void ahci_destroy_command(AHCIQState *ahci, uint8_t port, 
uint8_t slot);
 
 /* AHCI sanity check routines */
 void ahci_port_check_error(AHCIQState *ahci, AHCICommand *cmd);
-void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
-uint32_t intr_mask);
+void ahci_port_check_interrupts(AHCIQState *ahci, AHCICommand *cmd);
 void ahci_port_check_nonbusy(AHCIQState *ahci, AHCICommand *cmd);
 void ahci_port_check_d2h_sanity(AHCIQState *ahci, uint8_t port, uint8_t slot);
 void ahci_port_check_pio_sanity(AHCIQState *ahci, AHCICommand *cmd);
-- 
2.40.1




[PATCH 3/5] libqos/ahci: simplify ahci_port_check_error()

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

Modify ahci_port_check_error() to simply take a struct AHCICommand.
This way, the conditionals are in line which the existing code,
e.g. ahci_port_check_nonbusy(), which checks for cmd->errors.

This makes the code easier to reason with, we don't want to use
cmd->errors in some functions and emask in some functions.

No functional changes intended.

Signed-off-by: Niklas Cassel 
---
 tests/qtest/libqos/ahci.c | 14 +++---
 tests/qtest/libqos/ahci.h |  3 +--
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index 2d4981dae4..644ed7e79f 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -404,14 +404,14 @@ void ahci_port_clear(AHCIQState *ahci, uint8_t port)
 /**
  * Check a port for errors.
  */
-void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
-   uint32_t imask, uint8_t emask)
+void ahci_port_check_error(AHCIQState *ahci, AHCICommand *cmd)
 {
+uint8_t port = cmd->port;
 uint32_t reg;
 
 /* The upper 9 bits of the IS register all indicate errors. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_IS);
-reg &= ~imask;
+reg &= ~cmd->interrupts;
 reg >>= 23;
 g_assert_cmphex(reg, ==, 0);
 
@@ -421,13 +421,13 @@ void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
 
 /* The TFD also has two error sections. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
-if (!emask) {
+if (!cmd->errors) {
 ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_STS_ERR);
 } else {
 ASSERT_BIT_SET(reg, AHCI_PX_TFD_STS_ERR);
 }
-ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~emask << 8));
-ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (emask << 8));
+ASSERT_BIT_CLEAR(reg, AHCI_PX_TFD_ERR & (~cmd->errors << 8));
+ASSERT_BIT_SET(reg, AHCI_PX_TFD_ERR & (cmd->errors << 8));
 }
 
 void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
@@ -1242,7 +1242,7 @@ void ahci_command_verify(AHCIQState *ahci, AHCICommand 
*cmd)
 uint8_t slot = cmd->slot;
 uint8_t port = cmd->port;
 
-ahci_port_check_error(ahci, port, cmd->interrupts, cmd->errors);
+ahci_port_check_error(ahci, cmd);
 ahci_port_check_interrupts(ahci, port, cmd->interrupts);
 ahci_port_check_nonbusy(ahci, cmd);
 ahci_port_check_cmd_sanity(ahci, cmd);
diff --git a/tests/qtest/libqos/ahci.h b/tests/qtest/libqos/ahci.h
index 2234f46862..137858d79c 100644
--- a/tests/qtest/libqos/ahci.h
+++ b/tests/qtest/libqos/ahci.h
@@ -590,8 +590,7 @@ void ahci_set_command_header(AHCIQState *ahci, uint8_t port,
 void ahci_destroy_command(AHCIQState *ahci, uint8_t port, uint8_t slot);
 
 /* AHCI sanity check routines */
-void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
-   uint32_t imask, uint8_t emask);
+void ahci_port_check_error(AHCIQState *ahci, AHCICommand *cmd);
 void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
 uint32_t intr_mask);
 void ahci_port_check_nonbusy(AHCIQState *ahci, AHCICommand *cmd);
-- 
2.40.1




[PATCH 2/5] libqos/ahci: fix ahci_port_check_nonbusy()

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

Some tests, e.g. test_atapi_tray() call ahci_atapi_test_ready() with
ready == false, intentionally sending a command that will cause an error.

Thus, ahci_port_check_nonbusy() cannot assume that PxCI and PxSACT will
always be cleared for the executed command.

Signed-off-by: Niklas Cassel 
---
 tests/qtest/libqos/ahci.c | 27 +--
 tests/qtest/libqos/ahci.h |  2 +-
 2 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index c1d571e477..2d4981dae4 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -444,17 +444,32 @@ void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t 
port,
 g_assert_cmphex(ahci_px_rreg(ahci, port, AHCI_PX_IS), ==, 0);
 }
 
-void ahci_port_check_nonbusy(AHCIQState *ahci, uint8_t port, uint8_t slot)
+void ahci_port_check_nonbusy(AHCIQState *ahci, AHCICommand *cmd)
 {
+uint8_t slot = cmd->slot;
+uint8_t port = cmd->port;
 uint32_t reg;
 
-/* Assert that the command slot is no longer busy (NCQ) */
+/* For NCQ command with error PxSACT bit should still be set.
+ * For NCQ command without error, PxSACT bit should be cleared.
+ * For non-NCQ command, PxSACT bit should always be cleared. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_SACT);
-ASSERT_BIT_CLEAR(reg, (1 << slot));
+if (cmd->props->ncq && cmd->errors) {
+ASSERT_BIT_SET(reg, (1 << slot));
+} else {
+ASSERT_BIT_CLEAR(reg, (1 << slot));
+}
 
-/* Non-NCQ */
+/* For non-NCQ command with error, PxCI bit should still be set.
+ * For non-NCQ command without error, PxCI bit should be cleared.
+ * For NCQ command without error, PxCI bit should be cleared.
+ * For NCQ command with error, PxCI bit may or may not be cleared. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_CI);
-ASSERT_BIT_CLEAR(reg, (1 << slot));
+if (!cmd->props->ncq && cmd->errors) {
+ASSERT_BIT_SET(reg, (1 << slot));
+} else if (!cmd->errors) {
+ASSERT_BIT_CLEAR(reg, (1 << slot));
+}
 
 /* And assert that we are generally not busy. */
 reg = ahci_px_rreg(ahci, port, AHCI_PX_TFD);
@@ -1229,7 +1244,7 @@ void ahci_command_verify(AHCIQState *ahci, AHCICommand 
*cmd)
 
 ahci_port_check_error(ahci, port, cmd->interrupts, cmd->errors);
 ahci_port_check_interrupts(ahci, port, cmd->interrupts);
-ahci_port_check_nonbusy(ahci, port, slot);
+ahci_port_check_nonbusy(ahci, cmd);
 ahci_port_check_cmd_sanity(ahci, cmd);
 if (cmd->interrupts & AHCI_PX_IS_DHRS) {
 ahci_port_check_d2h_sanity(ahci, port, slot);
diff --git a/tests/qtest/libqos/ahci.h b/tests/qtest/libqos/ahci.h
index 88835b6228..2234f46862 100644
--- a/tests/qtest/libqos/ahci.h
+++ b/tests/qtest/libqos/ahci.h
@@ -594,7 +594,7 @@ void ahci_port_check_error(AHCIQState *ahci, uint8_t port,
uint32_t imask, uint8_t emask);
 void ahci_port_check_interrupts(AHCIQState *ahci, uint8_t port,
 uint32_t intr_mask);
-void ahci_port_check_nonbusy(AHCIQState *ahci, uint8_t port, uint8_t slot);
+void ahci_port_check_nonbusy(AHCIQState *ahci, AHCICommand *cmd);
 void ahci_port_check_d2h_sanity(AHCIQState *ahci, uint8_t port, uint8_t slot);
 void ahci_port_check_pio_sanity(AHCIQState *ahci, AHCICommand *cmd);
 void ahci_port_check_cmd_sanity(AHCIQState *ahci, AHCICommand *cmd);
-- 
2.40.1




[PATCH 0/5] improve ahci test suite

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

Hello John,

This series should be applied on top of the series called:
"[PATCH v2 0/8] misc AHCI cleanups"
which can be found here:
https://lists.nongnu.org/archive/html/qemu-block/2023-06/msg00038.html

This series improves the ahci test suite to be in line with
the AHCI specification when it comes to taskfile errors.

Theoretically these commits could be squashed with the corresponding
patch for the QEMU model, however, that would lose the commit log for
the test patches, so I prefer to keep them separate.

Please review.


Kind regards,
Niklas

Niklas Cassel (5):
  libqos/ahci: fix ahci_command_wait()
  libqos/ahci: fix ahci_port_check_nonbusy()
  libqos/ahci: simplify ahci_port_check_error()
  libqos/ahci: improve ahci_port_check_error()
  libqos/ahci: perform mandatory error recovery on error

 tests/qtest/libqos/ahci.c | 106 --
 tests/qtest/libqos/ahci.h |   8 ++-
 2 files changed, 83 insertions(+), 31 deletions(-)

-- 
2.40.1




[PATCH 1/5] libqos/ahci: fix ahci_command_wait()

2023-06-08 Thread Niklas Cassel
From: Niklas Cassel 

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

Some tests, e.g. test_atapi_tray() call ahci_atapi_test_ready() with
ready == false, intentionally sending a command that will cause an error.

Thus, ahci_command_wait() cannot simply wait indefinitely for PxCI to be
cleared, it also has to take ERR_STAT into account.

Signed-off-by: Niklas Cassel 
---
 tests/qtest/libqos/ahci.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tests/qtest/libqos/ahci.c b/tests/qtest/libqos/ahci.c
index f53f12aa99..c1d571e477 100644
--- a/tests/qtest/libqos/ahci.c
+++ b/tests/qtest/libqos/ahci.c
@@ -1207,9 +1207,10 @@ void ahci_command_wait(AHCIQState *ahci, AHCICommand 
*cmd)
 
 #define RSET(REG, MASK) (BITSET(ahci_px_rreg(ahci, cmd->port, (REG)), (MASK)))
 
-while (RSET(AHCI_PX_TFD, AHCI_PX_TFD_STS_BSY) ||
-   RSET(AHCI_PX_CI, 1 << cmd->slot) ||
-   (cmd->props->ncq && RSET(AHCI_PX_SACT, 1 << cmd->slot))) {
+while (!RSET(AHCI_PX_TFD, AHCI_PX_TFD_STS_ERR) &&
+   (RSET(AHCI_PX_TFD, AHCI_PX_TFD_STS_BSY) ||
+RSET(AHCI_PX_CI, 1 << cmd->slot) ||
+(cmd->props->ncq && RSET(AHCI_PX_SACT, 1 << cmd->slot {
 usleep(50);
 }
 
-- 
2.40.1




Re: [PATCH v2 5/8] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-06-07 Thread Niklas Cassel
On Wed, Jun 07, 2023 at 06:01:17PM +0200, Niklas Cassel wrote:
> On Mon, Jun 05, 2023 at 08:19:43PM -0400, John Snow wrote:
> > On Thu, Jun 1, 2023 at 9:46 AM Niklas Cassel  wrote:
> > >
> > > From: Niklas Cassel 
> > >
> > > For NCQ, PxCI is cleared on command queued successfully.
> > > For non-NCQ, PxCI is cleared on command completed successfully.
> > > Successfully means ERR_STAT, BUSY and DRQ are all cleared.
> > >
> > > A command that has ERR_STAT set, does not get to clear PxCI.
> > > See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
> > > and 5.3.16.5 ERR:FatalTaskfile.
> > >
> > > In the case of non-NCQ commands, not clearing PxCI is needed in order
> > > for host software to be able to see which command slot that failed.
> > >
> > > Signed-off-by: Niklas Cassel 
> > 
> > This patch causes the ahci test suite to hang. You might just need to
> > update the AHCI test suite.
> > 
> > "make check" will hang on the ahci-test as of this patch.
> 
> Argh :)
> 
> Is there any simple way to run only the ahci test suite?

To answer my own question:
QTEST_QEMU_BINARY=./build/qemu-system-x86_64 QTEST_QEMU_IMG=./build/qemu-img 
gtester -k --verbose -m=quick build/tests/qtest/ahci-test -o test_log.xml


Kind regards,
Niklas

Re: [PATCH v2 5/8] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-06-07 Thread Niklas Cassel
On Mon, Jun 05, 2023 at 08:19:43PM -0400, John Snow wrote:
> On Thu, Jun 1, 2023 at 9:46 AM Niklas Cassel  wrote:
> >
> > From: Niklas Cassel 
> >
> > For NCQ, PxCI is cleared on command queued successfully.
> > For non-NCQ, PxCI is cleared on command completed successfully.
> > Successfully means ERR_STAT, BUSY and DRQ are all cleared.
> >
> > A command that has ERR_STAT set, does not get to clear PxCI.
> > See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
> > and 5.3.16.5 ERR:FatalTaskfile.
> >
> > In the case of non-NCQ commands, not clearing PxCI is needed in order
> > for host software to be able to see which command slot that failed.
> >
> > Signed-off-by: Niklas Cassel 
> 
> This patch causes the ahci test suite to hang. You might just need to
> update the AHCI test suite.
> 
> "make check" will hang on the ahci-test as of this patch.

Argh :)

Is there any simple way to run only the ahci test suite?

"make check" and "make check-qtest" are running many tests that I'm not
interested in.


Kind regards,
Niklas

Re: [PATCH 0/9] misc AHCI cleanups

2023-06-01 Thread Niklas Cassel
On Wed, May 17, 2023 at 01:06:06PM -0400, John Snow wrote:
> On Fri, Apr 28, 2023 at 9:22 AM Niklas Cassel  wrote:
> >
> > From: Niklas Cassel 
> >
> > Hello John,
> >
> 
> Hi Niklas!
> 
> I haven't been actively involved with AHCI for a while, so I am not
> sure I can find the time to give this a deep scrub. I'm going to
> assume based on '@wdc.com` that you probably know a thing or two more
> about AHCI than I do, though. Can you tell me what testing you've
> performed on this? As long as it doesn't cause any obvious
> regressions, we might be able to push it through, but it might not be
> up to me anymore. I can give it a review on technical merit, but with
> regards to "correctness" I have to admit I am flying blind.

Hello John,

The testing is mostly using linux and injecting NCQ errors using some
additional QEMU patches that are not part of this series.

> 
> (I haven't looked at the patches yet, but if in your commit messages
> you can point to the relevant sections of the relevant specifications,
> that'd help immensely.)
> 
> > Here comes some misc AHCI cleanups.
> >
> > Most are related to error handling.
> 
> I've always found this the most difficult part to verify. In a real
> system, the registers between AHCI and the actual hard disk are
> *physically separate*, and they update at specific times based on the
> transmission of the FIS packets. The model in QEMU doesn't bother with
> a perfect reproduction of that, and so it's been a little tough to
> verify correctness. I tried to improve it a bit back in the day, but
> my understanding has always been a bit limited :)
> 
> Are there any vendor tools you're aware of that have test suites we
> could use to verify behavior?

Unfortunately, I don't know of any good test suite.

> A question for you: is it worth solidifying which ATA specification we
> conform to? I don't believe we adhere to any one specific model,
> because a lot of the code is shared between PATA and SATA, and we "in
> theory" support IDE hard drives for fairly old guest operating systems
> that may or may not predate the DMA extensions. As a result, the
> actual device emulation is kind of a mish-mash of different ATA
> specifications, generally whichever version provided the
> least-abstracted detail and was easy to implement.
> 
> If you're adding the logging features, that seems to push us towards
> the newer end of the spectrum, but I'm not sure if this causes any
> problems for guest operating systems doing probing to guess what kind
> of device they're talking to.
> 
> Any input?

I agree.

In my next series, after we have General Purpose Logging support,
I intend to bump the major version to indicate ACS-5 support.
I will need to verify that we are not missing any other major
feature from ACS-5 first though (other than GPL).


Kind regards,
Niklas

[PATCH v2 8/8] hw/ide/ahci: fix broken SError handling

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

When encountering an NCQ error, you should not write the NCQ tag to the
SError register. This is completely wrong.

The SError register has a clear definition, where each bit represents a
different error, see PxSERR definition in AHCI 1.3.1.

If we write a random value (like the NCQ tag) in SError, e.g. Linux will
read SError, and will trigger arbitrary error handling depending on the
NCQ tag that happened to be executing.

In case of success, ncq_cb() will call ncq_finish().
In case of error, ncq_cb() will call ncq_err() (which will clear
ncq_tfs->used), and then call ncq_finish(), thus using ncq_tfs->used is
sufficient to tell if finished should get set or not.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index ef6c9fc378..d0a774bc17 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1012,7 +1012,6 @@ static void ncq_err(NCQTransferState *ncq_tfs)
 
 ide_state->error = ABRT_ERR;
 ide_state->status = READY_STAT | ERR_STAT;
-ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
 qemu_sglist_destroy(_tfs->sglist);
 ncq_tfs->used = 0;
 }
@@ -1022,7 +1021,7 @@ static void ncq_finish(NCQTransferState *ncq_tfs)
 /* If we didn't error out, set our finished bit. Errored commands
  * do not get a bit set for the SDB FIS ACT register, nor do they
  * clear the outstanding bit in scr_act (PxSACT). */
-if (!(ncq_tfs->drive->port_regs.scr_err & (1 << ncq_tfs->tag))) {
+if (ncq_tfs->used) {
 ncq_tfs->drive->finished |= (1 << ncq_tfs->tag);
 }
 
-- 
2.40.1




[PATCH v2 7/8] hw/ide/ahci: fix ahci_write_fis_sdb()

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

When there is an error, we need to raise a TFES error irq, see AHCI 1.3.1,
5.3.13.1 SDB:Entry.

If ERR_STAT is set, we jump to state ERR:FatalTaskfile, which will raise
a TFES IRQ unconditionally, regardless if the I bit is set in the FIS or
not.

Thus, we should never raise a normal IRQ after having sent an error IRQ.

It is valid to signal successfully completed commands as finished in the
same SDB FIS that generates the error IRQ. The important thing is that
commands that did not complete successfully (e.g. commands that were
aborted, do not get the finished bit set).

Before this commit, there was never a TFES IRQ raised on NCQ error.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 12aaadc554..ef6c9fc378 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -806,8 +806,14 @@ static void ahci_write_fis_sdb(AHCIState *s, 
NCQTransferState *ncq_tfs)
 pr->scr_act &= ~ad->finished;
 ad->finished = 0;
 
-/* Trigger IRQ if interrupt bit is set (which currently, it always is) */
-if (sdb_fis->flags & 0x40) {
+/*
+ * TFES IRQ is always raised if ERR_STAT is set, regardless of I bit.
+ * If ERR_STAT is not set, trigger SDBS IRQ if interrupt bit is set
+ * (which currently, it always is).
+ */
+if (sdb_fis->status & ERR_STAT) {
+ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_TFES);
+} else if (sdb_fis->flags & 0x40) {
 ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_SDBS);
 }
 }
-- 
2.40.1




[PATCH v2 4/8] hw/ide/ahci: simplify and document PxCI handling

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

The AHCI spec states that:
For NCQ, PxCI is cleared on command queued successfully.

For non-NCQ, PxCI is cleared on command completed successfully.
(A non-NCQ command that completes with error does not clear PxCI.)

The current QEMU implementation either clears PxCI in check_cmd(),
or in ahci_cmd_done().

check_cmd() will clear PxCI for a command if handle_cmd() returns 0.
handle_cmd() will return -1 if BUSY or DRQ is set.

The QEMU implementation for NCQ commands will currently not set BUSY
or DRQ, so they will always have PxCI cleared by handle_cmd().
ahci_cmd_done() will never even get called for NCQ commands.

Non-NCQ commands are executed by ide_bus_exec_cmd().
Non-NCQ commands in QEMU are implemented either in a sync or in an async
way.

For non-NCQ commands implemented in a sync way, the command handler will
return true, and when ide_bus_exec_cmd() sees that a command handler
returns true, it will call ide_cmd_done() (which will call
ahci_cmd_done()). For a command implemented in a sync way,
ahci_cmd_done() will do nothing (since busy_slot is not set). Instead,
after ide_bus_exec_cmd() has finished, check_cmd() will clear PxCI for
these commands.

For non-NCQ commands implemented in an async way (using either aiocb or
pio_aiocb), the command handler will return false, ide_bus_exec_cmd()
will not call ide_cmd_done(), instead it is expected that the async
callback function will call ide_cmd_done() once the async command is
done. handle_cmd() will set busy_slot, if and only if BUSY or DRQ is
set, and this is checked _after_ ide_bus_exec_cmd() has returned.
handle_cmd() will return -1, so check_cmd() will not clear PxCI.
When the async callback calls ide_cmd_done() (which will call
ahci_cmd_done()), it will see that busy_slot is set, and
ahci_cmd_done() will clear PxCI.

This seems racy, since busy_slot is set _after_ ide_bus_exec_cmd() has
returned. The callback might come before busy_slot gets set. And it is
quite confusing that ahci_cmd_done() will be called for all non-NCQ
commands when the command is done, but will only clear PxCI in certain
cases, even though it will always write a D2H FIS and raise an IRQ.

Even worse, in the case where ahci_cmd_done() does not clear PxCI, it
still raises an IRQ. Host software might thus read an old PxCI value,
since PxCI is cleared (by check_cmd()) after the IRQ has been raised.

Try to simplify this by always setting busy_slot for non-NCQ commands,
such that ahci_cmd_done() will always be responsible for clearing PxCI
for non-NCQ commands.

For NCQ commands, clear PxCI when we receive the D2H FIS, but before
raising the IRQ, see AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and
RegFIS:ClearCI.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 70 ---
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4b272397fd..3deaf01add 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -41,9 +41,10 @@
 #include "trace.h"
 
 static void check_cmd(AHCIState *s, int port);
-static int handle_cmd(AHCIState *s, int port, uint8_t slot);
+static void handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
 static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
+static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t slot);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -591,9 +592,8 @@ static void check_cmd(AHCIState *s, int port)
 
 if ((pr->cmd & PORT_CMD_START) && pr->cmd_issue) {
 for (slot = 0; (slot < 32) && pr->cmd_issue; slot++) {
-if ((pr->cmd_issue & (1U << slot)) &&
-!handle_cmd(s, port, slot)) {
-pr->cmd_issue &= ~(1U << slot);
+if (pr->cmd_issue & (1U << slot)) {
+handle_cmd(s, port, slot);
 }
 }
 }
@@ -1123,6 +1123,22 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+/*
+ * A NCQ command clears the bit in PxCI after the command has been QUEUED
+ * successfully (ERROR not set, BUSY and DRQ cleared).
+ *
+ * For NCQ commands, PxCI will always be cleared here.
+ *
+ * (Once the NCQ command is COMPLETED, the device will send a SDB FIS with
+ * the interrupt bit set, which will clear PxSACT and raise an interrupt.)
+ */
+ahci_clear_cmd_issue(ad, slot);
+
+/*
+ * In reality, for NCQ commands, PxCI is cleared after receiving a D2H FIS
+ * without the interrupt bit set, but since ahci_write_fis_d2h() can raise
+ * an IRQ on error, we need to call them in reverse order.
+ */
 ahci_write_fis_d2h(ad, false);
 
 ncq_tfs->used = 1;
@@ -1197,6 +1213,7 @@ st

[PATCH v2 6/8] hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

According to AHCI 1.3.1 definition of PxSACT:
This field is cleared when PxCMD.ST is written from a '1' to a '0' by
software. This field is not cleared by a COMRESET or a software reset.

According to AHCI 1.3.1 definition of PxCI:
This field is also cleared when PxCMD.ST is written from a '1' to a '0'
by software.

Clearing PxCMD.ST is part of the error recovery procedure, see
AHCI 1.3.1, section "6.2 Error Recovery".

If we don't clear PxCI on error recovery, the previous command will
incorrectly still be marked as pending after error recovery.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 1237f94ddc..12aaadc554 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -329,6 +329,11 @@ static void ahci_port_write(AHCIState *s, int port, int 
offset, uint32_t val)
 ahci_check_irq(s);
 break;
 case AHCI_PORT_REG_CMD:
+if ((pr->cmd & PORT_CMD_START) && !(val & PORT_CMD_START)) {
+pr->scr_act = 0;
+pr->cmd_issue = 0;
+}
+
 /* Block any Read-only fields from being set;
  * including LIST_ON and FIS_ON.
  * The spec requires to set ICC bits to zero after the ICC change
-- 
2.40.1




[PATCH v2 5/8] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

For NCQ, PxCI is cleared on command queued successfully.
For non-NCQ, PxCI is cleared on command completed successfully.
Successfully means ERR_STAT, BUSY and DRQ are all cleared.

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

In the case of non-NCQ commands, not clearing PxCI is needed in order
for host software to be able to see which command slot that failed.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 3deaf01add..1237f94ddc 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1518,7 +1518,8 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 {
 IDEState *ide_state = >port.ifs[0];
 
-if (!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
+if (!(ide_state->status & ERR_STAT) &&
+!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
 ad->port_regs.cmd_issue &= ~(1 << slot);
 }
 }
@@ -1527,6 +1528,7 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 static void ahci_cmd_done(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
+IDEState *ide_state = >port.ifs[0];
 
 trace_ahci_cmd_done(ad->hba, ad->port_no);
 
@@ -1543,7 +1545,8 @@ static void ahci_cmd_done(const IDEDMA *dma)
  */
 ahci_write_fis_d2h(ad, true);
 
-if (ad->port_regs.cmd_issue && !ad->check_bh) {
+if (!(ide_state->status & ERR_STAT) &&
+ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
>mem_reentrancy_guard);
 qemu_bh_schedule(ad->check_bh);
-- 
2.40.1




[PATCH v2 3/8] hw/ide/ahci: write D2H FIS on when processing NCQ command

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

The way that BUSY + PxCI is cleared for NCQ (FPDMA QUEUED) commands is
described in SATA 3.5a Gold:

11.15 FPDMA QUEUED command protocol
DFPDMAQ2: ClearInterfaceBsy
"Transmit Register Device to Host FIS with the BSY bit cleared to zero
and the DRQ bit cleared to zero and Interrupt bit cleared to zero to
mark interface ready for the next command."

PxCI is currently cleared by handle_cmd(), but we don't write the D2H
FIS to the FIS Receive Area that actually caused PxCI to be cleared.

Similar to how ahci_pio_transfer() calls ahci_write_fis_pio() with an
additional parameter to write a PIO Setup FIS without raising an IRQ,
add a parameter to ahci_write_fis_d2h() so that ahci_write_fis_d2h()
also can write the FIS to the FIS Receive Area without raising an IRQ.

Change process_ncq_command() to call ahci_write_fis_d2h() without
raising an IRQ (similar to ahci_pio_transfer()), such that the FIS
Receive Area is in sync with the PxTFD shadow register.

E.g. Linux reads status and error fields from the FIS Receive Area
directly, so it is wise to keep the FIS Receive Area and the PxTFD
shadow register in sync.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 48d550f633..4b272397fd 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -43,7 +43,7 @@
 static void check_cmd(AHCIState *s, int port);
 static int handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
-static bool ahci_write_fis_d2h(AHCIDevice *ad);
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -618,7 +618,7 @@ static void ahci_init_d2h(AHCIDevice *ad)
 return;
 }
 
-if (ahci_write_fis_d2h(ad)) {
+if (ahci_write_fis_d2h(ad, true)) {
 ad->init_d2h_sent = true;
 /* We're emulating receiving the first Reg H2D Fis from the device;
  * Update the SIG register, but otherwise proceed as normal. */
@@ -850,7 +850,7 @@ static void ahci_write_fis_pio(AHCIDevice *ad, uint16_t 
len, bool pio_fis_i)
 }
 }
 
-static bool ahci_write_fis_d2h(AHCIDevice *ad)
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i)
 {
 AHCIPortRegs *pr = >port_regs;
 uint8_t *d2h_fis;
@@ -864,7 +864,7 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 d2h_fis = >res_fis[RES_FIS_RFIS];
 
 d2h_fis[0] = SATA_FIS_TYPE_REGISTER_D2H;
-d2h_fis[1] = (1 << 6); /* interrupt bit */
+d2h_fis[1] = d2h_fis_i ? (1 << 6) : 0; /* interrupt bit */
 d2h_fis[2] = s->status;
 d2h_fis[3] = s->error;
 
@@ -890,7 +890,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
 }
 
-ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+if (d2h_fis_i) {
+ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+}
+
 return true;
 }
 
@@ -1120,6 +1123,8 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+ahci_write_fis_d2h(ad, false);
+
 ncq_tfs->used = 1;
 ncq_tfs->drive = ad;
 ncq_tfs->slot = slot;
@@ -1506,7 +1511,7 @@ static void ahci_cmd_done(const IDEDMA *dma)
 }
 
 /* update d2h status */
-ahci_write_fis_d2h(ad);
+ahci_write_fis_d2h(ad, true);
 
 if (ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new_guarded(ahci_check_cmd_bh, ad,
-- 
2.40.1




[PATCH v2 2/8] hw/ide/core: set ERR_STAT in unsupported command completion

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

Currently, the first time sending an unsupported command
(e.g. READ LOG DMA EXT) will not have ERR_STAT set in the completion.
Sending the unsupported command again, will correctly have ERR_STAT set.

When ide_cmd_permitted() returns false, it calls ide_abort_command().
ide_abort_command() first calls ide_transfer_stop(), which will call
ide_transfer_halt() and ide_cmd_done(), after that ide_abort_command()
sets ERR_STAT in status.

ide_cmd_done() for AHCI will call ahci_write_fis_d2h() which writes the
current status in the FIS, and raises an IRQ. (The status here will not
have ERR_STAT set!).

Thus, we cannot call ide_transfer_stop() before setting ERR_STAT, as
ide_transfer_stop() will result in the FIS being written and an IRQ
being raised.

The reason why it works the second time, is that ERR_STAT will still
be set from the previous command, so when writing the FIS, the
completion will correctly have ERR_STAT set.

Set ERR_STAT before writing the FIS (calling cmd_done), so that we will
raise an error IRQ correctly when receiving an unsupported command.

Signed-off-by: Niklas Cassel 
---
 hw/ide/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index de48ff9f86..07971c0218 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -533,9 +533,9 @@ BlockAIOCB *ide_issue_trim(
 
 void ide_abort_command(IDEState *s)
 {
-ide_transfer_stop(s);
 s->status = READY_STAT | ERR_STAT;
 s->error = ABRT_ERR;
+ide_transfer_stop(s);
 }
 
 static void ide_set_retry(IDEState *s)
-- 
2.40.1




[PATCH v2 1/8] hw/ide/ahci: remove stray backslash

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

This backslash obviously does not belong here, so remove it.

Signed-off-by: Niklas Cassel 
Reviewed-by: Philippe Mathieu-Daudé 
Reviewed-by: John Snow 
---
 hw/ide/ahci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4e76d6b191..48d550f633 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -690,7 +690,7 @@ static void ahci_reset_port(AHCIState *s, int port)
 
 s->dev[port].port_state = STATE_RUN;
 if (ide_state->drive_kind == IDE_CD) {
-ahci_set_signature(d, SATA_SIGNATURE_CDROM);\
+ahci_set_signature(d, SATA_SIGNATURE_CDROM);
 ide_state->status = SEEK_STAT | WRERR_STAT | READY_STAT;
 } else {
 ahci_set_signature(d, SATA_SIGNATURE_DISK);
-- 
2.40.1




[PATCH v2 0/8] misc AHCI cleanups

2023-06-01 Thread Niklas Cassel
From: Niklas Cassel 

Hello John,

Here comes some misc AHCI cleanups.

Most are related to error handling.

Please review.

(I'm also working on a second series which will add support for
READ LOG EXT and READ LOG DMA EXT, but I will send that one out
once it is ready. (It might take a couple of weeks still, since
I've been a bit busy lately.))


Changes since v1:
-Picked up Reviewed-by tags.
 (I did not convert your ACK to explicit Acked-by tags, since I assume
 that the patches will go via your tree).
-Rebased on master in order to fix a conflict in patch
 "hw/ide/ahci: simplify and document PxCI handling".
-Dropped patch "hw/ide/ahci: trigger either error IRQ or regular IRQ, not both"
 for now, as it caused a boot time regression in SeaBIOS.
 This appears to be a bug in SeaBIOS, for more info see:
 
https://mail.coreboot.org/hyperkitty/list/seab...@seabios.org/thread/RIHV3FZ4EVMAJA4TEDPASKNYV7V72O4C/
 I will resend the QEMU patch separately once the SeaBIOS patch has been
 merged, and once QEMU has updated to a SeaBIOS tag that includes the fix.


Kind regards,
Niklas

Niklas Cassel (8):
  hw/ide/ahci: remove stray backslash
  hw/ide/core: set ERR_STAT in unsupported command completion
  hw/ide/ahci: write D2H FIS on when processing NCQ command
  hw/ide/ahci: simplify and document PxCI handling
  hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
  hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
  hw/ide/ahci: fix ahci_write_fis_sdb()
  hw/ide/ahci: fix broken SError handling

 hw/ide/ahci.c | 112 +++---
 hw/ide/core.c |   2 +-
 2 files changed, 81 insertions(+), 33 deletions(-)

-- 
2.40.1




Re: [PATCH 2/9] hw/ide/core: set ERR_STAT in unsupported command completion

2023-06-01 Thread Niklas Cassel
On Wed, May 17, 2023 at 05:12:57PM -0400, John Snow wrote:
> On Fri, Apr 28, 2023 at 9:22 AM Niklas Cassel  wrote:
> >
> > From: Niklas Cassel 
> >
> > Currently, the first time sending an unsupported command
> > (e.g. READ LOG DMA EXT) will not have ERR_STAT set in the completion.
> > Sending the unsupported command again, will correctly have ERR_STAT set.
> >
> > When ide_cmd_permitted() returns false, it calls ide_abort_command().
> > ide_abort_command() first calls ide_transfer_stop(), which will call
> > ide_transfer_halt() and ide_cmd_done(), after that ide_abort_command()
> > sets ERR_STAT in status.
> >
> > ide_cmd_done() for AHCI will call ahci_write_fis_d2h() which writes the
> > current status in the FIS, and raises an IRQ. (The status here will not
> > have ERR_STAT set!).
> >
> > Thus, we cannot call ide_transfer_stop() before setting ERR_STAT, as
> > ide_transfer_stop() will result in the FIS being written and an IRQ
> > being raised.
> >
> > The reason why it works the second time, is that ERR_STAT will still
> > be set from the previous command, so when writing the FIS, the
> > completion will correctly have ERR_STAT set.
> >
> > Set ERR_STAT before writing the FIS (calling cmd_done), so that we will
> > raise an error IRQ correctly when receiving an unsupported command.
> >
> > Signed-off-by: Niklas Cassel 
> > ---
> >  hw/ide/core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/hw/ide/core.c b/hw/ide/core.c
> > index 45d14a25e9..c144d1155d 100644
> > --- a/hw/ide/core.c
> > +++ b/hw/ide/core.c
> > @@ -531,9 +531,9 @@ BlockAIOCB *ide_issue_trim(
> >
> >  void ide_abort_command(IDEState *s)
> >  {
> > -ide_transfer_stop(s);
> >  s->status = READY_STAT | ERR_STAT;
> >  s->error = ABRT_ERR;
> > +ide_transfer_stop(s);
> >  }
> >
> >  static void ide_set_retry(IDEState *s)
> > --
> > 2.40.0
> >
> 
> Seems OK at a glance. Does this change the behavior of
> ide_transfer_stop at all? I guess we've been using this order of
> operations since 2008 at least. I didn't know C then.

Hello John,

Not as far as I can see.


Kind regards,
Niklas

[PATCH 9/9] hw/ide/ahci: fix broken SError handling

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

When encountering an NCQ error, you should not write the NCQ tag to the
SError register. This is completely wrong.

The SError register has a clear definition, where each bit represents a
different error, see PxSERR definition in AHCI 1.3.1.

If we write a random value (like the NCQ tag) in SError, e.g. Linux will
read SError, and will trigger arbitrary error handling depending on the
NCQ tag that happened to be executing.

In case of success, ncq_cb() will call ncq_finish().
In case of error, ncq_cb() will call ncq_err() (which will clear
ncq_tfs->used), and then call ncq_finish(), thus using ncq_tfs->used is
sufficient to tell if finished should get set or not.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 4950d3575e..09a14408e3 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1011,7 +1011,6 @@ static void ncq_err(NCQTransferState *ncq_tfs)
 
 ide_state->error = ABRT_ERR;
 ide_state->status = READY_STAT | ERR_STAT;
-ncq_tfs->drive->port_regs.scr_err |= (1 << ncq_tfs->tag);
 qemu_sglist_destroy(_tfs->sglist);
 ncq_tfs->used = 0;
 }
@@ -1021,7 +1020,7 @@ static void ncq_finish(NCQTransferState *ncq_tfs)
 /* If we didn't error out, set our finished bit. Errored commands
  * do not get a bit set for the SDB FIS ACT register, nor do they
  * clear the outstanding bit in scr_act (PxSACT). */
-if (!(ncq_tfs->drive->port_regs.scr_err & (1 << ncq_tfs->tag))) {
+if (ncq_tfs->used) {
 ncq_tfs->drive->finished |= (1 << ncq_tfs->tag);
 }
 
-- 
2.40.0




[PATCH 4/9] hw/ide/ahci: simplify and document PxCI handling

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

The AHCI spec states that:
For NCQ, PxCI is cleared on command queued successfully.

For non-NCQ, PxCI is cleared on command completed successfully.
(A non-NCQ command that completes with error does not clear PxCI.)

The current QEMU implementation either clears PxCI in check_cmd(),
or in ahci_cmd_done().

check_cmd() will clear PxCI for a command if handle_cmd() returns 0.
handle_cmd() will return -1 if BUSY or DRQ is set.

The QEMU implementation for NCQ commands will currently not set BUSY
or DRQ, so they will always have PxCI cleared by handle_cmd().
ahci_cmd_done() will never even get called for NCQ commands.

Non-NCQ commands are executed by ide_bus_exec_cmd().
Non-NCQ commands in QEMU are implemented either in a sync or in an async
way.

For non-NCQ commands implemented in a sync way, the command handler will
return true, and when ide_bus_exec_cmd() sees that a command handler
returns true, it will call ide_cmd_done() (which will call
ahci_cmd_done()). For a command implemented in a sync way,
ahci_cmd_done() will do nothing (since busy_slot is not set). Instead,
after ide_bus_exec_cmd() has finished, check_cmd() will clear PxCI for
these commands.

For non-NCQ commands implemented in an async way (using either aiocb or
pio_aiocb), the command handler will return false, ide_bus_exec_cmd()
will not call ide_cmd_done(), instead it is expected that the async
callback function will call ide_cmd_done() once the async command is
done. handle_cmd() will set busy_slot, if and only if BUSY or DRQ is
set, and this is checked _after_ ide_bus_exec_cmd() has returned.
handle_cmd() will return -1, so check_cmd() will not clear PxCI.
When the async callback calls ide_cmd_done() (which will call
ahci_cmd_done()), it will see that busy_slot is set, and
ahci_cmd_done() will clear PxCI.

This seems racy, since busy_slot is set _after_ ide_bus_exec_cmd() has
returned. The callback might come before busy_slot gets set. And it is
quite confusing that ahci_cmd_done() will be called for all non-NCQ
commands when the command is done, but will only clear PxCI in certain
cases, even though it will always write a D2H FIS and raise an IRQ.

Even worse, in the case where ahci_cmd_done() does not clear PxCI, it
still raises an IRQ. Host software might thus read an old PxCI value,
since PxCI is cleared (by check_cmd()) after the IRQ has been raised.

Try to simplify this by always setting busy_slot for non-NCQ commands,
such that ahci_cmd_done() will always be responsible for clearing PxCI
for non-NCQ commands.

For NCQ commands, clear PxCI when we receive the D2H FIS, but before
raising the IRQ, see AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and
RegFIS:ClearCI.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 70 ---
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 62aebc8de7..9d79b071b8 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -41,9 +41,10 @@
 #include "trace.h"
 
 static void check_cmd(AHCIState *s, int port);
-static int handle_cmd(AHCIState *s, int port, uint8_t slot);
+static void handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
 static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
+static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t slot);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -591,9 +592,8 @@ static void check_cmd(AHCIState *s, int port)
 
 if ((pr->cmd & PORT_CMD_START) && pr->cmd_issue) {
 for (slot = 0; (slot < 32) && pr->cmd_issue; slot++) {
-if ((pr->cmd_issue & (1U << slot)) &&
-!handle_cmd(s, port, slot)) {
-pr->cmd_issue &= ~(1U << slot);
+if (pr->cmd_issue & (1U << slot)) {
+handle_cmd(s, port, slot);
 }
 }
 }
@@ -1123,6 +1123,22 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+/*
+ * A NCQ command clears the bit in PxCI after the command has been QUEUED
+ * successfully (ERROR not set, BUSY and DRQ cleared).
+ *
+ * For NCQ commands, PxCI will always be cleared here.
+ *
+ * (Once the NCQ command is COMPLETED, the device will send a SDB FIS with
+ * the interrupt bit set, which will clear PxSACT and raise an interrupt.)
+ */
+ahci_clear_cmd_issue(ad, slot);
+
+/*
+ * In reality, for NCQ commands, PxCI is cleared after receiving a D2H FIS
+ * without the interrupt bit set, but since ahci_write_fis_d2h() can raise
+ * an IRQ on error, we need to call them in reverse order.
+ */
 ahci_write_fis_d2h(ad, false);
 
 ncq_tfs->used = 1;
@@ -1197,6 +1213,7 @@ st

[PATCH 5/9] hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

For NCQ, PxCI is cleared on command queued successfully.
For non-NCQ, PxCI is cleared on command completed successfully.
Successfully means ERR_STAT, BUSY and DRQ are all cleared.

A command that has ERR_STAT set, does not get to clear PxCI.
See AHCI 1.3.1, section 5.3.8, states RegFIS:Entry and RegFIS:ClearCI,
and 5.3.16.5 ERR:FatalTaskfile.

In the case of non-NCQ commands, not clearing PxCI is needed in order
for host software to be able to see which command slot that failed.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 9d79b071b8..366929132b 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -1518,7 +1518,8 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 {
 IDEState *ide_state = >port.ifs[0];
 
-if (!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
+if (!(ide_state->status & ERR_STAT) &&
+!(ide_state->status & (BUSY_STAT | DRQ_STAT))) {
 ad->port_regs.cmd_issue &= ~(1 << slot);
 }
 }
@@ -1527,6 +1528,7 @@ static void ahci_clear_cmd_issue(AHCIDevice *ad, uint8_t 
slot)
 static void ahci_cmd_done(const IDEDMA *dma)
 {
 AHCIDevice *ad = DO_UPCAST(AHCIDevice, dma, dma);
+IDEState *ide_state = >port.ifs[0];
 
 trace_ahci_cmd_done(ad->hba, ad->port_no);
 
@@ -1543,7 +1545,8 @@ static void ahci_cmd_done(const IDEDMA *dma)
  */
 ahci_write_fis_d2h(ad, true);
 
-if (ad->port_regs.cmd_issue && !ad->check_bh) {
+if (!(ide_state->status & ERR_STAT) &&
+ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new(ahci_check_cmd_bh, ad);
 qemu_bh_schedule(ad->check_bh);
 }
-- 
2.40.0




[PATCH 3/9] hw/ide/ahci: write D2H FIS on when processing NCQ command

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

The way that BUSY + PxCI is cleared for NCQ (FPDMA QUEUED) commands is
described in SATA 3.5a Gold:

11.15 FPDMA QUEUED command protocol
DFPDMAQ2: ClearInterfaceBsy
"Transmit Register Device to Host FIS with the BSY bit cleared to zero
and the DRQ bit cleared to zero and Interrupt bit cleared to zero to
mark interface ready for the next command."

PxCI is currently cleared by handle_cmd(), but we don't write the D2H
FIS to the FIS Receive Area that actually caused PxCI to be cleared.

Similar to how ahci_pio_transfer() calls ahci_write_fis_pio() with an
additional parameter to write a PIO Setup FIS without raising an IRQ,
add a parameter to ahci_write_fis_d2h() so that ahci_write_fis_d2h()
also can write the FIS to the FIS Receive Area without raising an IRQ.

Change process_ncq_command() to call ahci_write_fis_d2h() without
raising an IRQ (similar to ahci_pio_transfer()), such that the FIS
Receive Area is in sync with the PxTFD shadow register.

E.g. Linux reads status and error fields from the FIS Receive Area
directly, so it is wise to keep the FIS Receive Area and the PxTFD
shadow register in sync.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 17 +++--
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index a36e3fb77c..62aebc8de7 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -43,7 +43,7 @@
 static void check_cmd(AHCIState *s, int port);
 static int handle_cmd(AHCIState *s, int port, uint8_t slot);
 static void ahci_reset_port(AHCIState *s, int port);
-static bool ahci_write_fis_d2h(AHCIDevice *ad);
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i);
 static void ahci_init_d2h(AHCIDevice *ad);
 static int ahci_dma_prepare_buf(const IDEDMA *dma, int32_t limit);
 static bool ahci_map_clb_address(AHCIDevice *ad);
@@ -618,7 +618,7 @@ static void ahci_init_d2h(AHCIDevice *ad)
 return;
 }
 
-if (ahci_write_fis_d2h(ad)) {
+if (ahci_write_fis_d2h(ad, true)) {
 ad->init_d2h_sent = true;
 /* We're emulating receiving the first Reg H2D Fis from the device;
  * Update the SIG register, but otherwise proceed as normal. */
@@ -850,7 +850,7 @@ static void ahci_write_fis_pio(AHCIDevice *ad, uint16_t 
len, bool pio_fis_i)
 }
 }
 
-static bool ahci_write_fis_d2h(AHCIDevice *ad)
+static bool ahci_write_fis_d2h(AHCIDevice *ad, bool d2h_fis_i)
 {
 AHCIPortRegs *pr = >port_regs;
 uint8_t *d2h_fis;
@@ -864,7 +864,7 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 d2h_fis = >res_fis[RES_FIS_RFIS];
 
 d2h_fis[0] = SATA_FIS_TYPE_REGISTER_D2H;
-d2h_fis[1] = (1 << 6); /* interrupt bit */
+d2h_fis[1] = d2h_fis_i ? (1 << 6) : 0; /* interrupt bit */
 d2h_fis[2] = s->status;
 d2h_fis[3] = s->error;
 
@@ -890,7 +890,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad)
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
 }
 
-ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+if (d2h_fis_i) {
+ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
+}
+
 return true;
 }
 
@@ -1120,6 +1123,8 @@ static void process_ncq_command(AHCIState *s, int port, 
const uint8_t *cmd_fis,
 return;
 }
 
+ahci_write_fis_d2h(ad, false);
+
 ncq_tfs->used = 1;
 ncq_tfs->drive = ad;
 ncq_tfs->slot = slot;
@@ -1506,7 +1511,7 @@ static void ahci_cmd_done(const IDEDMA *dma)
 }
 
 /* update d2h status */
-ahci_write_fis_d2h(ad);
+ahci_write_fis_d2h(ad, true);
 
 if (ad->port_regs.cmd_issue && !ad->check_bh) {
 ad->check_bh = qemu_bh_new(ahci_check_cmd_bh, ad);
-- 
2.40.0




[PATCH 0/9] misc AHCI cleanups

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

Hello John,

Here comes some misc AHCI cleanups.

Most are related to error handling.

Please review.

(I'm also working on a second series which will add support for
READ LOG EXT and READ LOG DMA EXT, but I will send that one out
once it is ready.)


Kind regards,
Niklas


Niklas Cassel (9):
  hw/ide/ahci: remove stray backslash
  hw/ide/core: set ERR_STAT in unsupported command completion
  hw/ide/ahci: write D2H FIS on when processing NCQ command
  hw/ide/ahci: simplify and document PxCI handling
  hw/ide/ahci: PxCI should not get cleared when ERR_STAT is set
  hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared
  hw/ide/ahci: trigger either error IRQ or regular IRQ, not both
  hw/ide/ahci: fix ahci_write_fis_sdb()
  hw/ide/ahci: fix broken SError handling

 hw/ide/ahci.c | 111 +++---
 hw/ide/core.c |   2 +-
 2 files changed, 80 insertions(+), 33 deletions(-)

-- 
2.40.0




[PATCH 2/9] hw/ide/core: set ERR_STAT in unsupported command completion

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

Currently, the first time sending an unsupported command
(e.g. READ LOG DMA EXT) will not have ERR_STAT set in the completion.
Sending the unsupported command again, will correctly have ERR_STAT set.

When ide_cmd_permitted() returns false, it calls ide_abort_command().
ide_abort_command() first calls ide_transfer_stop(), which will call
ide_transfer_halt() and ide_cmd_done(), after that ide_abort_command()
sets ERR_STAT in status.

ide_cmd_done() for AHCI will call ahci_write_fis_d2h() which writes the
current status in the FIS, and raises an IRQ. (The status here will not
have ERR_STAT set!).

Thus, we cannot call ide_transfer_stop() before setting ERR_STAT, as
ide_transfer_stop() will result in the FIS being written and an IRQ
being raised.

The reason why it works the second time, is that ERR_STAT will still
be set from the previous command, so when writing the FIS, the
completion will correctly have ERR_STAT set.

Set ERR_STAT before writing the FIS (calling cmd_done), so that we will
raise an error IRQ correctly when receiving an unsupported command.

Signed-off-by: Niklas Cassel 
---
 hw/ide/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/core.c b/hw/ide/core.c
index 45d14a25e9..c144d1155d 100644
--- a/hw/ide/core.c
+++ b/hw/ide/core.c
@@ -531,9 +531,9 @@ BlockAIOCB *ide_issue_trim(
 
 void ide_abort_command(IDEState *s)
 {
-ide_transfer_stop(s);
 s->status = READY_STAT | ERR_STAT;
 s->error = ABRT_ERR;
+ide_transfer_stop(s);
 }
 
 static void ide_set_retry(IDEState *s)
-- 
2.40.0




[PATCH 7/9] hw/ide/ahci: trigger either error IRQ or regular IRQ, not both

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

According to AHCI 1.3.1, 5.3.8.1 RegFIS:Entry, if ERR_STAT is set,
we jump to state ERR:FatalTaskfile, which will raise a TFES IRQ
unconditionally, regardless if the I bit is set in the FIS or not.

Thus, we should never raise a normal IRQ after having sent an error
IRQ.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 2a59d0e0f5..d88961b4c0 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -891,11 +891,10 @@ static bool ahci_write_fis_d2h(AHCIDevice *ad, bool 
d2h_fis_i)
 pr->tfdata = (ad->port.ifs[0].error << 8) |
 ad->port.ifs[0].status;
 
+/* TFES IRQ is always raised if ERR_STAT is set, regardless of I bit. */
 if (d2h_fis[2] & ERR_STAT) {
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_TFES);
-}
-
-if (d2h_fis_i) {
+} else if (d2h_fis_i) {
 ahci_trigger_irq(ad->hba, ad, AHCI_PORT_IRQ_BIT_DHRS);
 }
 
-- 
2.40.0




[PATCH 6/9] hw/ide/ahci: PxSACT and PxCI is cleared when PxCMD.ST is cleared

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

According to AHCI 1.3.1 definition of PxSACT:
This field is cleared when PxCMD.ST is written from a '1' to a '0' by
software. This field is not cleared by a COMRESET or a software reset.

According to AHCI 1.3.1 definition of PxCI:
This field is also cleared when PxCMD.ST is written from a '1' to a '0'
by software.

Clearing PxCMD.ST is part of the error recovery procedure, see
AHCI 1.3.1, section "6.2 Error Recovery".

If we don't clear PxCI on error recovery, the previous command will
incorrectly still be marked as pending after error recovery.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 366929132b..2a59d0e0f5 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -329,6 +329,11 @@ static void ahci_port_write(AHCIState *s, int port, int 
offset, uint32_t val)
 ahci_check_irq(s);
 break;
 case AHCI_PORT_REG_CMD:
+if ((pr->cmd & PORT_CMD_START) && !(val & PORT_CMD_START)) {
+pr->scr_act = 0;
+pr->cmd_issue = 0;
+}
+
 /* Block any Read-only fields from being set;
  * including LIST_ON and FIS_ON.
  * The spec requires to set ICC bits to zero after the ICC change
-- 
2.40.0




[PATCH 1/9] hw/ide/ahci: remove stray backslash

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

This backslash obviously does not belong here, so remove it.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index 55902e1df7..a36e3fb77c 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -690,7 +690,7 @@ static void ahci_reset_port(AHCIState *s, int port)
 
 s->dev[port].port_state = STATE_RUN;
 if (ide_state->drive_kind == IDE_CD) {
-ahci_set_signature(d, SATA_SIGNATURE_CDROM);\
+ahci_set_signature(d, SATA_SIGNATURE_CDROM);
 ide_state->status = SEEK_STAT | WRERR_STAT | READY_STAT;
 } else {
 ahci_set_signature(d, SATA_SIGNATURE_DISK);
-- 
2.40.0




[PATCH 8/9] hw/ide/ahci: fix ahci_write_fis_sdb()

2023-04-28 Thread Niklas Cassel
From: Niklas Cassel 

When there is an error, we need to raise a TFES error irq, see AHCI 1.3.1,
5.3.13.1 SDB:Entry.

If ERR_STAT is set, we jump to state ERR:FatalTaskfile, which will raise
a TFES IRQ unconditionally, regardless if the I bit is set in the FIS or
not.

Thus, we should never raise a normal IRQ after having sent an error IRQ.

It is valid to signal successfully completed commands as finished in the
same SDB FIS that generates the error IRQ. The important thing is that
commands that did not complete successfully (e.g. commands that were
aborted, do not get the finished bit set).

Before this commit, there was never a TFES IRQ raised on NCQ error.

Signed-off-by: Niklas Cassel 
---
 hw/ide/ahci.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/hw/ide/ahci.c b/hw/ide/ahci.c
index d88961b4c0..4950d3575e 100644
--- a/hw/ide/ahci.c
+++ b/hw/ide/ahci.c
@@ -806,8 +806,14 @@ static void ahci_write_fis_sdb(AHCIState *s, 
NCQTransferState *ncq_tfs)
 pr->scr_act &= ~ad->finished;
 ad->finished = 0;
 
-/* Trigger IRQ if interrupt bit is set (which currently, it always is) */
-if (sdb_fis->flags & 0x40) {
+/*
+ * TFES IRQ is always raised if ERR_STAT is set, regardless of I bit.
+ * If ERR_STAT is not set, trigger SDBS IRQ if interrupt bit is set
+ * (which currently, it always is).
+ */
+if (sdb_fis->status & ERR_STAT) {
+ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_TFES);
+} else if (sdb_fis->flags & 0x40) {
 ahci_trigger_irq(s, ad, AHCI_PORT_IRQ_BIT_SDBS);
 }
 }
-- 
2.40.0




Re: [PATCH] hw/nvme: remove param zoned.auto_transition

2022-09-09 Thread Niklas Cassel
On Fri, Aug 12, 2022 at 01:01:37PM +0200, Niklas Cassel wrote:
> The intention of the Zoned Namespace Command Set Specification was
> never to make an automatic zone transition optional.
> 
> Excerpt from the nvmexpress.org zns mailing list:
> """
> A question came up internally on the differences between ZNS and ZAC/ZBC
> that asked about when a controller should transitions a specific zone in
> the Implicitly Opened state to Closed state.
> 
> For example, consider a ZNS SSD that supports a max of 20 active zones,
> and a max of 10 open zones, which has the following actions occur:
> 
> First, the host writes to ten empty zones, thereby transitioning 10 zones
> to the Implicitly Opened state.
> 
> Second, the host issues a write to an 11th empty zone.
> 
> Given that state, my understanding of the second part is that the ZNS SSD
> chooses one of the previously 10 zones, and transition the chosen zone to
> the Closed state, and then proceeds to write to the new zone which also
> implicitly transition it from the Empty state to the Impl. Open state.
> After this, there would be 11 active zones in total, 10 in impl. Open
> state, and one in closed state.
> 
> The above assumes that a ZNS SSD will always transition an implicitly
> opened zone to closed state when required to free up resources when
> another zone is opened. However, it isn’t strictly said in the ZNS spec.
> 
> The paragraph that should cover it is defined in section
> 2.1.1.4.1 – Managing Resources:
> The controller may transition zones in the ZSIO:Implicitly Opened state
> to the ZSC:Closed state for resource management purposes.
> 
> However, it doesn’t say “when” it should occur. Thus, as the text stand,
> it could be misinterpreted that the controller shouldn’t do close a zone
> to make room for a new zone. The issue with this, is that it makes the
> point of having implicitly managed zones moot.
> 
> The ZAC/ZBC specs is more specific and clarifies when a zone should be
> closed. I think it would be natural to the same here.
> """
> 
> While the Zoned Namespace Command Set Specification hasn't received an
> errata yet, it is quite clear that the intention was that an automatic
> zone transition was never supposed to be optional, as then the whole
> point of having implictly open zones would be pointless. Therefore,
> remove the param zoned.auto_transition, as this was never supposed to
> be controller implementation specific.
> 
> Signed-off-by: Niklas Cassel 
> ---

Gentle ping.


Kind regards,
Niklas

[PATCH] hw/nvme: remove param zoned.auto_transition

2022-08-12 Thread Niklas Cassel via
The intention of the Zoned Namespace Command Set Specification was
never to make an automatic zone transition optional.

Excerpt from the nvmexpress.org zns mailing list:
"""
A question came up internally on the differences between ZNS and ZAC/ZBC
that asked about when a controller should transitions a specific zone in
the Implicitly Opened state to Closed state.

For example, consider a ZNS SSD that supports a max of 20 active zones,
and a max of 10 open zones, which has the following actions occur:

First, the host writes to ten empty zones, thereby transitioning 10 zones
to the Implicitly Opened state.

Second, the host issues a write to an 11th empty zone.

Given that state, my understanding of the second part is that the ZNS SSD
chooses one of the previously 10 zones, and transition the chosen zone to
the Closed state, and then proceeds to write to the new zone which also
implicitly transition it from the Empty state to the Impl. Open state.
After this, there would be 11 active zones in total, 10 in impl. Open
state, and one in closed state.

The above assumes that a ZNS SSD will always transition an implicitly
opened zone to closed state when required to free up resources when
another zone is opened. However, it isn’t strictly said in the ZNS spec.

The paragraph that should cover it is defined in section
2.1.1.4.1 – Managing Resources:
The controller may transition zones in the ZSIO:Implicitly Opened state
to the ZSC:Closed state for resource management purposes.

However, it doesn’t say “when” it should occur. Thus, as the text stand,
it could be misinterpreted that the controller shouldn’t do close a zone
to make room for a new zone. The issue with this, is that it makes the
point of having implicitly managed zones moot.

The ZAC/ZBC specs is more specific and clarifies when a zone should be
closed. I think it would be natural to the same here.
"""

While the Zoned Namespace Command Set Specification hasn't received an
errata yet, it is quite clear that the intention was that an automatic
zone transition was never supposed to be optional, as then the whole
point of having implictly open zones would be pointless. Therefore,
remove the param zoned.auto_transition, as this was never supposed to
be controller implementation specific.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 35 ---
 hw/nvme/nvme.h |  1 -
 2 files changed, 12 insertions(+), 24 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..b8c8075301 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -34,7 +34,6 @@
  *  aerl=,aer_max_queued=, \
  *  mdts=,vsl=, \
  *  zoned.zasl=, \
- *  zoned.auto_transition=, \
  *  sriov_max_vfs= \
  *  sriov_vq_flexible= \
  *  sriov_vi_flexible= \
@@ -106,11 +105,6 @@
  *   the minimum memory page size (CAP.MPSMIN). The default value is 0 (i.e.
  *   defaulting to the value of `mdts`).
  *
- * - `zoned.auto_transition`
- *   Indicates if zones in zone state implicitly opened can be automatically
- *   transitioned to zone state closed for resource management purposes.
- *   Defaults to 'on'.
- *
  * - `sriov_max_vfs`
  *   Indicates the maximum number of PCIe virtual functions supported
  *   by the controller. The default value is 0. Specifying a non-zero value
@@ -1867,8 +1861,8 @@ enum {
 NVME_ZRM_ZRWA = 1 << 1,
 };
 
-static uint16_t nvme_zrm_open_flags(NvmeCtrl *n, NvmeNamespace *ns,
-NvmeZone *zone, int flags)
+static uint16_t nvme_zrm_open_flags(NvmeNamespace *ns, NvmeZone *zone,
+int flags)
 {
 int act = 0;
 uint16_t status;
@@ -1880,9 +1874,7 @@ static uint16_t nvme_zrm_open_flags(NvmeCtrl *n, 
NvmeNamespace *ns,
 /* fallthrough */
 
 case NVME_ZONE_STATE_CLOSED:
-if (n->params.auto_transition_zones) {
-nvme_zrm_auto_transition_zone(ns);
-}
+nvme_zrm_auto_transition_zone(ns);
 status = nvme_zns_check_resources(ns, act, 1,
   (flags & NVME_ZRM_ZRWA) ? 1 : 0);
 if (status) {
@@ -1925,10 +1917,9 @@ static uint16_t nvme_zrm_open_flags(NvmeCtrl *n, 
NvmeNamespace *ns,
 }
 }
 
-static inline uint16_t nvme_zrm_auto(NvmeCtrl *n, NvmeNamespace *ns,
- NvmeZone *zone)
+static inline uint16_t nvme_zrm_auto(NvmeNamespace *ns, NvmeZone *zone)
 {
-return nvme_zrm_open_flags(n, ns, zone, NVME_ZRM_AUTO);
+return nvme_zrm_open_flags(ns, zone, NVME_ZRM_AUTO);
 }
 
 static void nvme_advance_zone_wp(NvmeNamespace *ns, NvmeZone *zone,
@@ -3065,7 +3056,7 @@ static uint16_t nvme_copy(NvmeCtrl *n, NvmeRequest *req)
 goto invalid;
 }
 
-status = nvme_zrm_auto(n, ns, iocb->zone);
+status = nvme_zrm_auto(ns, iocb->zone);
 if (status) {
 goto

Re: [PATCH] hw/nvme: force nvme-ns param 'shared' to false if no nvme-subsys node

2022-06-28 Thread Niklas Cassel
On Tue, Jun 28, 2022 at 02:22:09PM +0200, Niklas Cassel via wrote:

Hello Peter,

It seems that mailman configuration on qemu-devel is rewriting the
"From:" field to "From: Niklas Cassel via "

If found this old thread about the same issue:
https://qemu-devel.nongnu.narkive.com/6hm8Fbvz/mailing-list-vs-dmarc-and-microsoft-com-s-p-reject-policy

Which says that this can happen when using p=reject policy.

However, doing a bcc to another of my personal addresses,
it looks like the SKIM/SPF/DMARC is all passing,
and non of them seem to have p=reject:

ARC-Authentication-Results: i=1; mx.google.com;
   dkim=pass header.i=@wdc.com header.s=dkim.wdc.com header.b=TTwPBUcS;
   spf=pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) 
smtp.mailfrom="prvs=171ad38db=niklas.cas...@wdc.com";
   dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=wdc.com
Return-Path: 
Received: from esa3.hgst.iphmx.com (esa3.hgst.iphmx.com. [216.71.153.141])
by mx.google.com with ESMTPS id 
sd33-20020a1709076e2100b00711f20051f2si13306645ejc.697.2022.06.28.05.22.25
for 
(version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128);
Tue, 28 Jun 2022 05:22:26 -0700 (PDT)
Received-SPF: pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) client-ip=216.71.153.141;
Authentication-Results: mx.google.com;
   dkim=pass header.i=@wdc.com header.s=dkim.wdc.com header.b=TTwPBUcS;
   spf=pass (google.com: domain of prvs=171ad38db=niklas.cas...@wdc.com 
designates 216.71.153.141 as permitted sender) 
smtp.mailfrom="prvs=171ad38db=niklas.cas...@wdc.com";
   dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE)


Any idea why mailmain is rewriting the "From:" field for my messages?


Kind regards,
Niklas


[PATCH] hw/nvme: force nvme-ns param 'shared' to false if no nvme-subsys node

2022-06-28 Thread Niklas Cassel via
Since commit 916b0f0b5264 ("hw/nvme: change nvme-ns 'shared' default")
the default value of nvme-ns param 'shared' is set to true, regardless
if there is a nvme-subsys node or not.

On a system without a nvme-subsys node, a namespace will never be able
to be attached to more than one controller, so for this configuration,
it is counterintuitive for this parameter to be set by default.

Force the nvme-ns param 'shared' to false for configurations where
there is no nvme-subsys node, as the namespace will never be able to
attach to more than one controller anyway.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ns.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 870c3ca1a2..62a1f97be0 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -546,6 +546,8 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 int i;
 
 if (!n->subsys) {
+/* If no subsys, the ns cannot be attached to more than one ctrl. */
+ns->params.shared = false;
 if (ns->params.detached) {
 error_setg(errp, "detached requires that the nvme device is "
"linked to an nvme-subsys device");
-- 
2.36.1




[PATCH] hw/nvme: fix example serial in documentation

2022-06-27 Thread Niklas Cassel via
The serial prop on the controller is actually describing the nvme
subsystem serial, which has to be identical for all controllers within
the same nvme subsystem.

This is enforced since commit a859eb9f8f64 ("hw/nvme: enforce common
serial per subsystem").

Fix the documentation, so that people copying the qemu command line
example won't get an error on qemu start.

Signed-off-by: Niklas Cassel 
---
 docs/system/devices/nvme.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/system/devices/nvme.rst b/docs/system/devices/nvme.rst
index aba253304e..30f841ef62 100644
--- a/docs/system/devices/nvme.rst
+++ b/docs/system/devices/nvme.rst
@@ -104,8 +104,8 @@ multipath I/O.
 .. code-block:: console
 
-device nvme-subsys,id=nvme-subsys-0,nqn=subsys0
-   -device nvme,serial=a,subsys=nvme-subsys-0
-   -device nvme,serial=b,subsys=nvme-subsys-0
+   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
+   -device nvme,serial=deadbeef,subsys=nvme-subsys-0
 
 This will create an NVM subsystem with two controllers. Having controllers
 linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters:
-- 
2.36.1




[PATCH v2 3/4] hw/nvme: add support for ratified TP4084

2022-06-27 Thread Niklas Cassel via
TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

When CC.CRIME is 1, the controller will become ready when CSTS.RDY gets
set to 1, but commands accessing a namespace are allowed to return
"Namespace Not Ready" or "Admin Command Media Not Ready".
After CRTO.CRWMT amount of time, if the namespace has not yet been
marked ready, the status codes also need to have the DNR bit set.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces.

Once a namespace is ready, it will set the NRDY bit in the I/O Command
Set Independent Identify Namespace Data Structure, and then send out a
Namespace Attribute Changed event.

This new "ready_delay" is supported on controllers not part of a NVMe
subsystem. The reasons are many. One problem is that multiple controllers
can have different CC.CRIME modes running. Another problem is the extra
locking needed. The third problem is when to actually clear NRDY. If we
assume that a namespace clears NRDY when it no longer has any controller
online for that namespace. The problem then is that Linux will reset the
controllers one by one during probe time. The reset goes so fast so that
there is no time when all controllers are in reset at the same time, so
NRDY will never get cleared. (The controllers are enabled by SeaBIOS by
default.) We could introduce a reset_time param, but this would only
increase the chances that all controllers are in reset at the same time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c   | 123 +--
 hw/nvme/ns.c |  18 +++
 hw/nvme/nvme.h   |   6 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 204 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 8ca824ea14..5404f67480 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -88,6 +88,12 @@
  *   completion when there are no outstanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * - `crwmt`
+ *   This is the Controller Ready With Media Timeout (CRWMT) field that is
+ *   defined in the CRTO register. This specifies the worst-case time that host
+ *   software should wait for the controller and all attached namespaces to
+ *   become ready. The value is in units of 500 milliseconds.
+ *
  * - `mdts`
  *   Indicates the maximum data transfer size for a command that transfers data
  *   between host-accessible memory and the controller. The value is specified
@@ -157,6 +163,11 @@
  *   namespace will be available in the subsystem but not attached to any
  *   controllers.
  *
+ * - `ready_delay`
+ *   This parameter specifies the amount of time that the namespace should wait
+ *   before being marked ready. Only applicable if CC.CRIME is set by the user.
+ *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -216,6 +227,8 @@
 #define NVME_VF_RES_GRANULARITY 1
 #define NVME_VF_OFFSET 0x1
 #define NVME_VF_STRIDE 1
+#define NVME_DEFAULT_CRIMT 0xa
+#define NVME_DEFAULT_CRWMT 0xf
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -4188,6 +4201,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
+if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
+return NVME_NS_NOT_READY;
+}
+
 if (ns->status) {
 return ns->status;
 }
@@ -4791,6 +4808,27 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req, bool active)
 return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
+static uint16_t nvme_identify_cs_indep_ns(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeNamespace *ns;
+NvmeIdentify *c = (NvmeIdentify *)>cmd;
+uint32_t nsid = le32_to_cpu(c->nsid);
+
+trace_pci_nvme_identify_cs_indep_ns(nsid);
+
+if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+ns = nvme_ns(n, nsid);
+if (unlikely(!ns)) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+
+return nvme_c2h(n, (uint8_t *)>id_indep_ns, sizeof(NvmeIdNsCsIndep),
+req);
+}
+
 static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
 bool attached)
 {
@@ -5081,6 +5119,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ns(n, req, true);
 case NVME_ID_CNS_NS_PRESENT:
 return nvme_identify_ns(n, req, false);
+case NVME_ID_CNS_CS_IN

[PATCH v2 2/4] hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace

2022-06-27 Thread Niklas Cassel via
Each NvmeNamespace can be used by serveral controllers,
but a NvmeNamespace can at most belong to a single NvmeSubsystem.
Store a pointer to the NvmeSubsystem, if the namespace was realized
with a NvmeSubsystem.

This will be used by a follow up patch.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ns.c   | 1 +
 hw/nvme/nvme.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 870c3ca1a2..8bee3e8b3b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -559,6 +559,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
 return;
 }
+ns->subsys = subsys;
 }
 
 if (nvme_ns_setup(ns, errp)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 0711b9748c..5487e2db40 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -167,6 +167,7 @@ typedef struct NvmeNamespace {
 int32_t nr_active_zones;
 
 NvmeNamespaceParams params;
+NvmeSubsystem *subsys;
 
 struct {
 uint32_t err_rec;
-- 
2.36.1




[PATCH v2 4/4] hw/nvme: add new never_ready parameter to test the DNR bit

2022-06-27 Thread Niklas Cassel via
Since we verify that "ready_delay" parameter has to be smaller than CRWMT,
we know that the namespace will always become ready.
Therefore the "Namespace Not Ready" status code will never have the DNR
bit set.

Add a new parameter "never_ready" that can be used to emulate a namespace
that never gets ready, such that the DNR bit gets set after CRWMT amount
of time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 28 +++-
 hw/nvme/ns.c   |  1 +
 hw/nvme/nvme.h |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 5404f67480..5f98d4778d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -168,6 +168,12 @@
  *   before being marked ready. Only applicable if CC.CRIME is set by the user.
  *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
  *
+ * - `never_ready`
+ *   This parameter specifies that a namespace should never be marked as ready.
+ *   When `crwmt` amount of time has passed after enabling the controller,
+ *   status code "Namespace Not Ready" will have the DNR bit set. If specified
+ *   together with `ready_delay`, `never_ready` will take precedence.
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -4156,6 +4162,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeRequest *req)
 return status;
 }
 
+static bool nvme_ready_has_passed_timeout(NvmeCtrl *n)
+{
+int64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+int64_t elapsed_time = current_time - n->cc_enable_timestamp;
+
+return elapsed_time > n->params.crwmt * 500;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeNamespace *ns;
@@ -4202,7 +4216,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 }
 
 if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
-return NVME_NS_NOT_READY;
+uint16_t ret = NVME_NS_NOT_READY;
+if (ns->params.never_ready && nvme_ready_has_passed_timeout(n)) {
+ret |= NVME_DNR;
+}
+return ret;
 }
 
 if (ns->status) {
@@ -5616,6 +5634,10 @@ static void nvme_set_ready_or_start_timer(NvmeCtrl *n, 
NvmeNamespace *ns)
 {
 int64_t expire_time;
 
+if (ns->params.never_ready) {
+return;
+}
+
 if (!NVME_CC_CRIME(ldl_le_p(>bar.cc)) || ns->params.ready_delay == 0) {
 ns->id_indep_ns.nstat |= NVME_NSTAT_NRDY;
 return;
@@ -6346,6 +6368,7 @@ static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType 
rst)
 }
 }
 
+n->cc_enable_timestamp = 0;
 n->aer_queued = 0;
 n->aer_mask = 0;
 n->outstanding_aers = 0;
@@ -6389,6 +6412,8 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 
 nvme_ns_shutdown(ns);
 }
+
+n->cc_enable_timestamp = 0;
 }
 
 static void nvme_ctrl_per_ns_action_on_start(NvmeCtrl *n)
@@ -6506,6 +6531,7 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 NVME_CAP_SET_TO(cap, new_cap_timeout);
 stq_le_p(>bar.cap, cap);
 
+n->cc_enable_timestamp = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->page_bits = page_bits;
 n->page_size = page_size;
 n->max_prp_ents = n->page_size / sizeof(uint64_t);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index c1d70183c4..fc12b4e0d3 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -661,6 +661,7 @@ static Property nvme_ns_props[] = {
 DEFINE_PROP_BOOL("eui64-default", NvmeNamespace, params.eui64_default,
  false),
 DEFINE_PROP_UINT16("ready_delay", NvmeNamespace, params.ready_delay, 0),
+DEFINE_PROP_BOOL("never_ready", NvmeNamespace, params.never_ready, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 9d2f13dfdb..292b1acf15 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -127,6 +127,7 @@ typedef struct NvmeNamespaceParams {
 uint64_t zrwafg;
 
 uint16_t ready_delay;
+bool never_ready;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -451,6 +452,7 @@ typedef struct NvmeCtrl {
 int cq_pending;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
+uint64_tcc_enable_timestamp;/* QEMU clock time */
 uint64_tstarttime_ms;
 uint16_ttemperature;
 uint8_t smart_critical_warning;
-- 
2.36.1




[PATCH v2 0/4] hw/nvme: add support for TP4084

2022-06-27 Thread Niklas Cassel via
Hello there,

considering that Linux v5.19 will include support for NVMe TP4084:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=354201c53e61e493017b15327294b0c8ab522d69

I thought that it might be nice to have QEMU support for the same.

TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces when CC.CRIME is 1.

The patch series also adds a "crwmt" controller parameter, in order to
be able to expose the worst case timeout that the host should wait for
all namespaces to become ready.


Example qemu cmd line for the new options:

# delay in s (20s)
NS1_DELAY_S=20
# convert to units of 500ms
NS1_DELAY=$((NS1_DELAY_S*2))

# delay in s (60s)
NS2_DELAY_S=60
# convert to units of 500ms
NS2_DELAY=$((NS2_DELAY_S*2))

# timeout in s (120s)
CRWMT_S=120
# convert to units of 500ms
CRWMT=$((CRWMT_S*2))

 -device nvme,serial=deadbeef,crwmt=$CRWMT \
 -drive file=$NS1_DATA,id=nvm-1,format=raw,if=none \
 -device nvme-ns,drive=nvm-1,ready_delay=$NS1_DELAY \
 -drive file=$NS2_DATA,id=nvm-2,format=raw,if=none \
 -device nvme-ns,drive=nvm-2,ready_delay=$NS2_DELAY \


Changes since v1:
-Rebased on nvme-next
-Set id_indep_ns->nmic if ns->params.shared in patch 3/4.


Niklas Cassel (4):
  hw/nvme: claim NVMe 2.0 compliance
  hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace
  hw/nvme: add support for ratified TP4084
  hw/nvme: add new never_ready parameter to test the DNR bit

 hw/nvme/ctrl.c   | 151 +--
 hw/nvme/ns.c |  20 ++
 hw/nvme/nvme.h   |   9 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 236 insertions(+), 5 deletions(-)

-- 
2.36.1




[PATCH v2 1/4] hw/nvme: claim NVMe 2.0 compliance

2022-06-27 Thread Niklas Cassel via
CRMS.CRWMS bit shall be set to 1 on controllers compliant with versions
later than NVMe 1.4.

The first version later than NVMe 1.4 is NVMe 2.0

Let's claim compliance with NVMe 2.0 such that a follow up patch can
set the CRMS.CRWMS bit.

This is needed since CC.CRIME is only writable when both CRMS.CRIMS
and CRMS.CRWMS is set.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index f8ec4a7be3..8ca824ea14 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -204,7 +204,7 @@
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
-#define NVME_SPEC_VER 0x00010400
+#define NVME_SPEC_VER 0x0002
 #define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 4
 #define NVME_TEMPERATURE 0x143
-- 
2.36.1




[PATCH 4/4] hw/nvme: add new never_ready parameter to test the DNR bit

2022-06-07 Thread Niklas Cassel via
Since we verify that "ready_delay" parameter has to be smaller than CRWMT,
we know that the namespace will always become ready.
Therefore the "Namespace Not Ready" status code will never have the DNR
bit set.

Add a new parameter "never_ready" that can be used to emulate a namespace
that never gets ready, such that the DNR bit gets set after CRWMT amount
of time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 28 +++-
 hw/nvme/ns.c   |  1 +
 hw/nvme/nvme.h |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 66d96714c3..5ec22ff13d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -134,6 +134,12 @@
  *   before being marked ready. Only applicable if CC.CRIME is set by the user.
  *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
  *
+ * - `never_ready`
+ *   This parameter specifies that a namespace should never be marked as ready.
+ *   When `crwmt` amount of time has passed after enabling the controller,
+ *   status code "Namespace Not Ready" will have the DNR bit set. If specified
+ *   together with `ready_delay`, `never_ready` will take precedence.
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -4118,6 +4124,14 @@ static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, 
NvmeRequest *req)
 return status;
 }
 
+static bool nvme_ready_has_passed_timeout(NvmeCtrl *n)
+{
+int64_t current_time = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+int64_t elapsed_time = current_time - n->cc_enable_timestamp;
+
+return elapsed_time > n->params.crwmt * 500;
+}
+
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
 NvmeNamespace *ns;
@@ -4164,7 +4178,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 }
 
 if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
-return NVME_NS_NOT_READY;
+uint16_t ret = NVME_NS_NOT_READY;
+if (ns->params.never_ready && nvme_ready_has_passed_timeout(n)) {
+ret |= NVME_DNR;
+}
+return ret;
 }
 
 if (ns->status) {
@@ -5537,6 +,10 @@ static void nvme_set_ready_or_start_timer(NvmeCtrl *n, 
NvmeNamespace *ns)
 {
 int64_t expire_time;
 
+if (ns->params.never_ready) {
+return;
+}
+
 if (!NVME_CC_CRIME(ldl_le_p(>bar.cc)) || ns->params.ready_delay == 0) {
 ns->id_indep_ns.nstat |= NVME_NSTAT_NRDY;
 return;
@@ -5979,6 +6001,7 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
 n->aer_queued = 0;
 n->outstanding_aers = 0;
 n->qs_created = false;
+n->cc_enable_timestamp = 0;
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -6000,6 +6023,8 @@ static void nvme_ctrl_shutdown(NvmeCtrl *n)
 
 nvme_ns_shutdown(ns);
 }
+
+n->cc_enable_timestamp = 0;
 }
 
 static void nvme_ctrl_per_ns_action_on_start(NvmeCtrl *n)
@@ -6109,6 +6134,7 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 NVME_CAP_SET_TO(cap, new_cap_timeout);
 stq_le_p(>bar.cap, cap);
 
+n->cc_enable_timestamp = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 n->page_bits = page_bits;
 n->page_size = page_size;
 n->max_prp_ents = n->page_size / sizeof(uint64_t);
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index c4e9f0e5c8..89c31658de 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -658,6 +658,7 @@ static Property nvme_ns_props[] = {
 DEFINE_PROP_BOOL("eui64-default", NvmeNamespace, params.eui64_default,
  false),
 DEFINE_PROP_UINT16("ready_delay", NvmeNamespace, params.ready_delay, 0),
+DEFINE_PROP_BOOL("never_ready", NvmeNamespace, params.never_ready, false),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index c9934d0097..6ff9725f21 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -122,6 +122,7 @@ typedef struct NvmeNamespaceParams {
 uint64_t zrwafg;
 
 uint16_t ready_delay;
+bool never_ready;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
@@ -436,6 +437,7 @@ typedef struct NvmeCtrl {
 int cq_pending;
 uint64_thost_timestamp; /* Timestamp sent by the host 
*/
 uint64_ttimestamp_set_qemu_clock_ms;/* QEMU clock time */
+uint64_tcc_enable_timestamp;/* QEMU clock time */
 uint64_tstarttime_ms;
 uint16_ttemperature;
 uint8_t smart_critical_warning;
-- 
2.36.1




[PATCH 3/4] hw/nvme: add support for ratified TP4084

2022-06-07 Thread Niklas Cassel via
TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

When CC.CRIME is 1, the controller will become ready when CSTS.RDY gets
set to 1, but commands accessing a namespace are allowed to return
"Namespace Not Ready" or "Admin Command Media Not Ready".
After CRTO.CRWMT amount of time, if the namespace has not yet been
marked ready, the status codes also need to have the DNR bit set.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces.

Once a namespace is ready, it will set the NRDY bit in the I/O Command
Set Independent Identify Namespace Data Structure, and then send out a
Namespace Attribute Changed event.

This new "ready_delay" is supported on controllers not part of a NVMe
subsystem. The reasons are many. One problem is that multiple controllers
can have different CC.CRIME modes running. Another problem is the extra
locking needed. The third problem is when to actually clear NRDY. If we
assume that a namespace clears NRDY when it no longer has any controller
online for that namespace. The problem then is that Linux will reset the
controllers one by one during probe time. The reset goes so fast so that
there is no time when all controllers are in reset at the same time, so
NRDY will never get cleared. (The controllers are enabled by SeaBIOS by
default.) We could introduce a reset_time param, but this would only
increase the chances that all controllers are in reset at the same time.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c   | 123 +--
 hw/nvme/ns.c |  15 ++
 hw/nvme/nvme.h   |   6 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 201 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 91469834b0..66d96714c3 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -83,6 +83,12 @@
  *   completion when there are no outstanding AERs. When the maximum number of
  *   enqueued events are reached, subsequent events will be dropped.
  *
+ * - `crwmt`
+ *   This is the Controller Ready With Media Timeout (CRWMT) field that is
+ *   defined in the CRTO register. This specifies the worst-case time that host
+ *   software should wait for the controller and all attached namespaces to
+ *   become ready. The value is in units of 500 milliseconds.
+ *
  * - `mdts`
  *   Indicates the maximum data transfer size for a command that transfers data
  *   between host-accessible memory and the controller. The value is specified
@@ -123,6 +129,11 @@
  *   namespace will be available in the subsystem but not attached to any
  *   controllers.
  *
+ * - `ready_delay`
+ *   This parameter specifies the amount of time that the namespace should wait
+ *   before being marked ready. Only applicable if CC.CRIME is set by the user.
+ *   The value is in units of 500 milliseconds (to be consistent with `crwmt`).
+ *
  * Setting `zoned` to true selects Zoned Command Set at the namespace.
  * In this case, the following namespace properties are available to configure
  * zoned operation:
@@ -176,6 +187,8 @@
 #define NVME_TEMPERATURE_CRITICAL 0x175
 #define NVME_NUM_FW_SLOTS 1
 #define NVME_DEFAULT_MAX_ZA_SIZE (128 * KiB)
+#define NVME_DEFAULT_CRIMT 0xa
+#define NVME_DEFAULT_CRWMT 0xf
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
 do { \
@@ -4150,6 +4163,10 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
+if (!(ns->id_indep_ns.nstat & NVME_NSTAT_NRDY)) {
+return NVME_NS_NOT_READY;
+}
+
 if (ns->status) {
 return ns->status;
 }
@@ -4746,6 +4763,27 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
NvmeRequest *req, bool active)
 return NVME_INVALID_CMD_SET | NVME_DNR;
 }
 
+static uint16_t nvme_identify_cs_indep_ns(NvmeCtrl *n, NvmeRequest *req)
+{
+NvmeNamespace *ns;
+NvmeIdentify *c = (NvmeIdentify *)>cmd;
+uint32_t nsid = le32_to_cpu(c->nsid);
+
+trace_pci_nvme_identify_cs_indep_ns(nsid);
+
+if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
+return NVME_INVALID_NSID | NVME_DNR;
+}
+
+ns = nvme_ns(n, nsid);
+if (unlikely(!ns)) {
+return nvme_rpt_empty_id_struct(n, req);
+}
+
+return nvme_c2h(n, (uint8_t *)>id_indep_ns, sizeof(NvmeIdNsCsIndep),
+req);
+}
+
 static uint16_t nvme_identify_ctrl_list(NvmeCtrl *n, NvmeRequest *req,
 bool attached)
 {
@@ -5005,6 +5043,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
*req)
 return nvme_identify_ns(n, req, true);
 case NVME_ID_CNS_NS_PRESENT:
 return nvme_identify_ns(n, req, fa

[PATCH 1/4] hw/nvme: claim NVMe 2.0 compliance

2022-06-07 Thread Niklas Cassel via
CRMS.CRWMS bit shall be set to 1 on controllers compliant with versions
later than NVMe 1.4.

The first version later than NVMe 1.4 is NVMe 2.0

Let's claim compliance with NVMe 2.0 such that a follow up patch can
set the CRMS.CRWMS bit.

This is needed since CC.CRIME is only writable when both CRMS.CRIMS
and CRMS.CRWMS is set.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 1e6e0fcad9..91469834b0 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -168,7 +168,7 @@
 
 #define NVME_MAX_IOQPAIRS 0x
 #define NVME_DB_SIZE  4
-#define NVME_SPEC_VER 0x00010400
+#define NVME_SPEC_VER 0x0002
 #define NVME_CMB_BIR 2
 #define NVME_PMR_BIR 4
 #define NVME_TEMPERATURE 0x143
-- 
2.36.1




[PATCH 2/4] hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace

2022-06-07 Thread Niklas Cassel via
Each NvmeNamespace can be used by serveral controllers,
but a NvmeNamespace can at most belong to a single NvmeSubsystem.
Store a pointer to the NvmeSubsystem, if the namespace was realized
with a NvmeSubsystem.

This will be used by a follow up patch.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ns.c   | 1 +
 hw/nvme/nvme.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 1b9c9d1156..07759a973b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -559,6 +559,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
 if (!qdev_set_parent_bus(dev, >bus.parent_bus, errp)) {
 return;
 }
+ns->subsys = subsys;
 }
 
 if (nvme_ns_setup(ns, errp)) {
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index e41771604f..32333d0c89 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -162,6 +162,7 @@ typedef struct NvmeNamespace {
 int32_t nr_active_zones;
 
 NvmeNamespaceParams params;
+NvmeSubsystem *subsys;
 
 struct {
 uint32_t err_rec;
-- 
2.36.1




[PATCH 0/4] hw/nvme: add support for TP4084

2022-06-07 Thread Niklas Cassel via
Hello there,

considering that Linux v5.19-rc1 is out which includes support for
NVMe TP4084:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=354201c53e61e493017b15327294b0c8ab522d69

I thought that it might be nice to have QEMU support for the same.

TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
as ready independently from the controller.

When CC.CRIME is 0 (default), things behave as before, all namespaces
are ready when CSTS.RDY gets set to 1.

Add a new "ready_delay" namespace device parameter, in order to emulate
different ready latencies for namespaces when CC.CRIME is 1.

The patch series also adds a "crwmt" controller parameter, in order to
be able to expose the worst case timeout that the host should wait for
all namespaces to become ready.


Example qemu cmd line for the new options:

# delay in s (20s)
NS1_DELAY_S=20
# convert to units of 500ms
NS1_DELAY=$((NS1_DELAY_S*2))

# delay in s (60s)
NS2_DELAY_S=60
# convert to units of 500ms
NS2_DELAY=$((NS2_DELAY_S*2))

# timeout in s (120s)
CRWMT_S=120
# convert to units of 500ms
CRWMT=$((CRWMT_S*2))

 -device nvme,serial=deadbeef,crwmt=$CRWMT \
 -drive file=$NS1_DATA,id=nvm-1,format=raw,if=none \
 -device nvme-ns,drive=nvm-1,ready_delay=$NS1_DELAY \
 -drive file=$NS2_DATA,id=nvm-2,format=raw,if=none \
 -device nvme-ns,drive=nvm-2,ready_delay=$NS2_DELAY \


Niklas Cassel (4):
  hw/nvme: claim NVMe 2.0 compliance
  hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace
  hw/nvme: add support for ratified TP4084
  hw/nvme: add new never_ready parameter to test the DNR bit

 hw/nvme/ctrl.c   | 151 +--
 hw/nvme/ns.c |  17 +
 hw/nvme/nvme.h   |   9 +++
 hw/nvme/trace-events |   1 +
 include/block/nvme.h |  60 -
 5 files changed, 233 insertions(+), 5 deletions(-)

-- 
2.36.1




Re: [PATCH] hw/nvme: add param to control auto zone transitioning to zone state closed

2021-06-07 Thread Niklas Cassel
On Mon, Jun 07, 2021 at 11:54:02AM +0200, Klaus Jensen wrote:
> On Jun  1 07:30, Niklas Cassel wrote:
> > On Mon, May 31, 2021 at 09:39:20PM +0200, Klaus Jensen wrote:
> > > On May 31 15:42, Niklas Cassel wrote:
> > > > On Fri, May 28, 2021 at 01:22:38PM +0200, Klaus Jensen wrote:
> > > > > On May 28 11:05, Niklas Cassel wrote:
> > > > > > From: Niklas Cassel 
> > > > > >
> > > > > > In the Zoned Namespace Command Set Specification, chapter
> > > > > > 2.5.1 Managing resources
> > > > > >
> > > > > > "The controller may transition zones in the ZSIO:Implicitly Opened 
> > > > > > state
> > > > > > to the ZSC:Closed state for resource management purposes."
> > > > > >
> > > > > > The word may in this sentence means that automatically transitioning
> > > > > > an implicitly opened zone to closed is completely optional.
> > > > > >
> > > > > > Add a new parameter so that the user can control if this automatic
> > > > > > transitioning should be performed or not.
> > > > > >
> > > > > > Being able to control this can help with verifying that e.g. a 
> > > > > > user-space
> > > > > > program behaves properly even without this optional ZNS feature.
> > > > > >
> > > > > > The default value is set to true, in order to not change the 
> > > > > > existing
> > > > > > behavior.
> > > > > >
> > > > > > Signed-off-by: Niklas Cassel 
> > > > > > ---
> > > > > > hw/nvme/ctrl.c | 9 -
> > > > > > hw/nvme/ns.c   | 2 ++
> > > > > > hw/nvme/nvme.h | 1 +
> > > > > > 3 files changed, 11 insertions(+), 1 deletion(-)
> > > > > >
> > > > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > > > index 40a7efcea9..d00f0297a5 100644
> > > > > > --- a/hw/nvme/ctrl.c
> > > > > > +++ b/hw/nvme/ctrl.c
> > > > > > @@ -141,6 +141,11 @@
> > > > > >  *
> > > > > >  * zoned.cross_read=
> > > > > >  * Setting this property to true enables Read Across Zone 
> > > > > > Boundaries.
> > > > > > + *
> > > > > > + * zoned.auto_transition= > > > > > default: true>
> > > > > > + * Indicates if zones in zone state implicitly opened can 
> > > > > > be
> > > > > > + * automatically transitioned to zone state closed for 
> > > > > > resource
> > > > > > + * management purposes.
> > > > > >  */
> > > > > >
> > > > > > #include "qemu/osdep.h"
> > > > > > @@ -1699,7 +1704,9 @@ static uint16_t 
> > > > > > nvme_zrm_open_flags(NvmeNamespace *ns, NvmeZone *zone,
> > > > > > /* fallthrough */
> > > > > >
> > > > > > case NVME_ZONE_STATE_CLOSED:
> > > > > > -nvme_zrm_auto_transition_zone(ns);
> > > > > > +if (ns->params.auto_transition_zones) {
> > > > > > +nvme_zrm_auto_transition_zone(ns);
> > > > > > +}
> > > > > > status = nvme_aor_check(ns, act, 1);
> > > > > > if (status) {
> > > > > > return status;
> > > > > > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > > > > > index 3fec9c6273..31dee43d30 100644
> > > > > > --- a/hw/nvme/ns.c
> > > > > > +++ b/hw/nvme/ns.c
> > > > > > @@ -531,6 +531,8 @@ static Property nvme_ns_props[] = {
> > > > > >params.max_open_zones, 0),
> > > > > > DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
> > > > > >params.zd_extension_size, 0),
> > > > > > +DEFINE_PROP_BOOL("zoned.auto_transition", NvmeNamespace,
> > > > > > + params.auto_transition_zones, true),
> > > > > > DEFINE_PROP_END_OF_LIST(),
> > > > > > };
> > > > > >
> > > > > > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme

Re: [PATCH] hw/nvme: add param to control auto zone transitioning to zone state closed

2021-06-01 Thread Niklas Cassel
On Mon, May 31, 2021 at 09:39:20PM +0200, Klaus Jensen wrote:
> On May 31 15:42, Niklas Cassel wrote:
> > On Fri, May 28, 2021 at 01:22:38PM +0200, Klaus Jensen wrote:
> > > On May 28 11:05, Niklas Cassel wrote:
> > > > From: Niklas Cassel 
> > > >
> > > > In the Zoned Namespace Command Set Specification, chapter
> > > > 2.5.1 Managing resources
> > > >
> > > > "The controller may transition zones in the ZSIO:Implicitly Opened state
> > > > to the ZSC:Closed state for resource management purposes."
> > > >
> > > > The word may in this sentence means that automatically transitioning
> > > > an implicitly opened zone to closed is completely optional.
> > > >
> > > > Add a new parameter so that the user can control if this automatic
> > > > transitioning should be performed or not.
> > > >
> > > > Being able to control this can help with verifying that e.g. a 
> > > > user-space
> > > > program behaves properly even without this optional ZNS feature.
> > > >
> > > > The default value is set to true, in order to not change the existing
> > > > behavior.
> > > >
> > > > Signed-off-by: Niklas Cassel 
> > > > ---
> > > > hw/nvme/ctrl.c | 9 -
> > > > hw/nvme/ns.c   | 2 ++
> > > > hw/nvme/nvme.h | 1 +
> > > > 3 files changed, 11 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > > > index 40a7efcea9..d00f0297a5 100644
> > > > --- a/hw/nvme/ctrl.c
> > > > +++ b/hw/nvme/ctrl.c
> > > > @@ -141,6 +141,11 @@
> > > >  *
> > > >  * zoned.cross_read=
> > > >  * Setting this property to true enables Read Across Zone 
> > > > Boundaries.
> > > > + *
> > > > + * zoned.auto_transition= > > > default: true>
> > > > + * Indicates if zones in zone state implicitly opened can be
> > > > + * automatically transitioned to zone state closed for resource
> > > > + * management purposes.
> > > >  */
> > > >
> > > > #include "qemu/osdep.h"
> > > > @@ -1699,7 +1704,9 @@ static uint16_t nvme_zrm_open_flags(NvmeNamespace 
> > > > *ns, NvmeZone *zone,
> > > > /* fallthrough */
> > > >
> > > > case NVME_ZONE_STATE_CLOSED:
> > > > -nvme_zrm_auto_transition_zone(ns);
> > > > +if (ns->params.auto_transition_zones) {
> > > > +nvme_zrm_auto_transition_zone(ns);
> > > > +}
> > > > status = nvme_aor_check(ns, act, 1);
> > > > if (status) {
> > > > return status;
> > > > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > > > index 3fec9c6273..31dee43d30 100644
> > > > --- a/hw/nvme/ns.c
> > > > +++ b/hw/nvme/ns.c
> > > > @@ -531,6 +531,8 @@ static Property nvme_ns_props[] = {
> > > >params.max_open_zones, 0),
> > > > DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
> > > >params.zd_extension_size, 0),
> > > > +DEFINE_PROP_BOOL("zoned.auto_transition", NvmeNamespace,
> > > > + params.auto_transition_zones, true),
> > > > DEFINE_PROP_END_OF_LIST(),
> > > > };
> > > >
> > > > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > > > index 81a35cda14..bd86054db2 100644
> > > > --- a/hw/nvme/nvme.h
> > > > +++ b/hw/nvme/nvme.h
> > > > @@ -100,6 +100,7 @@ typedef struct NvmeNamespaceParams {
> > > > uint32_t max_active_zones;
> > > > uint32_t max_open_zones;
> > > > uint32_t zd_extension_size;
> > > > +bool     auto_transition_zones;
> > > > } NvmeNamespaceParams;
> > > >
> > > > typedef struct NvmeNamespace {
> > > > --
> > > > 2.31.1
> > > >
> > > 
> > > Looks good Niklas!
> > > 
> > > Reviewed-by: Klaus Jensen 
> > 
> > In reality, it is the controller that does the auto transitioning.
> > 
> > In theory, one namespace could be attached to two different controllers,
> > and I guess, in that case, it depends on if the controller that we used
> > 

Re: [PATCH] hw/nvme: add param to control auto zone transitioning to zone state closed

2021-05-31 Thread Niklas Cassel
On Fri, May 28, 2021 at 01:22:38PM +0200, Klaus Jensen wrote:
> On May 28 11:05, Niklas Cassel wrote:
> > From: Niklas Cassel 
> > 
> > In the Zoned Namespace Command Set Specification, chapter
> > 2.5.1 Managing resources
> > 
> > "The controller may transition zones in the ZSIO:Implicitly Opened state
> > to the ZSC:Closed state for resource management purposes."
> > 
> > The word may in this sentence means that automatically transitioning
> > an implicitly opened zone to closed is completely optional.
> > 
> > Add a new parameter so that the user can control if this automatic
> > transitioning should be performed or not.
> > 
> > Being able to control this can help with verifying that e.g. a user-space
> > program behaves properly even without this optional ZNS feature.
> > 
> > The default value is set to true, in order to not change the existing
> > behavior.
> > 
> > Signed-off-by: Niklas Cassel 
> > ---
> > hw/nvme/ctrl.c | 9 -
> > hw/nvme/ns.c   | 2 ++
> > hw/nvme/nvme.h | 1 +
> > 3 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> > index 40a7efcea9..d00f0297a5 100644
> > --- a/hw/nvme/ctrl.c
> > +++ b/hw/nvme/ctrl.c
> > @@ -141,6 +141,11 @@
> >  *
> >  * zoned.cross_read=
> >  * Setting this property to true enables Read Across Zone 
> > Boundaries.
> > + *
> > + * zoned.auto_transition= > true>
> > + * Indicates if zones in zone state implicitly opened can be
> > + * automatically transitioned to zone state closed for resource
> > + * management purposes.
> >  */
> > 
> > #include "qemu/osdep.h"
> > @@ -1699,7 +1704,9 @@ static uint16_t nvme_zrm_open_flags(NvmeNamespace 
> > *ns, NvmeZone *zone,
> > /* fallthrough */
> > 
> > case NVME_ZONE_STATE_CLOSED:
> > -nvme_zrm_auto_transition_zone(ns);
> > +if (ns->params.auto_transition_zones) {
> > +nvme_zrm_auto_transition_zone(ns);
> > +}
> > status = nvme_aor_check(ns, act, 1);
> > if (status) {
> > return status;
> > diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
> > index 3fec9c6273..31dee43d30 100644
> > --- a/hw/nvme/ns.c
> > +++ b/hw/nvme/ns.c
> > @@ -531,6 +531,8 @@ static Property nvme_ns_props[] = {
> >params.max_open_zones, 0),
> > DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
> >params.zd_extension_size, 0),
> > +DEFINE_PROP_BOOL("zoned.auto_transition", NvmeNamespace,
> > + params.auto_transition_zones, true),
> > DEFINE_PROP_END_OF_LIST(),
> > };
> > 
> > diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> > index 81a35cda14..bd86054db2 100644
> > --- a/hw/nvme/nvme.h
> > +++ b/hw/nvme/nvme.h
> > @@ -100,6 +100,7 @@ typedef struct NvmeNamespaceParams {
> > uint32_t max_active_zones;
> > uint32_t max_open_zones;
> > uint32_t zd_extension_size;
> > +bool auto_transition_zones;
> > } NvmeNamespaceParams;
> > 
> > typedef struct NvmeNamespace {
> > -- 
> > 2.31.1
> > 
> 
> Looks good Niklas!
> 
> Reviewed-by: Klaus Jensen 

In reality, it is the controller that does the auto transitioning.

In theory, one namespace could be attached to two different controllers,
and I guess, in that case, it depends on if the controller that we used
when doing the write supports auto transitioning or not, that determines
if a zone will be auto transitioned or not.

If we were to change this to be a parameter of the controller instead
of a parameter of the namespace, we would require to refactor a lot of
code in the regular write path. As we currently don't have any NvmeRequest
object in nvme_zrm_open_flags().

Thoughts?


Kind regards,
Niklas


[PATCH] hw/nvme: add param to control auto zone transitioning to zone state closed

2021-05-28 Thread Niklas Cassel
From: Niklas Cassel 

In the Zoned Namespace Command Set Specification, chapter
2.5.1 Managing resources

"The controller may transition zones in the ZSIO:Implicitly Opened state
to the ZSC:Closed state for resource management purposes."

The word may in this sentence means that automatically transitioning
an implicitly opened zone to closed is completely optional.

Add a new parameter so that the user can control if this automatic
transitioning should be performed or not.

Being able to control this can help with verifying that e.g. a user-space
program behaves properly even without this optional ZNS feature.

The default value is set to true, in order to not change the existing
behavior.

Signed-off-by: Niklas Cassel 
---
 hw/nvme/ctrl.c | 9 -
 hw/nvme/ns.c   | 2 ++
 hw/nvme/nvme.h | 1 +
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 40a7efcea9..d00f0297a5 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -141,6 +141,11 @@
  *
  * zoned.cross_read=
  * Setting this property to true enables Read Across Zone Boundaries.
+ *
+ * zoned.auto_transition=
+ * Indicates if zones in zone state implicitly opened can be
+ * automatically transitioned to zone state closed for resource
+ * management purposes.
  */
 
 #include "qemu/osdep.h"
@@ -1699,7 +1704,9 @@ static uint16_t nvme_zrm_open_flags(NvmeNamespace *ns, 
NvmeZone *zone,
 /* fallthrough */
 
 case NVME_ZONE_STATE_CLOSED:
-nvme_zrm_auto_transition_zone(ns);
+if (ns->params.auto_transition_zones) {
+nvme_zrm_auto_transition_zone(ns);
+}
 status = nvme_aor_check(ns, act, 1);
 if (status) {
 return status;
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 3fec9c6273..31dee43d30 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -531,6 +531,8 @@ static Property nvme_ns_props[] = {
params.max_open_zones, 0),
 DEFINE_PROP_UINT32("zoned.descr_ext_size", NvmeNamespace,
params.zd_extension_size, 0),
+DEFINE_PROP_BOOL("zoned.auto_transition", NvmeNamespace,
+ params.auto_transition_zones, true),
 DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 81a35cda14..bd86054db2 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -100,6 +100,7 @@ typedef struct NvmeNamespaceParams {
 uint32_t max_active_zones;
 uint32_t max_open_zones;
 uint32_t zd_extension_size;
+bool auto_transition_zones;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
-- 
2.31.1



Re: [PATCH] hw/block/nvme: remove description for zoned.append_size_limit

2021-03-30 Thread Niklas Cassel
On Tue, Mar 23, 2021 at 12:20:32PM +0100, Klaus Jensen wrote:
> On Mar 23 11:18, Niklas Cassel wrote:
> > From: Niklas Cassel 
> > 
> > The description was originally removed in commit 578d914b263c
> > ("hw/block/nvme: align zoned.zasl with mdts") together with the removal
> > of the zoned.append_size_limit parameter itself.
> > 
> > However, it was (most likely accidentally), re-added in commit
> > f7dcd31885cb ("hw/block/nvme: add non-mdts command size limit for verify").
> > 
> > Remove the description again, since the parameter it describes,
> > zoned.append_size_limit, no longer exists.
> > 
> > Signed-off-by: Niklas Cassel 
> > ---
> >  hw/block/nvme.c | 8 
> >  1 file changed, 8 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 6842b01ab5..205d3ec944 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -91,14 +91,6 @@
> >   *   the minimum memory page size (CAP.MPSMIN). The default value is 0 
> > (i.e.
> >   *   defaulting to the value of `mdts`).
> >   *
> > - * - `zoned.append_size_limit`
> > - *   The maximum I/O size in bytes that is allowed in Zone Append command.
> > - *   The default is 128KiB. Since internally this this value is maintained 
> > as
> > - *   ZASL = log2( / ), some values assigned
> > - *   to this property may be rounded down and result in a lower maximum ZA
> > - *   data size being in effect. By setting this property to 0, users can 
> > make
> > - *   ZASL to be equal to MDTS. This property only affects zoned namespaces.
> > - *
> >   * nvme namespace device parameters
> >   * 
> >   * - `subsys`
> > -- 
> > 2.30.2
> 
> Argh. Thanks Niklas, queing it up for fixes.
> 
> Reviewed-by: Klaus Jensen 

I don't see it in nvme-fixes yet.

Did it get stuck in purgatory? ;)


Kind regards,
Niklas


[PATCH] hw/block/nvme: remove description for zoned.append_size_limit

2021-03-23 Thread Niklas Cassel
From: Niklas Cassel 

The description was originally removed in commit 578d914b263c
("hw/block/nvme: align zoned.zasl with mdts") together with the removal
of the zoned.append_size_limit parameter itself.

However, it was (most likely accidentally), re-added in commit
f7dcd31885cb ("hw/block/nvme: add non-mdts command size limit for verify").

Remove the description again, since the parameter it describes,
zoned.append_size_limit, no longer exists.

Signed-off-by: Niklas Cassel 
---
 hw/block/nvme.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 6842b01ab5..205d3ec944 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -91,14 +91,6 @@
  *   the minimum memory page size (CAP.MPSMIN). The default value is 0 (i.e.
  *   defaulting to the value of `mdts`).
  *
- * - `zoned.append_size_limit`
- *   The maximum I/O size in bytes that is allowed in Zone Append command.
- *   The default is 128KiB. Since internally this this value is maintained as
- *   ZASL = log2( / ), some values assigned
- *   to this property may be rounded down and result in a lower maximum ZA
- *   data size being in effect. By setting this property to 0, users can make
- *   ZASL to be equal to MDTS. This property only affects zoned namespaces.
- *
  * nvme namespace device parameters
  * 
  * - `subsys`
-- 
2.30.2



Re: [RFC PATCH V3 8/8] hw/block/nvme: Add Identify Active Namespace ID List

2021-01-21 Thread Niklas Cassel
On Thu, Jan 21, 2021 at 06:58:19AM +0900, Minwoo Im wrote:
> > Hello Minwoo,
> > 
> > By introducing a detached parameter,
> > you are also implicitly making the following
> > NVMe commands no longer be spec compliant:
> > 
> > NVME_ID_CNS_NS, NVME_ID_CNS_CS_NS,
> > NVME_ID_CNS_NS_ACTIVE_LIST, NVME_ID_CNS_CS_NS_ACTIVE_LIST
> > 
> > When these commands are called on a detached namespace,
> > they should usually return a zero filled data struct.
> 
> Agreed.
> 
> > Dmitry and I had a patch on V8 on the ZNS series
> > that tried to fix some the existing NVMe commands
> > to be spec compliant, by handling detached namespaces
> > properly. In the end, in order to make it easier to
> > get the ZNS series accepted, we decided to drop the
> > detached related stuff from the series.
> > 
> > Feel free to look at that patch for inspiration:
> > https://github.com/dmitry-fomichev/qemu/commit/251c0ffee5149c739b1347811fa7e32a1c36bf7c
> 
> I've seen this patch and as Klaus said, only thing patches need go with
> is to put some of nvme_ns_is_attached() helper among the Identify
> handlers.
> 
> > I'm not sure if you want to modify all the functions that
> > our patch modifies, but I think that you should at least
> > modify the following nvme functions:
> > 
> > nvme_identify_ns()
> > nvme_identify_ns_csi()
> > nvme_identify_nslist()
> > nvme_identify_nslist_csi()
> 
> Yes, pretty makes sense to update them.  But now it seems like
> 'attach/detach' scheme should have been separated out of this series
> which just introduced the multi-path for controllers and namespace
> sharing.  I will drop this 'detach' scheme out of this series and make
> another series to support all of the Identify you mentioned above
> cleanly.

Hello Minwoo,

thank you for putting in work on this!

Kind regards,
Niklas


Re: [RFC PATCH V3 8/8] hw/block/nvme: Add Identify Active Namespace ID List

2021-01-20 Thread Niklas Cassel
On Wed, Jan 20, 2021 at 02:01:47AM +0900, Minwoo Im wrote:
> Spec v1.4b 6.1.4 "Active and Inactive NSID Types" says:
> 
> "Active NSIDs for a controller refer to namespaces that are attached to
> that controller. Allocated NSIDs that are inactive for a controller refer
> to namespaces that are not attached to that controller."
> 
> This patch introduced for Identify Active Namespace ID List (CNS 02h).
> 
> Signed-off-by: Minwoo Im 
> ---
>  hw/block/nvme.c | 39 ++-
>  1 file changed, 34 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 2b2c07b36c2b..7247167b0ee6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -2883,6 +2883,39 @@ static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, 
> NvmeRequest *req)
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> +static uint16_t nvme_identify_nslist_active(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeNamespace *ns;
> +NvmeIdentify *c = (NvmeIdentify *)>cmd;
> +uint32_t min_nsid = le32_to_cpu(c->nsid);
> +uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +static const int data_len = sizeof(list);
> +uint32_t *list_ptr = (uint32_t *)list;
> +int i, j = 0;
> +
> +if (min_nsid >= NVME_NSID_BROADCAST - 1) {
> +return NVME_INVALID_NSID | NVME_DNR;
> +}
> +
> +for (i = 1; i <= n->num_namespaces; i++) {
> +ns = nvme_ns(n, i);
> +if (!ns || ns->params.nsid <= min_nsid) {
> +continue;
> +}
> +
> +if (!nvme_ns_is_attached(n, ns)) {
> +continue;
> +}
> +
> +list_ptr[j++] = cpu_to_le32(ns->params.nsid);
> +if (j == data_len / sizeof(uint32_t)) {
> +break;
> +}
> +}
> +
> +return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>  {
>  NvmeNamespace *ns;
> @@ -2914,10 +2947,6 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, 
> NvmeRequest *req)
>  continue;
>  }
>  
> -if (!nvme_ns_is_attached(n, ns)) {
> -continue;
> -}
> -
>  list_ptr[j++] = cpu_to_le32(ns->params.nsid);
>  if (j == data_len / sizeof(uint32_t)) {
>  break;
> @@ -3045,7 +3074,7 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest 
> *req)
>  case NVME_ID_CNS_CS_CTRL:
>  return nvme_identify_ctrl_csi(n, req);
>  case NVME_ID_CNS_NS_ACTIVE_LIST:
> - /* fall through */
> + return nvme_identify_nslist_active(n, req);
>  case NVME_ID_CNS_NS_PRESENT_LIST:
>  return nvme_identify_nslist(n, req);
>  case NVME_ID_CNS_CS_NS_ACTIVE_LIST:
> -- 
> 2.17.1
> 
> 

Hello Minwoo,

By introducing a detached parameter,
you are also implicitly making the following
NVMe commands no longer be spec compliant:

NVME_ID_CNS_NS, NVME_ID_CNS_CS_NS,
NVME_ID_CNS_NS_ACTIVE_LIST, NVME_ID_CNS_CS_NS_ACTIVE_LIST

When these commands are called on a detached namespace,
they should usually return a zero filled data struct.

Dmitry and I had a patch on V8 on the ZNS series
that tried to fix some the existing NVMe commands
to be spec compliant, by handling detached namespaces
properly. In the end, in order to make it easier to
get the ZNS series accepted, we decided to drop the
detached related stuff from the series.

Feel free to look at that patch for inspiration:
https://github.com/dmitry-fomichev/qemu/commit/251c0ffee5149c739b1347811fa7e32a1c36bf7c

I'm not sure if you want to modify all the functions that
our patch modifies, but I think that you should at least
modify the following nvme functions:

nvme_identify_ns()
nvme_identify_ns_csi()
nvme_identify_nslist()
nvme_identify_nslist_csi()

So they handle detached namespaces correctly for both:
NVME_ID_CNS_NS, NVME_ID_CNS_CS_NS,
NVME_ID_CNS_NS_ACTIVE_LIST, NVME_ID_CNS_CS_NS_ACTIVE_LIST
as well as for:
NVME_ID_CNS_NS_PRESENT, NVME_ID_CNS_CS_NS_PRESENT,
NVME_ID_CNS_NS_PRESENT_LIST, NVME_ID_CNS_CS_NS_PRESENT_LIST


Kind regards,
Niklas


Re: [PATCH 0/2] hw/block/nvme: zoned fixes

2021-01-20 Thread Niklas Cassel
On Tue, Jan 19, 2021 at 02:54:58PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Patch [1/2] fixes the zone append bug reported by Niklas. [2/2]
> refactors the zone write check function to return status codes in a
> different order if there are multiple zone write violations that apply.
> 
> Klaus Jensen (2):
>   hw/block/nvme: fix zone boundary check for append
>   hw/block/nvme: refactor the logic for zone write checks
> 
>  hw/block/nvme.c   | 89 +--
>  hw/block/trace-events |  5 +++
>  2 files changed, 48 insertions(+), 46 deletions(-)
> 
> -- 
> 2.30.0
> 
> 

For the series:
Tested-by: Niklas Cassel 


NVMe ZNS zone append past zone size?

2021-01-19 Thread Niklas Cassel
Hello all,


When testing with the ZNS code that is in nvme-next,
I can zone append, targeting the first zone by specifying zslba 0,
and then just put that call it in a while loop, it will
manage to fill up not just zone0, but the whole drive.

Since zslba is defined as:
"Zone Start Logical Block Address (ZSLBA): This field indicates the 64-bit
address of the lowest logical block of the zone in which the data and
metadata, if applicable, associated with this command is to be stored."

Should an append that specifies zslba 0 (== zone 0), be allowed
to write into zone 1 (and beyond).

According to 2.3.1.1 "Writing in Sequential Write Required Zones",
we should get either a "Zone Boundary Error" or "Zone Is Full" error,
depending on zone 0 write pointer, combined with how many LBAs we try
to append.


Looking at the code, I think that this has to be handled in
either nvme_check_zone_write() or nvme_advance_zone_wp().

Considering that Dmitry and Klaus were discussing when to advance
the write pointer, etc, for several of the patch series revisions,
I think it is better to leave a potential fix to you guys.


Kind regards,
Niklas


Re: [PATCH v9 08/12] hw/block/nvme: Support Zoned Namespace Command Set

2020-11-06 Thread Niklas Cassel
On Thu, Nov 05, 2020 at 11:53:38AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> Reviewed-by: Niklas Cassel 
> ---
>  hw/block/nvme-ns.h|  54 +++
>  hw/block/nvme.h   |   8 +
>  hw/block/nvme-ns.c| 173 
>  hw/block/nvme.c   | 971 +-
>  hw/block/trace-events |  18 +-
>  5 files changed, 1209 insertions(+), 15 deletions(-)
> 

(snip)

> +static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeCmd *cmd = (NvmeCmd *)>cmd;
> +NvmeNamespace *ns = req->ns;
> +/* cdw12 is zero-based number of dwords to return. Convert to bytes */
> +uint32_t data_size = (le32_to_cpu(cmd->cdw12) + 1) << 2;
> +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +uint32_t zone_idx, zra, zrasf, partial;
> +uint64_t max_zones, nr_zones = 0;
> +uint16_t ret;
> +uint64_t slba;
> +NvmeZoneDescr *z;
> +NvmeZone *zs;
> +NvmeZoneReportHeader *header;
> +void *buf, *buf_p;
> +size_t zone_entry_sz;
> +
> +req->status = NVME_SUCCESS;
> +
> +ret = nvme_get_mgmt_zone_slba_idx(ns, cmd, , _idx);
> +if (ret) {
> +return ret;
> +}
> +
> +zra = dw13 & 0xff;
> +if (zra != NVME_ZONE_REPORT) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +zrasf = (dw13 >> 8) & 0xff;
> +if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +if (data_size < sizeof(NvmeZoneReportHeader)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +ret = nvme_map_dptr(n, data_size, req);

nvme_map_dptr() call was not here in v8 patch set.

On v7 I commented that you were missing a call to nvme_check_mdts().
I think you still need to call nvme_check_mdts in order to verify
that data_size < mdts, no?

This function already has a call do nvme_dma(). nvme_dma() already
calls nvme_map_dptr().
If you use nvme_dma(), you cannot use nvme_map_dptr().

It will call nvme_map_addr() (which calls qemu_sglist_add()) on the
same buffer twice, causing the qsg->size to be twice what the user
sent in. Which will cause the:

if (unlikely(residual)) {

check in nvme_dma() to fail.


Looking at nvme_read()/nvme_write(), they use nvme_map_dptr()
(without any call to nvme_dma()), and then use dma_blk_read() or
dma_blk_write(). (and they both call nvme_check_mdts())


Kind regards,
Niklas

> +if (ret) {
> +return ret;
> +}
> +
> +partial = (dw13 >> 16) & 0x01;
> +
> +zone_entry_sz = sizeof(NvmeZoneDescr);
> +
> +max_zones = (data_size - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
> +buf = g_malloc0(data_size);
> +
> +header = (NvmeZoneReportHeader *)buf;
> +buf_p = buf + sizeof(NvmeZoneReportHeader);
> +
> +while (zone_idx < ns->num_zones && nr_zones < max_zones) {
> +

Re: [PATCH v8 07/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-30 Thread Niklas Cassel
On Fri, Oct 30, 2020 at 11:32:38AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> Reviewed-by: Niklas Cassel 

(snip)

> +static uint16_t nvme_zone_mgmt_recv(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeCmd *cmd = (NvmeCmd *)>cmd;
> +NvmeNamespace *ns = req->ns;
> +/* cdw12 is zero-based number of dwords to return. Convert to bytes */
> +uint32_t len = (le32_to_cpu(cmd->cdw12) + 1) << 2;
> +uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +uint32_t zone_idx, zra, zrasf, partial;
> +uint64_t max_zones, nr_zones = 0;
> +uint16_t ret;
> +uint64_t slba;
> +NvmeZoneDescr *z;
> +NvmeZone *zs;
> +NvmeZoneReportHeader *header;
> +void *buf, *buf_p;
> +size_t zone_entry_sz;
> +
> +req->status = NVME_SUCCESS;
> +
> +ret = nvme_get_mgmt_zone_slba_idx(ns, cmd, , _idx);
> +if (ret) {
> +return ret;
> +}
> +
> +if (len < sizeof(NvmeZoneReportHeader)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}

Just like nvme_read() and nvme_write(), nvme_zone_mgmt_recv()
has to do something like:

+ret = nvme_check_mdts(n, len);
+if (ret) {
+trace_pci_nvme_err_mdts(nvme_cid(req), len);
+return ret;
+}
+

To see that we are not exceeding MDTS.


Kind regards,
Niklas


> +
> +zra = dw13 & 0xff;
> +if (!(zra == NVME_ZONE_REPORT || zra == NVME_ZONE_REPORT_EXTENDED)) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +if (zra == NVME_ZONE_REPORT_EXTENDED) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +zrasf = (dw13 >> 8) & 0xff;
> +if (zrasf > NVME_ZONE_REPORT_OFFLINE) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +partial = (dw13 >> 16) & 0x01;
> +
> +zone_entry_sz = sizeof(NvmeZoneDescr);
> +
> +max_zones = (len - sizeof(NvmeZoneReportHeader)) / zone_entry_sz;
> +buf = g_malloc0(len);
> +
> +header = (NvmeZoneReportHeader *)buf;
> +buf_p = buf + sizeof(NvmeZoneReportHeader);
> +
> +while (zone_idx < ns->num_zones && nr_zones < max_zones) {
> +zs = >zone_array[zone_idx];
> +
> +if (!nvme_zone_matches_filter(zrasf, zs)) {
> +zone_idx++;
> +continue;
> +}
> +
> +z = (NvmeZoneDescr *)buf_p;
> +buf_p += sizeof(NvmeZoneDescr);
> +nr_zones++;
> +
> +z->zt = zs->d.zt;
> +z->zs = zs->d.zs;
> +z->zcap = cpu_to_le64(zs->d.zcap);
> +z->zslba = cpu_to_le64(zs->d.zslba);
> +z->za = zs->d.za;
> +
> +if (nvme_wp_is_valid(zs)) {
> +z->wp = cpu_to_le64(zs->d.wp);
> +} else {
> +z->wp = cpu_to_le64(~0ULL);
> +}
> +
> +zone_idx++;
> +}
> +
> +if (!partial) {
> +for (; zone_idx < ns->num_zones; zone_idx++) {
> +zs = >zone_array[zone_idx];
> +if (nvme_zone_matches_filter(zrasf, zs)) {
> +nr_zones++;
> +}
> +}
> +}
> +header->nr_zones = cpu_to_le64(nr_zones);
> +
> +ret = nvme_dma(n, (uint8_t *)buf, len, DMA_DIRECTION_FROM_DEVICE, req);
> +
> +g_free(buf);
> +
> +return ret;
> +}


Re: [PATCH v7 00/11] hw/block/nvme: Support Namespace Types and Zoned Namespace Command Set

2020-10-19 Thread Niklas Cassel
On Mon, Oct 19, 2020 at 11:17:15AM +0900, Dmitry Fomichev wrote:

(snip)

> 
> Dmitry Fomichev (9):
>   hw/block/nvme: Add Commands Supported and Effects log
>   hw/block/nvme: Generate namespace UUIDs
>   hw/block/nvme: Support Zoned Namespace Command Set
>   hw/block/nvme: Introduce max active and open zone limits
>   hw/block/nvme: Support Zone Descriptor Extensions
>   hw/block/nvme: Add injection of Offline/Read-Only zones
>   hw/block/nvme: Document zoned parameters in usage text
>   hw/block/nvme: Separate read and write handlers
>   hw/block/nvme: Merge nvme_write_zeroes() with nvme_write()
> 
> Niklas Cassel (2):
>   hw/block/nvme: Add support for Namespace Types
>   hw/block/nvme: Support allocated CNS command variants
> 
>  block/nvme.c  |2 +-
>  hw/block/nvme-ns.c|  295 
>  hw/block/nvme-ns.h|  109 +++
>  hw/block/nvme.c   | 1550 ++---
>  hw/block/nvme.h   |9 +
>  hw/block/trace-events |   36 +-
>  include/block/nvme.h  |  201 +-
>  7 files changed, 2078 insertions(+), 124 deletions(-)
> 
> -- 
> 2.21.0
> 

Thank you Dmitry, this version was easier to review.

Except for a missing
/* fall through */ comment in nvme_cmd_effects().
(in the "hw/block/nvme: Add Commands Supported and Effects log" patch.)

For the whole series:
Reviewed-by: Niklas Cassel 


Re: [PATCH v6 03/11] hw/block/nvme: Add support for Namespace Types

2020-10-14 Thread Niklas Cassel
On Wed, Oct 14, 2020 at 06:42:04AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel 
> 
> Define the structures and constants required to implement
> Namespace Types support.
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Dmitry Fomichev 
> ---

(snip)

> @@ -2090,6 +2199,27 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>  n->bar.cc = 0;
>  }
>  
> +static void nvme_select_ns_iocs(NvmeCtrl *n)
> +{
> +NvmeNamespace *ns;
> +int i;
> +
> +for (i = 1; i <= n->num_namespaces; i++) {
> +ns = nvme_ns(n, i);
> +if (!ns) {
> +continue;
> +}
> +ns->iocs = nvme_cse_iocs_none;
> +switch (ns->csi) {
> +case NVME_CSI_NVM:
> +if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
> +ns->iocs = nvme_cse_iocs_nvm;
> +}
> +break;
> +}
> +}
> +}
> +
>  static int nvme_start_ctrl(NvmeCtrl *n)
>  {
>  uint32_t page_bits = NVME_CC_MPS(n->bar.cc) + 12;
> @@ -2188,6 +2318,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>  QTAILQ_INIT(>aer_queue);
>  
> +nvme_select_ns_iocs(n);
> +
>  return 0;
>  }
>  
> @@ -2655,7 +2787,6 @@ int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace 
> *ns, Error **errp)
>  trace_pci_nvme_register_namespace(nsid);
>  
>  n->namespaces[nsid - 1] = ns;
> -ns->iocs = nvme_cse_iocs_nvm;
>  
>  return 0;
>  }

Considering how tightly coupled the three above diffs are with the
Commands Supported and Effects log, and since patch 1 already adds
the ns->iocs checking in nvme_admin_cmd() and nvme_io_cmd(),
and since these three diffs are not really related to NS types,
I think they should be moved to patch 1.

It really helps the reviewer if both the ns->iocs assignment
and checking is done in the same patch, and introduced as early
as possible. And since this code is needed/valid simply to check
if ADMIN_ONLY is selected (even before NS Types were introduced),
I don't see any reason not to introduce them in to patch 1
together with the other ns->iocs stuff.

(We were always able to select a I/O Command Set using CC.CSS
(Admin only/None, or NVM), NS types simply introduced the ability
to select/enable more than one command set at the same time.)


Kind regards,
Niklas


Re: [PATCH v6 01/11] hw/block/nvme: Add Commands Supported and Effects log

2020-10-14 Thread Niklas Cassel
On Tue, Oct 13, 2020 at 05:50:34PM -0700, Keith Busch wrote:
> On Wed, Oct 14, 2020 at 06:42:02AM +0900, Dmitry Fomichev wrote:
> > +{
> > +NvmeEffectsLog log = {};
> > +uint32_t *dst_acs = log.acs, *dst_iocs = log.iocs;
> > +uint32_t trans_len;
> > +int i;
> > +
> > +trace_pci_nvme_cmd_supp_and_effects_log_read();
> > +
> > +if (off >= sizeof(log)) {
> > +trace_pci_nvme_err_invalid_effects_log_offset(off);
> > +return NVME_INVALID_FIELD | NVME_DNR;
> > +}
> > +
> > +for (i = 0; i < 256; i++) {
> > +dst_acs[i] = nvme_cse_acs[i];
> > +}
> > +
> > +for (i = 0; i < 256; i++) {
> > +dst_iocs[i] = nvme_cse_iocs_nvm[i];
> > +}
> 
> You're just copying the array, so let's do it like this:
> 
> memcpy(log.acs, nvme_cse_acs, sizeof(nvme_cse_acs));
> memcpy(log.iocs, nvme_cse_iocs_nvm, sizeof(nvme_cse_iocs_nvm));
> 
> I think you also need to check
> 
> if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY)
> 
> before copying iocs.

So it seems Dmitry actually does this in the Namespace Types patch:


@@ -1051,10 +1054,6 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest 
*req)
 trace_pci_nvme_io_cmd(nvme_cid(req), nsid, nvme_sqid(req),
   req->cmd.opcode, nvme_io_opc_str(req->cmd.opcode));

-if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_ADMIN_ONLY) {
-return NVME_INVALID_OPCODE | NVME_DNR;
-}
-
 if (!nvme_nsid_valid(n, nsid)) {
 return NVME_INVALID_NSID | NVME_DNR;
 }
@@ -1333,8 +1332,10 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t 
buf_len,
 dst_acs[i] = nvme_cse_acs[i];
 }

-for (i = 0; i < 256; i++) {
-dst_iocs[i] = nvme_cse_iocs_nvm[i];
+if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+for (i = 0; i < 256; i++) {
+dst_iocs[i] = nvme_cse_iocs_nvm[i];
+}
 }


Which later in the ZNS patch is changed to:

@@ -1335,13 +2178,31 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, uint32_t 
buf_len,
 return NVME_INVALID_FIELD | NVME_DNR;
 }

+switch (NVME_CC_CSS(n->bar.cc)) {
+case NVME_CC_CSS_NVM:
+src_iocs = nvme_cse_iocs_nvm;
+break;
+case NVME_CC_CSS_CSI:
+switch (csi) {
+case NVME_CSI_NVM:
+src_iocs = nvme_cse_iocs_nvm;
+break;
+case NVME_CSI_ZONED:
+src_iocs = nvme_cse_iocs_zoned;
+break;
+}
+/* fall through */
+case NVME_CC_CSS_ADMIN_ONLY:
+break;
+}
+
 for (i = 0; i < 256; i++) {
 dst_acs[i] = nvme_cse_acs[i];
 }

-if (NVME_CC_CSS(n->bar.cc) != NVME_CC_CSS_ADMIN_ONLY) {
+if (src_iocs) {
 for (i = 0; i < 256; i++) {
-dst_iocs[i] = nvme_cse_iocs_nvm[i];
+dst_iocs[i] = src_iocs[i];
 }
 }


Perhaps the nvme_io_cmd() if-statement removal from the NS types patch +
the switch from the ZNS patch (with out the NVME_CSI_ZONED) could be moved
to this patch instead?

The middle version of adding "+if (NVME_CC_CSS(n->bar.cc) != 
NVME_CC_CSS_ADMIN_ONLY) {"
can then be dropped from the NS types patch, since it is not part of the final 
code anyway.


Kind regards,
Niklas


Re: [PATCH v6 05/11] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-14 Thread Niklas Cassel
On Wed, Oct 14, 2020 at 06:42:06AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Define values and structures that are needed to support Zoned
> Namespace Command Set (NVMe TP 4053) in PCI NVMe controller emulator.
> Define trace events where needed in newly introduced code.
> 
> In order to improve scalability, all open, closed and full zones
> are organized in separate linked lists. Consequently, almost all
> zone operations don't require scanning of the entire zone array
> (which potentially can be quite large) - it is only necessary to
> enumerate one or more zone lists.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> Subsequent commits in this series add ZDE support and checks for
> active and open zone limits.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 

(snip)

> @@ -2260,6 +3155,11 @@ static void nvme_select_ns_iocs(NvmeCtrl *n)
>  ns->iocs = nvme_cse_iocs_nvm;
>  }
>  break;
> +case NVME_CSI_ZONED:
> +if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
> +ns->iocs = nvme_cse_iocs_zoned;
> +}
> +break;
>  }
>  }
>  }

Who knows how this whole command set mess is supposed to work,
since e.g. the Key Value Command Set assigns opcodes for new commands
(Delete, Exist) with a opcode values (0x10,0x14) smaller than the
current highest opcode value (0x15) in the NVM Command Set,
while those opcodes (0x10,0x14) are reserved in the NVM Command Set.

At least for Zoned Command Set, they defined the new commands
(Zone Mgmt Send, Zone Mgmt Recv) to opcode values (0x79,0x7a)
that are higher than the current highest opcode value in the
NVM Command Set.

So since we know that the Zoned Command Set is a strict superset of
the NVM Command Set, I guess it might be nice to do something like:

case NVME_CSI_ZONED:
if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_CSI) {
ns->iocs = nvme_cse_iocs_zoned;
} else if (NVME_CC_CSS(n->bar.cc) == NVME_CC_CSS_NVM) {
ns->iocs = nvme_cse_iocs_nvm;
}
break;


Since I assume that the spec people intended reads/writes
to a ZNS namespace to still be possible when CC_CSS == NVM,
but who knows?


Kind regards,
Niklas


Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-10-05 Thread Niklas Cassel
On Sun, Oct 04, 2020 at 11:57:07PM +, Dmitry Fomichev wrote:
> On Wed, 2020-09-30 at 14:50 +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> > > The emulation code has been changed to advertise NVM Command Set when
> > > "zoned" device property is not set (default) and Zoned Namespace
> > > Command Set otherwise.
> > > 
> > > Handlers for three new NVMe commands introduced in Zoned Namespace
> > > Command Set specification are added, namely for Zone Management
> > > Receive, Zone Management Send and Zone Append.
> > > 
> > > Device initialization code has been extended to create a proper
> > > configuration for zoned operation using device properties.
> > > 
> > > Read/Write command handler is modified to only allow writes at the
> > > write pointer if the namespace is zoned. For Zone Append command,
> > > writes implicitly happen at the write pointer and the starting write
> > > pointer value is returned as the result of the command. Write Zeroes
> > > handler is modified to add zoned checks that are identical to those
> > > done as a part of Write flow.
> > > 
> > > The code to support for Zone Descriptor Extensions is not included in
> > > this commit and ZDES 0 is always reported. A later commit in this
> > > series will add ZDE support.
> > > 
> > > This commit doesn't yet include checks for active and open zone
> > > limits. It is assumed that there are no limits on either active or
> > > open zones.
> > > 
> > > Signed-off-by: Niklas Cassel 
> > > Signed-off-by: Hans Holmberg 
> > > Signed-off-by: Ajay Joshi 
> > > Signed-off-by: Chaitanya Kulkarni 
> > > Signed-off-by: Matias Bjorling 
> > > Signed-off-by: Aravind Ramesh 
> > > Signed-off-by: Shin'ichiro Kawasaki 
> > > Signed-off-by: Adam Manzanares 
> > > Signed-off-by: Dmitry Fomichev 
> > > ---
> > >  block/nvme.c |   2 +-
> > >  hw/block/nvme-ns.c   | 185 -
> > >  hw/block/nvme-ns.h   |   6 +-
> > >  hw/block/nvme.c  | 872 +--
> > >  include/block/nvme.h |   6 +-
> > >  5 files changed, 1033 insertions(+), 38 deletions(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 05485fdd11..7a513c9a17 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c

(snip)

> > 
> > Please read my comment on nvme_identify_nslist_csi() before reading
> > this comment.
> > 
> > At least for this function, the specification is clear:
> > 
> > "If the host requests a data structure for an I/O Command Set that the
> > controller does not support, the controller shall abort the command with
> > a status of Invalid Field in Command."
> > 
> > If the controller supports the I/O command set == if the Command Set bit
> > is set in the data struct returned by the nvme_identify_cmd_set(),
> > so here we should do something like:
> > 
> > } else if (->csi == NVME_CSI_ZONED && ctrl_has_zns_namespaces()) {
> > ...
> > }
> > 
> 
> With this commit, the controller supports ZNS command set regardless of
> the number of attached ZNS namespaces. It could be zero, but the controller
> still supports it. I think it would be better not to change the behavior
> of this command to depend on whether there are any ZNS namespaces added
> or not.

Ok, always having ZNS Command Set support, regardless if a user defines
a zoned namespace on the QEMU command line or not, does simplify things.

But then in nvme_identify_cmd_set(), you need to call
NVME_SET_CSI(*list, NVME_CSI_ZONED) unconditionally.

(Right now you loop though all namespaces, and only set the support bit
if you find a zoned namespace.)

> > Like I wrote in my review comment in the patch that added support for the 
> > new
> > allocated CNS values, I prefer if we remove this for-loop completely, and
> > simply set attached = true in nvme_ns_setup()/nvme_ns_init() instead.
> > 
> > (I was considering if we should set attach = true in nvme_zoned_init_ns(),
> > but because nvme_ns_setup()/nvme_ns_init() is called for all namespaces,
> > including ZNS namespaces, I don't think that any additional code in
> > nvme_zoned_init_ns() is warranted.)
> 
> I think CC.CSS value is not available during namespace setup and if we
> assign active flag in nvme_zoned_ns_setup(), zoned namespaces may end up
> being active even if NVM Only command set is selected. So keeping this loop
> seems like a good idea.

It is true that CC.CSS is not yet available during namespace setup,
but since the controller itself will never detach namespaces based on
CC.CSS, why are we dependant on CC.CSS being available?

Sure, once someone implements namespace management, they will need
to read if a certain namespace is attached or detached from some
persistent state, perhaps in the zone meta-data file, and set
attached boolean in nvme_ns_init() accordingly, but I still don't see
any dependance on CC.CSS even when namespace management is implemented.



Kind regards,
Niklas


Re: [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces

2020-10-05 Thread Niklas Cassel
On Sun, Oct 04, 2020 at 11:54:13PM +, Dmitry Fomichev wrote:
> On Wed, 2020-09-30 at 13:50 +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:20AM +0900, Dmitry Fomichev wrote:
> > > From: Niklas Cassel 
> > > 
> > > In NVMe, a namespace is active if it exists and is attached to the
> > > controller.
> > > 
> > > CAP.CSS (together with the I/O Command Set data structure) defines what
> > > command sets are supported by the controller.
> > > 
> > > CC.CSS (together with Set Profile) can be set to enable a subset of the
> > > available command sets. The namespaces belonging to a disabled command set
> > > will not be able to attach to the controller, and will thus be inactive.
> > > 
> > > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > > marked as inactive.
> > > 
> > > The identify namespace, the identify namespace CSI specific, and the 
> > > namespace
> > > list commands have two different versions, one that only shows active
> > > namespaces, and the other version that shows existing namespaces, 
> > > regardless
> > > of whether the namespace is attached or not.
> > > 
> > > Add an attached member to struct NvmeNamespace, and implement the missing 
> > > CNS
> > > commands.
> > > 
> > > The added functionality will also simplify the implementation of namespace
> > > management in the future, since namespace management can also attach and
> > > detach namespaces.
> > 
> > Following my previous discussion with Klaus,
> > I think we need to rewrite this commit message completely:
> > 
> > Subject: hw/block/nvme: Add support for allocated CNS command variants
> > 
> > Many CNS commands have "allocated" command variants.
> > These includes a namespace as long as it is allocated
> > (i.e. a namespace is included regardless if it is active (attached)
> > or not.)
> > 
> > While these commands are optional (they are mandatory for controllers
> > supporting the namespace attachment command), our QEMU implementation
> > is more complete by actually providing support for these CNS values.
> > 
> > However, since our QEMU model currently does not support the namespace
> > attachment command, these new allocated CNS commands will return the same
> > result as the active CNS command variants.
> > 
> > In NVMe, a namespace is active if it exists and is attached to the
> > controller.
> > 
> > CAP.CSS (together with the I/O Command Set data structure) defines what
> > command sets are supported by the controller.
> > 
> > CC.CSS (together with Set Profile) can be set to enable a subset of the
> > available command sets.
> > 
> > Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
> > will still be attached (and thus marked as active).
> > Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
> > will still be attached (and thus marked as active).
> > 
> > However, any operation from a disabled command set will result in a
> > Invalid Command Opcode.
> > 
> > Add an attached struct member for struct NvmeNamespace,
> > so that we lay the foundation for namespace attachment
> > support. Also implement logic in the new CNS values to
> > include/exclude namespaces based on this new struct member.
> > The only thing missing hooking up the actual Namespace Attachment
> > command opcode, which allows a user to toggle the attached
> > variable per namespace. The reason for not hooking up this
> > command completely is because the NVMe specification
> > requires that the namespace managment command is supported
> > if the namespacement attachment command is supported.
> > 
> 

(snip)

> > > @@ -2276,6 +2304,22 @@ static int nvme_start_ctrl(NvmeCtrl *n)
> > >  nvme_init_sq(>admin_sq, n, n->bar.asq, 0, 0,
> > >   NVME_AQA_ASQS(n->bar.aqa) + 1);
> > >  
> > > +for (i = 1; i <= n->num_namespaces; i++) {
> > > +ns = nvme_ns(n, i);
> > > +if (!ns) {
> > > +continue;
> > > +}
> > > +ns->params.attached = false;
> > > +switch (ns->params.csi) {
> > > +case NVME_CSI_NVM:
> > > +if (NVME_CC_CSS(n->bar.cc) == CSS_NVM_ONLY ||
> > > +NVME_CC_CSS(n->bar.cc) == CSS_CSI) {
> > > +ns->params.attached = true;
> > > +

Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types

2020-10-01 Thread Niklas Cassel
On Thu, Oct 01, 2020 at 08:59:31AM -0700, Keith Busch wrote:
> On Thu, Oct 01, 2020 at 03:50:35PM +0000, Niklas Cassel wrote:
> > On Thu, Oct 01, 2020 at 09:29:22AM -0600, Keith Busch wrote:
> > > On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> > > > On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > > > > From: Niklas Cassel 
> > > > > @@ -,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> > > > > offset, uint64_t data,
> > > > >  break;
> > > > >  case 0x14:  /* CC */
> > > > >  trace_pci_nvme_mmio_cfg(data & 0x);
> > > > > +
> > > > > +if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > > > > +if (NVME_CC_EN(n->bar.cc)) {
> > > > 
> > > > I just saw this print when doing controller reset on a live system.
> > > > 
> > > > Added a debug print:
> > > > nvme_write_bar WRITING: 0x0 previous: 0x464061
> > > > 
> > > > so the second if-statement has to be:
> > > > 
> > > > if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> > > > 
> > > > Sorry for introducing the bug in the first place.
> > > 
> > > No worries.
> > > 
> > > I don't think the check should be here at all, really. The only check for 
> > > valid
> > > CSS should be in nvme_start_ctrl(), which I posted yesterday.
> > 
> > The reasoning for this additional check is this:
> > 
> > From CC.CC register description:
> > 
> > "This field shall only be changed when the controller
> > is disabled (CC.EN is cleared to ‘0’)."
> > 
> > In the QEMU model, we have functions, e.g. nvme_cmd_effects(),
> > that uses n->bar.cc "at runtime".
> > 
> > So I don't think that simply checking for valid CSS in
> > nvme_start_ctrl() is sufficient.
> > 
> > Thoughts?
> 
> The qemu controller accepts host register writes only for valid enable
> and shutdown  bit transitions. Or at least it should. If not, then we
> need to fix that, but that's not specific to the CSS bits.

I simply added the second if-statement, (if (NVME_CC_EN(n->bar.cc))),
the rest of the NVME_CC_CSS was written by someone else.

But I see your point, all of this code:

if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
if (NVME_CC_EN(n->bar.cc)) {
NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
   "changing selected command set when enabled");
} else {
switch (NVME_CC_CSS(data)) {
case CSS_NVM_ONLY:
trace_pci_nvme_css_nvm_cset_selected_by_host(data &
 0x);
break;
case CSS_CSI:
NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
trace_pci_nvme_css_all_csets_sel_by_host(data &
 0x);
break;
case CSS_ADMIN_ONLY:
break;
default:
NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
   "unknown value in CC.CSS field");
}
}
}

should simply be dropped.

No need to call NVME_SET_CC_CSS() explicitly.

CC.CSS bit will be set futher down in this function anyway:

if (NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc)) {
n->bar.cc = data;


Kind regards,
Niklas

Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types

2020-10-01 Thread Niklas Cassel
On Thu, Oct 01, 2020 at 09:29:22AM -0600, Keith Busch wrote:
> On Thu, Oct 01, 2020 at 11:22:46AM +0000, Niklas Cassel wrote:
> > On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> > > From: Niklas Cassel 
> > > @@ -,6 +2328,30 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr 
> > > offset, uint64_t data,
> > >  break;
> > >  case 0x14:  /* CC */
> > >  trace_pci_nvme_mmio_cfg(data & 0x);
> > > +
> > > +if (NVME_CC_CSS(data) != NVME_CC_CSS(n->bar.cc)) {
> > > +if (NVME_CC_EN(n->bar.cc)) {
> > 
> > I just saw this print when doing controller reset on a live system.
> > 
> > Added a debug print:
> > nvme_write_bar WRITING: 0x0 previous: 0x464061
> > 
> > so the second if-statement has to be:
> > 
> > if (NVME_CC_EN(n->bar.cc) && NVME_CC_EN(data)) {
> > 
> > Sorry for introducing the bug in the first place.
> 
> No worries.
> 
> I don't think the check should be here at all, really. The only check for 
> valid
> CSS should be in nvme_start_ctrl(), which I posted yesterday.

The reasoning for this additional check is this:

From CC.CC register description:

"This field shall only be changed when the controller
is disabled (CC.EN is cleared to ‘0’)."

In the QEMU model, we have functions, e.g. nvme_cmd_effects(),
that uses n->bar.cc "at runtime".

So I don't think that simply checking for valid CSS in
nvme_start_ctrl() is sufficient.

Thoughts?


Kind regards,
Niklas

>  
> > > +NVME_GUEST_ERR(pci_nvme_err_change_css_when_enabled,
> > > +   "changing selected command set when 
> > > enabled");
> > > +} else {
> > > +switch (NVME_CC_CSS(data)) {
> > > +case CSS_NVM_ONLY:
> > > +trace_pci_nvme_css_nvm_cset_selected_by_host(data &
> > > + 
> > > 0x);
> > > +break;
> > > +case CSS_CSI:
> > > +NVME_SET_CC_CSS(n->bar.cc, CSS_CSI);
> > > +trace_pci_nvme_css_all_csets_sel_by_host(data & 
> > > 0x);
> > > +break;
> > > +case CSS_ADMIN_ONLY:
> > > +break;
> > > +default:
> > > +NVME_GUEST_ERR(pci_nvme_ub_unknown_css_value,
> > > +   "unknown value in CC.CSS field");
> > > +}
> > > +}
> > > +}

Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types

2020-10-01 Thread Niklas Cassel
On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel 
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Dmitry Fomichev 
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c| 210 +
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -40,6 +40,9 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  
>  id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> +ns->params.csi = NVME_CSI_NVM;
> +qemu_uuid_generate(>params.uuid); /* TODO make UUIDs persistent */
> +
>  /* no thin provisioning */
>  id_ns->ncap = id_ns->nsze;
>  id_ns->nuse = id_ns->ncap;
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 29fa005fa2..4ec1ddc90a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1495,6 +1495,13 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, 
> NvmeRequest *req)
>  return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_rpt_empty_id_struct(NvmeCtrl *n, NvmeRequest *req)
> +{
> +uint8_t id[NVME_IDENTIFY_DATA_SIZE] = {};
> +
> +return nvme_dma(n, id, sizeof(id), DMA_DIRECTION_FROM_DEVICE, req);
> +}
> +
>  static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  {
>  trace_pci_nvme_identify_ctrl();
> @@ -1503,11 +1510,23 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, 
> NvmeRequest *req)
>  DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeIdentify *c = (NvmeIdentify *)>cmd;
> +
> +trace_pci_nvme_identify_ctrl_csi(c->csi);
> +
> +if (c->csi == NVME_CSI_NVM) {
> +return nvme_rpt_empty_id_struct(n, req);
> +}
> +
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  {
>  NvmeNamespace *ns;
>  NvmeIdentify *c = (NvmeIdentify *)>cmd;
> -NvmeIdNs *id_ns, inactive = { 0 };
>  uint32_t nsid = le32_to_cpu(c->nsid);
>  
>  trace_pci_nvme_identify_ns(nsid);
> @@ -1518,23 +1537,46 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> NvmeRequest *req)
>  
>  ns = nvme_ns(n, nsid);
>  if (unlikely(!ns)) {
> -id_ns = 
> -} else {
> -id_ns = >id_ns;
> +return nvme_rpt_empty_id_struct(n, req);
>  }
>  
> -return nvme_dma(n, (uint8_t *)id_ns, sizeof(NvmeIdNs),
> +return nvme_dma(n, (uint8_t *)>id_ns, sizeof(NvmeIdNs),
>  DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> +static uint16_t nvme_identify_ns_csi(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeNamespace *ns;
> +NvmeIdentify *c = (NvmeIdentify *)>cmd;
> +uint32_t nsid = le32_to_cpu(c->nsid);
> +
> +trace_pci_nvme_identify_ns_csi(nsid, c->csi);
> +
> +if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> +return NVME_INVALID_NSID | NVME_DNR;
> +}
> +
> +ns = nvme_ns(n, nsid);
> +if (unlikely(!ns)) {
> +return nvme_rpt_empty_id_struct(n, req);
> +}
> +
> +if (c->csi == NVME_CSI_NVM) {
> +return nvme_rpt_empty_id_struct(n, req);
> +}
> +
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
>  static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeRequest *req)
>  {
> +NvmeNamespace *ns;
>  NvmeIdentify *c = (NvmeIdentify *)>cmd;
> -static const int data_len = NVME_IDENTIFY_DATA_SIZE;
>  uint32_t min_nsid = le32_to_cpu(c->nsid);
> -uint32_t *list;
> -uint16_t ret;
> -int j = 0;
> +uint8_t list[NVME_IDENTIFY_DATA_SIZE] = {};
> +static const int data_len = sizeof(list);
> +uint32_t *list_ptr = (uint32_t *)list;
> +int i, j = 0;
>  
>  trace_pci_nvme_identify_nslist(min_nsid);
>  
> @@ -1548,48 +1590,76 @@ static uint16_t nvme_identify_nslis

Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-09-30 Thread Niklas Cassel
On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> ---
>  block/nvme.c |   2 +-
>  hw/block/nvme-ns.c   | 185 -
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c  | 872 +--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c

(snip)

> @@ -1326,11 +2060,20 @@ static uint16_t nvme_cmd_effects(NvmeCtrl *n, 
> uint32_t buf_len,
>  acs[NVME_ADM_CMD_GET_LOG_PAGE] = NVME_CMD_EFFECTS_CSUPP;
>  acs[NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFFECTS_CSUPP;
>  
> -iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> -  NVME_CMD_EFFECTS_LBCC;
> -iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | NVME_CMD_EFFECTS_LBCC;
> -iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +if (NVME_CC_CSS(n->bar.cc) != CSS_ADMIN_ONLY) {
> +iocs[NVME_CMD_FLUSH] = NVME_CMD_EFFECTS_CSUPP | 
> NVME_CMD_EFFECTS_LBCC;
> +iocs[NVME_CMD_WRITE_ZEROES] = NVME_CMD_EFFECTS_CSUPP |
> +  NVME_CMD_EFFECTS_LBCC;
> +iocs[NVME_CMD_WRITE] = NVME_CMD_EFFECTS_CSUPP | 
> NVME_CMD_EFFECTS_LBCC;
> +iocs[NVME_CMD_READ] = NVME_CMD_EFFECTS_CSUPP;
> +}
> +
> +if (csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) == CSS_CSI) {

Actually, intead of naming the helper function, ctrl_has_zns_namespaces(),
a better name might be ctrl_has_zns_support()

Since this is what is used to set the bit in nvme_identify_cmd_set(),

Then, I think that this should be:

if (ctrl_has_zns_support() && csi == NVME_CSI_ZONED && NVME_CC_CSS(n->bar.cc) 
== CSS_CSI) {


> +iocs[NVME_CMD_ZONE_APPEND] = NVME_CMD_EFFECTS_CSUPP |
> + NVME_CMD_EFFECTS_LBCC;
> +iocs[NVME_CMD_ZONE_MGMT_SEND] = NVME_CMD_EFFECTS_CSUPP;
> +iocs[NVME_CMD_ZONE_MGMT_RECV] = NVME_CMD_EFFECTS_CSUPP;
> +}
>  
>  trans_len = MIN(sizeof(cmd_eff_log) - off, buf_len);
>  





Re: [PATCH v5 09/14] hw/block/nvme: Support Zoned Namespace Command Set

2020-09-30 Thread Niklas Cassel
On Mon, Sep 28, 2020 at 11:35:23AM +0900, Dmitry Fomichev wrote:
> The emulation code has been changed to advertise NVM Command Set when
> "zoned" device property is not set (default) and Zoned Namespace
> Command Set otherwise.
> 
> Handlers for three new NVMe commands introduced in Zoned Namespace
> Command Set specification are added, namely for Zone Management
> Receive, Zone Management Send and Zone Append.
> 
> Device initialization code has been extended to create a proper
> configuration for zoned operation using device properties.
> 
> Read/Write command handler is modified to only allow writes at the
> write pointer if the namespace is zoned. For Zone Append command,
> writes implicitly happen at the write pointer and the starting write
> pointer value is returned as the result of the command. Write Zeroes
> handler is modified to add zoned checks that are identical to those
> done as a part of Write flow.
> 
> The code to support for Zone Descriptor Extensions is not included in
> this commit and ZDES 0 is always reported. A later commit in this
> series will add ZDE support.
> 
> This commit doesn't yet include checks for active and open zone
> limits. It is assumed that there are no limits on either active or
> open zones.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Hans Holmberg 
> Signed-off-by: Ajay Joshi 
> Signed-off-by: Chaitanya Kulkarni 
> Signed-off-by: Matias Bjorling 
> Signed-off-by: Aravind Ramesh 
> Signed-off-by: Shin'ichiro Kawasaki 
> Signed-off-by: Adam Manzanares 
> Signed-off-by: Dmitry Fomichev 
> ---
>  block/nvme.c |   2 +-
>  hw/block/nvme-ns.c   | 185 -
>  hw/block/nvme-ns.h   |   6 +-
>  hw/block/nvme.c  | 872 +--
>  include/block/nvme.h |   6 +-
>  5 files changed, 1033 insertions(+), 38 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 05485fdd11..7a513c9a17 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -333,7 +333,7 @@ static inline int nvme_translate_error(const NvmeCqe *c)
>  {
>  uint16_t status = (le16_to_cpu(c->status) >> 1) & 0xFF;
>  if (status) {
> -trace_nvme_error(le32_to_cpu(c->result),
> +trace_nvme_error(le32_to_cpu(c->result32),
>   le16_to_cpu(c->sq_head),
>   le16_to_cpu(c->sq_id),
>   le16_to_cpu(c->cid),
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 31b7f986c3..6d9dc9205b 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -33,14 +33,14 @@ static void nvme_ns_init(NvmeNamespace *ns)
>  NvmeIdNs *id_ns = >id_ns;
>  
>  if (blk_get_flags(ns->blkconf.blk) & BDRV_O_UNMAP) {
> -ns->id_ns.dlfeat = 0x9;
> +ns->id_ns.dlfeat = 0x8;

You seem to change something that is NVM namespace specific here, why?
If it is indeed needed, I assume that this change should be in a separate
patch.

>  }
>  
>  id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
>  
>  id_ns->nsze = cpu_to_le64(nvme_ns_nlbas(ns));
>  
> -ns->params.csi = NVME_CSI_NVM;
> +ns->csi = NVME_CSI_NVM;
>  qemu_uuid_generate(>params.uuid); /* TODO make UUIDs persistent */
>  
>  /* no thin provisioning */
> @@ -73,7 +73,162 @@ static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace 
> *ns, Error **errp)
>  }
>  
>  lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> -ns->id_ns.lbaf[lba_index].ds = 31 - clz32(n->conf.logical_block_size);
> +ns->id_ns.lbaf[lba_index].ds = 31 - 
> clz32(ns->blkconf.logical_block_size);
> +
> +return 0;
> +}
> +
> +/*
> + * Add a zone to the tail of a zone list.
> + */
> +void nvme_add_zone_tail(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
> +{
> +uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +assert(nvme_zone_not_in_list(zone));
> +
> +if (!zl->size) {
> +zl->head = zl->tail = idx;
> +zone->next = zone->prev = NVME_ZONE_LIST_NIL;
> +} else {
> +ns->zone_array[zl->tail].next = idx;
> +zone->prev = zl->tail;
> +zone->next = NVME_ZONE_LIST_NIL;
> +zl->tail = idx;
> +}
> +zl->size++;
> +}
> +
> +/*
> + * Remove a zone from a zone list. The zone must be linked in the list.
> + */
> +void nvme_remove_zone(NvmeNamespace *ns, NvmeZoneList *zl, NvmeZone *zone)
> +{
> +uint32_t idx = (uint32_t)(zone - ns->zone_array);
> +
> +assert(!nvme_zone_not_in_list(zone));
> +
> +--zl->size;
> +if (zl-&g

Re: [PATCH v5 06/14] hw/block/nvme: Add support for active/inactive namespaces

2020-09-30 Thread Niklas Cassel
On Mon, Sep 28, 2020 at 11:35:20AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel 
> 
> In NVMe, a namespace is active if it exists and is attached to the
> controller.
> 
> CAP.CSS (together with the I/O Command Set data structure) defines what
> command sets are supported by the controller.
> 
> CC.CSS (together with Set Profile) can be set to enable a subset of the
> available command sets. The namespaces belonging to a disabled command set
> will not be able to attach to the controller, and will thus be inactive.
> 
> E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> marked as inactive.
> 
> The identify namespace, the identify namespace CSI specific, and the namespace
> list commands have two different versions, one that only shows active
> namespaces, and the other version that shows existing namespaces, regardless
> of whether the namespace is attached or not.
> 
> Add an attached member to struct NvmeNamespace, and implement the missing CNS
> commands.
> 
> The added functionality will also simplify the implementation of namespace
> management in the future, since namespace management can also attach and
> detach namespaces.

Following my previous discussion with Klaus,
I think we need to rewrite this commit message completely:

Subject: hw/block/nvme: Add support for allocated CNS command variants

Many CNS commands have "allocated" command variants.
These includes a namespace as long as it is allocated
(i.e. a namespace is included regardless if it is active (attached)
or not.)

While these commands are optional (they are mandatory for controllers
supporting the namespace attachment command), our QEMU implementation
is more complete by actually providing support for these CNS values.

However, since our QEMU model currently does not support the namespace
attachment command, these new allocated CNS commands will return the same
result as the active CNS command variants.

In NVMe, a namespace is active if it exists and is attached to the
controller.

CAP.CSS (together with the I/O Command Set data structure) defines what
command sets are supported by the controller.

CC.CSS (together with Set Profile) can be set to enable a subset of the
available command sets.

Even if a user configures CC.CSS to e.g. Admin only, NVM namespaces
will still be attached (and thus marked as active).
Similarly, if a user configures CC.CSS to e.g. NVM, ZNS namespaces
will still be attached (and thus marked as active).

However, any operation from a disabled command set will result in a
Invalid Command Opcode.

Add an attached struct member for struct NvmeNamespace,
so that we lay the foundation for namespace attachment
support. Also implement logic in the new CNS values to
include/exclude namespaces based on this new struct member.
The only thing missing hooking up the actual Namespace Attachment
command opcode, which allows a user to toggle the attached
variable per namespace. The reason for not hooking up this
command completely is because the NVMe specification
requires that the namespace managment command is supported
if the namespacement attachment command is supported.


> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Dmitry Fomichev 
> ---
>  hw/block/nvme-ns.h   |  1 +
>  hw/block/nvme.c  | 60 ++--
>  include/block/nvme.h | 20 +--
>  3 files changed, 65 insertions(+), 16 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index cca23bc0b3..acdb76f058 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -22,6 +22,7 @@
>  typedef struct NvmeNamespaceParams {
>  uint32_t nsid;
>  uint8_t  csi;
> +bool attached;
>  QemuUUID uuid;
>  } NvmeNamespaceParams;
>  
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4ec1ddc90a..63ad03d6d6 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c

We need to add an additional check in nvme_io_cmd()
that returns Invalid Command Opcode when CC.CSS == Admin only.

> @@ -1523,7 +1523,8 @@ static uint16_t nvme_identify_ctrl_csi(NvmeCtrl *n, 
> NvmeRequest *req)
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
> +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req,
> + bool only_active)
>  {
>  NvmeNamespace *ns;
>  NvmeIdentify *c = (NvmeIdentify *)>cmd;
> @@ -1540,11 +1541,16 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> NvmeRequest *req)
>  return nvme_rpt_empty_id_struct(n, req);
>  }
>  
> +if (only_active && !ns->params.attached) {
> +return nvme_rpt_empty_id_struct(n, req);
> +}
> +
>  return nvme_dma(n, (u

Re: [PATCH v5 05/14] hw/block/nvme: Add support for Namespace Types

2020-09-30 Thread Niklas Cassel
On Mon, Sep 28, 2020 at 11:35:19AM +0900, Dmitry Fomichev wrote:
> From: Niklas Cassel 
> 
> Namespace Types introduce a new command set, "I/O Command Sets",
> that allows the host to retrieve the command sets associated with
> a namespace. Introduce support for the command set and enable
> detection for the NVM Command Set.
> 
> The new workflows for identify commands rely heavily on zero-filled
> identify structs. E.g., certain CNS commands are defined to return
> a zero-filled identify struct when an inactive namespace NSID
> is supplied.
> 
> Add a helper function in order to avoid code duplication when
> reporting zero-filled identify structures.
> 
> Signed-off-by: Niklas Cassel 
> Signed-off-by: Dmitry Fomichev 
> ---
>  hw/block/nvme-ns.c |   3 +
>  hw/block/nvme.c| 210 +
>  2 files changed, 175 insertions(+), 38 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index bbd7879492..31b7f986c3 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c

(snip)

> @@ -1597,12 +1667,31 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl 
> *n, NvmeRequest *req)
>   * Namespace Identification Descriptor. Add a very basic Namespace UUID
>   * here.
>   */
> -ns_descrs->uuid.hdr.nidt = NVME_NIDT_UUID;
> -ns_descrs->uuid.hdr.nidl = NVME_NIDL_UUID;
> -stl_be_p(_descrs->uuid.v, nsid);
> +desc = list_ptr;
> +desc->nidt = NVME_NIDT_UUID;
> +desc->nidl = NVME_NIDL_UUID;
> +list_ptr += sizeof(*desc);
> +memcpy(list_ptr, ns->params.uuid.data, NVME_NIDL_UUID);
> +list_ptr += NVME_NIDL_UUID;
>  
> -return nvme_dma(n, list, NVME_IDENTIFY_DATA_SIZE,
> -DMA_DIRECTION_FROM_DEVICE, req);
> +desc = list_ptr;
> +desc->nidt = NVME_NIDT_CSI;
> +desc->nidl = NVME_NIDL_CSI;
> +list_ptr += sizeof(*desc);
> +*(uint8_t *)list_ptr = NVME_CSI_NVM;

I think that we should use ns->csi/ns->params.csi here rather than
NVME_CSI_NVM.
You do this change in a later patch, but I think it is more correct
to do it here already. (No reason not to, since ns->csi/ns->params.csi
should be set to NVME_CSI_NVM for NVM namespace already in this patch.)

> +
> +return nvme_dma(n, list, data_len, DMA_DIRECTION_FROM_DEVICE, req);
> +}


Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces

2020-09-24 Thread Niklas Cassel
On Thu, Sep 24, 2020 at 08:55:24PM +0200, Klaus Jensen wrote:
> On Sep 24 18:17, Niklas Cassel wrote:
> > On Thu, Sep 24, 2020 at 02:12:03PM +0200, Klaus Jensen wrote:
> > > On Sep 24 03:20, Dmitry Fomichev wrote:
> > > > From: Niklas Cassel 
> > > > 
> > > > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > > > marked as inactive.
> > > > 
> > > 
> > > Hmm. I'm not convinced that this is correct. Can you reference the spec?
> > > 
> > 
> > CC.CSS can only be changed when the controller is disabled.
> 
> Right. I think I see you point. While the controller is disabled, the
> host obviously cannot even see what namespaces are available, so the
> controller is free to only expose (aka, attach) the namespaces that
> makes sense for the value of CC.CSS.
> 
> OK then, the patch is good :)

That was my thought, that the controller internally would
detach unsupported namespaces (even if the controller didn't
expose namespace management capabilities to the user).

This was how I assumed that things worked, but if we should
follow the spec strictly, we should do like you suggested
and keep them attached, and return the proper error code,
on non-admin commands.

Thank you for improving my understanding.
Considering that CC.CSS can only be changed when the controller
is disabled, I still kind of wished that the spec said that
unsupported namespaces would be automatically detached.


Kind regards,
Niklas


Re: [PATCH v4 06/14] hw/block/nvme: Add support for active/inactive namespaces

2020-09-24 Thread Niklas Cassel
On Thu, Sep 24, 2020 at 02:12:03PM +0200, Klaus Jensen wrote:
> On Sep 24 03:20, Dmitry Fomichev wrote:
> > From: Niklas Cassel 
> > 
> > E.g., if the user sets CC.CSS to Admin Only, NVM namespaces should be
> > marked as inactive.
> > 
> 
> Hmm. I'm not convinced that this is correct. Can you reference the spec?
> 

"If the user sets CC.CSS to Admin Only, NVM namespaces should be marked as 
inactive."

After reading the spec several times, I agree that this statement is false,
although I really wish it wasn't :)



My interpretation was based on:

From CC.CSS:
"If bit 44 is set to ‘1’ in the Command Sets Supported (CSS) field, then the
value 111b indicates that only the Admin Command Set is supported and that no
I/O Command Set or I/O Command Set Specific Admin commands are supported."

The NVM Command Set is a Command Set, so I assumed that since the Command
Set was not supported, trying to do something like CNS 00h (Identify Namespace),
would return an zero-filled struct. (Since the namespace belongs to a
Command Set that is not supported.)

To me it seems silly that a namespace can be "active" at the same time
as the Command Set that the namespace belongs to is not enabled,
but that seems like how the spec is written...


Additionally in TP4056 section 5.19 it says:
"If an attempt is made to attach a namespace to a controller that supports the
corresponding I/O Command Set and the corresponding I/O Command Set is not
enabled by the I/O Command Set profile feature, then the command shall be
aborted with a status of I/O Command Set Not Enabled."

But if you already have a e.g. a Key Value namespace attached, and boot up with
e.g. CC.CSS = NVM, or CC.CSS = Admin, you will have a namespace belonging to a
disabled command set attached (and "active" :p).
But if you have no namespaces attached, boot up with CC.CSS = NVM or
CC.CSS = Admin, you will not be allowed to attach your Key Value namespace.
(5.19 says nothing about CC.CSS, but let's assume it isn't supposed to be
allowed.)
Will you be allowed to attach a ZNS namespace? (Since CC.CSS != ALL Selected,
I would assume that is is not supposed to be allowed.)
Now, will you be allowed to attach a NVM namespace when CC.CSS = Admin?
(I assume that the intention is that you should.)


CC.CSS can only be changed when the controller is disabled.
I assume that you can attach NVM namespaces even when CC.CSS = Admin.
(Attaching namespaces != NVM is probably not allowed when
CC.CSS != ALL Selected.)
Attaching namespaces are persistent. Even if you restart with CC.CSS = Admin,
they will still be attached, and active.
(Although you might not be able to do anything with them... see the next
section.)
You can not change to a profile (using Set Profile) that lacks a command set
that you are currently using. This change is while the controller is enabled,
so I guess it is ok that this check is stricter than CC.CSS.

Things would have been way simpler if the controller just deattached
non-supported namespaces internally at power on...


> My /interpretation/ (because the spec is vague on this point) is that
> with TP 4056, if the host writes 0x0 to CC.CSS, you will (should) just
> see Invalid Command Opcode for namespaces not supporting the NVM command
> set since we are operating in a backward compatible way.

If a controller is booted with CC.CSS = NVM Command Set:

We have an attached Key Value namespace. (It will be set as active.)
I think that we can agree that any IOCS Key Value specific command
will fail.

We have an attached Zoned Namespace. (It will be set as active.)
Sure, any IOCS Zoned Namespace Command specific command will fail.
Your interpretation is that it will allow NVM Commands, since a Zoned Namespace
implements the opcodes of the NVM Command Set.

Each Command Set has a table with the opcodes that it supports.
In e.g. Key Value Command Set, opcode 01h means "Store command"
but in NVM Command Set, opcode 01h means "Write command".

These commands are totally different, and uses completely different dwords.
So I don't think that it is obvious that CC.CSS = NVM, should allow
"NVM reserved opcodes" for certain Command Sets, but not in others.

Like so many other things in NVMe, this will probably be vendor specific.
For devices only supporting NVM + ZNS, it will probably support NVM
Command Set commands even when CC.CSS = NVM, but it's probably not
something that can be guarantee to be true for all devices supporting
a Command Set that extends the NVM Command Set.


Kind regards,
Niklas

Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions

2020-06-30 Thread Niklas Cassel
On Tue, Jun 30, 2020 at 06:57:16AM +0200, Klaus Jensen wrote:
> On Jun 18 06:34, Dmitry Fomichev wrote:
> > From: Niklas Cassel 
> > 
> > Define the structures and constants required to implement
> > Namespace Types support.
> > 
> > Signed-off-by: Niklas Cassel 
> > Signed-off-by: Dmitry Fomichev 
> > ---
> >  hw/block/nvme.h  |  3 ++
> >  include/block/nvme.h | 75 +---
> >  2 files changed, 73 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 4f0dac39ae..4fd155c409 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
> >  
> >  typedef struct NvmeNamespace {
> >  NvmeIdNsid_ns;
> > +uint32_tnsid;
> > +uint8_t csi;
> > +QemuUUIDuuid;
> >  } NvmeNamespace;
> >  
> >  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 6a58bac0c2..5a1e5e137c 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -50,6 +50,11 @@ enum NvmeCapMask {
> >  CAP_PMR_MASK   = 0x1,
> >  };
> >  
> > +enum NvmeCapCssBits {
> > +CAP_CSS_NVM= 0x01,
> > +CAP_CSS_CSI_SUPP   = 0x40,
> > +};
> > +
> >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> >  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)& CAP_CQR_MASK)
> >  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)& CAP_AMS_MASK)
> > @@ -101,6 +106,12 @@ enum NvmeCcMask {
> >  CC_IOCQES_MASK  = 0xf,
> >  };
> >  
> > +enum NvmeCcCss {
> > +CSS_NVM_ONLY= 0,
> > +CSS_ALL_NSTYPES = 6,
> 
> Maybe we could call this CSS_CSI, since it just specifies that one or
> more command sets are supported, not that ALL namespace types are
> supported.

The enum name here is CcCss, so this represents CC.CSS,
which specifies which Command Sets to enable,
not which Command Sets that are supported.

(Supported Command Sets are defined by CAP.CSS and the
I/O Command Set data structure.)

So it indeed says to enable ALL command sets supported by the
controller. (Although for the CSI case, you need to check the
I/O Command Set data structure to know what is actually supported.)


However, I agree, the name CSS_ALL_NSTYPES is a bit misleading.
ALL_SUPPORTED_CSI would have been a more precise name.
However, simply naming it CSS_CSI, like you suggest, is more intuitive,
and is what we use in the Linux kernel patches, so let's use that :)


Kind regards,
Niklas

> 
> Otherwise,
> Reviewed-by: Klaus Jensen 
> 
> > +CSS_ADMIN_ONLY  = 7,
> > +};
> > +
> >  #define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK)
> >  #define NVME_CC_CSS(cc)((cc >> CC_CSS_SHIFT)& CC_CSS_MASK)
> >  #define NVME_CC_MPS(cc)((cc >> CC_MPS_SHIFT)& CC_MPS_MASK)
> > @@ -109,6 +120,21 @@ enum NvmeCcMask {
> >  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
> >  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
> >  
> > +#define NVME_SET_CC_EN(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> > +#define NVME_SET_CC_CSS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> > +#define NVME_SET_CC_MPS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> > +#define NVME_SET_CC_AMS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> > +#define NVME_SET_CC_SHN(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> > +#define NVME_SET_CC_IOSQES(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> > +#define NVME_SET_CC_IOCQES(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> > +
> >  enum NvmeCstsShift {
> >  CSTS_RDY_SHIFT  = 0,
> >  CSTS_CFS_SHIFT  = 1,
> > @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
> >  uint64_trsvd2[2];
> >  uint64_tprp1;
> >  uint64_tprp2;
> > -uint32_tcns;
> > -uint32_trsvd11[5];
> > +uint8_t cns;
> > +uint8_t rsvd4;
> > +uint16_tctrlid;
> > +uint16_tnvmsetid;
> > +uint8_t 

Re: [PATCH 00/10] hw/block/nvme: namespace types and zoned namespaces

2020-06-30 Thread Niklas Cassel
On Tue, Jun 30, 2020 at 12:01:29PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Hi all,

Hello Klaus,

> 
> This series adds support for TP 4056 ("Namespace Types") and TP 4053
> ("Zoned Namespaces") and is an alternative implementation to the one
> submitted by Dmitry[1].
> 
> While I don't want this to end up as a discussion about the merits of
> each version, I want to point out a couple of differences from Dmitry's
> version. At a glance, my version
> 
>   * builds on my patch series that adds fairly complete NVMe v1.4
> mandatory support, as well as nice-to-have feature such as SGLs,
> multiple namespaces and mostly just overall clean up. This finally
> brings the nvme device into a fairly compliant state on which we can
> add new features. I've tried hard to get these compliance and
> clean-up patches merged for a long time (in parallel with developing
> the emulation of NST and ZNS) and I would be really sad to see them
> by-passed since they have already been through many iterations and
> already carries Acked- and Reviewed-by's for the bulk of the
> patches. I think the nvme device is already in a "frankenstate" wrt.
> the implemented nvme version and the features it currently supports,
> so I think this kind of cleanup is long overdue.
> 
>   * uses an attached blockdev and standard blk_aio for persistent zone
> info. This is the same method used in our patches for Write
> Uncorrectable and (separate and extended lba) metadata support, but
> I've left those optional features out for now to ease the review
> process.
> 
>   * relies on the universal dulbe support added in ("hw/block/nvme: add
> support for dulbe") and sparse images for handling reads in gaps
> (above write pointer and below ZSZE); that is - the size of the
> underlying blockdev is in terms of ZSZE, not ZCAP
> 
>   * the controller uses timers to autonomously finish zones (wrt. FRL)

AFAICT, Dmitry's patches does this as well.

> 
> I've been on paternity leave for a month, so I havn't been around to
> review Dmitry's patches, but I have started that process now. I would
> also be happy to work with Dmitry & Friends on merging our versions to
> get the best of both worlds if it makes sense.
> 
> This series and all preparatory patch sets (the ones I've been posting
> yesterday and today) are available on my GitHub[2]. Unfortunately
> Patchew got screwed up in the middle of me sending patches and it never
> picked up v2 of "hw/block/nvme: support multiple namespaces" because it
> was getting late and I made a mistake with the CC's. So my posted series
> don't apply according to Patchew, but they actually do if you follow the
> Based-on's (... or just grab [2]).
> 
> 
>   [1]: Message-Id: <20200617213415.22417-1-dmitry.fomic...@wdc.com>
>   [2]: https://github.com/birkelund/qemu/tree/for-master/nvme
> 
> 
> Based-on: <20200630043122.1307043-1-...@irrelevant.dk>
> ("[PATCH 0/3] hw/block/nvme: bump to v1.4")

Is this the only patch series that this series depends on?

In the beginning of the cover letter, you mentioned
"NVMe v1.4 mandatory support", "SGLs", "multiple namespaces",
and "and mostly just overall clean up".

> 
> Klaus Jensen (10):
>   hw/block/nvme: support I/O Command Sets
>   hw/block/nvme: add zns specific fields and types
>   hw/block/nvme: add basic read/write for zoned namespaces
>   hw/block/nvme: add the zone management receive command
>   hw/block/nvme: add the zone management send command
>   hw/block/nvme: add the zone append command
>   hw/block/nvme: track and enforce zone resources
>   hw/block/nvme: allow open to close transitions by controller
>   hw/block/nvme: allow zone excursions
>   hw/block/nvme: support reset/finish recommended limits
> 
>  block/nvme.c  |6 +-
>  hw/block/nvme-ns.c|  397 +-
>  hw/block/nvme-ns.h|  148 +++-
>  hw/block/nvme.c   | 1676 +++--
>  hw/block/nvme.h   |   76 +-
>  hw/block/trace-events |   43 +-
>  include/block/nvme.h  |  252 ++-
>  7 files changed, 2469 insertions(+), 129 deletions(-)
> 
> -- 
> 2.27.0
> 

I think that you have done a great job getting the NVMe
driver out of a frankenstate, and made it compliant with
a proper spec (NVMe 1.4).

I'm also a big fan of the refactoring so that the driver
handles more than one namespace, and the new bus model.

I know that you first sent your
"nvme: support NVMe v1.3d, SGLs and multiple namespaces"
patch series July, last year.

Looking at your outstanding patch series on patchwork:
https://patchwork.kernel.org/project/qemu-devel/list/?submitter=188679

(Feel free to correct me if I have misunderstood anything.)

I see that these are related to your patch series from July last year:
hw/block/nvme: bump to v1.3
hw/block/nvme: support scatter gather lists
hw/block/nvme: support multiple namespaces
hw/block/nvme: bump to v1.4


This patch series seems minor and could probably 

Re: [PATCH v2 05/18] hw/block/nvme: Introduce the Namespace Types definitions

2020-06-30 Thread Niklas Cassel
On Mon, Jun 29, 2020 at 07:12:47PM -0700, Alistair Francis wrote:
> On Wed, Jun 17, 2020 at 2:47 PM Dmitry Fomichev  
> wrote:
> >
> > From: Niklas Cassel 
> >
> > Define the structures and constants required to implement
> > Namespace Types support.
> >
> > Signed-off-by: Niklas Cassel 
> > Signed-off-by: Dmitry Fomichev 
> > ---
> >  hw/block/nvme.h  |  3 ++
> >  include/block/nvme.h | 75 +---
> >  2 files changed, 73 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 4f0dac39ae..4fd155c409 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -63,6 +63,9 @@ typedef struct NvmeCQueue {
> >
> >  typedef struct NvmeNamespace {
> >  NvmeIdNsid_ns;
> > +uint32_tnsid;
> > +uint8_t csi;
> > +QemuUUIDuuid;
> >  } NvmeNamespace;
> >
> >  static inline NvmeLBAF *nvme_ns_lbaf(NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 6a58bac0c2..5a1e5e137c 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -50,6 +50,11 @@ enum NvmeCapMask {
> >  CAP_PMR_MASK   = 0x1,
> >  };
> >
> > +enum NvmeCapCssBits {
> > +CAP_CSS_NVM= 0x01,
> > +CAP_CSS_CSI_SUPP   = 0x40,
> > +};
> > +
> >  #define NVME_CAP_MQES(cap)  (((cap) >> CAP_MQES_SHIFT)   & CAP_MQES_MASK)
> >  #define NVME_CAP_CQR(cap)   (((cap) >> CAP_CQR_SHIFT)& CAP_CQR_MASK)
> >  #define NVME_CAP_AMS(cap)   (((cap) >> CAP_AMS_SHIFT)& CAP_AMS_MASK)
> > @@ -101,6 +106,12 @@ enum NvmeCcMask {
> >  CC_IOCQES_MASK  = 0xf,
> >  };
> >
> > +enum NvmeCcCss {
> > +CSS_NVM_ONLY= 0,
> > +CSS_ALL_NSTYPES = 6,
> > +CSS_ADMIN_ONLY  = 7,
> > +};
> > +
> >  #define NVME_CC_EN(cc) ((cc >> CC_EN_SHIFT) & CC_EN_MASK)
> >  #define NVME_CC_CSS(cc)((cc >> CC_CSS_SHIFT)& CC_CSS_MASK)
> >  #define NVME_CC_MPS(cc)((cc >> CC_MPS_SHIFT)& CC_MPS_MASK)
> > @@ -109,6 +120,21 @@ enum NvmeCcMask {
> >  #define NVME_CC_IOSQES(cc) ((cc >> CC_IOSQES_SHIFT) & CC_IOSQES_MASK)
> >  #define NVME_CC_IOCQES(cc) ((cc >> CC_IOCQES_SHIFT) & CC_IOCQES_MASK)
> >
> > +#define NVME_SET_CC_EN(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_EN_MASK) << CC_EN_SHIFT)
> > +#define NVME_SET_CC_CSS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_CSS_MASK) << CC_CSS_SHIFT)
> > +#define NVME_SET_CC_MPS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_MPS_MASK) << CC_MPS_SHIFT)
> > +#define NVME_SET_CC_AMS(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_AMS_MASK) << CC_AMS_SHIFT)
> > +#define NVME_SET_CC_SHN(cc, val)\
> > +(cc |= (uint32_t)((val) & CC_SHN_MASK) << CC_SHN_SHIFT)
> > +#define NVME_SET_CC_IOSQES(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_IOSQES_MASK) << CC_IOSQES_SHIFT)
> > +#define NVME_SET_CC_IOCQES(cc, val) \
> > +(cc |= (uint32_t)((val) & CC_IOCQES_MASK) << CC_IOCQES_SHIFT)
> > +
> >  enum NvmeCstsShift {
> >  CSTS_RDY_SHIFT  = 0,
> >  CSTS_CFS_SHIFT  = 1,
> > @@ -482,10 +508,41 @@ typedef struct NvmeIdentify {
> >  uint64_trsvd2[2];
> >  uint64_tprp1;
> >  uint64_tprp2;
> > -uint32_tcns;
> > -uint32_trsvd11[5];
> > +uint8_t cns;
> > +uint8_t rsvd4;
> > +uint16_tctrlid;
> 
> Shouldn't this be CNTID?

>From the NVMe spec:
https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf

Figure 241:
Controller  Identifier  (CNTID)

So you are correct, this is the official abbreviation.

I guess that I tried wanted to keep it in sync with Linux:
https://github.com/torvalds/linux/blob/master/include/linux/nvme.h#L974

Which uses ctrlid.


Looking further at the NVMe spec:
In Figure 247 (Identify Controller Data Structure) they use other names
for fields:

Controller  ID  (CNTLID)
Controller Attributes (CTRATT)

I can understand if they want to have unique names for fields, but it
seems like they have trouble deciding how to abbreviate controller :)

Personally I think that ctrlid is more obvious that we are talking about
a controller and not a count. But I'm fine regardless.


Kind regards,
Niklas

> 
> Alistair
> 
> > +uint16_tnvmsetid;
> > +uint8_t rsvd3;
> >