Re: [PATCH] hw/nvme: Add support for setting the MQES for the NVMe emulation

2024-04-04 Thread Keith Busch
On Thu, Apr 04, 2024 at 01:04:18PM +0100, John Berg wrote:
> The MQES field in the CAP register describes the Maximum Queue Entries
> Supported for the IO queues of an NVMe controller. Adding a +1 to the
> value in this field results in the total queue size. A full queue is
> when a queue of size N contains N - 1 entries, and the minimum queue
> size is 2. Thus the lowest MQES value is 1.
> 
> This patch adds the new mqes property to the NVMe emulation which allows
> a user to specify the maximum queue size by setting this property. This
> is useful as it enables testing of NVMe controller where the MQES is
> relatively small. The smallest NVMe queue size supported in NVMe is 2
> submission and completion entries, which means that the smallest legal
> mqes value is 1.
> 
> The following example shows how the mqes can be set for a the NVMe
> emulation:
> 
> -drive id=nvme0,if=none,file=nvme.img,format=raw
> -device nvme,drive=nvme0,serial=foo,mqes=1
> 
> If the mqes property is not provided then the default mqes will still be
> 0x7ff (the queue size is 2048 entries).

Looks good. I had to double check where nvme_create_sq() was getting its
limit from when processing the host command, and sure enough it's
directly from the register field.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix endianness issue for shadow doorbells

2023-07-18 Thread Keith Busch
On Tue, Jul 18, 2023 at 12:35:12PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> In commit 2fda0726e514 ("hw/nvme: fix missing endian conversions for
> doorbell buffers"), we fixed shadow doorbells for big-endian guests
> running on little endian hosts. But I did not fix little-endian guests
> on big-endian hosts. Fix this.
> 
> Solves issue #1765.
> 
> Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support")
> Cc: qemu-sta...@nongnu.org
> Reported-by: Thomas Huth 
> Signed-off-by: Klaus Jensen 

Looks good!

Reviewed-by: Keith Busch 



Re: [PATCH v2 5/5] hw/nvme: flexible data placement emulation

2023-02-17 Thread Keith Busch
On Fri, Feb 17, 2023 at 01:07:43PM +0100, Jesper Devantier wrote:
> +static void nvme_do_write_fdp(NvmeCtrl *n, NvmeRequest *req, uint64_t slba,
> +  uint32_t nlb)
> +{
> +NvmeNamespace *ns = req->ns;
> +NvmeRwCmd *rw = (NvmeRwCmd *)>cmd;
> +uint64_t data_size = nvme_l2b(ns, nlb);
> +uint32_t dw12 = le32_to_cpu(req->cmd.cdw12);
> +uint8_t dtype = (dw12 >> 20) & 0xf;
> +uint16_t pid = le16_to_cpu(rw->dspec);
> +uint16_t ph, rg, ruhid;
> +NvmeReclaimUnit *ru;
> +
> +if (dtype != NVME_DIRECTIVE_DATA_PLACEMENT
> +|| !nvme_parse_pid(ns, pid, , )) {

Style nit, the "||" ought to go in the previous line.

> +ph = 0;
> +rg = 0;
> +}
> +
> +ruhid = ns->fdp.phs[ph];
> +ru = >endgrp->fdp.ruhs[ruhid].rus[rg];
> +
> +nvme_fdp_stat_inc(>endgrp->fdp.hbmw, data_size);
> +nvme_fdp_stat_inc(>endgrp->fdp.mbmw, data_size);
> +
> +//trace_pci_nvme_fdp_ruh_write(ruh->rgid, ruh->ruhid, ruh->nlb_ruamw, 
> nlb);
> +
> +while (nlb) {
> +if (nlb < ru->ruamw) {
> +ru->ruamw -= nlb;
> +break;
> +}
> +
> +nlb -= ru->ruamw;
> +//trace_pci_nvme_fdp_ruh_change(ruh->rgid, ruh->ruhid);

Please use the trace points if you find them useful, otherwise just delete
them instead of committing commented out code.

Beyond that, looks good! For the series:

Reviewed-by: Keith Busch 



Re: [PATCH 5/5] hw/nvme: flexible data placement emulation

2023-02-16 Thread Keith Busch
On Thu, Feb 16, 2023 at 05:48:06PM +0100, Jesper Devantier wrote:
> +static bool nvme_ns_init_fdp(NvmeNamespace *ns, Error **errp)
> +{
> +NvmeEnduranceGroup *endgrp = ns->endgrp;
> +NvmeRuHandle *ruh;
> +uint8_t lbafi = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> +unsigned int *ruhid, *ruhids;
> +char *r, *p, *token;
> +uint16_t *ph;
> +
> +if (!ns->params.fdp.ruhs) {
> +ns->fdp.nphs = 1;
> +ph = ns->fdp.phs = g_new(uint16_t, 1);
> +
> +ruh = nvme_find_ruh_by_attr(endgrp, NVME_RUHA_CTRL, ph);
> +if (!ruh) {
> +ruh = nvme_find_ruh_by_attr(endgrp, NVME_RUHA_UNUSED, ph);
> +if (!ruh) {
> +error_setg(errp, "no unused reclaim unit handles left");
> +return false;
> +}
> +
> +ruh->ruha = NVME_RUHA_CTRL;
> +ruh->lbafi = lbafi;
> +ruh->ruamw = endgrp->fdp.runs >> ns->lbaf.ds;
> +
> +for (uint16_t rg = 0; rg < endgrp->fdp.nrg; rg++) {
> +ruh->rus[rg].ruamw = ruh->ruamw;
> +}
> +} else if (ruh->lbafi != lbafi) {
> +error_setg(errp, "lba format index of controller assigned "
> +   "reclaim unit handle does not match namespace lba "
> +   "format index");
> +return false;
> +}
> +
> +return true;
> +}
> +
> +ruhid = ruhids = g_new0(unsigned int, endgrp->fdp.nruh);
> +r = p = strdup(ns->params.fdp.ruhs);
> +
> +/* parse the reclaim unit handle identifiers */
> +while ((token = qemu_strsep(, ";")) != NULL) {
> +if (++ns->fdp.nphs == endgrp->fdp.nruh) {

Since a namespace can't have more than 128 placement handles, and the endurance
group can have more, I think the 128 limit needs to be checked here.

> +error_setg(errp, "too many placement handles");
> +free(r);
> +return false;
> +}
> +
> +if (qemu_strtoui(token, NULL, 0, ruhid++) < 0) {
> +error_setg(errp, "cannot parse reclaim unit handle identifier");
> +free(r);
> +return false;
> +}
> +}



Re: [PATCH 5/5] hw/nvme: flexible data placement emulation

2023-02-16 Thread Keith Busch
This mostly looks fine. I need to clarify some spec decisions, though.

By default, FDP feature is disabled: "The default value of this Feature shall
be 0h." You can't change the value as long as namespaces exist within the
group, so FDP requires NS Mgmt be supported if you're going to enable it. The
spec, however, doesn't seem to explicitly say that, and NS Mgmt isn't currently
supported in this controller.

We can look past that for this implementation to allow static settings even if
that doesn't perfectly align with this feature (NS Mgmt is kind of difficult
here). In order to match what the spec says is possible though, we can't have a
namespace with more than 128 placement handles since that's the most you can
provide when you create the namespace.

Does that make sense, or am I totally misunderstanding the details?



Re: [PATCH 4/5] hw/nvme: basic directives support

2023-02-16 Thread Keith Busch
On Thu, Feb 16, 2023 at 06:35:27PM +0100, Klaus Jensen wrote:
> On Thu, Feb 16, 2023, at 18:23, Keith Busch wrote:
> > On Thu, Feb 16, 2023 at 05:48:05PM +0100, Jesper Devantier wrote:
> >> +enum NvmeDirective {
> >> +NVME_DIRECTIVE_SUPPORTED = 0x0,
> >> +NVME_DIRECTIVE_ENABLED   = 0x1,
> >> +};
> >
> > What's this?
> 
> That’s a left-over from my rebase. I’ll fix that one up.

Okay, other than that, this one looks good.



Re: [PATCH 4/5] hw/nvme: basic directives support

2023-02-16 Thread Keith Busch
On Thu, Feb 16, 2023 at 05:48:05PM +0100, Jesper Devantier wrote:
> +enum NvmeDirective {
> +NVME_DIRECTIVE_SUPPORTED = 0x0,
> +NVME_DIRECTIVE_ENABLED   = 0x1,
> +};

What's this?



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch
On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 12:44 PM Keith Busch  wrote:
> >
> > Further up, it says the "interrupt gateway" is responsible for
> > forwarding new interrupt requests while the level remains asserted, but
> > it doesn't look like anything is handling that, which essentially turns
> > this into an edge interrupt. Am I missing something, or is this really
> > not being handled?
> 
> Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs
> internal GPIO lines to trigger an interrupt. So with the current setup
> we only support edge triggered interrupts.

Thanks for confirming!

Klaus,
I think we can justify introducing a work-around in the emulated device
now. My previous proposal with pci_irq_pulse() is no good since it does
assert+deassert, but it needs to be the other way around, so please
don't considert that one.

Also, we ought to revisit the intms/intmc usage in the linux driver for
threaded interrupts.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch
On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote:
> On Thu, Jan 19, 2023 at 9:07 AM Keith Busch  wrote:
> > ---
> > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
> > index c2dfacf028..f8f7af08dc 100644
> > --- a/hw/intc/sifive_plic.c
> > +++ b/hw/intc/sifive_plic.c
> > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr 
> > addr, unsigned size)
> >  uint32_t max_irq = sifive_plic_claimed(plic, addrid);
> >
> >  if (max_irq) {
> > -sifive_plic_set_pending(plic, max_irq, false);
> >  sifive_plic_set_claimed(plic, max_irq, true);
> >  }
> >
> 
> This change isn't correct. The PLIC spec
> (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf)
> states:
> 
> """
> On receiving a claim message, the PLIC core will atomically determine
> the ID of the highest-priority pending interrupt for the target and
> then clear down the corresponding source’s IP bit
> """
> 
> which is what we are doing here. We are clearing the pending interrupt
> inside the PLIC

Thanks for the link. That's very helpful.

If you're familiar with this area, could you possibly clear up this part
of that spec?

"
On receiving an interrupt completion message, if the interrupt is
level-triggered and the interrupt is still asserted, a new interrupt
request will be forwarded to the PLIC core.
"

Further up, it says the "interrupt gateway" is responsible for
forwarding new interrupt requests while the level remains asserted, but
it doesn't look like anything is handling that, which essentially turns
this into an edge interrupt. Am I missing something, or is this really
not being handled?



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch
On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote:
> On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote:
> > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> > > Anyway - any idea what to do to help figuring out what is happening ?
> > > Add tracing support to pci interrupt handling, maybe ?
> > 
> > For intermittent bugs, I like recording the QEMU session under
> > rr (using its chaos mode to provoke the failure if necessary) to
> > get a recording that I can debug and re-debug at leisure. Usually
> > you want to turn on/add tracing to help with this, and if the
> > failure doesn't hit early in bootup then you might need to
> > do a QEMU snapshot just before point-of-failure so you can
> > run rr only on the short snapshot-to-failure segment.
> > 
> > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> > 
> > This gives you a debugging session from the QEMU side's perspective,
> > of course -- assuming you know what the hardware is supposed to do
> > you hopefully wind up with either "the guest software did X,Y,Z
> > and we incorrectly did A" or else "the guest software did X,Y,Z,
> > the spec says A is the right/a permitted thing but the guest got confused".
> > If it's the latter then you have to look at the guest as a separate
> > code analysis/debug problem.
> 
> Here's what I got, though I'm way out of my depth here.
> 
> It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
> interrupt after its first handling, which I think is expected. After
> claiming, QEMU masks the pending interrupt, lowering the level, though
> the device that raised it never deasserted.

I'm not sure if this is correct, but this is what I'm coming up with and
appears to fix the problem on my setup. The hardware that sets the
pending interrupt is going clear it, so I don't see why the interrupt
controller is automatically clearing it when the host claims it.

---
diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c
index c2dfacf028..f8f7af08dc 100644
--- a/hw/intc/sifive_plic.c
+++ b/hw/intc/sifive_plic.c
@@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, 
unsigned size)
 uint32_t max_irq = sifive_plic_claimed(plic, addrid);
 
 if (max_irq) {
-sifive_plic_set_pending(plic, max_irq, false);
 sifive_plic_set_claimed(plic, max_irq, true);
 }
 
--



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch
Klaus,

This isn't going to help your issue, but there are at least two legacy
irq bugs in the nvme qemu implementation.

1. The admin queue breaks if start with legacy and later initialize
msix.

2. The legacy vector is shared among all queues, but it's being
deasserted when the first queue's doorbell makes it empty. It should
remain enabled if any cq is non-empty.

I'll send you some patches for those later. Still working on the real
problem.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-18 Thread Keith Busch
On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 19:21, Guenter Roeck  wrote:
> > Anyway - any idea what to do to help figuring out what is happening ?
> > Add tracing support to pci interrupt handling, maybe ?
> 
> For intermittent bugs, I like recording the QEMU session under
> rr (using its chaos mode to provoke the failure if necessary) to
> get a recording that I can debug and re-debug at leisure. Usually
> you want to turn on/add tracing to help with this, and if the
> failure doesn't hit early in bootup then you might need to
> do a QEMU snapshot just before point-of-failure so you can
> run rr only on the short snapshot-to-failure segment.
> 
> https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/
> https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/
> 
> This gives you a debugging session from the QEMU side's perspective,
> of course -- assuming you know what the hardware is supposed to do
> you hopefully wind up with either "the guest software did X,Y,Z
> and we incorrectly did A" or else "the guest software did X,Y,Z,
> the spec says A is the right/a permitted thing but the guest got confused".
> If it's the latter then you have to look at the guest as a separate
> code analysis/debug problem.

Here's what I got, though I'm way out of my depth here.

It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the
interrupt after its first handling, which I think is expected. After
claiming, QEMU masks the pending interrupt, lowering the level, though
the device that raised it never deasserted.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Keith Busch
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled

I finally got a riscv setup recreating this, and I am not sure how
you're getting a "completion polled" here. I find that the polling from
here progresses the request state to IDLE, so the timeout handler moves
on to aborting because it's expecting COMPLETED. The driver should not
be doing that, so I'll fix that up while also investigating why the
interrupt isn't retriggering as expected.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-16 Thread Keith Busch
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.

That's assuming completions are deferred to a bottom half. We don't do
that by default in Linux nvme, though you can ask the driver to do that
if you want.
 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
>
> No changes are required in the QEMU hw/nvme interrupt logic.

Yeah, I can see why: it forces the irq line to deassert then assert,
just like we had forced to happen within the device side patches. Still,
none of that is supposed to be necessary, but this idea of using these
registers is probably fine.

>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +   struct nvme_queue *nvmeq = data;
> +   struct nvme_dev *dev = nvmeq->dev;
> +   u32 mask = 1 << nvmeq->cq_vector;
> +   irqreturn_t ret = IRQ_NONE;
> +   DEFINE_IO_COMP_BATCH(iob);
> +
> +   writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +   if (nvme_poll_cq(nvmeq, )) {
> +   if (!rq_list_empty(iob.req_list))
> +   nvme_pci_complete_batch();
> +   ret = IRQ_HANDLED;
> +   }
> +
> +   writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +   return ret;
> +}

If threaded interrupts are used, you'll want to do the masking in
nvme_irq_check(), then clear it in the threaded handler instead of doing
both in the same callback.

> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
> struct nvme_queue *nvmeq = data;
> DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
> struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> int nr = nvmeq->dev->ctrl.instance;
> +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
> if (use_threaded_interrupts) {
> return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -   nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +   handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> } else {
> -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> }
>  }
> 
> 





Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-13 Thread Keith Busch
On Fri, Jan 13, 2023 at 12:32:29PM +, Peter Maydell wrote:
> On Fri, 13 Jan 2023 at 08:55, Klaus Jensen  wrote:
> >
> > +CC qemu pci maintainers
> >
> > Michael, Marcel,
> >
> > Do you have any comments on this thread? As you can see one solution is
> > to simply deassert prior to asserting, the other is to reintroduce a
> > pci_irq_pulse(). Both seem to solve the issue.
> 
> Both seem to be missing any analysis of "this is what is
> happening, this is where we differ from hardware, this
> is why this is the correct fix". We shouldn't put in
> random "this seems to happen to cause the guest to boot"
> fixes, please.

It looks like these are being treated as edge instead of level
interrupts so the work-around is to create more edges. I would
definitely prefer a real fix, whether that's in the kernel or CPU
emulation or somewhere else, but I'm not sure I'll have time to see it
to completion.

FWIW, x86 seems to handle legacy IRQs with NVMe as expected. It's
actually easy enough for the DEASSERT to take so long that kernel
reports the irq as "spurious" because it's spinning too often on the
level.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(>parent_obj);
>pci_irq_assert(>parent_obj);
>} else {
>pci_irq_deassert(>parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.
> 
> I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS
> machine/board(s) are you testing?

Could you try the below?

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 2c85de4700..521c3c80c1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 }
 }
 
+static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq)
+{
+if (!cq->irq_enabled) {
+return;
+}
+
+if (msix_enabled(&(n->parent_obj))) {
+msix_notify(&(n->parent_obj), cq->vector);
+return;
+}
+
+pci_irq_pulse(>parent_obj);
+}
+
 static void nvme_req_clear(NvmeRequest *req)
 {
 req->ns = NULL;
@@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, 
int val)
 }
 
 nvme_irq_deassert(n, cq);
+} else {
+/*
+ * Retrigger the irq just to make sure the host has no excuse for
+ * not knowing there's more work to complete on this CQ.
+ */
+nvme_irq_pulse(n, cq);
 }
 } else {
 /* Submission queue doorbell write */
--



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote:
> On Jan 12 09:34, Keith Busch wrote:
> > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> > > 
> > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> > > am wondering if there is something going on with the kernel driver (but
> > > I certainly do not rule out that hw/nvme is at fault here, since
> > > pin-based interrupts has also been a source of several issues in the
> > > past).
> > 
> > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
> > While probably not the "correct" thing to do, it has better results in
> > my testing.
> > 
> 
> A simple s/pci_irq_assert/pci_irq_pulse broke the device. However,
> 
>   diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c
>   index 03760ddeae8c..0fc46dcb9ec4 100644
>   --- i/hw/nvme/ctrl.c
>   +++ w/hw/nvme/ctrl.c
>   @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n)
>return;
>}
>if (~intms & n->irq_status) {
>   +pci_irq_deassert(>parent_obj);
>pci_irq_assert(>parent_obj);
>} else {
>pci_irq_deassert(>parent_obj);
> 
> 
> seems to do the trick (pulse is the other way around, assert, then
> deassert).
> 
> Probably not the "correct" thing to do, but I'll take it since it seems
> to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm
> on ~20 runs now and have not encountered it.

Yep, that looks good. It's sounding like something with the CPU irq
handling is treating the level interrupt as edge triggered.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-12 Thread Keith Busch
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> 
> The pin-based interrupt logic in hw/nvme seems sound enough to me, so I
> am wondering if there is something going on with the kernel driver (but
> I certainly do not rule out that hw/nvme is at fault here, since
> pin-based interrupts has also been a source of several issues in the
> past).

Does it work if you change the pci_irq_assert() back to pci_irq_pulse()?
While probably not the "correct" thing to do, it has better results in
my testing.



Re: [PATCH v4 1/2] hw/nvme: Support for Namespaces Management from guest OS - create-ns

2022-12-28 Thread Keith Busch
On Wed, Dec 28, 2022 at 01:41:40PM -0600, Jonathan Derrick wrote:
> +static uint16_t nvme_ns_mgmt(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeCtrl *n_p = NULL; /* primary controller */
> +NvmeIdCtrl *id = >id_ctrl;
> +NvmeNamespace *ns;
> +NvmeIdNsMgmt id_ns = {};
> +uint8_t flags = req->cmd.flags;
> +uint32_t nsid = le32_to_cpu(req->cmd.nsid);
> +uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
> +uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
> +uint8_t sel = dw10 & 0xf;
> +uint8_t csi = (dw11 >> 24) & 0xf;
> +uint16_t i;
> +uint16_t ret;
> +Error *local_err = NULL;
> +
> +trace_pci_nvme_ns_mgmt(nvme_cid(req), nsid, sel, csi, 
> NVME_CMD_FLAGS_PSDT(flags));
> +
> +if (!(le16_to_cpu(id->oacs) & NVME_OACS_NS_MGMT)) {
> +return NVME_NS_ATTACH_MGMT_NOTSPRD | NVME_DNR;
> +}
> +
> +if (n->cntlid && !n->subsys) {
> +error_setg(_err, "Secondary controller without subsystem");
> +return NVME_NS_ATTACH_MGMT_NOTSPRD | NVME_DNR;

Leaking local_err. Any time you call error_setg(), the error needs to be
reported or freed at some point.

> +}
> +
> +n_p = n->subsys->ctrls[0];
> +
> +switch (sel) {
> +case NVME_NS_MANAGEMENT_CREATE:
> +switch (csi) {
> +case NVME_CSI_NVM:

The following case is sufficiently large enough that the implementation
should be its own function.

> +if (nsid) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +ret = nvme_h2c(n, (uint8_t *)_ns, sizeof(id_ns), req);
> +if (ret) {
> +return ret;
> +}
> +
> +uint64_t nsze = le64_to_cpu(id_ns.nsze);
> +uint64_t ncap = le64_to_cpu(id_ns.ncap);

Please don't mix declarations with code; declare these local variables
at the top of the scope.

> +
> +if (ncap > nsze) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +} else if (ncap != nsze) {
> +return NVME_THIN_PROVISION_NOTSPRD | NVME_DNR;
> +}
> +
> +nvme_validate_flbas(id_ns.flbas, _err);
> +if (local_err) {
> +error_report_err(local_err);
> +return NVME_INVALID_FORMAT | NVME_DNR;
> +}
> +
> +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> +if (nvme_ns(n_p, (uint32_t)i) || nvme_subsys_ns(n_p->subsys, 
> (uint32_t)i)) {
> +continue;
> +}
> +break;
> +}
> +
> +
> +if (i >  le32_to_cpu(n_p->id_ctrl.nn) || i >  
> NVME_MAX_NAMESPACES) {
> +   return NVME_NS_IDNTIFIER_UNAVAIL | NVME_DNR;
> +}
> +nsid = i;
> +
> +/* create ns here */
> +ns = nvme_ns_mgmt_create(n, nsid, _ns, _err);
> +if (!ns || local_err) {
> +if (local_err) {
> +error_report_err(local_err);
> +}
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +if (nvme_cfg_update(n, ns->size, NVME_NS_ALLOC_CHK)) {
> +/* place for delete-ns */
> +error_setg(_err, "Insufficient capacity, an orphaned 
> ns[%"PRIu32"] will be left behind", nsid);
> +return NVME_NS_INSUFFICIENT_CAPAC | NVME_DNR;

Leaked local_err.

> +}
> +(void)nvme_cfg_update(n, ns->size, NVME_NS_ALLOC);
> +if (nvme_cfg_save(n)) {
> +(void)nvme_cfg_update(n, ns->size, NVME_NS_DEALLOC);
> +/* place for delete-ns */
> +error_setg(_err, "Cannot save conf file, an orphaned 
> ns[%"PRIu32"] will be left behind", nsid);
> +return NVME_INVALID_FIELD | NVME_DNR;

Another leaked local_err.

> +}
> +req->cqe.result = cpu_to_le32(nsid);
> +break;
> +case NVME_CSI_ZONED:
> +/* fall through for now */
> +default:
> +return NVME_INVALID_FIELD | NVME_DNR;
> + }
> +break;
> +default:
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +return NVME_SUCCESS;
> +}



Re: [PATCH v4 2/4] hw/nvme: rename shadow doorbell related trace events

2022-12-14 Thread Keith Busch
On Mon, Dec 12, 2022 at 12:44:07PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Rename the trace events related to writing the event index and reading
> the doorbell value to make it more clear that the event is associated
> with an actual update (write or read respectively).
> 
> Reviewed-by: Philippe Mathieu-Daudé 
> Signed-off-by: Klaus Jensen 

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v4 3/4] hw/nvme: fix missing endian conversions for doorbell buffers

2022-12-14 Thread Keith Busch
On Mon, Dec 12, 2022 at 12:44:08PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The eventidx and doorbell value are not handling endianness correctly.
> Fix this.
> 
> Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support")
> Cc: qemu-sta...@nongnu.org
> Reported-by: Guenter Roeck 
> Signed-off-by: Klaus Jensen 

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v4 4/4] hw/nvme: fix missing cq eventidx update

2022-12-14 Thread Keith Busch
On Mon, Dec 12, 2022 at 12:44:09PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Prior to reading the shadow doorbell cq head, we have to update the
> eventidx. Otherwise, we risk that the driver will skip an mmio doorbell
> write. This happens on riscv64, as reported by Guenter.
> 
> Adding the missing update to the cq eventidx fixes the issue.
> 
> Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support")
> Cc: qemu-sta...@nongnu.org
> Cc: qemu-ri...@nongnu.org
> Reported-by: Guenter Roeck 
> Signed-off-by: Klaus Jensen 


Looks good.

Reviewed-by: Keith Busch 



Re: [PULL for-7.2 0/5] hw/nvme fixes

2022-12-04 Thread Keith Busch
On Sun, Dec 04, 2022 at 11:06:13AM -0500, Stefan Hajnoczi wrote:
> On Thu, 1 Dec 2022 at 11:50, Klaus Jensen  wrote:
> >
> > From: Klaus Jensen 
> >
> > Hi,
> >
> > The following changes since commit c4ffd91aba1c3d878e99a3e7ba8aad4826728ece:
> >
> >   Update VERSION for v7.2.0-rc3 (2022-11-29 18:15:26 -0500)
> >
> > are available in the Git repository at:
> >
> >   git://git.infradead.org/qemu-nvme.git tags/nvme-next-pull-request
> 
> Hi Klaus,
> Please send pull requests with an https:// URI in the future.

Is this a new requirement? Our public git server doesn't support https.



Re: [PATCH for-7.2 1/5] hw/nvme: fix aio cancel in format

2022-11-22 Thread Keith Busch
On Tue, Nov 22, 2022 at 09:13:44AM +0100, Klaus Jensen wrote:
> There are several bugs in the async cancel code for the Format command.
> 
> Firstly, cancelling a format operation neglects to set iocb->ret as well
> as clearing the iocb->aiocb after cancelling the underlying aiocb which
> causes the aio callback to ignore the cancellation. Trivial fix.
> 
> Secondly, and worse, because the request is queued up for posting to the
> CQ in a bottom half, if the cancellation is due to the submission queue
> being deleted (which calls blk_aio_cancel), the req structure is
> deallocated in nvme_del_sq prior to the bottom half being schedulued.
> 
> Fix this by simply removing the bottom half, there is no reason to defer
> it anyway.

I thought for sure I'd find a reason defered execution was needed, but
hey, it looks perfectly fine with this change!
 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index ac3885ce5079..26b53469328f 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -5756,14 +5756,15 @@ typedef struct NvmeFormatAIOCB {
>  uint8_t pil;
>  } NvmeFormatAIOCB;

I think you can remove the QEMUBH member from the above struct with this
patch.

Otherwise looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v3 1/2] hw/nvme: Support for Namespaces Management from guest OS - create-ns

2022-10-27 Thread Keith Busch
On Thu, Oct 27, 2022 at 01:00:45PM -0500, Jonathan Derrick wrote:
> +Parameters:
> +
> +``auto-ns-path=``
> +  If specified indicates a support for dynamic management of nvme namespaces
> +  by means of nvme create-ns command. This path points
> +  to the storage area for backend images must exist. Additionally it requires
> +  that parameter `ns-subsys` must be specified whereas parameter `drive`
> +  must not. The legacy namespace backend is disabled, instead, a pair of
> +  files 'nvme__ns_.cfg' and 'nvme__ns_.img'
> +  will refer to respective namespaces. The create-ns, attach-ns
> +  and detach-ns commands, issued at the guest side, will make changes to
> +  those files accordingly.
> +  For each namespace exists an image file in raw format and a config file
> +  containing namespace parameters and state of the attachment allowing QEMU
> +  to configure namespaces accordingly during start up. If for instance an
> +  image file has a size of 0 bytes this will be interpreted as non existent
> +  namespace. Issuing create-ns command will change the status in the config
> +  files and and will re-size the image file accordingly so the image file
> +  will be associated with the respective namespace. The main config file
> +  nvme__ctrl.cfg keeps the track of allocated capacity to the
> +  namespaces within the nvme controller.
> +  As it is the case of a typical hard drive, backend images together with
> +  config files need to be created. For this reason the qemu-img tool has
> +  been extended by adding createns command.
> +
> +   qemu-img createns {-S  -C }
> + [-N ] {}
> +
> +  Parameters:
> +  -S and -C and  are mandatory, `-S` must match `serial` parameter
> +  and  must match `auto-ns-path` parameter of "-device nvme,..."
> +  specification.
> +  -N is optional, if specified it will set a limit to the number of potential
> +  namespaces and will reduce the number of backend images and config files
> +  accordingly. As a default, a set of images of 0 bytes size and default
> +  config files for 256 namespaces will be created, a total of 513 files.
> +
> +Please note that ``nvme-ns`` device is not required to support of dynamic
> +namespaces management feature. It is not prohibited to assign a such device 
> to
> +``nvme`` device specified to support dynamic namespace management if one has
> +an use case to do so, however, it will only coexist and be out of the scope 
> of
> +Namespaces Management. NsIds will be consistently managed, creation 
> (create-ns)
> +of a namespace will not allocate the NsId already being taken. If ``nvme-ns``
> +device conflicts with previously created one by create-ns (the same NsId),
> +it will break QEMU's start up.

By requiring the controller's serial number up-front, does this mean we
can't share dynamic namespaces with other controllers in the subsystem?

> +static inline char *create_fmt_name(const char *fmt, char *ns_directory, 
> char *serial, uint32_t nsid, Error **errp)
> +{
> +char *file_name = NULL;
> +Error *local_err = NULL;
> +
> +storage_path_check(ns_directory, serial, errp);
> +if (local_err) {

How is 'local_err' ever *not* NULL here? Are you meaning to check
"*errp" instead?

> +error_propagate(errp, local_err);
> +} else {
> +file_name = g_strdup_printf(fmt, ns_directory, serial, nsid);
> +}
> +
> +return file_name;
> +}
> +
> +static inline char *create_cfg_name(char *ns_directory, char *serial, 
> uint32_t nsid, Error **errp)
> +{
> +return create_fmt_name(NS_FILE_FMT NS_CFG_EXT, ns_directory, serial, 
> nsid, errp);
> +}
> +
> +
> +static inline char *create_image_name(char *ns_directory, char *serial, 
> uint32_t nsid, Error **errp)
> +{
> +return create_fmt_name(NS_FILE_FMT NS_IMG_EXT, ns_directory, serial, 
> nsid, errp);
> +}
> +
> +static inline int nsid_cfg_save(char *ns_directory, char *serial, QDict 
> *ns_cfg, uint32_t nsid)
> +{
> +GString *json = NULL;
> +char *filename;
> +FILE *fp;
> +int ret = 0;
> +Error *local_err = NULL;
> +
> +json = qobject_to_json_pretty(QOBJECT(ns_cfg), false);
> +
> +if (strlen(json->str) + 2 /* '\n'+'\0' */ > NS_CFG_MAXSIZE) {
> +error_setg(_err, "ns-cfg allowed max size %d exceeded", 
> NS_CFG_MAXSIZE);

I find this whole "local_err" control flow unpleasant to follow. The
local_error gets set in this above condition only to be overwritten in
the very next step without ever being used? Surely that can't be right.

I'm just picking on this one example here, but the pattern appears to
repeat. I think this would be easier to read if the error conditions
took 'goto' paths to unwind so that the good path doesn't require such
deep 'if/else' nesting.

> +}
> +
> +filename = create_cfg_name(ns_directory, serial, nsid, _err);
> +if (!local_err) {
> +fp = fopen(filename, "w");
> +if (fp == NULL) {
> +error_setg(_err, "open %s: %s", filename,
> + 

Re: [PATCH] hw/nvme: reenable cqe batching

2022-10-20 Thread Keith Busch
On Thu, Oct 20, 2022 at 01:35:38PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell
> updates") had the unintended effect of disabling batching of CQEs.
> 
> This patch changes the sq/cq timers to bottom halfs and instead of
> calling nvme_post_cqes() immediately (causing an interrupt per cqe), we
> defer the call.

Nice change, looks good! Timers never did seem to be the best fit for
this.

Reviewed-by: Keith Busch 



[PATCHv3 1/2] block: move bdrv_qiov_is_aligned to file-posix

2022-09-29 Thread Keith Busch
From: Keith Busch 

There is only user of bdrv_qiov_is_aligned(), so move the alignment
function to there and make it static.

Signed-off-by: Keith Busch 
---
 block/file-posix.c   | 21 +
 block/io.c   | 21 -
 include/block/block-io.h |  1 -
 3 files changed, 21 insertions(+), 22 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..e3f3de2780 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2061,6 +2061,27 @@ static int coroutine_fn 
raw_thread_pool_submit(BlockDriverState *bs,
 return thread_pool_submit_co(pool, func, arg);
 }
 
+/*
+ * Check if all memory in this vector is sector aligned.
+ */
+static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
+{
+int i;
+size_t alignment = bdrv_min_mem_align(bs);
+IO_CODE();
+
+for (i = 0; i < qiov->niov; i++) {
+if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
+return false;
+}
+if (qiov->iov[i].iov_len % alignment) {
+return false;
+}
+}
+
+return true;
+}
+
 static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
uint64_t bytes, QEMUIOVector *qiov, int 
type)
 {
diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..96edc7f7cb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3236,27 +3236,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-/*
- * Check if all memory in this vector is sector aligned.
- */
-bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
-{
-int i;
-size_t alignment = bdrv_min_mem_align(bs);
-IO_CODE();
-
-for (i = 0; i < qiov->niov; i++) {
-if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
-return false;
-}
-if (qiov->iov[i].iov_len % alignment) {
-return false;
-}
-}
-
-return true;
-}
-
 void bdrv_io_plug(BlockDriverState *bs)
 {
 BdrvChild *child;
diff --git a/include/block/block-io.h b/include/block/block-io.h
index fd25ffa9be..492f95fc05 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -150,7 +150,6 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size);
 void *qemu_blockalign0(BlockDriverState *bs, size_t size);
 void *qemu_try_blockalign(BlockDriverState *bs, size_t size);
 void *qemu_try_blockalign0(BlockDriverState *bs, size_t size);
-bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
 
 void bdrv_enable_copy_on_read(BlockDriverState *bs);
 void bdrv_disable_copy_on_read(BlockDriverState *bs);
-- 
2.30.2




[PATCHv3 0/2] qemu direct io alignment fix

2022-09-29 Thread Keith Busch
From: Keith Busch 

Changes from v2:

  Split the patch so that the function move is separate from the
  functional change. This makes it immediately obvious what criteria is
  changing. (Kevin Wolf)

  Added received Tested-by tag in the changelog. 

Keith Busch (2):
  block: move bdrv_qiov_is_aligned to file-posix
  block: use the request length for iov alignment

 block/file-posix.c   | 22 ++
 block/io.c   | 21 -
 include/block/block-io.h |  1 -
 3 files changed, 22 insertions(+), 22 deletions(-)

-- 
2.30.2




[PATCHv3 2/2] block: use the request length for iov alignment

2022-09-29 Thread Keith Busch
From: Keith Busch 

An iov length needs to be aligned to the logical block size, which may
be larger than the memory alignment.

Tested-by: Jens Axboe 
Signed-off-by: Keith Busch 
---
 block/file-posix.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index e3f3de2780..af994aba2b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2068,13 +2068,14 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 {
 int i;
 size_t alignment = bdrv_min_mem_align(bs);
+size_t len = bs->bl.request_alignment;
 IO_CODE();
 
 for (i = 0; i < qiov->niov; i++) {
 if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
 return false;
 }
-if (qiov->iov[i].iov_len % alignment) {
+if (qiov->iov[i].iov_len % len) {
 return false;
 }
 }
-- 
2.30.2




Re: [PATCHv2] block: use the request length for iov alignment

2022-09-29 Thread Keith Busch
On Thu, Sep 29, 2022 at 07:59:50PM +0200, Kevin Wolf wrote:
> Am 29.09.2022 um 18:09 hat Keith Busch geschrieben:
> > On Fri, Sep 23, 2022 at 08:34:51AM -0700, Keith Busch wrote:
> > > 
> > > An iov length needs to be aligned to the logical block size, which may
> > > be larger than the memory alignment. And since this is only used with
> > > file-posix backing storage, move the alignment function to there, where
> > > the value of the request_alignment is known to be the file's logical
> > > block size.
> > 
> > Any objections to this version? This is fixing real bug reports that
> > may become more frequent without this patch.
> 
> I think it is okay. Splitting it in two patches would have been nicer
> (one for moving code, one for making the change), but it's small enough
> that I can ignore that. I'll probably merge it tomorrow.

I agree that splitting makes the functional change stand out, otherwise a
casual look may mistake the patch as a simple function move. I'll send you a
new version.



Re: [PATCHv2] block: use the request length for iov alignment

2022-09-29 Thread Keith Busch
On Fri, Sep 23, 2022 at 08:34:51AM -0700, Keith Busch wrote:
> 
> An iov length needs to be aligned to the logical block size, which may
> be larger than the memory alignment. And since this is only used with
> file-posix backing storage, move the alignment function to there, where
> the value of the request_alignment is known to be the file's logical
> block size.

Any objections to this version? This is fixing real bug reports that may become
more frequent without this patch.



[PATCHv2] block: use the request length for iov alignment

2022-09-23 Thread Keith Busch
From: Keith Busch 

An iov length needs to be aligned to the logical block size, which may
be larger than the memory alignment. And since this is only used with
file-posix backing storage, move the alignment function to there, where
the value of the request_alignment is known to be the file's logical
block size.

Signed-off-by: Keith Busch 
---
v1->v2:

  Relocate the function to the only caller so that irrelavant
  constraints don't need to be considered.

 block/file-posix.c   | 22 ++
 block/io.c   | 21 -
 include/block/block-io.h |  1 -
 3 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 48cd096624..af994aba2b 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -2061,6 +2061,28 @@ static int coroutine_fn 
raw_thread_pool_submit(BlockDriverState *bs,
 return thread_pool_submit_co(pool, func, arg);
 }
 
+/*
+ * Check if all memory in this vector is sector aligned.
+ */
+static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
+{
+int i;
+size_t alignment = bdrv_min_mem_align(bs);
+size_t len = bs->bl.request_alignment;
+IO_CODE();
+
+for (i = 0; i < qiov->niov; i++) {
+if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
+return false;
+}
+if (qiov->iov[i].iov_len % len) {
+return false;
+}
+}
+
+return true;
+}
+
 static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset,
uint64_t bytes, QEMUIOVector *qiov, int 
type)
 {
diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..96edc7f7cb 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3236,27 +3236,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t 
size)
 return mem;
 }
 
-/*
- * Check if all memory in this vector is sector aligned.
- */
-bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov)
-{
-int i;
-size_t alignment = bdrv_min_mem_align(bs);
-IO_CODE();
-
-for (i = 0; i < qiov->niov; i++) {
-if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
-return false;
-}
-if (qiov->iov[i].iov_len % alignment) {
-return false;
-}
-}
-
-return true;
-}
-
 void bdrv_io_plug(BlockDriverState *bs)
 {
 BdrvChild *child;
diff --git a/include/block/block-io.h b/include/block/block-io.h
index fd25ffa9be..492f95fc05 100644
--- a/include/block/block-io.h
+++ b/include/block/block-io.h
@@ -150,7 +150,6 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size);
 void *qemu_blockalign0(BlockDriverState *bs, size_t size);
 void *qemu_try_blockalign(BlockDriverState *bs, size_t size);
 void *qemu_try_blockalign0(BlockDriverState *bs, size_t size);
-bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov);
 
 void bdrv_enable_copy_on_read(BlockDriverState *bs);
 void bdrv_disable_copy_on_read(BlockDriverState *bs);
-- 
2.30.2




Re: [PATCH] block: use the request length for iov alignment

2022-09-20 Thread Keith Busch
On Wed, Sep 14, 2022 at 11:36:14AM +0100, Kevin Wolf wrote:
> Am 13.09.2022 um 15:12 hat Keith Busch geschrieben:
> > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote:
> > > From: Keith Busch 
> > > 
> > > An iov length needs to be aligned to the logical block size, which may
> > > be larger than the memory alignment.
> > 
> > [cc'ing some other interested folks]
> > 
> > Any thoughts on this patch? It is fixing an observed IO error  when running
> > virtio-blk with the default 512b logical block size backed by a drive 
> > formatted
> > with 4k logical block.
> 
> I need to take a real look after KVM Forum, but my first thought was
> that we might be overloading request_alignment with multiple meanings
> now (file offset alignment and memory address alignment), and the values
> just happen to be the same for files on Linux.
> 
> Did you consider a separate iov_alignment or similar and intentionally
> decided against it, or is it something you just didn't think about?

I've looked again at the request_alignment usage, and I think that value is
exactly what we want. This alignment check is only used with O_DIRECT backing
file descriptors, and the request_alignment in that case looks like it will
always be the logical blocks size.

Linux direct-io can accept arbitrary memory addrses offsets based on the
backing file's internal block limits, but the individual vector lengths do need
to be lba granularity.

Just in case, though, I could amend this so that the alignment is the larger of
request_alignment and the existing criteria, though I don't see how
request_alignment would ever be the smaller.

size_t len = max(bs->bl.request_alignment, alignment);

> > > @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
> > > QEMUIOVector *qiov)
> > >  {
> > >  int i;
> > >  size_t alignment = bdrv_min_mem_align(bs);
> > > +size_t len = bs->bl.request_alignment;
> > >  IO_CODE();
> > >  
> > >  for (i = 0; i < qiov->niov; i++) {
> > >  if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
> > >  return false;
> > >  }
> > > -if (qiov->iov[i].iov_len % alignment) {
> > > +if (qiov->iov[i].iov_len % len) {
> > >  return false;
> > >  }
> > >  }
> > > -- 



Re: [PATCH] block: use the request length for iov alignment

2022-09-15 Thread Keith Busch
On Wed, Sep 14, 2022 at 11:36:14AM +0100, Kevin Wolf wrote:
> Am 13.09.2022 um 15:12 hat Keith Busch geschrieben:
> > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote:
> > > From: Keith Busch 
> > > 
> > > An iov length needs to be aligned to the logical block size, which may
> > > be larger than the memory alignment.
> > 
> > [cc'ing some other interested folks]
> > 
> > Any thoughts on this patch? It is fixing an observed IO error  when running
> > virtio-blk with the default 512b logical block size backed by a drive 
> > formatted
> > with 4k logical block.
> 
> I need to take a real look after KVM Forum, but my first thought was
> that we might be overloading request_alignment with multiple meanings
> now (file offset alignment and memory address alignment), and the values
> just happen to be the same for files on Linux.
> 
> Did you consider a separate iov_alignment or similar and intentionally
> decided against it, or is it something you just didn't think about?

I thought the request_alignment was indicating the minimum block length, so I
did not consider an additional separate value. If it can be overloaded, then
yes, I can certainly take on that idea.



Re: [PATCH] block: use the request length for iov alignment

2022-09-13 Thread Keith Busch
On Tue, Sep 13, 2022 at 03:20:23PM +0100, Damien Le Moal wrote:
> On 2022/09/13 15:12, Keith Busch wrote:
> > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote:
> >> From: Keith Busch 
> >>
> >> An iov length needs to be aligned to the logical block size, which may
> >> be larger than the memory alignment.
> > 
> > [cc'ing some other interested folks]
> > 
> > Any thoughts on this patch? It is fixing an observed IO error  when running
> > virtio-blk with the default 512b logical block size backed by a drive 
> > formatted
> > with 4k logical block.
> 
> The patch look OK to me, but having virtio expose a 512B LBA size for a 
> backing
> device that has 4K LBAs will break all IOs if caching is turned off (direct 
> IOs
> case), even if this patch is applied. No ?

Oh, as to why that type of setup "works" with O_DIRECT, when the check below
returns 'false', qemu allocates a bounce buffer. We want that to happen if the
guest's virtio driver tries to read/write 512b. The lengths just need to be
checked against the backing store's block size instead of the memory address
alignment.

> >> @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
> >> QEMUIOVector *qiov)
> >>  {
> >>  int i;
> >>  size_t alignment = bdrv_min_mem_align(bs);
> >> +size_t len = bs->bl.request_alignment;
> >>  IO_CODE();
> >>  
> >>  for (i = 0; i < qiov->niov; i++) {
> >>  if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
> >>  return false;
> >>  }
> >> -if (qiov->iov[i].iov_len % alignment) {
> >> +if (qiov->iov[i].iov_len % len) {
> >>  return false;
> >>  }
> >>  }



Re: [PATCH] block: use the request length for iov alignment

2022-09-13 Thread Keith Busch
On Tue, Sep 13, 2022 at 03:20:23PM +0100, Damien Le Moal wrote:
> On 2022/09/13 15:12, Keith Busch wrote:
> > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote:
> >> From: Keith Busch 
> >>
> >> An iov length needs to be aligned to the logical block size, which may
> >> be larger than the memory alignment.
> > 
> > [cc'ing some other interested folks]
> > 
> > Any thoughts on this patch? It is fixing an observed IO error  when running
> > virtio-blk with the default 512b logical block size backed by a drive 
> > formatted
> > with 4k logical block.
> 
> The patch look OK to me, but having virtio expose a 512B LBA size for a 
> backing
> device that has 4K LBAs will break all IOs if caching is turned off (direct 
> IOs
> case), even if this patch is applied. No ?

Yeah, it's just the default. I don't think anyone would do that on purpose
since it's so sub-optimal from a performance stand-point. As a follow up patch,
perhaps if the logical_block_size parameter is not provided, we should default
to the backing device's block size?



Re: [PATCH] block: use the request length for iov alignment

2022-09-13 Thread Keith Busch
On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote:
> From: Keith Busch 
> 
> An iov length needs to be aligned to the logical block size, which may
> be larger than the memory alignment.

[cc'ing some other interested folks]

Any thoughts on this patch? It is fixing an observed IO error  when running
virtio-blk with the default 512b logical block size backed by a drive formatted
with 4k logical block.

> ---
>  block/io.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/block/io.c b/block/io.c
> index 0a8cbefe86..296d4b49a7 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
> QEMUIOVector *qiov)
>  {
>  int i;
>  size_t alignment = bdrv_min_mem_align(bs);
> +size_t len = bs->bl.request_alignment;
>  IO_CODE();
>  
>  for (i = 0; i < qiov->niov; i++) {
>  if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
>  return false;
>  }
> -if (qiov->iov[i].iov_len % alignment) {
> +if (qiov->iov[i].iov_len % len) {
>  return false;
>  }
>  }
> -- 
> 2.30.2
> 



[PATCH] block: use the request length for iov alignment

2022-09-08 Thread Keith Busch
From: Keith Busch 

An iov length needs to be aligned to the logical block size, which may
be larger than the memory alignment.

Signed-off-by: Keith Busch 
---
 block/io.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/block/io.c b/block/io.c
index 0a8cbefe86..296d4b49a7 100644
--- a/block/io.c
+++ b/block/io.c
@@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, 
QEMUIOVector *qiov)
 {
 int i;
 size_t alignment = bdrv_min_mem_align(bs);
+size_t len = bs->bl.request_alignment;
 IO_CODE();
 
 for (i = 0; i < qiov->niov; i++) {
 if ((uintptr_t) qiov->iov[i].iov_base % alignment) {
 return false;
 }
-if (qiov->iov[i].iov_len % alignment) {
+if (qiov->iov[i].iov_len % len) {
 return false;
 }
 }
-- 
2.30.2




Re: [PATCH v2 2/3] hw/nvme: use KVM irqfd when available

2022-08-26 Thread Keith Busch
On Fri, Aug 26, 2022 at 05:45:21PM +0200, Klaus Jensen wrote:
> On Aug 26 09:34, Keith Busch wrote:
> > On Fri, Aug 26, 2022 at 11:12:04PM +0800, Jinhao Fan wrote:
> > > Use KVM's irqfd to send interrupts when possible. This approach is
> > > thread safe. Moreover, it does not have the inter-thread communication
> > > overhead of plain event notifiers since handler callback are called
> > > in the same system call as irqfd write.
> > > 
> > > Signed-off-by: Jinhao Fan 
> > > Signed-off-by: Klaus Jensen 
> > 
> > No idea what's going on here... This one is causing the following assert
> > failure with --enable-kvm:
> > 
> >   qemu-system-x86_64: ../accel/kvm/kvm-all.c:1781: 
> > kvm_irqchip_commit_routes: Assertion `ret == 0' failed.
> > 
> > I find it calls KVM_SET_GSI_ROUTING ioctl with gsi set to 
> > KVM_IRQ_ROUTING_MSI,
> > and linux kernel returns EINVAL in that case. It's never set that way 
> > without
> > this patch. Am I the only one seeing this?
> 
> Argh, sorry, I threw that patch together a bit too quickly. I was just
> so pumped because I believed I had solved the issue hehe.
> 
> Are you missing the ioeventfd=on and irq-eventfd=on parameters by any
> chance? Without those I'm also getting an assertion, but a different one

I had not enabled those yet. This was purely a regrsession test with my
previously working paramaters for a sanity check.

If I enable those new nvme parameters, then it is successful.



Re: [PATCH v2 2/3] hw/nvme: use KVM irqfd when available

2022-08-26 Thread Keith Busch
On Fri, Aug 26, 2022 at 11:12:04PM +0800, Jinhao Fan wrote:
> Use KVM's irqfd to send interrupts when possible. This approach is
> thread safe. Moreover, it does not have the inter-thread communication
> overhead of plain event notifiers since handler callback are called
> in the same system call as irqfd write.
> 
> Signed-off-by: Jinhao Fan 
> Signed-off-by: Klaus Jensen 

No idea what's going on here... This one is causing the following assert
failure with --enable-kvm:

  qemu-system-x86_64: ../accel/kvm/kvm-all.c:1781: kvm_irqchip_commit_routes: 
Assertion `ret == 0' failed.

I find it calls KVM_SET_GSI_ROUTING ioctl with gsi set to KVM_IRQ_ROUTING_MSI,
and linux kernel returns EINVAL in that case. It's never set that way without
this patch. Am I the only one seeing this?



Re: [PATCH] hw/nvme: Add helper functions for qid-db conversion

2022-08-02 Thread Keith Busch
On Wed, Aug 03, 2022 at 09:46:05AM +0800, Jinhao Fan wrote:
> at 4:54 PM, Klaus Jensen  wrote:
> 
> > I am unsure if the compiler will transform that division into the shift
> > if it can infer that the divisor is a power of two (it most likely
> > will be able to).
> > 
> > But I see no reason to have a potential division here when we can do
> > without and to me it is just as readable when you know the definition of
> > DSTRD is `2 ^ (2 + DSTRD)`.
> 
> OK. I will send a new patch with shifts instead of divisions. BTW, why do we
> want to avoid divisions?

Integer division is at least an order of magnitude more CPU cycles than a
shift. Some archs are worse than others, but historically we go out of the way
to avoid them in a hot path, so shifting is a more familiar coding pattern.

Compilers typically implement division as a shift if you're dividing by a a
power of two integer constant expression (ICE).

This example here isn't an ICE, but it is a shifted constant power-of-two. I
wrote up a simple test to see what my compiler does with that, and it looks
like gcc will properly optimize it, but only if compiled with '-O3'.



Re: [PATCH for-7.1 0/3] hw/nvme: misc ioeventfd fixes

2022-07-28 Thread Keith Busch
On Thu, Jul 28, 2022 at 10:25:19AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> A set of fixes/changes to the ioeventfd support.

Series looks good.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: Add options to override hardcoded values

2022-07-13 Thread Keith Busch
On Wed, Jul 13, 2022 at 09:11:41PM +0200, Mauricio Sandt wrote:
> On 13/07/2022 20:48, Keith Busch wrote:
> > I guess I'm missing the bigger picture here. You are supposed to be able to
> > retrieve these fields with ioctl's, so not sure what this has to do with
> > malware. Why does the firmware revision matter to this program?
> Oh I'm sorry, I forgot to explain properly. Malware usually checks if it is
> being run in a sandbox environment like a VM, and if it detects such a
> sandbox, it doesn't run or doesn't unleash its full potential. This makes my
> life as a researcher much harder.
> 
> Hiding the VM by overriding the model, firmware, and nqn strings to either
> random values or names of existing hardware in the hypervisor is a much
> cleaner solution than intercepting the IOCTLs in the VM and changing the
> result with a kernel driver.

IIUC, this program is trying to avoid being studied, and uses indicators like
nvme firmware to help determine if it is running in such an environment. If so,
I suspect defeating all possible indicators will be a fun and time consuming
process. :)



Re: [PATCH] hw/nvme: Add options to override hardcoded values

2022-07-13 Thread Keith Busch
On Wed, Jul 13, 2022 at 08:06:26PM +0200, Mauricio Sandt wrote:
> My specific use case that required this patch is a piece of malware that used
> several IOCTLs to read model, firmware, and nqn from the NVMe attached to the
> VM. Modifying that info at the hypervisor level was a much better approach
> than loading an (unsigned) driver into windows to intercept said IOCTLs.
> Having this patch in the official qemu version would help me a lot in my test
> lab, and I'm pretty sure it would also help other people.

I guess I'm missing the bigger picture here. You are supposed to be able to
retrieve these fields with ioctl's, so not sure what this has to do with
malware. Why does the firmware revision matter to this program?
 
> I guess there could be a warning about potential problems with drivers in the
> description for model, firmware, and nqn, but I haven't experienced any
> issues when changing the values (for all of them). If you encountered any
> problems, how did they manifest?

Older qemu nvme wasn't reliably generating unique identifiers, so linux quirks
them to ignore. They are reliable now, so the quirk can be changed to firmware
specific when the version number updates with the next release.



Re: [PATCH] hw/nvme: Add options to override hardcoded values

2022-07-13 Thread Keith Busch
On Sun, Jun 12, 2022 at 12:35:09AM +0200, Mauricio Sandt wrote:
> This small patch is the result of some recent malware research I did
> in a QEMU VM. The malware used multiple ways of querying info from
> the VM disk and I needed a clean way to change those values from the
> hypervisor.
> 
> I believe this functionality could be useful to more people from multiple
> fields, sometimes you just want to change some default values and having
> them hardcoded in the sourcecode makes that much harder.
> 
> This patch adds three config parameters to the nvme device, all of them
> are optional to not break anything. If any of them are not specified,
> the previous (before this patch) default is used.
> 
> -model - This takes a string and sets it as the devices model name.
> If you don't specify this parameter, the default is "QEMU NVMe Ctrl".
> 
> -firmware - The firmware version string, max 8 ascii characters.
> The default is whatever `QEMU_VERSION` evaluates to.
> 
> -nqn_override - Allows to set a custom nqn on the nvme device.
> Only used if there is no subsystem. This string should be in the same
> format as the default "nqn.2019-08.org.qemu:...", but there is no
> validation for that. Its up to the user to provide a valid string.

I guess the nqn can be user tunable just like it is when used with subsystems,
but what's the point of messing with model and firmware? That could mess with
host drivers' ability to detect what quirks it needs to apply to specific
instances of this virtual controller.



Re: [PATCH 0/4] hw/nvme: add support for TP4084

2022-06-16 Thread Keith Busch
On Thu, Jun 16, 2022 at 12:42:49PM +0200, Klaus Jensen wrote:
> On Jun  8 03:28, Niklas Cassel via wrote:
> > Hello there,
> > 
> > considering that Linux v5.19-rc1 is out which includes support for
> > NVMe TP4084:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=354201c53e61e493017b15327294b0c8ab522d69
> > 
> > I thought that it might be nice to have QEMU support for the same.
> > 
> > TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace
> > as ready independently from the controller.
> > 
> > When CC.CRIME is 0 (default), things behave as before, all namespaces
> > are ready when CSTS.RDY gets set to 1.
> > 
> > Add a new "ready_delay" namespace device parameter, in order to emulate
> > different ready latencies for namespaces when CC.CRIME is 1.
> > 
> > The patch series also adds a "crwmt" controller parameter, in order to
> > be able to expose the worst case timeout that the host should wait for
> > all namespaces to become ready.
> > 
> > 
> > Example qemu cmd line for the new options:
> > 
> > # delay in s (20s)
> > NS1_DELAY_S=20
> > # convert to units of 500ms
> > NS1_DELAY=$((NS1_DELAY_S*2))
> > 
> > # delay in s (60s)
> > NS2_DELAY_S=60
> > # convert to units of 500ms
> > NS2_DELAY=$((NS2_DELAY_S*2))
> > 
> > # timeout in s (120s)
> > CRWMT_S=120
> > # convert to units of 500ms
> > CRWMT=$((CRWMT_S*2))
> > 
> >  -device nvme,serial=deadbeef,crwmt=$CRWMT \
> >  -drive file=$NS1_DATA,id=nvm-1,format=raw,if=none \
> >  -device nvme-ns,drive=nvm-1,ready_delay=$NS1_DELAY \
> >  -drive file=$NS2_DATA,id=nvm-2,format=raw,if=none \
> >  -device nvme-ns,drive=nvm-2,ready_delay=$NS2_DELAY \
> > 
> > 
> > Niklas Cassel (4):
> >   hw/nvme: claim NVMe 2.0 compliance
> >   hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace
> >   hw/nvme: add support for ratified TP4084
> >   hw/nvme: add new never_ready parameter to test the DNR bit
> > 
> >  hw/nvme/ctrl.c   | 151 +--
> >  hw/nvme/ns.c |  17 +
> >  hw/nvme/nvme.h   |   9 +++
> >  hw/nvme/trace-events |   1 +
> >  include/block/nvme.h |  60 -
> >  5 files changed, 233 insertions(+), 5 deletions(-)
> > 
> > -- 
> > 2.36.1
> > 
> > 
> 
> Hi Niklas,
> 
> I've been going back and forth on my position on this.
> 
> I'm not straight up against it, but this only seems useful as a one-off
> patch to test the kernel support for this. Considering the limitations
> you state and the limited use case, I fear this is a little bloaty to
> carry upstream.
> 
> But I totally acknowledge that this is a horrible complicated behavior
> to implement on the driver side, so I guess we might all benefit from
> this.
> 
> Keith, do you have an opinion on this?

There are drivers utilizing the capability, so I think supporting it is fine
despite this environment not having any inherent time-to-ready delays.

This will probably be another knob that gets lots of use for initial driver
validation, then largely forgotten. But maybe it will be useful for driver unit
and regression testing in the future.



Re: [PATCH] hw/nvme: clear aen mask on reset

2022-06-03 Thread Keith Busch
On Thu, May 12, 2022 at 11:30:55AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The internally maintained AEN mask is not cleared on reset. Fix this.

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH] nvme: remove bit bucket support for now

2022-06-03 Thread Keith Busch
On Fri, Jun 03, 2022 at 10:31:14PM +0200, Klaus Jensen wrote:
> Keith,
> 
> We never merged anything to fix this. I suggest we simply revert it and
> get rid of the code entirely until *someone* comes up with a proper fix
> ;)

Yeah, that sounds right. My silly hack was okay for my Linux kernel tests, but
a proper fix isn't likely to happen in the near term. A revert is appropriate.



Re: [PATCH] hw/nvme: Fix deallocate when metadata is present

2022-06-03 Thread Keith Busch
On Fri, Jun 03, 2022 at 01:14:40PM -0600, Jonathan Derrick wrote:
> When metadata is present in the namespace and deallocates are issued, the 
> first
> deallocate could fail to zero the block range, resulting in another
> deallocation to be issued. Normally after the deallocation completes and the
> range is checked for zeroes, a deallocation is then issued for the metadata
> space. In the failure case where the range is not zeroed, deallocation is
> reissued for the block range (and followed with metadata deallocation), but 
> the
> original range deallocation task will also issue a metadata deallocation:
> 
> nvme_dsm_cb()
>   *range deallocation*
>   nvme_dsm_md_cb()
> if (nvme_block_status_all()) (range deallocation failure)
>   nvme_dsm_cb()
>   *range deallocation*
> nvme_dsm_md_cb()
>   if (nvme_block_status_all()) (no failure)
>   *metadata deallocation*
> *metadata deallocation*
> 
> This sequence results in reentry of nvme_dsm_cb() before the metadata has been
> deallocated. During reentry, the metadata is deallocated in the reentrant 
> task.
> nvme_dsm_bh() is called which deletes and sets iocb->bh to NULL. When reentry
> returns from nvme_dsm_cb(), metadata deallocation takes place again, and
> results in a null pointer dereference on the iocb->bh:

Nice, thank you for the detailed analysis. Patch looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v8 00/12] hw/nvme: SR-IOV with Virtualization Enhancements

2022-05-17 Thread Keith Busch
On Tue, May 17, 2022 at 01:04:56PM +0200, Klaus Jensen wrote:
> > 
> > Should we consider this series ready to merge?
> > 
> 
> Hi Lukasz and Lukasz :)
> 
> Yes, I'm queing this up.

FWIW, this looks good to me. I was hoping to give it a test run, but I don't
think I'll get to that for another week or two, so don't wait for me if you
think you've sorted out your recent observation.



Re: [PATCH] nvme: fix bit buckets

2022-04-25 Thread Keith Busch
On Mon, Apr 25, 2022 at 11:59:30AM +0200, Klaus Jensen wrote:
> The approach is neat and simple, but I don't think it has the intended
> effect. The memory region addr is just left at 0x0, so we just end up
> with mapping that directly into the qsg and in my test setup, this
> basically does DMA to the admin completion queue which happens to be at
> 0x0.
> 
> I would have liked to handle it like we do for CMB addresses, and
> reserve some address known to the device (i.e. remapping to a local
> allocated buffer), but we can't easily do that because of the iov/qsg
> duality thingie. The dma helpers wont work with it.
> 
> For now, I think we need to just rip out the bit bucket support.

Ah crap, I think you're right. Not as simple as I thought. The idea was to have
a dummy DMA-able region. We can have a controller DMA to another controller's
CMB without any special handling, so that's kind of what I'm trying except the
region doesn't need to be tied to any particular device.

And now that you got me thinking about it, there needs to be special bit bucket
handling for local CMB as well.



[PATCH] nvme: fix bit buckets

2022-04-22 Thread Keith Busch
We can't just ignore the bit buckets since the data offsets read from
disk need to be accounted for. We could either split into multiple reads
for the actual user data requested and skip the buckets, or we could
have a place holder for bucket data. Splitting is too much over head, so
just allocate a memory region to dump unwanted data.

Signed-off-by: Keith Busch 
---
This came out easier than I thought, so we can ignore my previous
feature removal patch:
  https://lists.nongnu.org/archive/html/qemu-block/2022-04/msg00398.html

 hw/nvme/ctrl.c | 9 +
 hw/nvme/nvme.h | 1 +
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 03760ddeae..711c6fac29 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -845,11 +845,11 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, NvmeSg *sg,
 trans_len = MIN(*len, dlen);
 
 if (type == NVME_SGL_DESCR_TYPE_BIT_BUCKET) {
-goto next;
+addr = n->bitBucket.addr;
+} else {
+addr = le64_to_cpu(segment[i].addr);
 }
 
-addr = le64_to_cpu(segment[i].addr);
-
 if (UINT64_MAX - addr < dlen) {
 return NVME_DATA_SGL_LEN_INVALID | NVME_DNR;
 }
@@ -859,7 +859,6 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, NvmeSg *sg,
 return status;
 }
 
-next:
 *len -= trans_len;
 }
 
@@ -6686,6 +6685,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, 
Error **errp)
 nvme_init_pmr(n, pci_dev);
 }
 
+memory_region_init(>bitBucket, OBJECT(n), NULL, 0x10);
+
 return 0;
 }
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 739c8b8f79..d59eadc69d 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -411,6 +411,7 @@ typedef struct NvmeCtrl {
 PCIDeviceparent_obj;
 MemoryRegion bar0;
 MemoryRegion iomem;
+MemoryRegion bitBucket;
 NvmeBar  bar;
 NvmeParams   params;
 NvmeBus  bus;
-- 
2.30.2




[PATCH] nvme: remove bit bucket support for now

2022-04-22 Thread Keith Busch
THe emulated controller correctly accounts for not including bit buckets
in the controller-to-host data transfer, however it doesn't correctly
account for the holes for the on-disk data offsets.

Signed-off-by: Keith Busch 
---
 hw/nvme/ctrl.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 03760ddeae..5e56191d45 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -6773,8 +6773,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice 
*pci_dev)
 id->vwc = NVME_VWC_NSID_BROADCAST_SUPPORT | NVME_VWC_PRESENT;
 
 id->ocfs = cpu_to_le16(NVME_OCFS_COPY_FORMAT_0 | NVME_OCFS_COPY_FORMAT_1);
-id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN |
-   NVME_CTRL_SGLS_BITBUCKET);
+id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN);
 
 nvme_init_subnqn(n);
 
-- 
2.30.2




Re: [PATCH 0/5] hw/nvme: fix namespace identifiers

2022-04-19 Thread Keith Busch
On Tue, Apr 19, 2022 at 02:10:34PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The namespace identifiers reported by the controller is kind of a mess.
> See [1,2].
> 
> This series should fix this for both the `-device nvme,drive=...` and
> `-device nvme-ns,...` cases.

Series looks good.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix control flow statement

2022-04-15 Thread Keith Busch
On Fri, Apr 15, 2022 at 10:27:21PM +0300, Dmitry Tikhov wrote:
> Since there is no else after nvme_dsm_cb invocation, metadata associated
> with non-zero block range is currently zeroed. Also this behaviour leads
> to segfault since we schedule iocb->bh two times. First when entering
> nvme_dsm_cb with iocb->idx == iocb->nr and second on call stack unwinding
> by calling blk_aio_pwrite_zeroes and subsequent nvme_dsm_cb callback
> because of missing else statement.
> 
> Signed-off-by: Dmitry Tikhov 
> ---
>  hw/nvme/ctrl.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 03760ddeae..7ebd2aa326 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -2372,11 +2372,12 @@ static void nvme_dsm_md_cb(void *opaque, int ret)
>  }
>  
>  nvme_dsm_cb(iocb, 0);
> +} else {
> +iocb->aiocb = blk_aio_pwrite_zeroes(ns->blkconf.blk, nvme_moff(ns, 
> slba),
> +nvme_m2b(ns, nlb), 
> BDRV_REQ_MAY_UNMAP,
> +nvme_dsm_cb, iocb);
>  }

Instead of the 'else', just insert an early 'return;' after nvme_dsm_cb() like
the earlier condition above here. Otherwise, looks good, and thanks for the
fix.



Re: [PATCH v2 0/6] hw/nvme: enhanced protection information (64-bit guard)

2022-03-01 Thread Keith Busch
On Tue, Mar 01, 2022 at 11:44:22AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This adds support for one possible new protection information format
> introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard
> and 48-bit reference tag. This version does not support storage tags.
> 
> Like the CRC16 support already present, this uses a software
> implementation of CRC64 (so it is naturally pretty slow). But its good
> enough for verification purposes.
> 
> This goes hand-in-hand with the support that Keith submitted for the
> Linux kernel[1].
> 
>   [1]: 
> https://lore.kernel.org/linux-nvme/20220201190128.3075065-1-kbu...@kernel.org/

Thanks Klaus, this looks good to me.

Reviewed-by: Keith Busch 



Re: [PATCH 0/6] hw/nvme: enhanced protection information (64-bit guard)

2022-02-16 Thread Keith Busch
On Mon, Feb 14, 2022 at 01:30:23PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This adds support for one possible new protection information format
> introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard
> and 48-bit reference tag. This version does not support storage tags.
> 
> Like the CRC16 support already present, this uses a software
> implementation of CRC64 (so it is naturally pretty slow). But its good
> enough for verification purposes.
> 
> This goes hand-in-hand with the support that Keith submitted for the
> Linux kernel[1].
> 
> [1]: 
> https://lore.kernel.org/linux-nvme/20220201190128.3075065-1-kbu...@kernel.org/

Other than comment on 6/6, series looks good to me.

Reviewed-by: Keith Busch 



Re: [PATCH 6/6] hw/nvme: 64-bit pi support

2022-02-16 Thread Keith Busch
On Mon, Feb 14, 2022 at 01:30:29PM +0100, Klaus Jensen wrote:
> @@ -384,6 +389,12 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, 
> Error **errp)
>  return -1;
>  }
>  
> +if (ns->params.pif != NVME_PI_GUARD_16 &&
> +ns->params.pif != NVME_PI_GUARD_64) {
> +error_setg(errp, "invalid 'pif'");
> +return -1;
> +}

In addition, the requested metadata size ('params.ms') should be checked
against the requested PI option. The function currently just checks
against 8 bytes, but the 64b guard requires at least 16 bytes.

Otherwise, looks great.



Re: [PATCH for-7.0 4/4] hw/nvme: add support for zoned random write area

2022-01-26 Thread Keith Busch
On Thu, Nov 25, 2021 at 08:37:35AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Add support for TP 4076 ("Zoned Random Write Area"), v2021.08.23
> ("Ratified").
> 
> This adds three new namespace parameters: "zoned.numzrwa" (number of
> zrwa resources, i.e. number of zones that can have a zrwa),
> "zoned.zrwas" (zrwa size in LBAs), "zoned.zrwafg" (granularity in LBAs
> for flushes).
> 
> Signed-off-by: Klaus Jensen 

Looks good, and will just need a minor update if you choose to take the
feedback from patch 2 onboard.

Reviewed-by: Keith Busch 



Re: [PATCH for-7.0 3/4] hw/nvme: add ozcs enum

2022-01-26 Thread Keith Busch
On Thu, Nov 25, 2021 at 08:37:34AM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Add enumeration for OZCS values.

Looks good. 

Reviewed-by: Keith Busch 



Re: [PATCH for-7.0 2/4] hw/nvme: add zone attribute get/set helpers

2022-01-26 Thread Keith Busch
On Thu, Nov 25, 2021 at 08:37:33AM +0100, Klaus Jensen wrote:
> @@ -295,7 +295,7 @@ static void nvme_assign_zone_state(NvmeNamespace *ns, 
> NvmeZone *zone,
>  case NVME_ZONE_STATE_READ_ONLY:
>  break;
>  default:
> -zone->d.za = 0;
> +NVME_ZA_CLEAR_ALL(zone->d.za);
>  }
>  }
>  
> @@ -3356,7 +3356,7 @@ static uint16_t nvme_set_zd_ext(NvmeNamespace *ns, 
> NvmeZone *zone)
>  return status;
>  }
>  nvme_aor_inc_active(ns);
> -zone->d.za |= NVME_ZA_ZD_EXT_VALID;
> +NVME_ZA_SET(zone->d.za, NVME_ZA_ZD_EXT_VALID);
>  nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED);
>  return NVME_SUCCESS;
>  }
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 2ee227760265..2b8b906466ab 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -1407,6 +1407,10 @@ enum NvmeZoneAttr {
>  NVME_ZA_ZD_EXT_VALID = 1 << 7,
>  };
>  
> +#define NVME_ZA_SET(za, attrs)   ((za) |= (attrs))
> +#define NVME_ZA_CLEAR(za, attrs) ((za) &= ~(attrs))
> +#define NVME_ZA_CLEAR_ALL(za)((za) = 0x0)

This doesn't really look any more helpful than open coding it. I think
it would appear better to take a "struct NvmeZone" type parameter
instead, and use inline functions instead of macro.



Re: [PATCH for-7.0 1/4] hw/nvme: add struct for zone management send

2022-01-26 Thread Keith Busch
On Thu, Nov 25, 2021 at 08:37:32AM +0100, Klaus Jensen wrote:
> +typedef struct QEMU_PACKED NvmeZoneSendCmd {
> +uint8_t opcode;
> +uint8_t flags;
> +uint16_tcid;
> +uint32_tnsid;
> +uint32_trsvd2[4];
> +NvmeCmdDptr dptr;
> +uint64_tslba;
> +uint32_trsvd12;
> +uint8_t zsa;
> +uint8_t zsflags[3];

This should be just a single uint8_t for zsflags, followed by 
'uint8_t rsvd[2]'.

Otherwise, looks good.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix CVE-2021-3929

2022-01-20 Thread Keith Busch
On Thu, Jan 20, 2022 at 09:01:55AM +0100, Klaus Jensen wrote:
> +static inline bool nvme_addr_is_iomem(NvmeCtrl *n, hwaddr addr)
> +{
> +hwaddr hi, lo;
> +
> +lo = n->bar0.addr;
> +hi = lo + int128_get64(n->bar0.size);
> +
> +return addr >= lo && addr < hi;

Looks fine considering this implementation always puts CMB in an
exclusive BAR. From a spec consideration though, you can put a CMB at a
BAR0 offset. I don't think that's going to happen anytime soon here, but
may be worth a comment to notify this function needs to be updated if
that assumption ever changes.

Reviewed-by: Keith Busch 



Re: [PATCH 2/2] hw/nvme/ctrl: Prohibit DMA accesses to devices (CVE-2021-3929)

2021-12-16 Thread Keith Busch
On Thu, Dec 16, 2021 at 06:55:10PM +0100, Philippe Mathieu-Daudé wrote:
> Async DMA requests might access MMIO regions and re-program the
> NVMe controller internal registers while DMA requests are still
> scheduled or in flight. Avoid that by prohibing the controller
> to access non-memories regions.

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH 1/2] hw/nvme/ctrl: Do not ignore DMA access errors

2021-12-16 Thread Keith Busch
On Thu, Dec 16, 2021 at 06:55:09PM +0100, Philippe Mathieu-Daudé wrote:
> dma_buf_read/dma_buf_write() return a MemTxResult type.
> Do not discard it, propagate the DMA error to the caller.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v2 08/15] hw/nvme: Implement the Function Level Reset

2021-11-16 Thread Keith Busch
On Tue, Nov 16, 2021 at 04:34:39PM +0100, Łukasz Gieryk wrote:
>  if (!pci_is_vf(>parent_obj) && n->params.sriov_max_vfs) {
> -pcie_sriov_pf_disable_vfs(>parent_obj);
> +if (rst != NVME_RESET_CONTROLLER) {
> +pcie_sriov_pf_disable_vfs(>parent_obj);

Shouldn't this be 'if (rst == NVME_RESET_FUNCTION)'?



Re: [PATCH 0/2] hw/nvme: fix namespace attachment on controller hotplug

2021-11-16 Thread Keith Busch
On Fri, Sep 24, 2021 at 09:27:40AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> First patch from Hannes fixes the subsystem registration process such
> that shared (but non-detached) namespaces are automatically attached to
> hotplugged controllers.
> 
> The second patch changes the default for 'shared' such that namespaces
> are shared by default and will thus by default be attached to hotplugged
> controllers. This adds a compat property for older machine versions and
> updates the documentation to reflect this.

Series looks good.

Reviewed-by: Keith Busch 



Re: [PATCH RFC 02/13] hw/nvme: move zns helpers and types into zoned.h

2021-09-16 Thread Keith Busch
On Tue, Sep 14, 2021 at 10:37:26PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Move ZNS related helpers and types into zoned.h. Use a common prefix
> (nvme_zoned or nvme_ns_zoned) for zns related functions.

Just a nitpicks on the naming, you can feel free to ignore.

Since we're within NVMe specific protocol here, using that terminology
should be fine. I prefer "nvme_zns_" for the prefix.

And for function names like "nvme_zoned_zs()", the "zs" abbreviation
expands to "zone_state", so "zone" is redunant. I think
"nvme_zns_state()" is a good name for that one.



Re: [PATCH v2] hw/nvme: Return error for fused operations

2021-09-15 Thread Keith Busch
On Wed, Sep 15, 2021 at 05:43:30PM +0200, Pankaj Raghav wrote:
> Currently, FUSED operations are not supported by QEMU. As per the 1.4 SPEC,
> controller should abort the command that requested a fused operation with 
> an INVALID FIELD error code if they are not supported.
> 
> Changes from v1:
> Added FUSE flag check also to the admin cmd processing as the FUSED 
> operations are mentioned in the general SQE section in the SPEC. 

Just for future reference, the changes from previous versions should go
below the "---" line so that they don't get included in the official
changelog.

> +if (NVME_CMD_FLAGS_FUSE(req->cmd.flags)) {
> +return NVME_INVALID_FIELD;
> +}
> +
>  req->ns = ns;
>  
>  switch (req->cmd.opcode) {
> @@ -5475,6 +5479,10 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, 
> NvmeRequest *req)
>  return NVME_INVALID_FIELD | NVME_DNR;
>  }
>  
> +if (NVME_CMD_FLAGS_FUSE(req->cmd.flags)) {
> +return NVME_INVALID_FIELD;
> +}

I'm a little surprised this macro exists considering this is the first
time it's used! But this looks fine, I really hope hosts weren't
actually trying fused commands on this target, but it's good to confirm.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix validation of ASQ and ACQ

2021-08-23 Thread Keith Busch
On Mon, Aug 23, 2021 at 02:20:18PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Address 0x0 is a valid address. Fix the admin submission and completion
> queue address validation to not error out on this.

Indeed, there are environments that can use that address. It's a host error if
the controller was enabled with invalid queue addresses anyway. The controller
only needs to verify the lower bits are clear, which we do later.

Reviewed-by: Keith Busch 



Re: [RFC PATCH v1] Adding Support for namespace management

2021-08-13 Thread Keith Busch
On Fri, Aug 13, 2021 at 03:02:22PM +0530, Naveen wrote:
> +static uint16_t nvme_identify_ns_common(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeIdNs id_ns = {};
> +
> +id_ns.nsfeat |= (0x4 | 0x10);
> +id_ns.dpc = 0x1f;
> +
> +NvmeLBAF lbaf[16] = {
> +[0] = {.ds = 9},
> +[1] = {.ds = 9, .ms = 8},
> +[2] = {.ds = 9, .ms = 16},
> +[3] = {.ds = 9, .ms = 64},
> +[4] = {.ds = 12},
> +[5] = {.ds = 12, .ms = 8},
> +[6] = {.ds = 12, .ms = 16},
> +[7] = {.ds = 12, .ms = 64},
> +};

Since the lbaf is a copy of what's defined in nvme_ns_init, so should be
defined for reuse.

> +
> +memcpy(_ns.lbaf, , sizeof(lbaf));
> +id_ns.nlbaf = 7;

The identify structure should be what's in common with all namespaces,
and this doesn't look complete. Just off the top of my head, missing
fields include MC and DLFEAT.

> +
> +return nvme_c2h(n, (uint8_t *)_ns, sizeof(NvmeIdNs), req);
> +}
> +
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req, bool active)
>  {
>  NvmeNamespace *ns;
> @@ -4453,8 +4478,10 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, 
> NvmeRequest *req, bool active)
>  
>  trace_pci_nvme_identify_ns(nsid);
>  
> -if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) {
> +if (!nvme_nsid_valid(n, nsid)) {
>  return NVME_INVALID_NSID | NVME_DNR;
> +} else if (nsid == NVME_NSID_BROADCAST) {
> +return nvme_identify_ns_common(n, req);
>  }
>  
>  ns = nvme_ns(n, nsid);
> @@ -5184,6 +5211,195 @@ static void nvme_select_iocs_ns(NvmeCtrl *n, 
> NvmeNamespace *ns)
>  }
>  }
>  
> +static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp)
> +{
> +int ret;
> +uint64_t perm, shared_perm;
> +
> +blk_get_perm(blk, , _perm);
> +
> +ret = blk_set_perm(blk, perm | BLK_PERM_RESIZE, shared_perm, errp);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +ret = blk_truncate(blk, len, false, PREALLOC_MODE_OFF, 0, errp);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +ret = blk_set_perm(blk, perm, shared_perm, errp);
> +if (ret < 0) {
> +return ret;
> +}
> +
> +return 0;
> +}
> +
> +static uint32_t nvme_allocate_nsid(NvmeCtrl *n)
> +{
> +uint32_t nsid = 0;
> +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) {
> +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) {
> +continue;
> +}
> +
> +nsid = i;
> +return nsid;
> +}
> +return nsid;
> +}
> +
> +static uint16_t nvme_namespace_create(NvmeCtrl *n, NvmeRequest *req)
> +{
> +   uint32_t ret;
> +   NvmeIdNs id_ns_host;
> +   NvmeSubsystem *subsys = n->subsys;
> +   Error *err = NULL;
> +   uint8_t flbas_host;
> +   uint64_t ns_size;
> +   int lba_index;
> +   NvmeNamespace *ns;
> +   NvmeCtrl *ctrl;
> +   NvmeIdNs *id_ns;
> +
> +ret = nvme_h2c(n, (uint8_t *)_ns_host, sizeof(id_ns_host), req);
> +if (ret) {
> +return ret;
> +}

Some unusual indentation here.

> +
> +if (id_ns_host.ncap < id_ns_host.nsze) {
> +return NVME_THIN_PROVISION_NO_SUPP | NVME_DNR;
> +} else if (id_ns_host.ncap > id_ns_host.nsze) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +if (!id_ns_host.nsze) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +if (QSLIST_EMPTY(>unallocated_namespaces)) {
> +return NVME_NS_ID_UNAVAILABLE;
> +}
> +
> +ns = QSLIST_FIRST(>unallocated_namespaces);
> +id_ns = >id_ns;
> +flbas_host = (id_ns_host.flbas) & (0xF);
> +
> +if (flbas_host > id_ns->nlbaf) {
> +return NVME_INVALID_FORMAT | NVME_DNR;
> +}
> +
> +ret = nvme_ns_setup(ns, );
> +if (ret) {
> +return ret;
> +}
> +
> +id_ns->flbas = id_ns_host.flbas;
> +id_ns->dps = id_ns_host.dps;
> +id_ns->nmic = id_ns_host.nmic;
> +
> +lba_index = NVME_ID_NS_FLBAS_INDEX(id_ns->flbas);
> +ns_size = id_ns_host.nsze * ((1 << id_ns->lbaf[lba_index].ds) +
> +(id_ns->lbaf[lba_index].ms));
> +id_ns->nvmcap = ns_size;
> +
> +if (ns_size > n->id_ctrl.unvmcap) {
> +return NVME_NS_INSUFF_CAP;
> +}
> +
> +ret = nvme_blk_truncate(ns->blkconf.blk, id_ns->nvmcap, );
> +if (ret) {
> +return ret;
> +}
> +
> +ns->size = blk_getlength(ns->blkconf.blk);
> +nvme_ns_init_format(ns);
> +
> +ns->params.nsid = nvme_allocate_nsid(n);
> +if (!ns->params.nsid) {
> +return NVME_NS_ID_UNAVAILABLE;
> +}
> +subsys->namespaces[ns->params.nsid] = ns;
> +
> +for (int cntlid = 0; cntlid < ARRAY_SIZE(n->subsys->ctrls); cntlid++) {
> +ctrl = nvme_subsys_ctrl(n->subsys, cntlid);
> +if (ctrl) {
> +ctrl->id_ctrl.unvmcap -= le64_to_cpu(ns->size);
> +}
> +}
> +
> +stl_le_p(>cqe.result, ns->params.nsid);
> +QSLIST_REMOVE_HEAD(>unallocated_namespaces, entry);
> +return NVME_SUCCESS;
> +}
> +
> 

Re: [PATCH v6 1/5] hw/nvme: split pmrmsc register into upper and lower

2021-07-26 Thread Keith Busch
On Wed, Jul 21, 2021 at 09:48:32AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to
> make up the 64 bit logical PMRMSC register.
> 
> Make it so.

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH v5 2/5] hw/nvme: use symbolic names for registers

2021-07-19 Thread Keith Busch
On Tue, Jul 20, 2021 at 12:46:44AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Add the NvmeBarRegs enum and use these instead of explicit register
> offsets.

Thanks, this is a very nice cleanup. For a suggested follow-up companion
patch, we should add "ASSERT_OFFSET()" checks for each register to
enforce correct positioning of the BAR offsets at build time.

Reviewed-by: Keith Busch 


> Signed-off-by: Klaus Jensen 
> Reviewed-by: Gollu Appalanaidu 
> Reviewed-by: Philippe Mathieu-Daudé 
> ---
>  include/block/nvme.h | 29 -
>  hw/nvme/ctrl.c   | 44 ++--
>  2 files changed, 50 insertions(+), 23 deletions(-)
> 
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 84053b68b987..77aae0117494 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -9,7 +9,7 @@ typedef struct QEMU_PACKED NvmeBar {
>  uint32_tcc;
>  uint8_t rsvd24[4];
>  uint32_tcsts;
> -uint32_tnssrc;
> +uint32_tnssr;
>  uint32_taqa;
>  uint64_tasq;
>  uint64_tacq;
> @@ -31,6 +31,33 @@ typedef struct QEMU_PACKED NvmeBar {
>  uint8_t css[484];
>  } NvmeBar;
>  
> +enum NvmeBarRegs {
> +NVME_REG_CAP = offsetof(NvmeBar, cap),
> +NVME_REG_VS  = offsetof(NvmeBar, vs),
> +NVME_REG_INTMS   = offsetof(NvmeBar, intms),
> +NVME_REG_INTMC   = offsetof(NvmeBar, intmc),
> +NVME_REG_CC  = offsetof(NvmeBar, cc),
> +NVME_REG_CSTS= offsetof(NvmeBar, csts),
> +NVME_REG_NSSR= offsetof(NvmeBar, nssr),
> +NVME_REG_AQA = offsetof(NvmeBar, aqa),
> +NVME_REG_ASQ = offsetof(NvmeBar, asq),
> +NVME_REG_ACQ = offsetof(NvmeBar, acq),
> +NVME_REG_CMBLOC  = offsetof(NvmeBar, cmbloc),
> +NVME_REG_CMBSZ   = offsetof(NvmeBar, cmbsz),
> +NVME_REG_BPINFO  = offsetof(NvmeBar, bpinfo),
> +NVME_REG_BPRSEL  = offsetof(NvmeBar, bprsel),
> +NVME_REG_BPMBL   = offsetof(NvmeBar, bpmbl),
> +NVME_REG_CMBMSC  = offsetof(NvmeBar, cmbmsc),
> +NVME_REG_CMBSTS  = offsetof(NvmeBar, cmbsts),
> +NVME_REG_PMRCAP  = offsetof(NvmeBar, pmrcap),
> +NVME_REG_PMRCTL  = offsetof(NvmeBar, pmrctl),
> +NVME_REG_PMRSTS  = offsetof(NvmeBar, pmrsts),
> +NVME_REG_PMREBS  = offsetof(NvmeBar, pmrebs),
> +NVME_REG_PMRSWTP = offsetof(NvmeBar, pmrswtp),
> +NVME_REG_PMRMSCL = offsetof(NvmeBar, pmrmscl),
> +NVME_REG_PMRMSCU = offsetof(NvmeBar, pmrmscu),
> +};
> +
>  enum NvmeCapShift {
>  CAP_MQES_SHIFT = 0,
>  CAP_CQR_SHIFT  = 16,
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 28299c6f3764..8c305315f41c 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -5740,7 +5740,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  }
>  
>  switch (offset) {
> -case 0xc:   /* INTMS */
> +case NVME_REG_INTMS:
>  if (unlikely(msix_enabled(&(n->parent_obj {
>  NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix,
> "undefined access to interrupt mask set"
> @@ -5752,7 +5752,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  trace_pci_nvme_mmio_intm_set(data & 0x, n->bar.intmc);
>  nvme_irq_check(n);
>  break;
> -case 0x10:  /* INTMC */
> +case NVME_REG_INTMC:
>  if (unlikely(msix_enabled(&(n->parent_obj {
>  NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix,
> "undefined access to interrupt mask clr"
> @@ -5764,7 +5764,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  trace_pci_nvme_mmio_intm_clr(data & 0x, n->bar.intmc);
>  nvme_irq_check(n);
>  break;
> -case 0x14:  /* CC */
> +case NVME_REG_CC:
>  trace_pci_nvme_mmio_cfg(data & 0x);
>  /* Windows first sends data, then sends enable bit */
>  if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
> @@ -5798,7 +5798,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
>  n->bar.cc = data;
>  }
>  break;
> -case 0x1c:  /* CSTS */
> +case NVME_REG_CSTS:
>  if (data & (1 << 4)) {
>  NVME_GUEST_ERR(pci_nvme_ub_mmiowr_ssreset_w1c_unsupported,
> "attempted to W1C CSTS.NSSRO"
> @@ -5809,7 +5809,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
> " of controll

Re: [RFC PATCH nvme-cli 2/2] nvme-cli/plugins/mi:add support

2021-07-09 Thread Keith Busch
> +int hal_init()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_init();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_open()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_open();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_close()
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_close();
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_i2c_write(uint8_t *data_out, uint16_t num_bytes)
> +{
> +int retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_write(data_out, num_bytes);
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}
> +
> +int hal_i2c_read(uint8_t *data_in, uint16_t num_bytes)
> +{
> +uint32_t retval = -1;
> +switch (GetSidebandInterface()) {
> +case qemu_nvme_mi:
> +retval = qemu_mi_read(data_in, num_bytes);
> +break;
> +default:
> +break;
> +}
> +return retval;
> +}

I'm really not a fan of having non-standard interfaces. If you were
going to do that, though, you should create a struct of function
pointers so that you don't need these repetitive "switch (...)"
statements.

But if we're going to have OOB MI support in toolign, they should all
use the same standard defined interface.



Re: [RFC PATCH 1/2] hw/nvme: add mi device

2021-07-09 Thread Keith Busch
On Fri, Jul 09, 2021 at 07:25:45PM +0530, Padmakar Kalghatgi wrote:
> The following commands are tested with nvme-cli by hooking
> to the cid of the vsock as shown above and use the socket
> send/recieve commands to issue the commands and get the response.

Why sockets? Shouldn't mi targets use smbus for that?



[PATCH] nvme: add 'zoned.zasl' to documentation

2021-06-29 Thread Keith Busch
Signed-off-by: Keith Busch 
---
 docs/system/nvme.rst | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/docs/system/nvme.rst b/docs/system/nvme.rst
index 33a15c7dbc..bff72d1c24 100644
--- a/docs/system/nvme.rst
+++ b/docs/system/nvme.rst
@@ -202,6 +202,12 @@ The namespace may be configured with additional parameters
   allows all zones to be open. If ``zoned.max_active`` is specified, this value
   must be less than or equal to that.
 
+``zoned.zasl=UINT8`` (default: ``0``)
+  Set the maximum data transfer size for the Zone Append command. Like
+  ``mdts``, the value is specified as a power of two (2^n) and is in units of
+  the minimum memory page size (CAP.MPSMIN). The default value (``0``)
+  has this property inherit the ``mdts`` value.
+
 Metadata
 
 
-- 
2.25.4




Re: [PATCH v2] hw/nvme: fix pin-based interrupt behavior (again)

2021-06-28 Thread Keith Busch
On Thu, Jun 17, 2021 at 08:55:42PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Jakub noticed[1] that, when using pin-based interrupts, the device will
> unconditionally deasssert when any CQEs are acknowledged. However, the
> pin should not be deasserted if other completion queues still holds
> unacknowledged CQEs.
> 
> The bug is an artifact of commit ca247d35098d ("hw/block/nvme: fix
> pin-based interrupt behavior") which fixed one bug but introduced
> another. This is the third time someone tries to fix pin-based
> interrupts (see commit 5e9aa92eb1a5 ("hw/block: Fix pin-based interrupt
> behaviour of NVMe"))...
> 
> Third time's the charm, so fix it, again, by keeping track of how many
> CQs have unacknowledged CQEs and only deassert when all are cleared.
> 
>   [1]: <20210610114624.304681-1-jakub.jer...@kernkonzept.com>
> 
> Fixes: ca247d35098d ("hw/block/nvme: fix pin-based interrupt behavior")
> Reported-by: Jakub Jermář 
> Signed-off-by: Klaus Jensen 

Looks good.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix missing check for PMR capability

2021-06-28 Thread Keith Busch
On Mon, Jun 07, 2021 at 11:47:57AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Qiang Liu reported that an access on an unknown address is triggered in
> memory_region_set_enabled because a check on CAP.PMRS is missing for the
> PMRCTL register write when no PMR is configured.
> 
> Cc: qemu-sta...@nongnu.org
> Fixes: 75c3c9de961d ("hw/block/nvme: disable PMR at boot up")
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/362
> Signed-off-by: Klaus Jensen 

Looks good.

Reviewed-by: Keith Busch 

> ---
>  hw/nvme/ctrl.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 0bcaf7192f99..463772602c4e 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -5583,6 +5583,10 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, 
> uint64_t data,
> "invalid write to PMRCAP register, ignored");
>  return;
>  case 0xe04: /* PMRCTL */
> +if (!NVME_CAP_PMRS(n->bar.cap)) {
> +return;
> +}
> +
>  n->bar.pmrctl = data;
>  if (NVME_PMRCTL_EN(data)) {
>  memory_region_set_enabled(>pmr.dev->mr, true);
> -- 
> 2.31.1



Re: [PATCH v2 11/11] Partially revert "hw/block/nvme: drain namespaces on sq deletion"

2021-06-28 Thread Keith Busch
On Thu, Jun 17, 2021 at 09:06:57PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This partially reverts commit 98f84f5a4eca5c03e32fff20f246d9b4b96d6422.
> 
> Since all "multi aio" commands are now reimplemented to properly track
> the nested aiocbs, we can revert the "hack" that was introduced to make
> sure all requests we're properly drained upon sq deletion.
> 
> The revert is partial since we keep the assert that no outstanding
> requests remain on the submission queue after the explicit cancellation.
> 
> Signed-off-by: Klaus Jensen 
> ---
>  hw/nvme/ctrl.c | 19 ++-
>  1 file changed, 2 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index ec8ddb76cd39..5a1d25265a9d 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -3918,7 +3918,6 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
> *req)
>  NvmeSQueue *sq;
>  NvmeCQueue *cq;
>  uint16_t qid = le16_to_cpu(c->qid);
> -uint32_t nsid;
>  
>  if (unlikely(!qid || nvme_check_sqid(n, qid))) {
>  trace_pci_nvme_err_invalid_del_sq(qid);
> @@ -3930,22 +3929,8 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest 
> *req)
>  sq = n->sq[qid];
>  while (!QTAILQ_EMPTY(>out_req_list)) {
>  r = QTAILQ_FIRST(>out_req_list);
> -if (r->aiocb) {
> -blk_aio_cancel(r->aiocb);
> -}
> -}
> -
> -/*
> - * Drain all namespaces if there are still outstanding requests that we
> - * could not cancel explicitly.
> - */
> -if (!QTAILQ_EMPTY(>out_req_list)) {

Quick question: was this 'if' ever even possible to hit? The preceding
'while' loop doesn't break out until the queue is empty, so this should
have been unreachable.

> -for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) {
> -NvmeNamespace *ns = nvme_ns(n, nsid);
> -if (ns) {
> -nvme_ns_drain(ns);
> -}
> -}
> +assert(r->aiocb);
> +blk_aio_cancel(r->aiocb);
>  }
>  
>  assert(QTAILQ_EMPTY(>out_req_list));
> -- 



Re: [PATCH v2 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs

2021-06-25 Thread Keith Busch
On Thu, Jun 17, 2021 at 09:06:46PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This series reimplements flush, dsm, copy, zone reset and format nvm to
> allow cancellation. I posted an RFC back in March ("hw/block/nvme:
> convert ad-hoc aio tracking to aiocb") and I've applied some feedback
> from Stefan and reimplemented the remaining commands.
> 
> The basic idea is to define custom AIOCBs for these commands. The custom
> AIOCB takes care of issuing all the "nested" AIOs one by one instead of
> blindly sending them off simultaneously without tracking the returned
> aiocbs.


Klaus,

THis series looks good to me.

Reviewed-by: Keith Busch 



Re: [PATCH] hw/nvme: fix pin-based interrupt behavior (again)

2021-06-17 Thread Keith Busch
On Thu, Jun 17, 2021 at 12:08:20PM +0200, Klaus Jensen wrote:
>  if (cq->tail != cq->head) {
> +if (!pending) {
> +n->cq_pending++;
> +}

You should check cq->irq_enabled before incrementing cq_pending. You
don't want to leave the irq asserted when polled queues have outsanding
CQEs.



Re: [PATCH v2 2/3] hw/nvme: namespace parameter for EUI-64

2021-06-14 Thread Keith Busch
On Sat, Jun 12, 2021 at 01:46:30AM +0200, Heinrich Schuchardt wrote:
> +``eui64``
> +  Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended
> +  Unique Identifier" descriptor in the Namespace Identification Descriptor 
> List.

This needs to match Identify Namespace's EUI64 field, too. The
Descriptor List is really just a copy of what is reported in that
structure.



Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions

2021-06-01 Thread Keith Busch
On Tue, Jun 01, 2021 at 09:07:56PM +0200, Klaus Jensen wrote:
> On Jun  1 11:07, Keith Busch wrote:
> > On Tue, Jun 01, 2021 at 07:41:34PM +0200, Klaus Jensen wrote:
> > > On Jun  1 10:19, Keith Busch wrote:
> > > > On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote:
> > > > > NVMe Boot Partitions provides an area that may be read by the host
> > > > > without initializing queues or even enabling the controller. This
> > > > > allows various platform initialization code to be stored on the NVMe
> > > > > device instead of some separete medium.
> > > > >
> > > > > This patch adds the read support for such an area, as well as support
> > > > > for updating the boot partition contents from the host through the
> > > > > FW Download and Commit commands.
> > > >
> > > > Please provide some details on what platform initilization sequence
> > > > running on QEMU is going to make use of this feature.
> > > >
> > > 
> > > I totally get your reluctance to accept useless features like device
> > > self-test and ill-supported ones like write uncorrectable.
> > > 
> > > But I think this feature qualifies just fine for the device. It is useful
> > > for embedded development and while there might not be any qemu boards that
> > > wants to use this *right now*, it allows for experimentation. And this is 
> > > a
> > > feature that actually *is* implemented by real products for embedded
> > > systems.
> > 
> > That wasn't my request, though. I am well aware of the feature and also
> > have hardware that implements it. It just sounds like you haven't
> > actually tested this feature under the protocol's intended use cases
> > inside this environment. I think that type of testing and a high level
> > description of it in the changelog ought to be part of acceptance
> > criteria.
> > 
> 
> Alright, I see.
> 
> You'd like to see this tested by defining a new board that loads firmware
> over PCIe from the device?

Yes, something like that.

When the feature was initially published, I took a very brief look at
how qemu could use it and concluded this wasn't very practical here. I
would be happy to know if there's any example platform that can use it,
though. That, to me, demostrates sufficient value.



Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions

2021-06-01 Thread Keith Busch
On Tue, Jun 01, 2021 at 07:41:34PM +0200, Klaus Jensen wrote:
> On Jun  1 10:19, Keith Busch wrote:
> > On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote:
> > > NVMe Boot Partitions provides an area that may be read by the host
> > > without initializing queues or even enabling the controller. This
> > > allows various platform initialization code to be stored on the NVMe
> > > device instead of some separete medium.
> > > 
> > > This patch adds the read support for such an area, as well as support
> > > for updating the boot partition contents from the host through the
> > > FW Download and Commit commands.
> > 
> > Please provide some details on what platform initilization sequence
> > running on QEMU is going to make use of this feature.
> > 
> 
> I totally get your reluctance to accept useless features like device
> self-test and ill-supported ones like write uncorrectable.
> 
> But I think this feature qualifies just fine for the device. It is useful
> for embedded development and while there might not be any qemu boards that
> wants to use this *right now*, it allows for experimentation. And this is a
> feature that actually *is* implemented by real products for embedded
> systems.

That wasn't my request, though. I am well aware of the feature and also
have hardware that implements it. It just sounds like you haven't
actually tested this feature under the protocol's intended use cases
inside this environment. I think that type of testing and a high level
description of it in the changelog ought to be part of acceptance
criteria.



Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions

2021-06-01 Thread Keith Busch
On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote:
> NVMe Boot Partitions provides an area that may be read by the host
> without initializing queues or even enabling the controller. This
> allows various platform initialization code to be stored on the NVMe
> device instead of some separete medium.
> 
> This patch adds the read support for such an area, as well as support
> for updating the boot partition contents from the host through the
> FW Download and Commit commands.

Please provide some details on what platform initilization sequence
running on QEMU is going to make use of this feature.



Re: [PATCH 07/23] hw/block/nvme: Use definition to avoid dynamic stack allocation

2021-05-05 Thread Keith Busch
On Wed, May 05, 2021 at 06:09:10PM -0500, Eric Blake wrote:
> On 5/5/21 5:07 PM, Philippe Mathieu-Daudé wrote:
> > +Eric
> > 
> > On 5/5/21 11:22 PM, Keith Busch wrote:
> >> On Wed, May 05, 2021 at 11:10:31PM +0200, Philippe Mathieu-Daudé wrote:
> >>> The compiler isn't clever enough to figure 'SEG_CHUNK_SIZE' is
> >>> a constant! Help it by using a definitions instead.
> >>
> >> I don't understand.
> > 
> > Neither do I TBH...
> > 
> >> It's labeled 'const', so any reasonable compiler
> >> will place it in the 'text' segment of the executable rather than on the
> >> stack. While that's compiler specific, is there really a compiler doing
> >> something bad with this? If not, I do prefer the 'const' here if only
> >> because it limits the symbol scope ('static const' is also preferred if
> >> that helps).
> > 
> > Using: gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC)
> > 
> > Both static+const / const trigger:
> > 
> > hw/block/nvme.c: In function ‘nvme_map_sgl’:
> > hw/block/nvme.c:818:5: error: ISO C90 forbids variable length array
> > ‘segment’ [-Werror=vla]
> >   818 | NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld;
> >   | ^
> > cc1: all warnings being treated as errors
> 
> C99 6.7.5.2 paragraph 4:
> "If the size is an integer constant expression and the element type has
> a known constant size, the array type is not a variable length array
> type; otherwise, the array type is a variable length array type."
> 
> 6.6 paragraph 6:
> "An integer constant expression shall have integer type and shall only
> have operands that are integer constants, enumeration constants,
> character constants, sizeof expressions whose results are integer
> constants, and floating constants that are the immediate operands of
> casts. Cast operators in an integer constant expression shall only
> convert arithmetic types to integer types, except as part of an operand
> to the sizeof operator."
> 
> Notably absent from that list are 'const int' variables, which even
> though they act as constants (in that the name always represents the
> same value), do not actually qualify as such under C99 rules.  Yes, it's
> a pain.  What's more, 6.6 paragraph 10:
> 
> "An implementation may accept other forms of constant expressions."
> 
> which means it _should_ be possible for the compiler to do what we want.
>  But just because it is permitted does not make it actually work. :(

Thank you, that makes sense. In this case, we are talking about an
integer constant expression for the value of a 'const' symbol. That
seems like perfect fodder for a compiler to optimize. I understand it's
not obligated to do that, but I assumed it would.

Anyway, Philippe, I have no objection if you want to push forward with
this series.



Re: [PATCH 07/23] hw/block/nvme: Use definition to avoid dynamic stack allocation

2021-05-05 Thread Keith Busch
On Wed, May 05, 2021 at 11:10:31PM +0200, Philippe Mathieu-Daudé wrote:
> The compiler isn't clever enough to figure 'SEG_CHUNK_SIZE' is
> a constant! Help it by using a definitions instead.

I don't understand. It's labeled 'const', so any reasonable compiler
will place it in the 'text' segment of the executable rather than on the
stack. While that's compiler specific, is there really a compiler doing
something bad with this? If not, I do prefer the 'const' here if only
because it limits the symbol scope ('static const' is also preferred if
that helps).

> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  hw/block/nvme.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 5fe082ec34c..2f6d4925826 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -812,7 +812,7 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, NvmeSg *sg, 
> NvmeSglDescriptor sgl,
>   * descriptors and segment chain) than the command transfer size, so it 
> is
>   * not bounded by MDTS.
>   */
> -const int SEG_CHUNK_SIZE = 256;
> +#define SEG_CHUNK_SIZE 256
>  
>  NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld;
>  uint64_t nsgld;
> -- 
> 2.26.3
> 



Re: [PATCH 00/14] hw(/block/)nvme: spring cleaning

2021-04-20 Thread Keith Busch
On Mon, Apr 19, 2021 at 09:27:47PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This series consists of various clean up patches.
> 
> The final patch moves nvme emulation from hw/block to hw/nvme.

Series looks good to me.

Reviewed-by: Keith Busch 



Re: [PATCH v3] hw/block/nvme: add device self test command support

2021-04-09 Thread Keith Busch
On Wed, Mar 31, 2021 at 02:54:27PM +0530, Gollu Appalanaidu wrote:
> This is to add support for Device Self Test Command (DST) and
> DST Log Page. Refer NVM Express specification 1.4b section 5.8
> ("Device Self-test command")

Please don't write change logs that just say what you did. I can read
the code to see that. Explain why this is useful because this frankly
looks like another useless feature. We don't need to implement every
optional spec feature here. There should be a real value proposition.



Re: [PATCH] hw/block/nvme: slba equal to nsze is out of bounds if nlb is 1-based

2021-04-09 Thread Keith Busch
On Fri, Apr 09, 2021 at 01:55:01PM +0200, Klaus Jensen wrote:
> On Apr  9 20:05, Minwoo Im wrote:
> > On 21-04-09 13:14:02, Gollu Appalanaidu wrote:
> > > NSZE is the total size of the namespace in logical blocks. So the max
> > > addressable logical block is NLB minus 1. So your starting logical
> > > block is equal to NSZE it is a out of range.
> > > 
> > > Signed-off-by: Gollu Appalanaidu 
> > > ---
> > >  hw/block/nvme.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 953ec64729..be9edb1158 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -2527,7 +2527,7 @@ static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest 
> > > *req)
> > >  uint64_t slba = le64_to_cpu(range[i].slba);
> > >  uint32_t nlb = le32_to_cpu(range[i].nlb);
> > > 
> > > -if (nvme_check_bounds(ns, slba, nlb)) {
> > > +if (nvme_check_bounds(ns, slba, nlb) || slba == 
> > > ns->id_ns.nsze) {
> > 
> > This patch also looks like check the boundary about slba.  Should it be
> > also checked inside of nvme_check_bounds() ?
> 
> The catch here is that DSM is like the only command where the number of
> logical blocks is a 1s-based value. Otherwise we always have nlb > 0, which
> means that nvme_check_bounds() will always "do the right thing".
> 
> My main gripe here is that (in my mind), by definition, a "zero length
> range" does not reference any LBAs at all. So how can it result in LBA Out
> of Range?

So what's the problem? If the request is to discard 0 blocks starting
from the last block, then that's valid. Is this patch actually fixing
anything?



Re: [PATCH] hw/block/nvme: map prp fix if prp2 contains non-zero offset

2021-04-08 Thread Keith Busch
On Thu, Apr 08, 2021 at 09:53:13PM +0530, Padmakar Kalghatgi wrote:
> +/*
> + *   The first PRP list entry, pointed by PRP2 can contain
> + *   offsets. Hence, we need calculate the no of entries in
> + *   prp2 based on the offset it has.
> + */

This comment has some unnecessary spacing at the beginning.

> +nents = (n->page_size - (prp2 % n->page_size)) >> 3;

page_size is a always a power of two, so let's replace the costly modulo
with:

nents = (n->page_size - (prp2 & (n->page_size - 1))) >> 3;



Re: [PATCH for-6.0 v2 0/8] hw/block/nvme: misc fixes

2021-04-05 Thread Keith Busch
On Mon, Apr 05, 2021 at 07:54:44PM +0200, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> Various fixes for 6.0.
> 
> v2:
>   - "hw/block/nvme: fix handling of private namespaces"
> update documentation (Gollu)
>   - add a patch for missing copyright headers

Series looks fine.

Reviewed-by: Keith Busch 



Re: [PATCH V4 0/8] hw/block/nvme: support namespace attachment

2021-03-04 Thread Keith Busch
On Tue, Mar 02, 2021 at 10:26:09PM +0900, Minwoo Im wrote:
> Hello,
> 
> This series supports namespace attachment: attach and detach.  This is
> the fourth version of series with replacing changed namespace list to
> bitmap to indicate changed namespace IDs.
> 
> Please review.

Looks good to me.

Reviewed-by: Keith Busch 



Re: [PATCH v4 00/12] hw/block/nvme: metadata and end-to-end data protection support

2021-03-01 Thread Keith Busch
On Mon, Mar 01, 2021 at 03:00:35PM +0100, Klaus Jensen wrote:
> From: Klaus Jensen 
> 
> This is v4 (RFC removed) of a series that adds support for metadata and
> end-to-end data protection.
> 
> First, on the subject of metadata, in v1, support was restricted to
> extended logical blocks, which was pretty trivial to implement, but
> required special initialization and broke DULBE. In v2, metadata is
> always stored continuously at the end of the underlying block device.
> This has the advantage of not breaking DULBE since the data blocks
> remains aligned and allows bdrv_block_status to be used to determinate
> allocation status. It comes at the expense of complicating the extended
> LBA emulation, but on the other hand it also gains support for metadata
> transfered as a separate buffer.
> 
> The end-to-end data protection support blew up in terms of required
> changes. This is due to the fact that a bunch of new commands has been
> added to the device since v1 (zone append, compare, copy), and they all
> require various special handling for protection information. If
> potential reviewers would like it split up into multiple patches, each
> adding pi support to one command, shout out.
> 
> The core of the series (metadata and eedp) is preceeded by a set of
> patches that refactors mapping (yes, again) and tries to deal with the
> qsg/iov duality mess (maybe also again?).
> 
> Support fro metadata and end-to-end data protection is all joint work
> with Gollu Appalanaidu.

Looks fine.

Reviewed-by: Keith Busch 



Re: [PATCH RFC v3 12/12] hw/block/nvme: add support for the format nvm command

2021-03-01 Thread Keith Busch
On Wed, Feb 17, 2021 at 09:26:37AM +0100, Klaus Jensen wrote:
> On Feb 16 15:16, Keith Busch wrote:
> > On Mon, Feb 15, 2021 at 12:02:40AM +0100, Klaus Jensen wrote:
> > > From: Minwoo Im 
> > > 
> > > Format NVM admin command can make a namespace or namespaces to be
> > > with different LBA size and metadata size with protection information
> > > types.
> > > 
> > > This patch introduces Format NVM command with LBA format, Metadata, and
> > > Protection Information for the device. The secure erase operation things
> > > are yet to be added.
> > > 
> > > The parameter checks inside of this patch has been referred from
> > > Keith's old branch.
> > 
> > Oh, and here's the format command now, so my previous comment on patch
> > 11 doesn't matter.
> > 
> > > +struct nvme_aio_format_ctx {
> > > +NvmeRequest   *req;
> > > +NvmeNamespace *ns;
> > > +
> > > +/* number of outstanding write zeroes for this namespace */
> > > +int *count;
> > 
> > Shouldn't this count be the NvmeRequest's opaque value?
> 
> That is already occupied by `num_formats` which tracks formats of
> individual namespaces. `count` is for outstanding write zeroes on one
> particular namespace.

But why are they tracked separately? It looks like we only care about
the number of outstanding zero-out commands. It doesn't matter how that
number is spread across multiple namespaces.



Re: [PATCH v2 5/5] hw/block/nvme: report non-mdts command size limit for dsm

2021-03-01 Thread Keith Busch
On Mon, Mar 01, 2021 at 12:15:26PM +0100, Klaus Jensen wrote:
> On Feb 22 22:12, Klaus Jensen wrote:
> > On Feb 23 05:55, Keith Busch wrote:
> > > On Mon, Feb 22, 2021 at 07:47:59PM +0100, Klaus Jensen wrote:
> > > > +typedef struct NvmeIdCtrlNvm {
> > > > +uint8_t vsl;
> > > > +uint8_t wzsl;
> > > > +uint8_t wusl;
> > > > +uint8_t dmrl;
> > > > +uint32_tdmrsl;
> > > > +uint64_tdmsl;
> > > > +uint8_t rsvd16[4080];
> > > > +} NvmeIdCtrlNvm;
> > > 
> > > TP 4040a still displays these fields with preceding '...' indicating
> > > something comes before this. Is that just left-over from the integration
> > > for TBD offsets, or is there something that still hasn't been accounted
> > > for?
> > 
> > Good question.
> > 
> > But since the TBDs have been assigned I believe it is just a left-over.
> > I must admit that I have not cross checked this with all other TPs, but
> > AFAIK this is the only ratified TP that adds something to the
> > NVM-specific identify controller data structure.
> 
> Are you otherwise OK with this?

Yes, I compared other TP's and it appears to be set for good once an
actual numeric value is assigned, so okay to go here.

Reviewed-by: Keith Busch 



Re: [PATCH V2 6/7] hw/block/nvme: support namespace attachment command

2021-02-26 Thread Keith Busch
On Thu, Feb 11, 2021 at 01:09:36AM +0900, Minwoo Im wrote:
> @@ -183,6 +183,7 @@ static const uint32_t nvme_cse_acs[256] = {
>  [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP,
>  [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP,
>  [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP,
> +[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP,

Missing NVME_CMD_EFF_NIC for the attachment command.

>  };
>  
>  static const uint32_t nvme_cse_iocs_none[256];
> @@ -3766,6 +3767,62 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
>  return NVME_NO_COMPLETE;
>  }
>  
> +static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns);
> +static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req)
> +{
> +NvmeNamespace *ns;
> +NvmeCtrl *ctrl;
> +uint16_t list[NVME_CONTROLLER_LIST_SIZE] = {};
> +uint32_t nsid = le32_to_cpu(req->cmd.nsid);
> +uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
> +bool attach = !(dw10 & 0xf);
> +uint16_t *nr_ids = [0];
> +uint16_t *ids = [1];
> +uint16_t ret;
> +int i;
> +
> +trace_pci_nvme_ns_attachment(nvme_cid(req), dw10 & 0xf);
> +
> +ns = nvme_subsys_ns(n->subsys, nsid);
> +if (!ns) {
> +return NVME_INVALID_FIELD | NVME_DNR;
> +}
> +
> +ret = nvme_dma(n, (uint8_t *)list, 4096,
> +   DMA_DIRECTION_TO_DEVICE, req);
> +if (ret) {
> +return ret;
> +}
> +
> +if (!*nr_ids) {
> +return NVME_NS_CTRL_LIST_INVALID | NVME_DNR;
> +}
> +
> +for (i = 0; i < *nr_ids; i++) {
> +ctrl = nvme_subsys_ctrl(n->subsys, ids[i]);
> +if (!ctrl) {
> +return NVME_NS_CTRL_LIST_INVALID | NVME_DNR;
> +}
> +
> +if (attach) {
> +if (nvme_ns_is_attached(ctrl, ns)) {
> +return NVME_NS_ALREADY_ATTACHED | NVME_DNR;
> +}
> +
> +nvme_ns_attach(ctrl, ns);
> +__nvme_select_ns_iocs(ctrl, ns);
> +} else {
> +if (!nvme_ns_is_attached(ctrl, ns)) {
> +return NVME_NS_NOT_ATTACHED | NVME_DNR;
> +}
> +
> +nvme_ns_detach(ctrl, ns);
> +}
> +}
> +
> +return NVME_SUCCESS;
> +}

Every controller that has newly attached the namespace needs to emit the
Namespace Notify AER in order for the host to react correctly to the
command.



Re: [PATCH 0/3] hw/block/nvme: mdts/zasl cleanup

2021-02-22 Thread Keith Busch
These look good.

Reviewed-by: Keith Busch 



Re: [PATCH v2 5/5] hw/block/nvme: report non-mdts command size limit for dsm

2021-02-22 Thread Keith Busch
On Mon, Feb 22, 2021 at 07:47:59PM +0100, Klaus Jensen wrote:
> +typedef struct NvmeIdCtrlNvm {
> +uint8_t vsl;
> +uint8_t wzsl;
> +uint8_t wusl;
> +uint8_t dmrl;
> +uint32_tdmrsl;
> +uint64_tdmsl;
> +uint8_t rsvd16[4080];
> +} NvmeIdCtrlNvm;

TP 4040a still displays these fields with preceding '...' indicating
something comes before this. Is that just left-over from the integration
for TBD offsets, or is there something that still hasn't been accounted
for?



Re: [PATCH RFC v3 00/12] hw/block/nvme: metadata and end-to-end data protection support

2021-02-17 Thread Keith Busch
On Wed, Feb 17, 2021 at 10:06:49AM +0100, Klaus Jensen wrote:
> On Feb 16 16:19, Keith Busch wrote:
> > The verify implementation looked fine, but lacking a generic backing for
> > it sounds to me the use cases aren't there to justify taking on this
> > feature.
> 
> Please check my reply on the verify patch - can you elaborate on
> "generic backing"? I'm not sure I understand what you have in mind,
> API-wise.

I meant it'd be nice if qemu block api provided a function like
"blk_aio_pverify()" to handle the details that you're implementing in
the nvme device. As you mentioned though, handling the protection
information part is problematic.



  1   2   3   >