Re: [PATCH v11 02/10] block/raw: add persistent reservation in/out driver
On Mon, Sep 09, 2024 at 07:34:45PM +0800, Changqi Lu wrote: > +static int coroutine_fn GRAPH_RDLOCK > +raw_co_pr_register(BlockDriverState *bs, uint64_t old_key, > + uint64_t new_key, BlockPrType type, > + bool ptpl, bool ignore_key) > +{ > +return bdrv_co_pr_register(bs->file->bs, old_key, new_key, > + type, ptpl, ignore_key); > +} The nvme parts look fine, but could you help me understand how this all works? I was looking for something utilizing ioctl's, like IOC_PR_REGISTER for this one, chained through the file-posix block driver. Is this only supposed to work with iscsi?
[PATCH] nvme: report id controller metadata sgl support
From: Keith Busch The controller already supports this decoding, so just set the ID_CTRL.SGLS field accordingly. Signed-off-by: Keith Busch --- hw/nvme/ctrl.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 5b1b0cabcf..68d4c19eda 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -8299,7 +8299,8 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->vwc = NVME_VWC_NSID_BROADCAST_SUPPORT | NVME_VWC_PRESENT; id->ocfs = cpu_to_le16(NVME_OCFS_COPY_FORMAT_0 | NVME_OCFS_COPY_FORMAT_1); -id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN); +id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN | + NVME_CTRL_SGLS_MPTR_SGL); nvme_init_subnqn(n); -- 2.43.5
Re: [PATCH] hw/nvme: Add support for setting the MQES for the NVMe emulation
On Thu, Apr 04, 2024 at 01:04:18PM +0100, John Berg wrote: > The MQES field in the CAP register describes the Maximum Queue Entries > Supported for the IO queues of an NVMe controller. Adding a +1 to the > value in this field results in the total queue size. A full queue is > when a queue of size N contains N - 1 entries, and the minimum queue > size is 2. Thus the lowest MQES value is 1. > > This patch adds the new mqes property to the NVMe emulation which allows > a user to specify the maximum queue size by setting this property. This > is useful as it enables testing of NVMe controller where the MQES is > relatively small. The smallest NVMe queue size supported in NVMe is 2 > submission and completion entries, which means that the smallest legal > mqes value is 1. > > The following example shows how the mqes can be set for a the NVMe > emulation: > > -drive id=nvme0,if=none,file=nvme.img,format=raw > -device nvme,drive=nvme0,serial=foo,mqes=1 > > If the mqes property is not provided then the default mqes will still be > 0x7ff (the queue size is 2048 entries). Looks good. I had to double check where nvme_create_sq() was getting its limit from when processing the host command, and sure enough it's directly from the register field. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix endianness issue for shadow doorbells
On Tue, Jul 18, 2023 at 12:35:12PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > In commit 2fda0726e514 ("hw/nvme: fix missing endian conversions for > doorbell buffers"), we fixed shadow doorbells for big-endian guests > running on little endian hosts. But I did not fix little-endian guests > on big-endian hosts. Fix this. > > Solves issue #1765. > > Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support") > Cc: qemu-sta...@nongnu.org > Reported-by: Thomas Huth > Signed-off-by: Klaus Jensen Looks good! Reviewed-by: Keith Busch
Re: [PATCH v2 5/5] hw/nvme: flexible data placement emulation
On Fri, Feb 17, 2023 at 01:07:43PM +0100, Jesper Devantier wrote: > +static void nvme_do_write_fdp(NvmeCtrl *n, NvmeRequest *req, uint64_t slba, > + uint32_t nlb) > +{ > +NvmeNamespace *ns = req->ns; > +NvmeRwCmd *rw = (NvmeRwCmd *)&req->cmd; > +uint64_t data_size = nvme_l2b(ns, nlb); > +uint32_t dw12 = le32_to_cpu(req->cmd.cdw12); > +uint8_t dtype = (dw12 >> 20) & 0xf; > +uint16_t pid = le16_to_cpu(rw->dspec); > +uint16_t ph, rg, ruhid; > +NvmeReclaimUnit *ru; > + > +if (dtype != NVME_DIRECTIVE_DATA_PLACEMENT > +|| !nvme_parse_pid(ns, pid, &ph, &rg)) { Style nit, the "||" ought to go in the previous line. > +ph = 0; > +rg = 0; > +} > + > +ruhid = ns->fdp.phs[ph]; > +ru = &ns->endgrp->fdp.ruhs[ruhid].rus[rg]; > + > +nvme_fdp_stat_inc(&ns->endgrp->fdp.hbmw, data_size); > +nvme_fdp_stat_inc(&ns->endgrp->fdp.mbmw, data_size); > + > +//trace_pci_nvme_fdp_ruh_write(ruh->rgid, ruh->ruhid, ruh->nlb_ruamw, > nlb); > + > +while (nlb) { > +if (nlb < ru->ruamw) { > +ru->ruamw -= nlb; > +break; > +} > + > +nlb -= ru->ruamw; > +//trace_pci_nvme_fdp_ruh_change(ruh->rgid, ruh->ruhid); Please use the trace points if you find them useful, otherwise just delete them instead of committing commented out code. Beyond that, looks good! For the series: Reviewed-by: Keith Busch
Re: [PATCH 5/5] hw/nvme: flexible data placement emulation
On Thu, Feb 16, 2023 at 05:48:06PM +0100, Jesper Devantier wrote: > +static bool nvme_ns_init_fdp(NvmeNamespace *ns, Error **errp) > +{ > +NvmeEnduranceGroup *endgrp = ns->endgrp; > +NvmeRuHandle *ruh; > +uint8_t lbafi = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas); > +unsigned int *ruhid, *ruhids; > +char *r, *p, *token; > +uint16_t *ph; > + > +if (!ns->params.fdp.ruhs) { > +ns->fdp.nphs = 1; > +ph = ns->fdp.phs = g_new(uint16_t, 1); > + > +ruh = nvme_find_ruh_by_attr(endgrp, NVME_RUHA_CTRL, ph); > +if (!ruh) { > +ruh = nvme_find_ruh_by_attr(endgrp, NVME_RUHA_UNUSED, ph); > +if (!ruh) { > +error_setg(errp, "no unused reclaim unit handles left"); > +return false; > +} > + > +ruh->ruha = NVME_RUHA_CTRL; > +ruh->lbafi = lbafi; > +ruh->ruamw = endgrp->fdp.runs >> ns->lbaf.ds; > + > +for (uint16_t rg = 0; rg < endgrp->fdp.nrg; rg++) { > +ruh->rus[rg].ruamw = ruh->ruamw; > +} > +} else if (ruh->lbafi != lbafi) { > +error_setg(errp, "lba format index of controller assigned " > + "reclaim unit handle does not match namespace lba " > + "format index"); > +return false; > +} > + > +return true; > +} > + > +ruhid = ruhids = g_new0(unsigned int, endgrp->fdp.nruh); > +r = p = strdup(ns->params.fdp.ruhs); > + > +/* parse the reclaim unit handle identifiers */ > +while ((token = qemu_strsep(&p, ";")) != NULL) { > +if (++ns->fdp.nphs == endgrp->fdp.nruh) { Since a namespace can't have more than 128 placement handles, and the endurance group can have more, I think the 128 limit needs to be checked here. > +error_setg(errp, "too many placement handles"); > +free(r); > +return false; > +} > + > +if (qemu_strtoui(token, NULL, 0, ruhid++) < 0) { > +error_setg(errp, "cannot parse reclaim unit handle identifier"); > +free(r); > +return false; > +} > +}
Re: [PATCH 5/5] hw/nvme: flexible data placement emulation
This mostly looks fine. I need to clarify some spec decisions, though. By default, FDP feature is disabled: "The default value of this Feature shall be 0h." You can't change the value as long as namespaces exist within the group, so FDP requires NS Mgmt be supported if you're going to enable it. The spec, however, doesn't seem to explicitly say that, and NS Mgmt isn't currently supported in this controller. We can look past that for this implementation to allow static settings even if that doesn't perfectly align with this feature (NS Mgmt is kind of difficult here). In order to match what the spec says is possible though, we can't have a namespace with more than 128 placement handles since that's the most you can provide when you create the namespace. Does that make sense, or am I totally misunderstanding the details?
Re: [PATCH 4/5] hw/nvme: basic directives support
On Thu, Feb 16, 2023 at 06:35:27PM +0100, Klaus Jensen wrote: > On Thu, Feb 16, 2023, at 18:23, Keith Busch wrote: > > On Thu, Feb 16, 2023 at 05:48:05PM +0100, Jesper Devantier wrote: > >> +enum NvmeDirective { > >> +NVME_DIRECTIVE_SUPPORTED = 0x0, > >> +NVME_DIRECTIVE_ENABLED = 0x1, > >> +}; > > > > What's this? > > That’s a left-over from my rebase. I’ll fix that one up. Okay, other than that, this one looks good.
Re: [PATCH 4/5] hw/nvme: basic directives support
On Thu, Feb 16, 2023 at 05:48:05PM +0100, Jesper Devantier wrote: > +enum NvmeDirective { > +NVME_DIRECTIVE_SUPPORTED = 0x0, > +NVME_DIRECTIVE_ENABLED = 0x1, > +}; What's this?
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 01:10:57PM +1000, Alistair Francis wrote: > On Thu, Jan 19, 2023 at 12:44 PM Keith Busch wrote: > > > > Further up, it says the "interrupt gateway" is responsible for > > forwarding new interrupt requests while the level remains asserted, but > > it doesn't look like anything is handling that, which essentially turns > > this into an edge interrupt. Am I missing something, or is this really > > not being handled? > > Yeah, that wouldn't be handled. In QEMU the PLIC relies on QEMUs > internal GPIO lines to trigger an interrupt. So with the current setup > we only support edge triggered interrupts. Thanks for confirming! Klaus, I think we can justify introducing a work-around in the emulated device now. My previous proposal with pci_irq_pulse() is no good since it does assert+deassert, but it needs to be the other way around, so please don't considert that one. Also, we ought to revisit the intms/intmc usage in the linux driver for threaded interrupts.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 19, 2023 at 10:41:42AM +1000, Alistair Francis wrote: > On Thu, Jan 19, 2023 at 9:07 AM Keith Busch wrote: > > --- > > diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c > > index c2dfacf028..f8f7af08dc 100644 > > --- a/hw/intc/sifive_plic.c > > +++ b/hw/intc/sifive_plic.c > > @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr > > addr, unsigned size) > > uint32_t max_irq = sifive_plic_claimed(plic, addrid); > > > > if (max_irq) { > > -sifive_plic_set_pending(plic, max_irq, false); > > sifive_plic_set_claimed(plic, max_irq, true); > > } > > > > This change isn't correct. The PLIC spec > (https://github.com/riscv/riscv-plic-spec/releases/download/1.0.0_rc5/riscv-plic-1.0.0_rc5.pdf) > states: > > """ > On receiving a claim message, the PLIC core will atomically determine > the ID of the highest-priority pending interrupt for the target and > then clear down the corresponding source’s IP bit > """ > > which is what we are doing here. We are clearing the pending interrupt > inside the PLIC Thanks for the link. That's very helpful. If you're familiar with this area, could you possibly clear up this part of that spec? " On receiving an interrupt completion message, if the interrupt is level-triggered and the interrupt is still asserted, a new interrupt request will be forwarded to the PLIC core. " Further up, it says the "interrupt gateway" is responsible for forwarding new interrupt requests while the level remains asserted, but it doesn't look like anything is handling that, which essentially turns this into an edge interrupt. Am I missing something, or is this really not being handled?
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Wed, Jan 18, 2023 at 09:33:05AM -0700, Keith Busch wrote: > On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote: > > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > > > Anyway - any idea what to do to help figuring out what is happening ? > > > Add tracing support to pci interrupt handling, maybe ? > > > > For intermittent bugs, I like recording the QEMU session under > > rr (using its chaos mode to provoke the failure if necessary) to > > get a recording that I can debug and re-debug at leisure. Usually > > you want to turn on/add tracing to help with this, and if the > > failure doesn't hit early in bootup then you might need to > > do a QEMU snapshot just before point-of-failure so you can > > run rr only on the short snapshot-to-failure segment. > > > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ > > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ > > > > This gives you a debugging session from the QEMU side's perspective, > > of course -- assuming you know what the hardware is supposed to do > > you hopefully wind up with either "the guest software did X,Y,Z > > and we incorrectly did A" or else "the guest software did X,Y,Z, > > the spec says A is the right/a permitted thing but the guest got confused". > > If it's the latter then you have to look at the guest as a separate > > code analysis/debug problem. > > Here's what I got, though I'm way out of my depth here. > > It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the > interrupt after its first handling, which I think is expected. After > claiming, QEMU masks the pending interrupt, lowering the level, though > the device that raised it never deasserted. I'm not sure if this is correct, but this is what I'm coming up with and appears to fix the problem on my setup. The hardware that sets the pending interrupt is going clear it, so I don't see why the interrupt controller is automatically clearing it when the host claims it. --- diff --git a/hw/intc/sifive_plic.c b/hw/intc/sifive_plic.c index c2dfacf028..f8f7af08dc 100644 --- a/hw/intc/sifive_plic.c +++ b/hw/intc/sifive_plic.c @@ -157,7 +157,6 @@ static uint64_t sifive_plic_read(void *opaque, hwaddr addr, unsigned size) uint32_t max_irq = sifive_plic_claimed(plic, addrid); if (max_irq) { -sifive_plic_set_pending(plic, max_irq, false); sifive_plic_set_claimed(plic, max_irq, true); } --
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
Klaus, This isn't going to help your issue, but there are at least two legacy irq bugs in the nvme qemu implementation. 1. The admin queue breaks if start with legacy and later initialize msix. 2. The legacy vector is shared among all queues, but it's being deasserted when the first queue's doorbell makes it empty. It should remain enabled if any cq is non-empty. I'll send you some patches for those later. Still working on the real problem.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Wed, Jan 18, 2023 at 03:04:06PM +, Peter Maydell wrote: > On Tue, 17 Jan 2023 at 19:21, Guenter Roeck wrote: > > Anyway - any idea what to do to help figuring out what is happening ? > > Add tracing support to pci interrupt handling, maybe ? > > For intermittent bugs, I like recording the QEMU session under > rr (using its chaos mode to provoke the failure if necessary) to > get a recording that I can debug and re-debug at leisure. Usually > you want to turn on/add tracing to help with this, and if the > failure doesn't hit early in bootup then you might need to > do a QEMU snapshot just before point-of-failure so you can > run rr only on the short snapshot-to-failure segment. > > https://translatedcode.wordpress.com/2015/05/30/tricks-for-debugging-qemu-rr/ > https://translatedcode.wordpress.com/2015/07/06/tricks-for-debugging-qemu-savevm-snapshots/ > > This gives you a debugging session from the QEMU side's perspective, > of course -- assuming you know what the hardware is supposed to do > you hopefully wind up with either "the guest software did X,Y,Z > and we incorrectly did A" or else "the guest software did X,Y,Z, > the spec says A is the right/a permitted thing but the guest got confused". > If it's the latter then you have to look at the guest as a separate > code analysis/debug problem. Here's what I got, though I'm way out of my depth here. It looks like Linux kernel's fasteoi for RISC-V's PLIC claims the interrupt after its first handling, which I think is expected. After claiming, QEMU masks the pending interrupt, lowering the level, though the device that raised it never deasserted.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > Hi all (linux-nvme, qemu-devel, maintainers), > > On QEMU riscv64, which does not use MSI/MSI-X and thus relies on > pin-based interrupts, I'm seeing occasional completion timeouts, i.e. > > nvme nvme0: I/O 333 QID 1 timeout, completion polled I finally got a riscv setup recreating this, and I am not sure how you're getting a "completion polled" here. I find that the polling from here progresses the request state to IDLE, so the timeout handler moves on to aborting because it's expecting COMPLETED. The driver should not be doing that, so I'll fix that up while also investigating why the interrupt isn't retriggering as expected.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > I noticed that the Linux driver does not use the INTMS/INTMC registers > to mask interrupts on the controller while processing CQEs. While not > required by the spec, it is *recommended* in setups not using MSI-X to > reduce the risk of spurious and/or missed interrupts. That's assuming completions are deferred to a bottom half. We don't do that by default in Linux nvme, though you can ask the driver to do that if you want. > With the patch below, running 100 boot iterations, no timeouts were > observed on QEMU emulated riscv64 or mips64. > > No changes are required in the QEMU hw/nvme interrupt logic. Yeah, I can see why: it forces the irq line to deassert then assert, just like we had forced to happen within the device side patches. Still, none of that is supposed to be necessary, but this idea of using these registers is probably fine. > static irqreturn_t nvme_irq(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + struct nvme_dev *dev = nvmeq->dev; > + u32 mask = 1 << nvmeq->cq_vector; > + irqreturn_t ret = IRQ_NONE; > + DEFINE_IO_COMP_BATCH(iob); > + > + writel(mask, dev->bar + NVME_REG_INTMS); > + > + if (nvme_poll_cq(nvmeq, &iob)) { > + if (!rq_list_empty(iob.req_list)) > + nvme_pci_complete_batch(&iob); > + ret = IRQ_HANDLED; > + } > + > + writel(mask, dev->bar + NVME_REG_INTMC); > + > + return ret; > +} If threaded interrupts are used, you'll want to do the masking in nvme_irq_check(), then clear it in the threaded handler instead of doing both in the same callback. > +static irqreturn_t nvme_irq_msix(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > DEFINE_IO_COMP_BATCH(iob); > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > { > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > int nr = nvmeq->dev->ctrl.instance; > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq; > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } else { > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } > } > >
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Fri, Jan 13, 2023 at 12:32:29PM +, Peter Maydell wrote: > On Fri, 13 Jan 2023 at 08:55, Klaus Jensen wrote: > > > > +CC qemu pci maintainers > > > > Michael, Marcel, > > > > Do you have any comments on this thread? As you can see one solution is > > to simply deassert prior to asserting, the other is to reintroduce a > > pci_irq_pulse(). Both seem to solve the issue. > > Both seem to be missing any analysis of "this is what is > happening, this is where we differ from hardware, this > is why this is the correct fix". We shouldn't put in > random "this seems to happen to cause the guest to boot" > fixes, please. It looks like these are being treated as edge instead of level interrupts so the work-around is to create more edges. I would definitely prefer a real fix, whether that's in the kernel or CPU emulation or somewhere else, but I'm not sure I'll have time to see it to completion. FWIW, x86 seems to handle legacy IRQs with NVMe as expected. It's actually easy enough for the DEASSERT to take so long that kernel reports the irq as "spurious" because it's spinning too often on the level.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote: > On Jan 12 09:34, Keith Busch wrote: > > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > > > > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > > > am wondering if there is something going on with the kernel driver (but > > > I certainly do not rule out that hw/nvme is at fault here, since > > > pin-based interrupts has also been a source of several issues in the > > > past). > > > > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? > > While probably not the "correct" thing to do, it has better results in > > my testing. > > > > A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, > > diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c > index 03760ddeae8c..0fc46dcb9ec4 100644 > --- i/hw/nvme/ctrl.c > +++ w/hw/nvme/ctrl.c > @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) >return; >} >if (~intms & n->irq_status) { > +pci_irq_deassert(&n->parent_obj); >pci_irq_assert(&n->parent_obj); >} else { >pci_irq_deassert(&n->parent_obj); > > > seems to do the trick (pulse is the other way around, assert, then > deassert). > > Probably not the "correct" thing to do, but I'll take it since it seems > to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm > on ~20 runs now and have not encountered it. > > I'll see if I can set up a mips rootfs and test that. Guenter, what MIPS > machine/board(s) are you testing? Could you try the below? --- diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 2c85de4700..521c3c80c1 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -558,6 +558,20 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq) } } +static void nvme_irq_pulse(NvmeCtrl *n, NvmeCQueue *cq) +{ +if (!cq->irq_enabled) { +return; +} + +if (msix_enabled(&(n->parent_obj))) { +msix_notify(&(n->parent_obj), cq->vector); +return; +} + +pci_irq_pulse(&n->parent_obj); +} + static void nvme_req_clear(NvmeRequest *req) { req->ns = NULL; @@ -6917,6 +6931,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val) } nvme_irq_deassert(n, cq); +} else { +/* + * Retrigger the irq just to make sure the host has no excuse for + * not knowing there's more work to complete on this CQ. + */ +nvme_irq_pulse(n, cq); } } else { /* Submission queue doorbell write */ --
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 06:45:55PM +0100, Klaus Jensen wrote: > On Jan 12 09:34, Keith Busch wrote: > > On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > > > > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > > > am wondering if there is something going on with the kernel driver (but > > > I certainly do not rule out that hw/nvme is at fault here, since > > > pin-based interrupts has also been a source of several issues in the > > > past). > > > > Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? > > While probably not the "correct" thing to do, it has better results in > > my testing. > > > > A simple s/pci_irq_assert/pci_irq_pulse broke the device. However, > > diff --git i/hw/nvme/ctrl.c w/hw/nvme/ctrl.c > index 03760ddeae8c..0fc46dcb9ec4 100644 > --- i/hw/nvme/ctrl.c > +++ w/hw/nvme/ctrl.c > @@ -477,6 +477,7 @@ static void nvme_irq_check(NvmeCtrl *n) >return; >} >if (~intms & n->irq_status) { > +pci_irq_deassert(&n->parent_obj); >pci_irq_assert(&n->parent_obj); >} else { >pci_irq_deassert(&n->parent_obj); > > > seems to do the trick (pulse is the other way around, assert, then > deassert). > > Probably not the "correct" thing to do, but I'll take it since it seems > to fix it. On a simple boot loop I got the timeout about 1 out of 5. I'm > on ~20 runs now and have not encountered it. Yep, that looks good. It's sounding like something with the CPU irq handling is treating the level interrupt as edge triggered.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > > The pin-based interrupt logic in hw/nvme seems sound enough to me, so I > am wondering if there is something going on with the kernel driver (but > I certainly do not rule out that hw/nvme is at fault here, since > pin-based interrupts has also been a source of several issues in the > past). Does it work if you change the pci_irq_assert() back to pci_irq_pulse()? While probably not the "correct" thing to do, it has better results in my testing.
Re: [PATCH v4 1/2] hw/nvme: Support for Namespaces Management from guest OS - create-ns
On Wed, Dec 28, 2022 at 01:41:40PM -0600, Jonathan Derrick wrote: > +static uint16_t nvme_ns_mgmt(NvmeCtrl *n, NvmeRequest *req) > +{ > +NvmeCtrl *n_p = NULL; /* primary controller */ > +NvmeIdCtrl *id = &n->id_ctrl; > +NvmeNamespace *ns; > +NvmeIdNsMgmt id_ns = {}; > +uint8_t flags = req->cmd.flags; > +uint32_t nsid = le32_to_cpu(req->cmd.nsid); > +uint32_t dw10 = le32_to_cpu(req->cmd.cdw10); > +uint32_t dw11 = le32_to_cpu(req->cmd.cdw11); > +uint8_t sel = dw10 & 0xf; > +uint8_t csi = (dw11 >> 24) & 0xf; > +uint16_t i; > +uint16_t ret; > +Error *local_err = NULL; > + > +trace_pci_nvme_ns_mgmt(nvme_cid(req), nsid, sel, csi, > NVME_CMD_FLAGS_PSDT(flags)); > + > +if (!(le16_to_cpu(id->oacs) & NVME_OACS_NS_MGMT)) { > +return NVME_NS_ATTACH_MGMT_NOTSPRD | NVME_DNR; > +} > + > +if (n->cntlid && !n->subsys) { > +error_setg(&local_err, "Secondary controller without subsystem"); > +return NVME_NS_ATTACH_MGMT_NOTSPRD | NVME_DNR; Leaking local_err. Any time you call error_setg(), the error needs to be reported or freed at some point. > +} > + > +n_p = n->subsys->ctrls[0]; > + > +switch (sel) { > +case NVME_NS_MANAGEMENT_CREATE: > +switch (csi) { > +case NVME_CSI_NVM: The following case is sufficiently large enough that the implementation should be its own function. > +if (nsid) { > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +ret = nvme_h2c(n, (uint8_t *)&id_ns, sizeof(id_ns), req); > +if (ret) { > +return ret; > +} > + > +uint64_t nsze = le64_to_cpu(id_ns.nsze); > +uint64_t ncap = le64_to_cpu(id_ns.ncap); Please don't mix declarations with code; declare these local variables at the top of the scope. > + > +if (ncap > nsze) { > +return NVME_INVALID_FIELD | NVME_DNR; > +} else if (ncap != nsze) { > +return NVME_THIN_PROVISION_NOTSPRD | NVME_DNR; > +} > + > +nvme_validate_flbas(id_ns.flbas, &local_err); > +if (local_err) { > +error_report_err(local_err); > +return NVME_INVALID_FORMAT | NVME_DNR; > +} > + > +for (i = 1; i <= NVME_MAX_NAMESPACES; i++) { > +if (nvme_ns(n_p, (uint32_t)i) || nvme_subsys_ns(n_p->subsys, > (uint32_t)i)) { > +continue; > +} > +break; > +} > + > + > +if (i > le32_to_cpu(n_p->id_ctrl.nn) || i > > NVME_MAX_NAMESPACES) { > + return NVME_NS_IDNTIFIER_UNAVAIL | NVME_DNR; > +} > +nsid = i; > + > +/* create ns here */ > +ns = nvme_ns_mgmt_create(n, nsid, &id_ns, &local_err); > +if (!ns || local_err) { > +if (local_err) { > +error_report_err(local_err); > +} > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +if (nvme_cfg_update(n, ns->size, NVME_NS_ALLOC_CHK)) { > +/* place for delete-ns */ > +error_setg(&local_err, "Insufficient capacity, an orphaned > ns[%"PRIu32"] will be left behind", nsid); > +return NVME_NS_INSUFFICIENT_CAPAC | NVME_DNR; Leaked local_err. > +} > +(void)nvme_cfg_update(n, ns->size, NVME_NS_ALLOC); > +if (nvme_cfg_save(n)) { > +(void)nvme_cfg_update(n, ns->size, NVME_NS_DEALLOC); > +/* place for delete-ns */ > +error_setg(&local_err, "Cannot save conf file, an orphaned > ns[%"PRIu32"] will be left behind", nsid); > +return NVME_INVALID_FIELD | NVME_DNR; Another leaked local_err. > +} > +req->cqe.result = cpu_to_le32(nsid); > +break; > +case NVME_CSI_ZONED: > +/* fall through for now */ > +default: > +return NVME_INVALID_FIELD | NVME_DNR; > + } > +break; > +default: > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +return NVME_SUCCESS; > +}
Re: [PATCH v4 2/4] hw/nvme: rename shadow doorbell related trace events
On Mon, Dec 12, 2022 at 12:44:07PM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > Rename the trace events related to writing the event index and reading > the doorbell value to make it more clear that the event is associated > with an actual update (write or read respectively). > > Reviewed-by: Philippe Mathieu-Daudé > Signed-off-by: Klaus Jensen Looks good. Reviewed-by: Keith Busch
Re: [PATCH v4 3/4] hw/nvme: fix missing endian conversions for doorbell buffers
On Mon, Dec 12, 2022 at 12:44:08PM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > The eventidx and doorbell value are not handling endianness correctly. > Fix this. > > Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support") > Cc: qemu-sta...@nongnu.org > Reported-by: Guenter Roeck > Signed-off-by: Klaus Jensen Looks good. Reviewed-by: Keith Busch
Re: [PATCH v4 4/4] hw/nvme: fix missing cq eventidx update
On Mon, Dec 12, 2022 at 12:44:09PM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > Prior to reading the shadow doorbell cq head, we have to update the > eventidx. Otherwise, we risk that the driver will skip an mmio doorbell > write. This happens on riscv64, as reported by Guenter. > > Adding the missing update to the cq eventidx fixes the issue. > > Fixes: 3f7fe8de3d49 ("hw/nvme: Implement shadow doorbell buffer support") > Cc: qemu-sta...@nongnu.org > Cc: qemu-ri...@nongnu.org > Reported-by: Guenter Roeck > Signed-off-by: Klaus Jensen Looks good. Reviewed-by: Keith Busch
Re: [PULL for-7.2 0/5] hw/nvme fixes
On Sun, Dec 04, 2022 at 11:06:13AM -0500, Stefan Hajnoczi wrote: > On Thu, 1 Dec 2022 at 11:50, Klaus Jensen wrote: > > > > From: Klaus Jensen > > > > Hi, > > > > The following changes since commit c4ffd91aba1c3d878e99a3e7ba8aad4826728ece: > > > > Update VERSION for v7.2.0-rc3 (2022-11-29 18:15:26 -0500) > > > > are available in the Git repository at: > > > > git://git.infradead.org/qemu-nvme.git tags/nvme-next-pull-request > > Hi Klaus, > Please send pull requests with an https:// URI in the future. Is this a new requirement? Our public git server doesn't support https.
Re: [PATCH for-7.2 1/5] hw/nvme: fix aio cancel in format
On Tue, Nov 22, 2022 at 09:13:44AM +0100, Klaus Jensen wrote: > There are several bugs in the async cancel code for the Format command. > > Firstly, cancelling a format operation neglects to set iocb->ret as well > as clearing the iocb->aiocb after cancelling the underlying aiocb which > causes the aio callback to ignore the cancellation. Trivial fix. > > Secondly, and worse, because the request is queued up for posting to the > CQ in a bottom half, if the cancellation is due to the submission queue > being deleted (which calls blk_aio_cancel), the req structure is > deallocated in nvme_del_sq prior to the bottom half being schedulued. > > Fix this by simply removing the bottom half, there is no reason to defer > it anyway. I thought for sure I'd find a reason defered execution was needed, but hey, it looks perfectly fine with this change! > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index ac3885ce5079..26b53469328f 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -5756,14 +5756,15 @@ typedef struct NvmeFormatAIOCB { > uint8_t pil; > } NvmeFormatAIOCB; I think you can remove the QEMUBH member from the above struct with this patch. Otherwise looks good. Reviewed-by: Keith Busch
Re: [PATCH v3 1/2] hw/nvme: Support for Namespaces Management from guest OS - create-ns
On Thu, Oct 27, 2022 at 01:00:45PM -0500, Jonathan Derrick wrote: > +Parameters: > + > +``auto-ns-path=`` > + If specified indicates a support for dynamic management of nvme namespaces > + by means of nvme create-ns command. This path points > + to the storage area for backend images must exist. Additionally it requires > + that parameter `ns-subsys` must be specified whereas parameter `drive` > + must not. The legacy namespace backend is disabled, instead, a pair of > + files 'nvme__ns_.cfg' and 'nvme__ns_.img' > + will refer to respective namespaces. The create-ns, attach-ns > + and detach-ns commands, issued at the guest side, will make changes to > + those files accordingly. > + For each namespace exists an image file in raw format and a config file > + containing namespace parameters and state of the attachment allowing QEMU > + to configure namespaces accordingly during start up. If for instance an > + image file has a size of 0 bytes this will be interpreted as non existent > + namespace. Issuing create-ns command will change the status in the config > + files and and will re-size the image file accordingly so the image file > + will be associated with the respective namespace. The main config file > + nvme__ctrl.cfg keeps the track of allocated capacity to the > + namespaces within the nvme controller. > + As it is the case of a typical hard drive, backend images together with > + config files need to be created. For this reason the qemu-img tool has > + been extended by adding createns command. > + > + qemu-img createns {-S -C } > + [-N ] {} > + > + Parameters: > + -S and -C and are mandatory, `-S` must match `serial` parameter > + and must match `auto-ns-path` parameter of "-device nvme,..." > + specification. > + -N is optional, if specified it will set a limit to the number of potential > + namespaces and will reduce the number of backend images and config files > + accordingly. As a default, a set of images of 0 bytes size and default > + config files for 256 namespaces will be created, a total of 513 files. > + > +Please note that ``nvme-ns`` device is not required to support of dynamic > +namespaces management feature. It is not prohibited to assign a such device > to > +``nvme`` device specified to support dynamic namespace management if one has > +an use case to do so, however, it will only coexist and be out of the scope > of > +Namespaces Management. NsIds will be consistently managed, creation > (create-ns) > +of a namespace will not allocate the NsId already being taken. If ``nvme-ns`` > +device conflicts with previously created one by create-ns (the same NsId), > +it will break QEMU's start up. By requiring the controller's serial number up-front, does this mean we can't share dynamic namespaces with other controllers in the subsystem? > +static inline char *create_fmt_name(const char *fmt, char *ns_directory, > char *serial, uint32_t nsid, Error **errp) > +{ > +char *file_name = NULL; > +Error *local_err = NULL; > + > +storage_path_check(ns_directory, serial, errp); > +if (local_err) { How is 'local_err' ever *not* NULL here? Are you meaning to check "*errp" instead? > +error_propagate(errp, local_err); > +} else { > +file_name = g_strdup_printf(fmt, ns_directory, serial, nsid); > +} > + > +return file_name; > +} > + > +static inline char *create_cfg_name(char *ns_directory, char *serial, > uint32_t nsid, Error **errp) > +{ > +return create_fmt_name(NS_FILE_FMT NS_CFG_EXT, ns_directory, serial, > nsid, errp); > +} > + > + > +static inline char *create_image_name(char *ns_directory, char *serial, > uint32_t nsid, Error **errp) > +{ > +return create_fmt_name(NS_FILE_FMT NS_IMG_EXT, ns_directory, serial, > nsid, errp); > +} > + > +static inline int nsid_cfg_save(char *ns_directory, char *serial, QDict > *ns_cfg, uint32_t nsid) > +{ > +GString *json = NULL; > +char *filename; > +FILE *fp; > +int ret = 0; > +Error *local_err = NULL; > + > +json = qobject_to_json_pretty(QOBJECT(ns_cfg), false); > + > +if (strlen(json->str) + 2 /* '\n'+'\0' */ > NS_CFG_MAXSIZE) { > +error_setg(&local_err, "ns-cfg allowed max size %d exceeded", > NS_CFG_MAXSIZE); I find this whole "local_err" control flow unpleasant to follow. The local_error gets set in this above condition only to be overwritten in the very next step without ever being used? Surely that can't be right. I'm just picking on this one example here, but the pattern appears to repeat. I think this would be easier to read if the error conditions took 'goto' paths to unwind so that the good path doesn't require such deep 'if/else' nesting. > +} > + > +filename = create_cfg_name(ns_directory, serial, nsid, &local_err); > +if (!local_err) { > +fp = fopen(filename, "w"); > +if (fp == NULL) { > +error_setg(&local_err, "open %s: %s", filename, > +
Re: [PATCH] hw/nvme: reenable cqe batching
On Thu, Oct 20, 2022 at 01:35:38PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Commit 2e53b0b45024 ("hw/nvme: Use ioeventfd to handle doorbell > updates") had the unintended effect of disabling batching of CQEs. > > This patch changes the sq/cq timers to bottom halfs and instead of > calling nvme_post_cqes() immediately (causing an interrupt per cqe), we > defer the call. Nice change, looks good! Timers never did seem to be the best fit for this. Reviewed-by: Keith Busch
[PATCHv3 1/2] block: move bdrv_qiov_is_aligned to file-posix
From: Keith Busch There is only user of bdrv_qiov_is_aligned(), so move the alignment function to there and make it static. Signed-off-by: Keith Busch --- block/file-posix.c | 21 + block/io.c | 21 - include/block/block-io.h | 1 - 3 files changed, 21 insertions(+), 22 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 48cd096624..e3f3de2780 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2061,6 +2061,27 @@ static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs, return thread_pool_submit_co(pool, func, arg); } +/* + * Check if all memory in this vector is sector aligned. + */ +static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) +{ +int i; +size_t alignment = bdrv_min_mem_align(bs); +IO_CODE(); + +for (i = 0; i < qiov->niov; i++) { +if ((uintptr_t) qiov->iov[i].iov_base % alignment) { +return false; +} +if (qiov->iov[i].iov_len % alignment) { +return false; +} +} + +return true; +} + static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int type) { diff --git a/block/io.c b/block/io.c index 0a8cbefe86..96edc7f7cb 100644 --- a/block/io.c +++ b/block/io.c @@ -3236,27 +3236,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size) return mem; } -/* - * Check if all memory in this vector is sector aligned. - */ -bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) -{ -int i; -size_t alignment = bdrv_min_mem_align(bs); -IO_CODE(); - -for (i = 0; i < qiov->niov; i++) { -if ((uintptr_t) qiov->iov[i].iov_base % alignment) { -return false; -} -if (qiov->iov[i].iov_len % alignment) { -return false; -} -} - -return true; -} - void bdrv_io_plug(BlockDriverState *bs) { BdrvChild *child; diff --git a/include/block/block-io.h b/include/block/block-io.h index fd25ffa9be..492f95fc05 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -150,7 +150,6 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size); void *qemu_blockalign0(BlockDriverState *bs, size_t size); void *qemu_try_blockalign(BlockDriverState *bs, size_t size); void *qemu_try_blockalign0(BlockDriverState *bs, size_t size); -bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov); void bdrv_enable_copy_on_read(BlockDriverState *bs); void bdrv_disable_copy_on_read(BlockDriverState *bs); -- 2.30.2
[PATCHv3 0/2] qemu direct io alignment fix
From: Keith Busch Changes from v2: Split the patch so that the function move is separate from the functional change. This makes it immediately obvious what criteria is changing. (Kevin Wolf) Added received Tested-by tag in the changelog. Keith Busch (2): block: move bdrv_qiov_is_aligned to file-posix block: use the request length for iov alignment block/file-posix.c | 22 ++ block/io.c | 21 - include/block/block-io.h | 1 - 3 files changed, 22 insertions(+), 22 deletions(-) -- 2.30.2
[PATCHv3 2/2] block: use the request length for iov alignment
From: Keith Busch An iov length needs to be aligned to the logical block size, which may be larger than the memory alignment. Tested-by: Jens Axboe Signed-off-by: Keith Busch --- block/file-posix.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/file-posix.c b/block/file-posix.c index e3f3de2780..af994aba2b 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2068,13 +2068,14 @@ static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) { int i; size_t alignment = bdrv_min_mem_align(bs); +size_t len = bs->bl.request_alignment; IO_CODE(); for (i = 0; i < qiov->niov; i++) { if ((uintptr_t) qiov->iov[i].iov_base % alignment) { return false; } -if (qiov->iov[i].iov_len % alignment) { +if (qiov->iov[i].iov_len % len) { return false; } } -- 2.30.2
Re: [PATCHv2] block: use the request length for iov alignment
On Thu, Sep 29, 2022 at 07:59:50PM +0200, Kevin Wolf wrote: > Am 29.09.2022 um 18:09 hat Keith Busch geschrieben: > > On Fri, Sep 23, 2022 at 08:34:51AM -0700, Keith Busch wrote: > > > > > > An iov length needs to be aligned to the logical block size, which may > > > be larger than the memory alignment. And since this is only used with > > > file-posix backing storage, move the alignment function to there, where > > > the value of the request_alignment is known to be the file's logical > > > block size. > > > > Any objections to this version? This is fixing real bug reports that > > may become more frequent without this patch. > > I think it is okay. Splitting it in two patches would have been nicer > (one for moving code, one for making the change), but it's small enough > that I can ignore that. I'll probably merge it tomorrow. I agree that splitting makes the functional change stand out, otherwise a casual look may mistake the patch as a simple function move. I'll send you a new version.
Re: [PATCHv2] block: use the request length for iov alignment
On Fri, Sep 23, 2022 at 08:34:51AM -0700, Keith Busch wrote: > > An iov length needs to be aligned to the logical block size, which may > be larger than the memory alignment. And since this is only used with > file-posix backing storage, move the alignment function to there, where > the value of the request_alignment is known to be the file's logical > block size. Any objections to this version? This is fixing real bug reports that may become more frequent without this patch.
[PATCHv2] block: use the request length for iov alignment
From: Keith Busch An iov length needs to be aligned to the logical block size, which may be larger than the memory alignment. And since this is only used with file-posix backing storage, move the alignment function to there, where the value of the request_alignment is known to be the file's logical block size. Signed-off-by: Keith Busch --- v1->v2: Relocate the function to the only caller so that irrelavant constraints don't need to be considered. block/file-posix.c | 22 ++ block/io.c | 21 - include/block/block-io.h | 1 - 3 files changed, 22 insertions(+), 22 deletions(-) diff --git a/block/file-posix.c b/block/file-posix.c index 48cd096624..af994aba2b 100644 --- a/block/file-posix.c +++ b/block/file-posix.c @@ -2061,6 +2061,28 @@ static int coroutine_fn raw_thread_pool_submit(BlockDriverState *bs, return thread_pool_submit_co(pool, func, arg); } +/* + * Check if all memory in this vector is sector aligned. + */ +static bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) +{ +int i; +size_t alignment = bdrv_min_mem_align(bs); +size_t len = bs->bl.request_alignment; +IO_CODE(); + +for (i = 0; i < qiov->niov; i++) { +if ((uintptr_t) qiov->iov[i].iov_base % alignment) { +return false; +} +if (qiov->iov[i].iov_len % len) { +return false; +} +} + +return true; +} + static int coroutine_fn raw_co_prw(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int type) { diff --git a/block/io.c b/block/io.c index 0a8cbefe86..96edc7f7cb 100644 --- a/block/io.c +++ b/block/io.c @@ -3236,27 +3236,6 @@ void *qemu_try_blockalign0(BlockDriverState *bs, size_t size) return mem; } -/* - * Check if all memory in this vector is sector aligned. - */ -bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) -{ -int i; -size_t alignment = bdrv_min_mem_align(bs); -IO_CODE(); - -for (i = 0; i < qiov->niov; i++) { -if ((uintptr_t) qiov->iov[i].iov_base % alignment) { -return false; -} -if (qiov->iov[i].iov_len % alignment) { -return false; -} -} - -return true; -} - void bdrv_io_plug(BlockDriverState *bs) { BdrvChild *child; diff --git a/include/block/block-io.h b/include/block/block-io.h index fd25ffa9be..492f95fc05 100644 --- a/include/block/block-io.h +++ b/include/block/block-io.h @@ -150,7 +150,6 @@ void *qemu_blockalign(BlockDriverState *bs, size_t size); void *qemu_blockalign0(BlockDriverState *bs, size_t size); void *qemu_try_blockalign(BlockDriverState *bs, size_t size); void *qemu_try_blockalign0(BlockDriverState *bs, size_t size); -bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov); void bdrv_enable_copy_on_read(BlockDriverState *bs); void bdrv_disable_copy_on_read(BlockDriverState *bs); -- 2.30.2
Re: [PATCH] block: use the request length for iov alignment
On Wed, Sep 14, 2022 at 11:36:14AM +0100, Kevin Wolf wrote: > Am 13.09.2022 um 15:12 hat Keith Busch geschrieben: > > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote: > > > From: Keith Busch > > > > > > An iov length needs to be aligned to the logical block size, which may > > > be larger than the memory alignment. > > > > [cc'ing some other interested folks] > > > > Any thoughts on this patch? It is fixing an observed IO error when running > > virtio-blk with the default 512b logical block size backed by a drive > > formatted > > with 4k logical block. > > I need to take a real look after KVM Forum, but my first thought was > that we might be overloading request_alignment with multiple meanings > now (file offset alignment and memory address alignment), and the values > just happen to be the same for files on Linux. > > Did you consider a separate iov_alignment or similar and intentionally > decided against it, or is it something you just didn't think about? I've looked again at the request_alignment usage, and I think that value is exactly what we want. This alignment check is only used with O_DIRECT backing file descriptors, and the request_alignment in that case looks like it will always be the logical blocks size. Linux direct-io can accept arbitrary memory addrses offsets based on the backing file's internal block limits, but the individual vector lengths do need to be lba granularity. Just in case, though, I could amend this so that the alignment is the larger of request_alignment and the existing criteria, though I don't see how request_alignment would ever be the smaller. size_t len = max(bs->bl.request_alignment, alignment); > > > @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, > > > QEMUIOVector *qiov) > > > { > > > int i; > > > size_t alignment = bdrv_min_mem_align(bs); > > > +size_t len = bs->bl.request_alignment; > > > IO_CODE(); > > > > > > for (i = 0; i < qiov->niov; i++) { > > > if ((uintptr_t) qiov->iov[i].iov_base % alignment) { > > > return false; > > > } > > > -if (qiov->iov[i].iov_len % alignment) { > > > +if (qiov->iov[i].iov_len % len) { > > > return false; > > > } > > > } > > > --
Re: [PATCH] block: use the request length for iov alignment
On Wed, Sep 14, 2022 at 11:36:14AM +0100, Kevin Wolf wrote: > Am 13.09.2022 um 15:12 hat Keith Busch geschrieben: > > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote: > > > From: Keith Busch > > > > > > An iov length needs to be aligned to the logical block size, which may > > > be larger than the memory alignment. > > > > [cc'ing some other interested folks] > > > > Any thoughts on this patch? It is fixing an observed IO error when running > > virtio-blk with the default 512b logical block size backed by a drive > > formatted > > with 4k logical block. > > I need to take a real look after KVM Forum, but my first thought was > that we might be overloading request_alignment with multiple meanings > now (file offset alignment and memory address alignment), and the values > just happen to be the same for files on Linux. > > Did you consider a separate iov_alignment or similar and intentionally > decided against it, or is it something you just didn't think about? I thought the request_alignment was indicating the minimum block length, so I did not consider an additional separate value. If it can be overloaded, then yes, I can certainly take on that idea.
Re: [PATCH] block: use the request length for iov alignment
On Tue, Sep 13, 2022 at 03:20:23PM +0100, Damien Le Moal wrote: > On 2022/09/13 15:12, Keith Busch wrote: > > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote: > >> From: Keith Busch > >> > >> An iov length needs to be aligned to the logical block size, which may > >> be larger than the memory alignment. > > > > [cc'ing some other interested folks] > > > > Any thoughts on this patch? It is fixing an observed IO error when running > > virtio-blk with the default 512b logical block size backed by a drive > > formatted > > with 4k logical block. > > The patch look OK to me, but having virtio expose a 512B LBA size for a > backing > device that has 4K LBAs will break all IOs if caching is turned off (direct > IOs > case), even if this patch is applied. No ? Oh, as to why that type of setup "works" with O_DIRECT, when the check below returns 'false', qemu allocates a bounce buffer. We want that to happen if the guest's virtio driver tries to read/write 512b. The lengths just need to be checked against the backing store's block size instead of the memory address alignment. > >> @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, > >> QEMUIOVector *qiov) > >> { > >> int i; > >> size_t alignment = bdrv_min_mem_align(bs); > >> +size_t len = bs->bl.request_alignment; > >> IO_CODE(); > >> > >> for (i = 0; i < qiov->niov; i++) { > >> if ((uintptr_t) qiov->iov[i].iov_base % alignment) { > >> return false; > >> } > >> -if (qiov->iov[i].iov_len % alignment) { > >> +if (qiov->iov[i].iov_len % len) { > >> return false; > >> } > >> }
Re: [PATCH] block: use the request length for iov alignment
On Tue, Sep 13, 2022 at 03:20:23PM +0100, Damien Le Moal wrote: > On 2022/09/13 15:12, Keith Busch wrote: > > On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote: > >> From: Keith Busch > >> > >> An iov length needs to be aligned to the logical block size, which may > >> be larger than the memory alignment. > > > > [cc'ing some other interested folks] > > > > Any thoughts on this patch? It is fixing an observed IO error when running > > virtio-blk with the default 512b logical block size backed by a drive > > formatted > > with 4k logical block. > > The patch look OK to me, but having virtio expose a 512B LBA size for a > backing > device that has 4K LBAs will break all IOs if caching is turned off (direct > IOs > case), even if this patch is applied. No ? Yeah, it's just the default. I don't think anyone would do that on purpose since it's so sub-optimal from a performance stand-point. As a follow up patch, perhaps if the logical_block_size parameter is not provided, we should default to the backing device's block size?
Re: [PATCH] block: use the request length for iov alignment
On Thu, Sep 08, 2022 at 09:45:26AM -0700, Keith Busch wrote: > From: Keith Busch > > An iov length needs to be aligned to the logical block size, which may > be larger than the memory alignment. [cc'ing some other interested folks] Any thoughts on this patch? It is fixing an observed IO error when running virtio-blk with the default 512b logical block size backed by a drive formatted with 4k logical block. > --- > block/io.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/block/io.c b/block/io.c > index 0a8cbefe86..296d4b49a7 100644 > --- a/block/io.c > +++ b/block/io.c > @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, > QEMUIOVector *qiov) > { > int i; > size_t alignment = bdrv_min_mem_align(bs); > +size_t len = bs->bl.request_alignment; > IO_CODE(); > > for (i = 0; i < qiov->niov; i++) { > if ((uintptr_t) qiov->iov[i].iov_base % alignment) { > return false; > } > -if (qiov->iov[i].iov_len % alignment) { > +if (qiov->iov[i].iov_len % len) { > return false; > } > } > -- > 2.30.2 >
[PATCH] block: use the request length for iov alignment
From: Keith Busch An iov length needs to be aligned to the logical block size, which may be larger than the memory alignment. Signed-off-by: Keith Busch --- block/io.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/block/io.c b/block/io.c index 0a8cbefe86..296d4b49a7 100644 --- a/block/io.c +++ b/block/io.c @@ -3243,13 +3243,14 @@ bool bdrv_qiov_is_aligned(BlockDriverState *bs, QEMUIOVector *qiov) { int i; size_t alignment = bdrv_min_mem_align(bs); +size_t len = bs->bl.request_alignment; IO_CODE(); for (i = 0; i < qiov->niov; i++) { if ((uintptr_t) qiov->iov[i].iov_base % alignment) { return false; } -if (qiov->iov[i].iov_len % alignment) { +if (qiov->iov[i].iov_len % len) { return false; } } -- 2.30.2
Re: [PATCH v2 2/3] hw/nvme: use KVM irqfd when available
On Fri, Aug 26, 2022 at 05:45:21PM +0200, Klaus Jensen wrote: > On Aug 26 09:34, Keith Busch wrote: > > On Fri, Aug 26, 2022 at 11:12:04PM +0800, Jinhao Fan wrote: > > > Use KVM's irqfd to send interrupts when possible. This approach is > > > thread safe. Moreover, it does not have the inter-thread communication > > > overhead of plain event notifiers since handler callback are called > > > in the same system call as irqfd write. > > > > > > Signed-off-by: Jinhao Fan > > > Signed-off-by: Klaus Jensen > > > > No idea what's going on here... This one is causing the following assert > > failure with --enable-kvm: > > > > qemu-system-x86_64: ../accel/kvm/kvm-all.c:1781: > > kvm_irqchip_commit_routes: Assertion `ret == 0' failed. > > > > I find it calls KVM_SET_GSI_ROUTING ioctl with gsi set to > > KVM_IRQ_ROUTING_MSI, > > and linux kernel returns EINVAL in that case. It's never set that way > > without > > this patch. Am I the only one seeing this? > > Argh, sorry, I threw that patch together a bit too quickly. I was just > so pumped because I believed I had solved the issue hehe. > > Are you missing the ioeventfd=on and irq-eventfd=on parameters by any > chance? Without those I'm also getting an assertion, but a different one I had not enabled those yet. This was purely a regrsession test with my previously working paramaters for a sanity check. If I enable those new nvme parameters, then it is successful.
Re: [PATCH v2 2/3] hw/nvme: use KVM irqfd when available
On Fri, Aug 26, 2022 at 11:12:04PM +0800, Jinhao Fan wrote: > Use KVM's irqfd to send interrupts when possible. This approach is > thread safe. Moreover, it does not have the inter-thread communication > overhead of plain event notifiers since handler callback are called > in the same system call as irqfd write. > > Signed-off-by: Jinhao Fan > Signed-off-by: Klaus Jensen No idea what's going on here... This one is causing the following assert failure with --enable-kvm: qemu-system-x86_64: ../accel/kvm/kvm-all.c:1781: kvm_irqchip_commit_routes: Assertion `ret == 0' failed. I find it calls KVM_SET_GSI_ROUTING ioctl with gsi set to KVM_IRQ_ROUTING_MSI, and linux kernel returns EINVAL in that case. It's never set that way without this patch. Am I the only one seeing this?
Re: [PATCH] hw/nvme: Add helper functions for qid-db conversion
On Wed, Aug 03, 2022 at 09:46:05AM +0800, Jinhao Fan wrote: > at 4:54 PM, Klaus Jensen wrote: > > > I am unsure if the compiler will transform that division into the shift > > if it can infer that the divisor is a power of two (it most likely > > will be able to). > > > > But I see no reason to have a potential division here when we can do > > without and to me it is just as readable when you know the definition of > > DSTRD is `2 ^ (2 + DSTRD)`. > > OK. I will send a new patch with shifts instead of divisions. BTW, why do we > want to avoid divisions? Integer division is at least an order of magnitude more CPU cycles than a shift. Some archs are worse than others, but historically we go out of the way to avoid them in a hot path, so shifting is a more familiar coding pattern. Compilers typically implement division as a shift if you're dividing by a a power of two integer constant expression (ICE). This example here isn't an ICE, but it is a shifted constant power-of-two. I wrote up a simple test to see what my compiler does with that, and it looks like gcc will properly optimize it, but only if compiled with '-O3'.
Re: [PATCH for-7.1 0/3] hw/nvme: misc ioeventfd fixes
On Thu, Jul 28, 2022 at 10:25:19AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > A set of fixes/changes to the ioeventfd support. Series looks good. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: Add options to override hardcoded values
On Wed, Jul 13, 2022 at 09:11:41PM +0200, Mauricio Sandt wrote: > On 13/07/2022 20:48, Keith Busch wrote: > > I guess I'm missing the bigger picture here. You are supposed to be able to > > retrieve these fields with ioctl's, so not sure what this has to do with > > malware. Why does the firmware revision matter to this program? > Oh I'm sorry, I forgot to explain properly. Malware usually checks if it is > being run in a sandbox environment like a VM, and if it detects such a > sandbox, it doesn't run or doesn't unleash its full potential. This makes my > life as a researcher much harder. > > Hiding the VM by overriding the model, firmware, and nqn strings to either > random values or names of existing hardware in the hypervisor is a much > cleaner solution than intercepting the IOCTLs in the VM and changing the > result with a kernel driver. IIUC, this program is trying to avoid being studied, and uses indicators like nvme firmware to help determine if it is running in such an environment. If so, I suspect defeating all possible indicators will be a fun and time consuming process. :)
Re: [PATCH] hw/nvme: Add options to override hardcoded values
On Wed, Jul 13, 2022 at 08:06:26PM +0200, Mauricio Sandt wrote: > My specific use case that required this patch is a piece of malware that used > several IOCTLs to read model, firmware, and nqn from the NVMe attached to the > VM. Modifying that info at the hypervisor level was a much better approach > than loading an (unsigned) driver into windows to intercept said IOCTLs. > Having this patch in the official qemu version would help me a lot in my test > lab, and I'm pretty sure it would also help other people. I guess I'm missing the bigger picture here. You are supposed to be able to retrieve these fields with ioctl's, so not sure what this has to do with malware. Why does the firmware revision matter to this program? > I guess there could be a warning about potential problems with drivers in the > description for model, firmware, and nqn, but I haven't experienced any > issues when changing the values (for all of them). If you encountered any > problems, how did they manifest? Older qemu nvme wasn't reliably generating unique identifiers, so linux quirks them to ignore. They are reliable now, so the quirk can be changed to firmware specific when the version number updates with the next release.
Re: [PATCH] hw/nvme: Add options to override hardcoded values
On Sun, Jun 12, 2022 at 12:35:09AM +0200, Mauricio Sandt wrote: > This small patch is the result of some recent malware research I did > in a QEMU VM. The malware used multiple ways of querying info from > the VM disk and I needed a clean way to change those values from the > hypervisor. > > I believe this functionality could be useful to more people from multiple > fields, sometimes you just want to change some default values and having > them hardcoded in the sourcecode makes that much harder. > > This patch adds three config parameters to the nvme device, all of them > are optional to not break anything. If any of them are not specified, > the previous (before this patch) default is used. > > -model - This takes a string and sets it as the devices model name. > If you don't specify this parameter, the default is "QEMU NVMe Ctrl". > > -firmware - The firmware version string, max 8 ascii characters. > The default is whatever `QEMU_VERSION` evaluates to. > > -nqn_override - Allows to set a custom nqn on the nvme device. > Only used if there is no subsystem. This string should be in the same > format as the default "nqn.2019-08.org.qemu:...", but there is no > validation for that. Its up to the user to provide a valid string. I guess the nqn can be user tunable just like it is when used with subsystems, but what's the point of messing with model and firmware? That could mess with host drivers' ability to detect what quirks it needs to apply to specific instances of this virtual controller.
Re: [PATCH 0/4] hw/nvme: add support for TP4084
On Thu, Jun 16, 2022 at 12:42:49PM +0200, Klaus Jensen wrote: > On Jun 8 03:28, Niklas Cassel via wrote: > > Hello there, > > > > considering that Linux v5.19-rc1 is out which includes support for > > NVMe TP4084: > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/nvme/host/core.c?id=354201c53e61e493017b15327294b0c8ab522d69 > > > > I thought that it might be nice to have QEMU support for the same. > > > > TP4084 adds a new mode, CC.CRIME, that can be used to mark a namespace > > as ready independently from the controller. > > > > When CC.CRIME is 0 (default), things behave as before, all namespaces > > are ready when CSTS.RDY gets set to 1. > > > > Add a new "ready_delay" namespace device parameter, in order to emulate > > different ready latencies for namespaces when CC.CRIME is 1. > > > > The patch series also adds a "crwmt" controller parameter, in order to > > be able to expose the worst case timeout that the host should wait for > > all namespaces to become ready. > > > > > > Example qemu cmd line for the new options: > > > > # delay in s (20s) > > NS1_DELAY_S=20 > > # convert to units of 500ms > > NS1_DELAY=$((NS1_DELAY_S*2)) > > > > # delay in s (60s) > > NS2_DELAY_S=60 > > # convert to units of 500ms > > NS2_DELAY=$((NS2_DELAY_S*2)) > > > > # timeout in s (120s) > > CRWMT_S=120 > > # convert to units of 500ms > > CRWMT=$((CRWMT_S*2)) > > > > -device nvme,serial=deadbeef,crwmt=$CRWMT \ > > -drive file=$NS1_DATA,id=nvm-1,format=raw,if=none \ > > -device nvme-ns,drive=nvm-1,ready_delay=$NS1_DELAY \ > > -drive file=$NS2_DATA,id=nvm-2,format=raw,if=none \ > > -device nvme-ns,drive=nvm-2,ready_delay=$NS2_DELAY \ > > > > > > Niklas Cassel (4): > > hw/nvme: claim NVMe 2.0 compliance > > hw/nvme: store a pointer to the NvmeSubsystem in the NvmeNamespace > > hw/nvme: add support for ratified TP4084 > > hw/nvme: add new never_ready parameter to test the DNR bit > > > > hw/nvme/ctrl.c | 151 +-- > > hw/nvme/ns.c | 17 + > > hw/nvme/nvme.h | 9 +++ > > hw/nvme/trace-events | 1 + > > include/block/nvme.h | 60 - > > 5 files changed, 233 insertions(+), 5 deletions(-) > > > > -- > > 2.36.1 > > > > > > Hi Niklas, > > I've been going back and forth on my position on this. > > I'm not straight up against it, but this only seems useful as a one-off > patch to test the kernel support for this. Considering the limitations > you state and the limited use case, I fear this is a little bloaty to > carry upstream. > > But I totally acknowledge that this is a horrible complicated behavior > to implement on the driver side, so I guess we might all benefit from > this. > > Keith, do you have an opinion on this? There are drivers utilizing the capability, so I think supporting it is fine despite this environment not having any inherent time-to-ready delays. This will probably be another knob that gets lots of use for initial driver validation, then largely forgotten. But maybe it will be useful for driver unit and regression testing in the future.
Re: [PATCH] hw/nvme: clear aen mask on reset
On Thu, May 12, 2022 at 11:30:55AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > The internally maintained AEN mask is not cleared on reset. Fix this. Looks good. Reviewed-by: Keith Busch
Re: [PATCH] nvme: remove bit bucket support for now
On Fri, Jun 03, 2022 at 10:31:14PM +0200, Klaus Jensen wrote: > Keith, > > We never merged anything to fix this. I suggest we simply revert it and > get rid of the code entirely until *someone* comes up with a proper fix > ;) Yeah, that sounds right. My silly hack was okay for my Linux kernel tests, but a proper fix isn't likely to happen in the near term. A revert is appropriate.
Re: [PATCH] hw/nvme: Fix deallocate when metadata is present
On Fri, Jun 03, 2022 at 01:14:40PM -0600, Jonathan Derrick wrote: > When metadata is present in the namespace and deallocates are issued, the > first > deallocate could fail to zero the block range, resulting in another > deallocation to be issued. Normally after the deallocation completes and the > range is checked for zeroes, a deallocation is then issued for the metadata > space. In the failure case where the range is not zeroed, deallocation is > reissued for the block range (and followed with metadata deallocation), but > the > original range deallocation task will also issue a metadata deallocation: > > nvme_dsm_cb() > *range deallocation* > nvme_dsm_md_cb() > if (nvme_block_status_all()) (range deallocation failure) > nvme_dsm_cb() > *range deallocation* > nvme_dsm_md_cb() > if (nvme_block_status_all()) (no failure) > *metadata deallocation* > *metadata deallocation* > > This sequence results in reentry of nvme_dsm_cb() before the metadata has been > deallocated. During reentry, the metadata is deallocated in the reentrant > task. > nvme_dsm_bh() is called which deletes and sets iocb->bh to NULL. When reentry > returns from nvme_dsm_cb(), metadata deallocation takes place again, and > results in a null pointer dereference on the iocb->bh: Nice, thank you for the detailed analysis. Patch looks good. Reviewed-by: Keith Busch
Re: [PATCH v8 00/12] hw/nvme: SR-IOV with Virtualization Enhancements
On Tue, May 17, 2022 at 01:04:56PM +0200, Klaus Jensen wrote: > > > > Should we consider this series ready to merge? > > > > Hi Lukasz and Lukasz :) > > Yes, I'm queing this up. FWIW, this looks good to me. I was hoping to give it a test run, but I don't think I'll get to that for another week or two, so don't wait for me if you think you've sorted out your recent observation.
Re: [PATCH] nvme: fix bit buckets
On Mon, Apr 25, 2022 at 11:59:30AM +0200, Klaus Jensen wrote: > The approach is neat and simple, but I don't think it has the intended > effect. The memory region addr is just left at 0x0, so we just end up > with mapping that directly into the qsg and in my test setup, this > basically does DMA to the admin completion queue which happens to be at > 0x0. > > I would have liked to handle it like we do for CMB addresses, and > reserve some address known to the device (i.e. remapping to a local > allocated buffer), but we can't easily do that because of the iov/qsg > duality thingie. The dma helpers wont work with it. > > For now, I think we need to just rip out the bit bucket support. Ah crap, I think you're right. Not as simple as I thought. The idea was to have a dummy DMA-able region. We can have a controller DMA to another controller's CMB without any special handling, so that's kind of what I'm trying except the region doesn't need to be tied to any particular device. And now that you got me thinking about it, there needs to be special bit bucket handling for local CMB as well.
[PATCH] nvme: fix bit buckets
We can't just ignore the bit buckets since the data offsets read from disk need to be accounted for. We could either split into multiple reads for the actual user data requested and skip the buckets, or we could have a place holder for bucket data. Splitting is too much over head, so just allocate a memory region to dump unwanted data. Signed-off-by: Keith Busch --- This came out easier than I thought, so we can ignore my previous feature removal patch: https://lists.nongnu.org/archive/html/qemu-block/2022-04/msg00398.html hw/nvme/ctrl.c | 9 + hw/nvme/nvme.h | 1 + 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 03760ddeae..711c6fac29 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -845,11 +845,11 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, NvmeSg *sg, trans_len = MIN(*len, dlen); if (type == NVME_SGL_DESCR_TYPE_BIT_BUCKET) { -goto next; +addr = n->bitBucket.addr; +} else { +addr = le64_to_cpu(segment[i].addr); } -addr = le64_to_cpu(segment[i].addr); - if (UINT64_MAX - addr < dlen) { return NVME_DATA_SGL_LEN_INVALID | NVME_DNR; } @@ -859,7 +859,6 @@ static uint16_t nvme_map_sgl_data(NvmeCtrl *n, NvmeSg *sg, return status; } -next: *len -= trans_len; } @@ -6686,6 +6685,8 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp) nvme_init_pmr(n, pci_dev); } +memory_region_init(&n->bitBucket, OBJECT(n), NULL, 0x10); + return 0; } diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h index 739c8b8f79..d59eadc69d 100644 --- a/hw/nvme/nvme.h +++ b/hw/nvme/nvme.h @@ -411,6 +411,7 @@ typedef struct NvmeCtrl { PCIDeviceparent_obj; MemoryRegion bar0; MemoryRegion iomem; +MemoryRegion bitBucket; NvmeBar bar; NvmeParams params; NvmeBus bus; -- 2.30.2
[PATCH] nvme: remove bit bucket support for now
THe emulated controller correctly accounts for not including bit buckets in the controller-to-host data transfer, however it doesn't correctly account for the holes for the on-disk data offsets. Signed-off-by: Keith Busch --- hw/nvme/ctrl.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c index 03760ddeae..5e56191d45 100644 --- a/hw/nvme/ctrl.c +++ b/hw/nvme/ctrl.c @@ -6773,8 +6773,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev) id->vwc = NVME_VWC_NSID_BROADCAST_SUPPORT | NVME_VWC_PRESENT; id->ocfs = cpu_to_le16(NVME_OCFS_COPY_FORMAT_0 | NVME_OCFS_COPY_FORMAT_1); -id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN | - NVME_CTRL_SGLS_BITBUCKET); +id->sgls = cpu_to_le32(NVME_CTRL_SGLS_SUPPORT_NO_ALIGN); nvme_init_subnqn(n); -- 2.30.2
Re: [PATCH 0/5] hw/nvme: fix namespace identifiers
On Tue, Apr 19, 2022 at 02:10:34PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > The namespace identifiers reported by the controller is kind of a mess. > See [1,2]. > > This series should fix this for both the `-device nvme,drive=...` and > `-device nvme-ns,...` cases. Series looks good. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix control flow statement
On Fri, Apr 15, 2022 at 10:27:21PM +0300, Dmitry Tikhov wrote: > Since there is no else after nvme_dsm_cb invocation, metadata associated > with non-zero block range is currently zeroed. Also this behaviour leads > to segfault since we schedule iocb->bh two times. First when entering > nvme_dsm_cb with iocb->idx == iocb->nr and second on call stack unwinding > by calling blk_aio_pwrite_zeroes and subsequent nvme_dsm_cb callback > because of missing else statement. > > Signed-off-by: Dmitry Tikhov > --- > hw/nvme/ctrl.c | 7 --- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 03760ddeae..7ebd2aa326 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -2372,11 +2372,12 @@ static void nvme_dsm_md_cb(void *opaque, int ret) > } > > nvme_dsm_cb(iocb, 0); > +} else { > +iocb->aiocb = blk_aio_pwrite_zeroes(ns->blkconf.blk, nvme_moff(ns, > slba), > +nvme_m2b(ns, nlb), > BDRV_REQ_MAY_UNMAP, > +nvme_dsm_cb, iocb); > } Instead of the 'else', just insert an early 'return;' after nvme_dsm_cb() like the earlier condition above here. Otherwise, looks good, and thanks for the fix.
Re: [PATCH v2 0/6] hw/nvme: enhanced protection information (64-bit guard)
On Tue, Mar 01, 2022 at 11:44:22AM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > This adds support for one possible new protection information format > introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard > and 48-bit reference tag. This version does not support storage tags. > > Like the CRC16 support already present, this uses a software > implementation of CRC64 (so it is naturally pretty slow). But its good > enough for verification purposes. > > This goes hand-in-hand with the support that Keith submitted for the > Linux kernel[1]. > > [1]: > https://lore.kernel.org/linux-nvme/20220201190128.3075065-1-kbu...@kernel.org/ Thanks Klaus, this looks good to me. Reviewed-by: Keith Busch
Re: [PATCH 0/6] hw/nvme: enhanced protection information (64-bit guard)
On Mon, Feb 14, 2022 at 01:30:23PM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > This adds support for one possible new protection information format > introduced in TP4068 (and integrated in NVMe 2.0): the 64-bit CRC guard > and 48-bit reference tag. This version does not support storage tags. > > Like the CRC16 support already present, this uses a software > implementation of CRC64 (so it is naturally pretty slow). But its good > enough for verification purposes. > > This goes hand-in-hand with the support that Keith submitted for the > Linux kernel[1]. > > [1]: > https://lore.kernel.org/linux-nvme/20220201190128.3075065-1-kbu...@kernel.org/ Other than comment on 6/6, series looks good to me. Reviewed-by: Keith Busch
Re: [PATCH 6/6] hw/nvme: 64-bit pi support
On Mon, Feb 14, 2022 at 01:30:29PM +0100, Klaus Jensen wrote: > @@ -384,6 +389,12 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, > Error **errp) > return -1; > } > > +if (ns->params.pif != NVME_PI_GUARD_16 && > +ns->params.pif != NVME_PI_GUARD_64) { > +error_setg(errp, "invalid 'pif'"); > +return -1; > +} In addition, the requested metadata size ('params.ms') should be checked against the requested PI option. The function currently just checks against 8 bytes, but the 64b guard requires at least 16 bytes. Otherwise, looks great.
Re: [PATCH for-7.0 4/4] hw/nvme: add support for zoned random write area
On Thu, Nov 25, 2021 at 08:37:35AM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > Add support for TP 4076 ("Zoned Random Write Area"), v2021.08.23 > ("Ratified"). > > This adds three new namespace parameters: "zoned.numzrwa" (number of > zrwa resources, i.e. number of zones that can have a zrwa), > "zoned.zrwas" (zrwa size in LBAs), "zoned.zrwafg" (granularity in LBAs > for flushes). > > Signed-off-by: Klaus Jensen Looks good, and will just need a minor update if you choose to take the feedback from patch 2 onboard. Reviewed-by: Keith Busch
Re: [PATCH for-7.0 3/4] hw/nvme: add ozcs enum
On Thu, Nov 25, 2021 at 08:37:34AM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > Add enumeration for OZCS values. Looks good. Reviewed-by: Keith Busch
Re: [PATCH for-7.0 2/4] hw/nvme: add zone attribute get/set helpers
On Thu, Nov 25, 2021 at 08:37:33AM +0100, Klaus Jensen wrote: > @@ -295,7 +295,7 @@ static void nvme_assign_zone_state(NvmeNamespace *ns, > NvmeZone *zone, > case NVME_ZONE_STATE_READ_ONLY: > break; > default: > -zone->d.za = 0; > +NVME_ZA_CLEAR_ALL(zone->d.za); > } > } > > @@ -3356,7 +3356,7 @@ static uint16_t nvme_set_zd_ext(NvmeNamespace *ns, > NvmeZone *zone) > return status; > } > nvme_aor_inc_active(ns); > -zone->d.za |= NVME_ZA_ZD_EXT_VALID; > +NVME_ZA_SET(zone->d.za, NVME_ZA_ZD_EXT_VALID); > nvme_assign_zone_state(ns, zone, NVME_ZONE_STATE_CLOSED); > return NVME_SUCCESS; > } > diff --git a/include/block/nvme.h b/include/block/nvme.h > index 2ee227760265..2b8b906466ab 100644 > --- a/include/block/nvme.h > +++ b/include/block/nvme.h > @@ -1407,6 +1407,10 @@ enum NvmeZoneAttr { > NVME_ZA_ZD_EXT_VALID = 1 << 7, > }; > > +#define NVME_ZA_SET(za, attrs) ((za) |= (attrs)) > +#define NVME_ZA_CLEAR(za, attrs) ((za) &= ~(attrs)) > +#define NVME_ZA_CLEAR_ALL(za)((za) = 0x0) This doesn't really look any more helpful than open coding it. I think it would appear better to take a "struct NvmeZone" type parameter instead, and use inline functions instead of macro.
Re: [PATCH for-7.0 1/4] hw/nvme: add struct for zone management send
On Thu, Nov 25, 2021 at 08:37:32AM +0100, Klaus Jensen wrote: > +typedef struct QEMU_PACKED NvmeZoneSendCmd { > +uint8_t opcode; > +uint8_t flags; > +uint16_tcid; > +uint32_tnsid; > +uint32_trsvd2[4]; > +NvmeCmdDptr dptr; > +uint64_tslba; > +uint32_trsvd12; > +uint8_t zsa; > +uint8_t zsflags[3]; This should be just a single uint8_t for zsflags, followed by 'uint8_t rsvd[2]'. Otherwise, looks good. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix CVE-2021-3929
On Thu, Jan 20, 2022 at 09:01:55AM +0100, Klaus Jensen wrote: > +static inline bool nvme_addr_is_iomem(NvmeCtrl *n, hwaddr addr) > +{ > +hwaddr hi, lo; > + > +lo = n->bar0.addr; > +hi = lo + int128_get64(n->bar0.size); > + > +return addr >= lo && addr < hi; Looks fine considering this implementation always puts CMB in an exclusive BAR. From a spec consideration though, you can put a CMB at a BAR0 offset. I don't think that's going to happen anytime soon here, but may be worth a comment to notify this function needs to be updated if that assumption ever changes. Reviewed-by: Keith Busch
Re: [PATCH 2/2] hw/nvme/ctrl: Prohibit DMA accesses to devices (CVE-2021-3929)
On Thu, Dec 16, 2021 at 06:55:10PM +0100, Philippe Mathieu-Daudé wrote: > Async DMA requests might access MMIO regions and re-program the > NVMe controller internal registers while DMA requests are still > scheduled or in flight. Avoid that by prohibing the controller > to access non-memories regions. Looks good. Reviewed-by: Keith Busch
Re: [PATCH 1/2] hw/nvme/ctrl: Do not ignore DMA access errors
On Thu, Dec 16, 2021 at 06:55:09PM +0100, Philippe Mathieu-Daudé wrote: > dma_buf_read/dma_buf_write() return a MemTxResult type. > Do not discard it, propagate the DMA error to the caller. > > Signed-off-by: Philippe Mathieu-Daudé Looks good. Reviewed-by: Keith Busch
Re: [PATCH v2 08/15] hw/nvme: Implement the Function Level Reset
On Tue, Nov 16, 2021 at 04:34:39PM +0100, Łukasz Gieryk wrote: > if (!pci_is_vf(&n->parent_obj) && n->params.sriov_max_vfs) { > -pcie_sriov_pf_disable_vfs(&n->parent_obj); > +if (rst != NVME_RESET_CONTROLLER) { > +pcie_sriov_pf_disable_vfs(&n->parent_obj); Shouldn't this be 'if (rst == NVME_RESET_FUNCTION)'?
Re: [PATCH 0/2] hw/nvme: fix namespace attachment on controller hotplug
On Fri, Sep 24, 2021 at 09:27:40AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > First patch from Hannes fixes the subsystem registration process such > that shared (but non-detached) namespaces are automatically attached to > hotplugged controllers. > > The second patch changes the default for 'shared' such that namespaces > are shared by default and will thus by default be attached to hotplugged > controllers. This adds a compat property for older machine versions and > updates the documentation to reflect this. Series looks good. Reviewed-by: Keith Busch
Re: [PATCH RFC 02/13] hw/nvme: move zns helpers and types into zoned.h
On Tue, Sep 14, 2021 at 10:37:26PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Move ZNS related helpers and types into zoned.h. Use a common prefix > (nvme_zoned or nvme_ns_zoned) for zns related functions. Just a nitpicks on the naming, you can feel free to ignore. Since we're within NVMe specific protocol here, using that terminology should be fine. I prefer "nvme_zns_" for the prefix. And for function names like "nvme_zoned_zs()", the "zs" abbreviation expands to "zone_state", so "zone" is redunant. I think "nvme_zns_state()" is a good name for that one.
Re: [PATCH v2] hw/nvme: Return error for fused operations
On Wed, Sep 15, 2021 at 05:43:30PM +0200, Pankaj Raghav wrote: > Currently, FUSED operations are not supported by QEMU. As per the 1.4 SPEC, > controller should abort the command that requested a fused operation with > an INVALID FIELD error code if they are not supported. > > Changes from v1: > Added FUSE flag check also to the admin cmd processing as the FUSED > operations are mentioned in the general SQE section in the SPEC. Just for future reference, the changes from previous versions should go below the "---" line so that they don't get included in the official changelog. > +if (NVME_CMD_FLAGS_FUSE(req->cmd.flags)) { > +return NVME_INVALID_FIELD; > +} > + > req->ns = ns; > > switch (req->cmd.opcode) { > @@ -5475,6 +5479,10 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, > NvmeRequest *req) > return NVME_INVALID_FIELD | NVME_DNR; > } > > +if (NVME_CMD_FLAGS_FUSE(req->cmd.flags)) { > +return NVME_INVALID_FIELD; > +} I'm a little surprised this macro exists considering this is the first time it's used! But this looks fine, I really hope hosts weren't actually trying fused commands on this target, but it's good to confirm. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix validation of ASQ and ACQ
On Mon, Aug 23, 2021 at 02:20:18PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Address 0x0 is a valid address. Fix the admin submission and completion > queue address validation to not error out on this. Indeed, there are environments that can use that address. It's a host error if the controller was enabled with invalid queue addresses anyway. The controller only needs to verify the lower bits are clear, which we do later. Reviewed-by: Keith Busch
Re: [RFC PATCH v1] Adding Support for namespace management
On Fri, Aug 13, 2021 at 03:02:22PM +0530, Naveen wrote: > +static uint16_t nvme_identify_ns_common(NvmeCtrl *n, NvmeRequest *req) > +{ > +NvmeIdNs id_ns = {}; > + > +id_ns.nsfeat |= (0x4 | 0x10); > +id_ns.dpc = 0x1f; > + > +NvmeLBAF lbaf[16] = { > +[0] = {.ds = 9}, > +[1] = {.ds = 9, .ms = 8}, > +[2] = {.ds = 9, .ms = 16}, > +[3] = {.ds = 9, .ms = 64}, > +[4] = {.ds = 12}, > +[5] = {.ds = 12, .ms = 8}, > +[6] = {.ds = 12, .ms = 16}, > +[7] = {.ds = 12, .ms = 64}, > +}; Since the lbaf is a copy of what's defined in nvme_ns_init, so should be defined for reuse. > + > +memcpy(&id_ns.lbaf, &lbaf, sizeof(lbaf)); > +id_ns.nlbaf = 7; The identify structure should be what's in common with all namespaces, and this doesn't look complete. Just off the top of my head, missing fields include MC and DLFEAT. > + > +return nvme_c2h(n, (uint8_t *)&id_ns, sizeof(NvmeIdNs), req); > +} > + > static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req, bool active) > { > NvmeNamespace *ns; > @@ -4453,8 +4478,10 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, > NvmeRequest *req, bool active) > > trace_pci_nvme_identify_ns(nsid); > > -if (!nvme_nsid_valid(n, nsid) || nsid == NVME_NSID_BROADCAST) { > +if (!nvme_nsid_valid(n, nsid)) { > return NVME_INVALID_NSID | NVME_DNR; > +} else if (nsid == NVME_NSID_BROADCAST) { > +return nvme_identify_ns_common(n, req); > } > > ns = nvme_ns(n, nsid); > @@ -5184,6 +5211,195 @@ static void nvme_select_iocs_ns(NvmeCtrl *n, > NvmeNamespace *ns) > } > } > > +static int nvme_blk_truncate(BlockBackend *blk, size_t len, Error **errp) > +{ > +int ret; > +uint64_t perm, shared_perm; > + > +blk_get_perm(blk, &perm, &shared_perm); > + > +ret = blk_set_perm(blk, perm | BLK_PERM_RESIZE, shared_perm, errp); > +if (ret < 0) { > +return ret; > +} > + > +ret = blk_truncate(blk, len, false, PREALLOC_MODE_OFF, 0, errp); > +if (ret < 0) { > +return ret; > +} > + > +ret = blk_set_perm(blk, perm, shared_perm, errp); > +if (ret < 0) { > +return ret; > +} > + > +return 0; > +} > + > +static uint32_t nvme_allocate_nsid(NvmeCtrl *n) > +{ > +uint32_t nsid = 0; > +for (int i = 1; i <= NVME_MAX_NAMESPACES; i++) { > +if (nvme_ns(n, i) || nvme_subsys_ns(n->subsys, i)) { > +continue; > +} > + > +nsid = i; > +return nsid; > +} > +return nsid; > +} > + > +static uint16_t nvme_namespace_create(NvmeCtrl *n, NvmeRequest *req) > +{ > + uint32_t ret; > + NvmeIdNs id_ns_host; > + NvmeSubsystem *subsys = n->subsys; > + Error *err = NULL; > + uint8_t flbas_host; > + uint64_t ns_size; > + int lba_index; > + NvmeNamespace *ns; > + NvmeCtrl *ctrl; > + NvmeIdNs *id_ns; > + > +ret = nvme_h2c(n, (uint8_t *)&id_ns_host, sizeof(id_ns_host), req); > +if (ret) { > +return ret; > +} Some unusual indentation here. > + > +if (id_ns_host.ncap < id_ns_host.nsze) { > +return NVME_THIN_PROVISION_NO_SUPP | NVME_DNR; > +} else if (id_ns_host.ncap > id_ns_host.nsze) { > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +if (!id_ns_host.nsze) { > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +if (QSLIST_EMPTY(&subsys->unallocated_namespaces)) { > +return NVME_NS_ID_UNAVAILABLE; > +} > + > +ns = QSLIST_FIRST(&subsys->unallocated_namespaces); > +id_ns = &ns->id_ns; > +flbas_host = (id_ns_host.flbas) & (0xF); > + > +if (flbas_host > id_ns->nlbaf) { > +return NVME_INVALID_FORMAT | NVME_DNR; > +} > + > +ret = nvme_ns_setup(ns, &err); > +if (ret) { > +return ret; > +} > + > +id_ns->flbas = id_ns_host.flbas; > +id_ns->dps = id_ns_host.dps; > +id_ns->nmic = id_ns_host.nmic; > + > +lba_index = NVME_ID_NS_FLBAS_INDEX(id_ns->flbas); > +ns_size = id_ns_host.nsze * ((1 << id_ns->lbaf[lba_index].ds) + > +(id_ns->lbaf[lba_index].ms)); > +id_ns->nvmcap = ns_size; > + > +if (ns_size > n->id_ctrl.unvmcap) { > +return NVME_NS_INSUFF_CAP; > +} > + > +ret = nvme_blk_truncate(ns->blkconf.blk, id_ns->nvmcap, &err); > +if (ret) { > +return ret; > +} > + > +ns->size = blk_getlength(ns->blkconf.blk); > +nvme_ns_init_format(ns); > + > +ns->params.nsid = nvme_allocate_nsid(n); > +if (!ns->params.nsid) { > +return NVME_NS_ID_UNAVAILABLE; > +} > +subsys->namespaces[ns->params.nsid] = ns; > + > +for (int cntlid = 0; cntlid < ARRAY_SIZE(n->subsys->ctrls); cntlid++) { > +ctrl = nvme_subsys_ctrl(n->subsys, cntlid); > +if (ctrl) { > +ctrl->id_ctrl.unvmcap -= le64_to_cpu(ns->size); > +} > +} > + > +stl_le_p(&req->cqe.result, ns->params.nsid); > +QSLIST_REMOVE_HEAD(&subsys->unalloc
Re: [PATCH v6 1/5] hw/nvme: split pmrmsc register into upper and lower
On Wed, Jul 21, 2021 at 09:48:32AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > The specification uses a set of 32 bit PMRMSCL and PMRMSCU registers to > make up the 64 bit logical PMRMSC register. > > Make it so. Looks good. Reviewed-by: Keith Busch
Re: [PATCH v5 2/5] hw/nvme: use symbolic names for registers
On Tue, Jul 20, 2021 at 12:46:44AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Add the NvmeBarRegs enum and use these instead of explicit register > offsets. Thanks, this is a very nice cleanup. For a suggested follow-up companion patch, we should add "ASSERT_OFFSET()" checks for each register to enforce correct positioning of the BAR offsets at build time. Reviewed-by: Keith Busch > Signed-off-by: Klaus Jensen > Reviewed-by: Gollu Appalanaidu > Reviewed-by: Philippe Mathieu-Daudé > --- > include/block/nvme.h | 29 - > hw/nvme/ctrl.c | 44 ++-- > 2 files changed, 50 insertions(+), 23 deletions(-) > > diff --git a/include/block/nvme.h b/include/block/nvme.h > index 84053b68b987..77aae0117494 100644 > --- a/include/block/nvme.h > +++ b/include/block/nvme.h > @@ -9,7 +9,7 @@ typedef struct QEMU_PACKED NvmeBar { > uint32_tcc; > uint8_t rsvd24[4]; > uint32_tcsts; > -uint32_tnssrc; > +uint32_tnssr; > uint32_taqa; > uint64_tasq; > uint64_tacq; > @@ -31,6 +31,33 @@ typedef struct QEMU_PACKED NvmeBar { > uint8_t css[484]; > } NvmeBar; > > +enum NvmeBarRegs { > +NVME_REG_CAP = offsetof(NvmeBar, cap), > +NVME_REG_VS = offsetof(NvmeBar, vs), > +NVME_REG_INTMS = offsetof(NvmeBar, intms), > +NVME_REG_INTMC = offsetof(NvmeBar, intmc), > +NVME_REG_CC = offsetof(NvmeBar, cc), > +NVME_REG_CSTS= offsetof(NvmeBar, csts), > +NVME_REG_NSSR= offsetof(NvmeBar, nssr), > +NVME_REG_AQA = offsetof(NvmeBar, aqa), > +NVME_REG_ASQ = offsetof(NvmeBar, asq), > +NVME_REG_ACQ = offsetof(NvmeBar, acq), > +NVME_REG_CMBLOC = offsetof(NvmeBar, cmbloc), > +NVME_REG_CMBSZ = offsetof(NvmeBar, cmbsz), > +NVME_REG_BPINFO = offsetof(NvmeBar, bpinfo), > +NVME_REG_BPRSEL = offsetof(NvmeBar, bprsel), > +NVME_REG_BPMBL = offsetof(NvmeBar, bpmbl), > +NVME_REG_CMBMSC = offsetof(NvmeBar, cmbmsc), > +NVME_REG_CMBSTS = offsetof(NvmeBar, cmbsts), > +NVME_REG_PMRCAP = offsetof(NvmeBar, pmrcap), > +NVME_REG_PMRCTL = offsetof(NvmeBar, pmrctl), > +NVME_REG_PMRSTS = offsetof(NvmeBar, pmrsts), > +NVME_REG_PMREBS = offsetof(NvmeBar, pmrebs), > +NVME_REG_PMRSWTP = offsetof(NvmeBar, pmrswtp), > +NVME_REG_PMRMSCL = offsetof(NvmeBar, pmrmscl), > +NVME_REG_PMRMSCU = offsetof(NvmeBar, pmrmscu), > +}; > + > enum NvmeCapShift { > CAP_MQES_SHIFT = 0, > CAP_CQR_SHIFT = 16, > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 28299c6f3764..8c305315f41c 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -5740,7 +5740,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > } > > switch (offset) { > -case 0xc: /* INTMS */ > +case NVME_REG_INTMS: > if (unlikely(msix_enabled(&(n->parent_obj { > NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix, > "undefined access to interrupt mask set" > @@ -5752,7 +5752,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > trace_pci_nvme_mmio_intm_set(data & 0x, n->bar.intmc); > nvme_irq_check(n); > break; > -case 0x10: /* INTMC */ > +case NVME_REG_INTMC: > if (unlikely(msix_enabled(&(n->parent_obj { > NVME_GUEST_ERR(pci_nvme_ub_mmiowr_intmask_with_msix, > "undefined access to interrupt mask clr" > @@ -5764,7 +5764,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > trace_pci_nvme_mmio_intm_clr(data & 0x, n->bar.intmc); > nvme_irq_check(n); > break; > -case 0x14: /* CC */ > +case NVME_REG_CC: > trace_pci_nvme_mmio_cfg(data & 0x); > /* Windows first sends data, then sends enable bit */ > if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) && > @@ -5798,7 +5798,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > n->bar.cc = data; > } > break; > -case 0x1c: /* CSTS */ > +case NVME_REG_CSTS: > if (data & (1 << 4)) { > NVME_GUEST_ERR(pci_nvme_ub_mmiowr_ssreset_w1c_unsupported, > "attempted to W1C CSTS.NSSRO" > @@ -5809,7 +5809,7 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > " of controll
Re: [RFC PATCH nvme-cli 2/2] nvme-cli/plugins/mi:add support
> +int hal_init() > +{ > +int retval = -1; > +switch (GetSidebandInterface()) { > +case qemu_nvme_mi: > +retval = qemu_mi_init(); > +break; > +default: > +break; > +} > +return retval; > +} > + > +int hal_open() > +{ > +int retval = -1; > +switch (GetSidebandInterface()) { > +case qemu_nvme_mi: > +retval = qemu_mi_open(); > +break; > +default: > +break; > +} > +return retval; > +} > + > +int hal_close() > +{ > +int retval = -1; > +switch (GetSidebandInterface()) { > +case qemu_nvme_mi: > +retval = qemu_mi_close(); > +break; > +default: > +break; > +} > +return retval; > +} > + > +int hal_i2c_write(uint8_t *data_out, uint16_t num_bytes) > +{ > +int retval = -1; > +switch (GetSidebandInterface()) { > +case qemu_nvme_mi: > +retval = qemu_mi_write(data_out, num_bytes); > +break; > +default: > +break; > +} > +return retval; > +} > + > +int hal_i2c_read(uint8_t *data_in, uint16_t num_bytes) > +{ > +uint32_t retval = -1; > +switch (GetSidebandInterface()) { > +case qemu_nvme_mi: > +retval = qemu_mi_read(data_in, num_bytes); > +break; > +default: > +break; > +} > +return retval; > +} I'm really not a fan of having non-standard interfaces. If you were going to do that, though, you should create a struct of function pointers so that you don't need these repetitive "switch (...)" statements. But if we're going to have OOB MI support in toolign, they should all use the same standard defined interface.
Re: [RFC PATCH 1/2] hw/nvme: add mi device
On Fri, Jul 09, 2021 at 07:25:45PM +0530, Padmakar Kalghatgi wrote: > The following commands are tested with nvme-cli by hooking > to the cid of the vsock as shown above and use the socket > send/recieve commands to issue the commands and get the response. Why sockets? Shouldn't mi targets use smbus for that?
[PATCH] nvme: add 'zoned.zasl' to documentation
Signed-off-by: Keith Busch --- docs/system/nvme.rst | 6 ++ 1 file changed, 6 insertions(+) diff --git a/docs/system/nvme.rst b/docs/system/nvme.rst index 33a15c7dbc..bff72d1c24 100644 --- a/docs/system/nvme.rst +++ b/docs/system/nvme.rst @@ -202,6 +202,12 @@ The namespace may be configured with additional parameters allows all zones to be open. If ``zoned.max_active`` is specified, this value must be less than or equal to that. +``zoned.zasl=UINT8`` (default: ``0``) + Set the maximum data transfer size for the Zone Append command. Like + ``mdts``, the value is specified as a power of two (2^n) and is in units of + the minimum memory page size (CAP.MPSMIN). The default value (``0``) + has this property inherit the ``mdts`` value. + Metadata -- 2.25.4
Re: [PATCH v2] hw/nvme: fix pin-based interrupt behavior (again)
On Thu, Jun 17, 2021 at 08:55:42PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Jakub noticed[1] that, when using pin-based interrupts, the device will > unconditionally deasssert when any CQEs are acknowledged. However, the > pin should not be deasserted if other completion queues still holds > unacknowledged CQEs. > > The bug is an artifact of commit ca247d35098d ("hw/block/nvme: fix > pin-based interrupt behavior") which fixed one bug but introduced > another. This is the third time someone tries to fix pin-based > interrupts (see commit 5e9aa92eb1a5 ("hw/block: Fix pin-based interrupt > behaviour of NVMe"))... > > Third time's the charm, so fix it, again, by keeping track of how many > CQs have unacknowledged CQEs and only deassert when all are cleared. > > [1]: <20210610114624.304681-1-jakub.jer...@kernkonzept.com> > > Fixes: ca247d35098d ("hw/block/nvme: fix pin-based interrupt behavior") > Reported-by: Jakub Jermář > Signed-off-by: Klaus Jensen Looks good. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix missing check for PMR capability
On Mon, Jun 07, 2021 at 11:47:57AM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Qiang Liu reported that an access on an unknown address is triggered in > memory_region_set_enabled because a check on CAP.PMRS is missing for the > PMRCTL register write when no PMR is configured. > > Cc: qemu-sta...@nongnu.org > Fixes: 75c3c9de961d ("hw/block/nvme: disable PMR at boot up") > Resolves: https://gitlab.com/qemu-project/qemu/-/issues/362 > Signed-off-by: Klaus Jensen Looks good. Reviewed-by: Keith Busch > --- > hw/nvme/ctrl.c | 4 > 1 file changed, 4 insertions(+) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index 0bcaf7192f99..463772602c4e 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -5583,6 +5583,10 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, > uint64_t data, > "invalid write to PMRCAP register, ignored"); > return; > case 0xe04: /* PMRCTL */ > +if (!NVME_CAP_PMRS(n->bar.cap)) { > +return; > +} > + > n->bar.pmrctl = data; > if (NVME_PMRCTL_EN(data)) { > memory_region_set_enabled(&n->pmr.dev->mr, true); > -- > 2.31.1
Re: [PATCH v2 11/11] Partially revert "hw/block/nvme: drain namespaces on sq deletion"
On Thu, Jun 17, 2021 at 09:06:57PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > This partially reverts commit 98f84f5a4eca5c03e32fff20f246d9b4b96d6422. > > Since all "multi aio" commands are now reimplemented to properly track > the nested aiocbs, we can revert the "hack" that was introduced to make > sure all requests we're properly drained upon sq deletion. > > The revert is partial since we keep the assert that no outstanding > requests remain on the submission queue after the explicit cancellation. > > Signed-off-by: Klaus Jensen > --- > hw/nvme/ctrl.c | 19 ++- > 1 file changed, 2 insertions(+), 17 deletions(-) > > diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c > index ec8ddb76cd39..5a1d25265a9d 100644 > --- a/hw/nvme/ctrl.c > +++ b/hw/nvme/ctrl.c > @@ -3918,7 +3918,6 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest > *req) > NvmeSQueue *sq; > NvmeCQueue *cq; > uint16_t qid = le16_to_cpu(c->qid); > -uint32_t nsid; > > if (unlikely(!qid || nvme_check_sqid(n, qid))) { > trace_pci_nvme_err_invalid_del_sq(qid); > @@ -3930,22 +3929,8 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest > *req) > sq = n->sq[qid]; > while (!QTAILQ_EMPTY(&sq->out_req_list)) { > r = QTAILQ_FIRST(&sq->out_req_list); > -if (r->aiocb) { > -blk_aio_cancel(r->aiocb); > -} > -} > - > -/* > - * Drain all namespaces if there are still outstanding requests that we > - * could not cancel explicitly. > - */ > -if (!QTAILQ_EMPTY(&sq->out_req_list)) { Quick question: was this 'if' ever even possible to hit? The preceding 'while' loop doesn't break out until the queue is empty, so this should have been unreachable. > -for (nsid = 1; nsid <= NVME_MAX_NAMESPACES; nsid++) { > -NvmeNamespace *ns = nvme_ns(n, nsid); > -if (ns) { > -nvme_ns_drain(ns); > -} > -} > +assert(r->aiocb); > +blk_aio_cancel(r->aiocb); > } > > assert(QTAILQ_EMPTY(&sq->out_req_list)); > --
Re: [PATCH v2 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
On Thu, Jun 17, 2021 at 09:06:46PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > This series reimplements flush, dsm, copy, zone reset and format nvm to > allow cancellation. I posted an RFC back in March ("hw/block/nvme: > convert ad-hoc aio tracking to aiocb") and I've applied some feedback > from Stefan and reimplemented the remaining commands. > > The basic idea is to define custom AIOCBs for these commands. The custom > AIOCB takes care of issuing all the "nested" AIOs one by one instead of > blindly sending them off simultaneously without tracking the returned > aiocbs. Klaus, THis series looks good to me. Reviewed-by: Keith Busch
Re: [PATCH] hw/nvme: fix pin-based interrupt behavior (again)
On Thu, Jun 17, 2021 at 12:08:20PM +0200, Klaus Jensen wrote: > if (cq->tail != cq->head) { > +if (!pending) { > +n->cq_pending++; > +} You should check cq->irq_enabled before incrementing cq_pending. You don't want to leave the irq asserted when polled queues have outsanding CQEs.
Re: [PATCH v2 2/3] hw/nvme: namespace parameter for EUI-64
On Sat, Jun 12, 2021 at 01:46:30AM +0200, Heinrich Schuchardt wrote: > +``eui64`` > + Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended > + Unique Identifier" descriptor in the Namespace Identification Descriptor > List. This needs to match Identify Namespace's EUI64 field, too. The Descriptor List is really just a copy of what is reported in that structure.
Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions
On Tue, Jun 01, 2021 at 09:07:56PM +0200, Klaus Jensen wrote: > On Jun 1 11:07, Keith Busch wrote: > > On Tue, Jun 01, 2021 at 07:41:34PM +0200, Klaus Jensen wrote: > > > On Jun 1 10:19, Keith Busch wrote: > > > > On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote: > > > > > NVMe Boot Partitions provides an area that may be read by the host > > > > > without initializing queues or even enabling the controller. This > > > > > allows various platform initialization code to be stored on the NVMe > > > > > device instead of some separete medium. > > > > > > > > > > This patch adds the read support for such an area, as well as support > > > > > for updating the boot partition contents from the host through the > > > > > FW Download and Commit commands. > > > > > > > > Please provide some details on what platform initilization sequence > > > > running on QEMU is going to make use of this feature. > > > > > > > > > > I totally get your reluctance to accept useless features like device > > > self-test and ill-supported ones like write uncorrectable. > > > > > > But I think this feature qualifies just fine for the device. It is useful > > > for embedded development and while there might not be any qemu boards that > > > wants to use this *right now*, it allows for experimentation. And this is > > > a > > > feature that actually *is* implemented by real products for embedded > > > systems. > > > > That wasn't my request, though. I am well aware of the feature and also > > have hardware that implements it. It just sounds like you haven't > > actually tested this feature under the protocol's intended use cases > > inside this environment. I think that type of testing and a high level > > description of it in the changelog ought to be part of acceptance > > criteria. > > > > Alright, I see. > > You'd like to see this tested by defining a new board that loads firmware > over PCIe from the device? Yes, something like that. When the feature was initially published, I took a very brief look at how qemu could use it and concluded this wasn't very practical here. I would be happy to know if there's any example platform that can use it, though. That, to me, demostrates sufficient value.
Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions
On Tue, Jun 01, 2021 at 07:41:34PM +0200, Klaus Jensen wrote: > On Jun 1 10:19, Keith Busch wrote: > > On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote: > > > NVMe Boot Partitions provides an area that may be read by the host > > > without initializing queues or even enabling the controller. This > > > allows various platform initialization code to be stored on the NVMe > > > device instead of some separete medium. > > > > > > This patch adds the read support for such an area, as well as support > > > for updating the boot partition contents from the host through the > > > FW Download and Commit commands. > > > > Please provide some details on what platform initilization sequence > > running on QEMU is going to make use of this feature. > > > > I totally get your reluctance to accept useless features like device > self-test and ill-supported ones like write uncorrectable. > > But I think this feature qualifies just fine for the device. It is useful > for embedded development and while there might not be any qemu boards that > wants to use this *right now*, it allows for experimentation. And this is a > feature that actually *is* implemented by real products for embedded > systems. That wasn't my request, though. I am well aware of the feature and also have hardware that implements it. It just sounds like you haven't actually tested this feature under the protocol's intended use cases inside this environment. I think that type of testing and a high level description of it in the changelog ought to be part of acceptance criteria.
Re: [PATCH v2 1/2] hw/nvme: add support for boot partiotions
On Tue, Jun 01, 2021 at 08:07:48PM +0530, Gollu Appalanaidu wrote: > NVMe Boot Partitions provides an area that may be read by the host > without initializing queues or even enabling the controller. This > allows various platform initialization code to be stored on the NVMe > device instead of some separete medium. > > This patch adds the read support for such an area, as well as support > for updating the boot partition contents from the host through the > FW Download and Commit commands. Please provide some details on what platform initilization sequence running on QEMU is going to make use of this feature.
Re: [PATCH 07/23] hw/block/nvme: Use definition to avoid dynamic stack allocation
On Wed, May 05, 2021 at 06:09:10PM -0500, Eric Blake wrote: > On 5/5/21 5:07 PM, Philippe Mathieu-Daudé wrote: > > +Eric > > > > On 5/5/21 11:22 PM, Keith Busch wrote: > >> On Wed, May 05, 2021 at 11:10:31PM +0200, Philippe Mathieu-Daudé wrote: > >>> The compiler isn't clever enough to figure 'SEG_CHUNK_SIZE' is > >>> a constant! Help it by using a definitions instead. > >> > >> I don't understand. > > > > Neither do I TBH... > > > >> It's labeled 'const', so any reasonable compiler > >> will place it in the 'text' segment of the executable rather than on the > >> stack. While that's compiler specific, is there really a compiler doing > >> something bad with this? If not, I do prefer the 'const' here if only > >> because it limits the symbol scope ('static const' is also preferred if > >> that helps). > > > > Using: gcc version 10.2.1 20201125 (Red Hat 10.2.1-9) (GCC) > > > > Both static+const / const trigger: > > > > hw/block/nvme.c: In function ‘nvme_map_sgl’: > > hw/block/nvme.c:818:5: error: ISO C90 forbids variable length array > > ‘segment’ [-Werror=vla] > > 818 | NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld; > > | ^ > > cc1: all warnings being treated as errors > > C99 6.7.5.2 paragraph 4: > "If the size is an integer constant expression and the element type has > a known constant size, the array type is not a variable length array > type; otherwise, the array type is a variable length array type." > > 6.6 paragraph 6: > "An integer constant expression shall have integer type and shall only > have operands that are integer constants, enumeration constants, > character constants, sizeof expressions whose results are integer > constants, and floating constants that are the immediate operands of > casts. Cast operators in an integer constant expression shall only > convert arithmetic types to integer types, except as part of an operand > to the sizeof operator." > > Notably absent from that list are 'const int' variables, which even > though they act as constants (in that the name always represents the > same value), do not actually qualify as such under C99 rules. Yes, it's > a pain. What's more, 6.6 paragraph 10: > > "An implementation may accept other forms of constant expressions." > > which means it _should_ be possible for the compiler to do what we want. > But just because it is permitted does not make it actually work. :( Thank you, that makes sense. In this case, we are talking about an integer constant expression for the value of a 'const' symbol. That seems like perfect fodder for a compiler to optimize. I understand it's not obligated to do that, but I assumed it would. Anyway, Philippe, I have no objection if you want to push forward with this series.
Re: [PATCH 07/23] hw/block/nvme: Use definition to avoid dynamic stack allocation
On Wed, May 05, 2021 at 11:10:31PM +0200, Philippe Mathieu-Daudé wrote: > The compiler isn't clever enough to figure 'SEG_CHUNK_SIZE' is > a constant! Help it by using a definitions instead. I don't understand. It's labeled 'const', so any reasonable compiler will place it in the 'text' segment of the executable rather than on the stack. While that's compiler specific, is there really a compiler doing something bad with this? If not, I do prefer the 'const' here if only because it limits the symbol scope ('static const' is also preferred if that helps). > Signed-off-by: Philippe Mathieu-Daudé > --- > hw/block/nvme.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > index 5fe082ec34c..2f6d4925826 100644 > --- a/hw/block/nvme.c > +++ b/hw/block/nvme.c > @@ -812,7 +812,7 @@ static uint16_t nvme_map_sgl(NvmeCtrl *n, NvmeSg *sg, > NvmeSglDescriptor sgl, > * descriptors and segment chain) than the command transfer size, so it > is > * not bounded by MDTS. > */ > -const int SEG_CHUNK_SIZE = 256; > +#define SEG_CHUNK_SIZE 256 > > NvmeSglDescriptor segment[SEG_CHUNK_SIZE], *sgld, *last_sgld; > uint64_t nsgld; > -- > 2.26.3 >
Re: [PATCH 00/14] hw(/block/)nvme: spring cleaning
On Mon, Apr 19, 2021 at 09:27:47PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > This series consists of various clean up patches. > > The final patch moves nvme emulation from hw/block to hw/nvme. Series looks good to me. Reviewed-by: Keith Busch
Re: [PATCH v3] hw/block/nvme: add device self test command support
On Wed, Mar 31, 2021 at 02:54:27PM +0530, Gollu Appalanaidu wrote: > This is to add support for Device Self Test Command (DST) and > DST Log Page. Refer NVM Express specification 1.4b section 5.8 > ("Device Self-test command") Please don't write change logs that just say what you did. I can read the code to see that. Explain why this is useful because this frankly looks like another useless feature. We don't need to implement every optional spec feature here. There should be a real value proposition.
Re: [PATCH] hw/block/nvme: slba equal to nsze is out of bounds if nlb is 1-based
On Fri, Apr 09, 2021 at 01:55:01PM +0200, Klaus Jensen wrote: > On Apr 9 20:05, Minwoo Im wrote: > > On 21-04-09 13:14:02, Gollu Appalanaidu wrote: > > > NSZE is the total size of the namespace in logical blocks. So the max > > > addressable logical block is NLB minus 1. So your starting logical > > > block is equal to NSZE it is a out of range. > > > > > > Signed-off-by: Gollu Appalanaidu > > > --- > > > hw/block/nvme.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c > > > index 953ec64729..be9edb1158 100644 > > > --- a/hw/block/nvme.c > > > +++ b/hw/block/nvme.c > > > @@ -2527,7 +2527,7 @@ static uint16_t nvme_dsm(NvmeCtrl *n, NvmeRequest > > > *req) > > > uint64_t slba = le64_to_cpu(range[i].slba); > > > uint32_t nlb = le32_to_cpu(range[i].nlb); > > > > > > -if (nvme_check_bounds(ns, slba, nlb)) { > > > +if (nvme_check_bounds(ns, slba, nlb) || slba == > > > ns->id_ns.nsze) { > > > > This patch also looks like check the boundary about slba. Should it be > > also checked inside of nvme_check_bounds() ? > > The catch here is that DSM is like the only command where the number of > logical blocks is a 1s-based value. Otherwise we always have nlb > 0, which > means that nvme_check_bounds() will always "do the right thing". > > My main gripe here is that (in my mind), by definition, a "zero length > range" does not reference any LBAs at all. So how can it result in LBA Out > of Range? So what's the problem? If the request is to discard 0 blocks starting from the last block, then that's valid. Is this patch actually fixing anything?
Re: [PATCH] hw/block/nvme: map prp fix if prp2 contains non-zero offset
On Thu, Apr 08, 2021 at 09:53:13PM +0530, Padmakar Kalghatgi wrote: > +/* > + * The first PRP list entry, pointed by PRP2 can contain > + * offsets. Hence, we need calculate the no of entries in > + * prp2 based on the offset it has. > + */ This comment has some unnecessary spacing at the beginning. > +nents = (n->page_size - (prp2 % n->page_size)) >> 3; page_size is a always a power of two, so let's replace the costly modulo with: nents = (n->page_size - (prp2 & (n->page_size - 1))) >> 3;
Re: [PATCH for-6.0 v2 0/8] hw/block/nvme: misc fixes
On Mon, Apr 05, 2021 at 07:54:44PM +0200, Klaus Jensen wrote: > From: Klaus Jensen > > Various fixes for 6.0. > > v2: > - "hw/block/nvme: fix handling of private namespaces" > update documentation (Gollu) > - add a patch for missing copyright headers Series looks fine. Reviewed-by: Keith Busch
Re: [PATCH V4 0/8] hw/block/nvme: support namespace attachment
On Tue, Mar 02, 2021 at 10:26:09PM +0900, Minwoo Im wrote: > Hello, > > This series supports namespace attachment: attach and detach. This is > the fourth version of series with replacing changed namespace list to > bitmap to indicate changed namespace IDs. > > Please review. Looks good to me. Reviewed-by: Keith Busch
Re: [PATCH v4 00/12] hw/block/nvme: metadata and end-to-end data protection support
On Mon, Mar 01, 2021 at 03:00:35PM +0100, Klaus Jensen wrote: > From: Klaus Jensen > > This is v4 (RFC removed) of a series that adds support for metadata and > end-to-end data protection. > > First, on the subject of metadata, in v1, support was restricted to > extended logical blocks, which was pretty trivial to implement, but > required special initialization and broke DULBE. In v2, metadata is > always stored continuously at the end of the underlying block device. > This has the advantage of not breaking DULBE since the data blocks > remains aligned and allows bdrv_block_status to be used to determinate > allocation status. It comes at the expense of complicating the extended > LBA emulation, but on the other hand it also gains support for metadata > transfered as a separate buffer. > > The end-to-end data protection support blew up in terms of required > changes. This is due to the fact that a bunch of new commands has been > added to the device since v1 (zone append, compare, copy), and they all > require various special handling for protection information. If > potential reviewers would like it split up into multiple patches, each > adding pi support to one command, shout out. > > The core of the series (metadata and eedp) is preceeded by a set of > patches that refactors mapping (yes, again) and tries to deal with the > qsg/iov duality mess (maybe also again?). > > Support fro metadata and end-to-end data protection is all joint work > with Gollu Appalanaidu. Looks fine. Reviewed-by: Keith Busch
Re: [PATCH RFC v3 12/12] hw/block/nvme: add support for the format nvm command
On Wed, Feb 17, 2021 at 09:26:37AM +0100, Klaus Jensen wrote: > On Feb 16 15:16, Keith Busch wrote: > > On Mon, Feb 15, 2021 at 12:02:40AM +0100, Klaus Jensen wrote: > > > From: Minwoo Im > > > > > > Format NVM admin command can make a namespace or namespaces to be > > > with different LBA size and metadata size with protection information > > > types. > > > > > > This patch introduces Format NVM command with LBA format, Metadata, and > > > Protection Information for the device. The secure erase operation things > > > are yet to be added. > > > > > > The parameter checks inside of this patch has been referred from > > > Keith's old branch. > > > > Oh, and here's the format command now, so my previous comment on patch > > 11 doesn't matter. > > > > > +struct nvme_aio_format_ctx { > > > +NvmeRequest *req; > > > +NvmeNamespace *ns; > > > + > > > +/* number of outstanding write zeroes for this namespace */ > > > +int *count; > > > > Shouldn't this count be the NvmeRequest's opaque value? > > That is already occupied by `num_formats` which tracks formats of > individual namespaces. `count` is for outstanding write zeroes on one > particular namespace. But why are they tracked separately? It looks like we only care about the number of outstanding zero-out commands. It doesn't matter how that number is spread across multiple namespaces.
Re: [PATCH v2 5/5] hw/block/nvme: report non-mdts command size limit for dsm
On Mon, Mar 01, 2021 at 12:15:26PM +0100, Klaus Jensen wrote: > On Feb 22 22:12, Klaus Jensen wrote: > > On Feb 23 05:55, Keith Busch wrote: > > > On Mon, Feb 22, 2021 at 07:47:59PM +0100, Klaus Jensen wrote: > > > > +typedef struct NvmeIdCtrlNvm { > > > > +uint8_t vsl; > > > > +uint8_t wzsl; > > > > +uint8_t wusl; > > > > +uint8_t dmrl; > > > > +uint32_tdmrsl; > > > > +uint64_tdmsl; > > > > +uint8_t rsvd16[4080]; > > > > +} NvmeIdCtrlNvm; > > > > > > TP 4040a still displays these fields with preceding '...' indicating > > > something comes before this. Is that just left-over from the integration > > > for TBD offsets, or is there something that still hasn't been accounted > > > for? > > > > Good question. > > > > But since the TBDs have been assigned I believe it is just a left-over. > > I must admit that I have not cross checked this with all other TPs, but > > AFAIK this is the only ratified TP that adds something to the > > NVM-specific identify controller data structure. > > Are you otherwise OK with this? Yes, I compared other TP's and it appears to be set for good once an actual numeric value is assigned, so okay to go here. Reviewed-by: Keith Busch
Re: [PATCH V2 6/7] hw/block/nvme: support namespace attachment command
On Thu, Feb 11, 2021 at 01:09:36AM +0900, Minwoo Im wrote: > @@ -183,6 +183,7 @@ static const uint32_t nvme_cse_acs[256] = { > [NVME_ADM_CMD_SET_FEATURES] = NVME_CMD_EFF_CSUPP, > [NVME_ADM_CMD_GET_FEATURES] = NVME_CMD_EFF_CSUPP, > [NVME_ADM_CMD_ASYNC_EV_REQ] = NVME_CMD_EFF_CSUPP, > +[NVME_ADM_CMD_NS_ATTACHMENT]= NVME_CMD_EFF_CSUPP, Missing NVME_CMD_EFF_NIC for the attachment command. > }; > > static const uint32_t nvme_cse_iocs_none[256]; > @@ -3766,6 +3767,62 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req) > return NVME_NO_COMPLETE; > } > > +static void __nvme_select_ns_iocs(NvmeCtrl *n, NvmeNamespace *ns); > +static uint16_t nvme_ns_attachment(NvmeCtrl *n, NvmeRequest *req) > +{ > +NvmeNamespace *ns; > +NvmeCtrl *ctrl; > +uint16_t list[NVME_CONTROLLER_LIST_SIZE] = {}; > +uint32_t nsid = le32_to_cpu(req->cmd.nsid); > +uint32_t dw10 = le32_to_cpu(req->cmd.cdw10); > +bool attach = !(dw10 & 0xf); > +uint16_t *nr_ids = &list[0]; > +uint16_t *ids = &list[1]; > +uint16_t ret; > +int i; > + > +trace_pci_nvme_ns_attachment(nvme_cid(req), dw10 & 0xf); > + > +ns = nvme_subsys_ns(n->subsys, nsid); > +if (!ns) { > +return NVME_INVALID_FIELD | NVME_DNR; > +} > + > +ret = nvme_dma(n, (uint8_t *)list, 4096, > + DMA_DIRECTION_TO_DEVICE, req); > +if (ret) { > +return ret; > +} > + > +if (!*nr_ids) { > +return NVME_NS_CTRL_LIST_INVALID | NVME_DNR; > +} > + > +for (i = 0; i < *nr_ids; i++) { > +ctrl = nvme_subsys_ctrl(n->subsys, ids[i]); > +if (!ctrl) { > +return NVME_NS_CTRL_LIST_INVALID | NVME_DNR; > +} > + > +if (attach) { > +if (nvme_ns_is_attached(ctrl, ns)) { > +return NVME_NS_ALREADY_ATTACHED | NVME_DNR; > +} > + > +nvme_ns_attach(ctrl, ns); > +__nvme_select_ns_iocs(ctrl, ns); > +} else { > +if (!nvme_ns_is_attached(ctrl, ns)) { > +return NVME_NS_NOT_ATTACHED | NVME_DNR; > +} > + > +nvme_ns_detach(ctrl, ns); > +} > +} > + > +return NVME_SUCCESS; > +} Every controller that has newly attached the namespace needs to emit the Namespace Notify AER in order for the host to react correctly to the command.
Re: [PATCH 0/3] hw/block/nvme: mdts/zasl cleanup
These look good. Reviewed-by: Keith Busch