Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote: > Hi all (linux-nvme, qemu-devel, maintainers), > > On QEMU riscv64, which does not use MSI/MSI-X and thus relies on > pin-based interrupts, I'm seeing occasional completion timeouts, i.e. > > nvme nvme0: I/O 333 QID 1 timeout, completion polled I finally got a riscv setup recreating this, and I am not sure how you're getting a "completion polled" here. I find that the polling from here progresses the request state to IDLE, so the timeout handler moves on to aborting because it's expecting COMPLETED. The driver should not be doing that, so I'll fix that up while also investigating why the interrupt isn't retriggering as expected.
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Tue, Jan 17, 2023 at 04:18:14PM +, Peter Maydell wrote: > On Tue, 17 Jan 2023 at 16:10, Guenter Roeck wrote: > > > > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > > > to mask interrupts on the controller while processing CQEs. While not > > > > required by the spec, it is *recommended* in setups not using MSI-X to > > > > reduce the risk of spurious and/or missed interrupts. > > > > > > That's assuming completions are deferred to a bottom half. We don't do > > > that by default in Linux nvme, though you can ask the driver to do that > > > if you want. > > > > > > > With the patch below, running 100 boot iterations, no timeouts were > > > > observed on QEMU emulated riscv64 or mips64. > > > > > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > > > > > Yeah, I can see why: it forces the irq line to deassert then assert, > > > just like we had forced to happen within the device side patches. Still, > > > none of that is supposed to be necessary, but this idea of using these > > > registers is probably fine. > > > > There is still no answer why this would be necessary in the first place, > > on either side. In my opinion, unless someone can confirm that the problem > > is seen with real hardware, we should assume that it happens on the qemu > > side and address it there. > > Sure, but that means identifying what the divergence > between QEMU's implementation and the hardware is first. I don't > want a fudged fix in QEMU's code any more than you want one in > the kernel's driver code :-) > I actually prefer it in qemu because that means I can test nvme support on all active LTS releases of the Linux kernel, but that is POV and secondary. This has been broken ever since I started testing nvme support with qemu, so it doesn't make much of a difference if fixing the problem for good takes a bit longer. Plus, I run my own version of qemu anyway, so carrying the fix (hack) in qemu doesn't make much of a difference for me. Anyway - any idea what to do to help figuring out what is happening ? Add tracing support to pci interrupt handling, maybe ? Guenter
Re: [PATCH] blkdebug: ignore invalid rules in non-coroutine context
Am 17.01.2023 um 16:44 hat Kevin Wolf geschrieben: > Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben: > > blkdebug events can be called from either non-coroutine or coroutine > > contexts. However, suspend actions only make sense from within > > a coroutine. Currently, using those action would lead to an abort() in > > qemu_coroutine_yield() ("Co-routine is yielding to no one"). Catch them > > and print an error instead. > > > > Signed-off-by: Paolo Bonzini > > --- > > block.c | 2 +- > > block/blkdebug.c | 10 -- > > include/block/block-io.h | 2 +- > > include/block/block_int-common.h | 3 ++- > > 4 files changed, 12 insertions(+), 5 deletions(-) > > > > diff --git a/block.c b/block.c > > index 3f2bd128570e..49c66475c73e 100644 > > --- a/block.c > > +++ b/block.c > > @@ -6334,7 +6334,7 @@ BlockStatsSpecific > > *bdrv_get_specific_stats(BlockDriverState *bs) > > return drv->bdrv_get_specific_stats(bs); > > } > > > > -void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event) > > +void coroutine_mixed_fn bdrv_debug_event(BlockDriverState *bs, > > BlkdebugEvent event) > > { > > IO_CODE(); > > if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) { > > diff --git a/block/blkdebug.c b/block/blkdebug.c > > index 4265ca125e25..ce297961b7db 100644 > > --- a/block/blkdebug.c > > +++ b/block/blkdebug.c > > @@ -31,6 +31,7 @@ > > #include "block/qdict.h" > > #include "qemu/module.h" > > #include "qemu/option.h" > > +#include "qemu/error-report.h" > > #include "qapi/qapi-visit-block-core.h" > > #include "qapi/qmp/qdict.h" > > #include "qapi/qmp/qlist.h" > > @@ -837,7 +838,7 @@ static void process_rule(BlockDriverState *bs, struct > > BlkdebugRule *rule, > > } > > } > > > > -static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event) > > +static void coroutine_mixed_fn blkdebug_debug_event(BlockDriverState *bs, > > BlkdebugEvent event) > > { > > BDRVBlkdebugState *s = bs->opaque; > > struct BlkdebugRule *rule, *next; > > @@ -855,7 +856,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, > > BlkdebugEvent event) > > } > > > > while (actions_count[ACTION_SUSPEND] > 0) { > > -qemu_coroutine_yield(); > > +if (qemu_in_coroutine()) { > > +qemu_coroutine_yield(); > > +} else { > > +error_report("Non-coroutine event %s cannot suspend\n", > > + BlkdebugEvent_lookup.array[event]); > > error_report() already adds a newline, so we shouldn't have an "\n" > here. > > > +} > > actions_count[ACTION_SUSPEND]--; > > } > > } > > Thanks, fixed this up and applied to the block branch. In fact, this conflicts with a patch in my series: [PATCH v2 13/14] block: Convert bdrv_debug_event() to co_wrapper_mixed Resolving the conflict essentially reverts this one because after that patch it actually is a coroutine_fn. So I may just drop this one agian. Kevin
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Tue, 17 Jan 2023 at 16:10, Guenter Roeck wrote: > > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > > to mask interrupts on the controller while processing CQEs. While not > > > required by the spec, it is *recommended* in setups not using MSI-X to > > > reduce the risk of spurious and/or missed interrupts. > > > > That's assuming completions are deferred to a bottom half. We don't do > > that by default in Linux nvme, though you can ask the driver to do that > > if you want. > > > > > With the patch below, running 100 boot iterations, no timeouts were > > > observed on QEMU emulated riscv64 or mips64. > > > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > > > Yeah, I can see why: it forces the irq line to deassert then assert, > > just like we had forced to happen within the device side patches. Still, > > none of that is supposed to be necessary, but this idea of using these > > registers is probably fine. > > There is still no answer why this would be necessary in the first place, > on either side. In my opinion, unless someone can confirm that the problem > is seen with real hardware, we should assume that it happens on the qemu > side and address it there. Sure, but that means identifying what the divergence between QEMU's implementation and the hardware is first. I don't want a fudged fix in QEMU's code any more than you want one in the kernel's driver code :-) thanks -- PMM
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote: > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: > > I noticed that the Linux driver does not use the INTMS/INTMC registers > > to mask interrupts on the controller while processing CQEs. While not > > required by the spec, it is *recommended* in setups not using MSI-X to > > reduce the risk of spurious and/or missed interrupts. > > That's assuming completions are deferred to a bottom half. We don't do > that by default in Linux nvme, though you can ask the driver to do that > if you want. > > > With the patch below, running 100 boot iterations, no timeouts were > > observed on QEMU emulated riscv64 or mips64. > > > > No changes are required in the QEMU hw/nvme interrupt logic. > > Yeah, I can see why: it forces the irq line to deassert then assert, > just like we had forced to happen within the device side patches. Still, > none of that is supposed to be necessary, but this idea of using these > registers is probably fine. There is still no answer why this would be necessary in the first place, on either side. In my opinion, unless someone can confirm that the problem is seen with real hardware, we should assume that it happens on the qemu side and address it there. Guenter > > > static irqreturn_t nvme_irq(int irq, void *data) > > +{ > > + struct nvme_queue *nvmeq = data; > > + struct nvme_dev *dev = nvmeq->dev; > > + u32 mask = 1 << nvmeq->cq_vector; > > + irqreturn_t ret = IRQ_NONE; > > + DEFINE_IO_COMP_BATCH(iob); > > + > > + writel(mask, dev->bar + NVME_REG_INTMS); > > + > > + if (nvme_poll_cq(nvmeq, )) { > > + if (!rq_list_empty(iob.req_list)) > > + nvme_pci_complete_batch(); > > + ret = IRQ_HANDLED; > > + } > > + > > + writel(mask, dev->bar + NVME_REG_INTMC); > > + > > + return ret; > > +} > > If threaded interrupts are used, you'll want to do the masking in > nvme_irq_check(), then clear it in the threaded handler instead of doing > both in the same callback. > > > +static irqreturn_t nvme_irq_msix(int irq, void *data) > > { > > struct nvme_queue *nvmeq = data; > > DEFINE_IO_COMP_BATCH(iob); > > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue > > *nvmeq) > > { > > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > > int nr = nvmeq->dev->ctrl.instance; > > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : > > nvme_irq; > > > > if (use_threaded_interrupts) { > > return pci_request_irq(pdev, nvmeq->cq_vector, > > nvme_irq_check, > > - nvme_irq, nvmeq, "nvme%dq%d", nr, > > nvmeq->qid); > > + handler, nvmeq, "nvme%dq%d", nr, > > nvmeq->qid); > > } else { > > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > > } > > } > > > > > >
Re: completion timeouts with pin-based interrupts in QEMU hw/nvme
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote: [ ... ] > > I noticed that the Linux driver does not use the INTMS/INTMC registers > to mask interrupts on the controller while processing CQEs. While not > required by the spec, it is *recommended* in setups not using MSI-X to > reduce the risk of spurious and/or missed interrupts. > > With the patch below, running 100 boot iterations, no timeouts were > observed on QEMU emulated riscv64 or mips64. > > No changes are required in the QEMU hw/nvme interrupt logic. > Yes, but isn't that technically similar to dropping the interrupt request and raising it again, or in other words pulsing the interrupt ? I still don't understand why the (level triggered) interrupt pin would require pulsing in the first place. Thanks, Guenter > > diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c > index b13baccedb4a..75f6b87c4c3f 100644 > --- i/drivers/nvme/host/pci.c > +++ w/drivers/nvme/host/pci.c > @@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue > *nvmeq, > } > > static irqreturn_t nvme_irq(int irq, void *data) > +{ > + struct nvme_queue *nvmeq = data; > + struct nvme_dev *dev = nvmeq->dev; > + u32 mask = 1 << nvmeq->cq_vector; > + irqreturn_t ret = IRQ_NONE; > + DEFINE_IO_COMP_BATCH(iob); > + > + writel(mask, dev->bar + NVME_REG_INTMS); > + > + if (nvme_poll_cq(nvmeq, )) { > + if (!rq_list_empty(iob.req_list)) > + nvme_pci_complete_batch(); > + ret = IRQ_HANDLED; > + } > + > + writel(mask, dev->bar + NVME_REG_INTMC); > + > + return ret; > +} > + > +static irqreturn_t nvme_irq_msix(int irq, void *data) > { > struct nvme_queue *nvmeq = data; > DEFINE_IO_COMP_BATCH(iob); > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq) > { > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev); > int nr = nvmeq->dev->ctrl.instance; > + irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq; > > if (use_threaded_interrupts) { > return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check, > - nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > + handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } else { > - return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq, > + return pci_request_irq(pdev, nvmeq->cq_vector, handler, > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid); > } > } > >
Re: [PATCH] block: remove bdrv_coroutine_enter
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben: > It has only one caller---inline it and remove the function. > > Signed-off-by: Paolo Bonzini Thanks, applied to the block branch. Kevin
Re: [PATCH] qemu-io: do not reinvent the blk_pwrite_zeroes wheel
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben: > qemu-io's do_co_pwrite_zeroes is reinventing the coroutine wrapper > blk_pwrite_zeroes. Just use the real thing directly. > > Signed-off-by: Paolo Bonzini Thanks, applied to the block branch. Kevin
Re: [PATCH] blkdebug: ignore invalid rules in non-coroutine context
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben: > blkdebug events can be called from either non-coroutine or coroutine > contexts. However, suspend actions only make sense from within > a coroutine. Currently, using those action would lead to an abort() in > qemu_coroutine_yield() ("Co-routine is yielding to no one"). Catch them > and print an error instead. > > Signed-off-by: Paolo Bonzini > --- > block.c | 2 +- > block/blkdebug.c | 10 -- > include/block/block-io.h | 2 +- > include/block/block_int-common.h | 3 ++- > 4 files changed, 12 insertions(+), 5 deletions(-) > > diff --git a/block.c b/block.c > index 3f2bd128570e..49c66475c73e 100644 > --- a/block.c > +++ b/block.c > @@ -6334,7 +6334,7 @@ BlockStatsSpecific > *bdrv_get_specific_stats(BlockDriverState *bs) > return drv->bdrv_get_specific_stats(bs); > } > > -void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event) > +void coroutine_mixed_fn bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent > event) > { > IO_CODE(); > if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) { > diff --git a/block/blkdebug.c b/block/blkdebug.c > index 4265ca125e25..ce297961b7db 100644 > --- a/block/blkdebug.c > +++ b/block/blkdebug.c > @@ -31,6 +31,7 @@ > #include "block/qdict.h" > #include "qemu/module.h" > #include "qemu/option.h" > +#include "qemu/error-report.h" > #include "qapi/qapi-visit-block-core.h" > #include "qapi/qmp/qdict.h" > #include "qapi/qmp/qlist.h" > @@ -837,7 +838,7 @@ static void process_rule(BlockDriverState *bs, struct > BlkdebugRule *rule, > } > } > > -static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event) > +static void coroutine_mixed_fn blkdebug_debug_event(BlockDriverState *bs, > BlkdebugEvent event) > { > BDRVBlkdebugState *s = bs->opaque; > struct BlkdebugRule *rule, *next; > @@ -855,7 +856,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, > BlkdebugEvent event) > } > > while (actions_count[ACTION_SUSPEND] > 0) { > -qemu_coroutine_yield(); > +if (qemu_in_coroutine()) { > +qemu_coroutine_yield(); > +} else { > +error_report("Non-coroutine event %s cannot suspend\n", > + BlkdebugEvent_lookup.array[event]); error_report() already adds a newline, so we shouldn't have an "\n" here. > +} > actions_count[ACTION_SUSPEND]--; > } > } Thanks, fixed this up and applied to the block branch. Kevin
Re: [PATCH 0/2] Make coroutine annotations ready for static analysis
Am 16.12.2022 um 12:07 hat Paolo Bonzini geschrieben: > Clang has a generic __annotate__ attribute that can be used by > static analyzers to understand properties of functions and > analyze the control flow. > > Unlike TSA annotations, the __annotate__ attribute applies to function > pointers as well, which is very fortunate because many BlockDriver > function driver run in coroutines. > > Paolo > > v1->v2: improved comments for patch 2 Thanks, applied to the block branch. Kevin
Re: [PATCH 0/4] qemu-img: Fix exit code for errors closing the image
On 12.01.23 20:14, Kevin Wolf wrote: This series addresses the problem described in these bug reports: https://gitlab.com/qemu-project/qemu/-/issues/1330 https://bugzilla.redhat.com/show_bug.cgi?id=2147617 qcow2 can fail when writing back dirty bitmaps in qcow2_inactivate(). However, when the function is called through blk_unref(), in the case of such errors, while an error message is written to stderr, the callers never see an error return. Specifically, 'qemu-img bitmap/commit' are reported to exit with an exit code 0 despite the errors. The solution taken here is inactivating the images first, which can still return errors, but already performs all of the write operations. Only then the images are actually blk_unref()-ed. Reviewed-by: Hanna Czenczek