Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Keith Busch
On Thu, Jan 12, 2023 at 02:10:51PM +0100, Klaus Jensen wrote:
> Hi all (linux-nvme, qemu-devel, maintainers),
> 
> On QEMU riscv64, which does not use MSI/MSI-X and thus relies on
> pin-based interrupts, I'm seeing occasional completion timeouts, i.e.
> 
>   nvme nvme0: I/O 333 QID 1 timeout, completion polled

I finally got a riscv setup recreating this, and I am not sure how
you're getting a "completion polled" here. I find that the polling from
here progresses the request state to IDLE, so the timeout handler moves
on to aborting because it's expecting COMPLETED. The driver should not
be doing that, so I'll fix that up while also investigating why the
interrupt isn't retriggering as expected.



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck
On Tue, Jan 17, 2023 at 04:18:14PM +, Peter Maydell wrote:
> On Tue, 17 Jan 2023 at 16:10, Guenter Roeck  wrote:
> >
> > On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > > to mask interrupts on the controller while processing CQEs. While not
> > > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > > reduce the risk of spurious and/or missed interrupts.
> > >
> > > That's assuming completions are deferred to a bottom half. We don't do
> > > that by default in Linux nvme, though you can ask the driver to do that
> > > if you want.
> > >
> > > > With the patch below, running 100 boot iterations, no timeouts were
> > > > observed on QEMU emulated riscv64 or mips64.
> > > >
> > > > No changes are required in the QEMU hw/nvme interrupt logic.
> > >
> > > Yeah, I can see why: it forces the irq line to deassert then assert,
> > > just like we had forced to happen within the device side patches. Still,
> > > none of that is supposed to be necessary, but this idea of using these
> > > registers is probably fine.
> >
> > There is still no answer why this would be necessary in the first place,
> > on either side. In my opinion, unless someone can confirm that the problem
> > is seen with real hardware, we should assume that it happens on the qemu
> > side and address it there.
> 
> Sure, but that means identifying what the divergence
> between QEMU's implementation and the hardware is first. I don't
> want a fudged fix in QEMU's code any more than you want one in
> the kernel's driver code :-)
> 

I actually prefer it in qemu because that means I can test nvme support
on all active LTS releases of the Linux kernel, but that is POV and
secondary. This has been broken ever since I started testing nvme
support with qemu, so it doesn't make much of a difference if fixing
the problem for good takes a bit longer. Plus, I run my own version
of qemu anyway, so carrying the fix (hack) in qemu doesn't make much
of a difference for me.

Anyway - any idea what to do to help figuring out what is happening ?
Add tracing support to pci interrupt handling, maybe ?

Guenter



Re: [PATCH] blkdebug: ignore invalid rules in non-coroutine context

2023-01-17 Thread Kevin Wolf
Am 17.01.2023 um 16:44 hat Kevin Wolf geschrieben:
> Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben:
> > blkdebug events can be called from either non-coroutine or coroutine
> > contexts.  However, suspend actions only make sense from within
> > a coroutine.  Currently, using those action would lead to an abort() in
> > qemu_coroutine_yield() ("Co-routine is yielding to no one").  Catch them
> > and print an error instead.
> > 
> > Signed-off-by: Paolo Bonzini 
> > ---
> >  block.c  |  2 +-
> >  block/blkdebug.c | 10 --
> >  include/block/block-io.h |  2 +-
> >  include/block/block_int-common.h |  3 ++-
> >  4 files changed, 12 insertions(+), 5 deletions(-)
> > 
> > diff --git a/block.c b/block.c
> > index 3f2bd128570e..49c66475c73e 100644
> > --- a/block.c
> > +++ b/block.c
> > @@ -6334,7 +6334,7 @@ BlockStatsSpecific 
> > *bdrv_get_specific_stats(BlockDriverState *bs)
> >  return drv->bdrv_get_specific_stats(bs);
> >  }
> >  
> > -void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
> > +void coroutine_mixed_fn bdrv_debug_event(BlockDriverState *bs, 
> > BlkdebugEvent event)
> >  {
> >  IO_CODE();
> >  if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
> > diff --git a/block/blkdebug.c b/block/blkdebug.c
> > index 4265ca125e25..ce297961b7db 100644
> > --- a/block/blkdebug.c
> > +++ b/block/blkdebug.c
> > @@ -31,6 +31,7 @@
> >  #include "block/qdict.h"
> >  #include "qemu/module.h"
> >  #include "qemu/option.h"
> > +#include "qemu/error-report.h"
> >  #include "qapi/qapi-visit-block-core.h"
> >  #include "qapi/qmp/qdict.h"
> >  #include "qapi/qmp/qlist.h"
> > @@ -837,7 +838,7 @@ static void process_rule(BlockDriverState *bs, struct 
> > BlkdebugRule *rule,
> >  }
> >  }
> >  
> > -static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event)
> > +static void coroutine_mixed_fn blkdebug_debug_event(BlockDriverState *bs, 
> > BlkdebugEvent event)
> >  {
> >  BDRVBlkdebugState *s = bs->opaque;
> >  struct BlkdebugRule *rule, *next;
> > @@ -855,7 +856,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
> > BlkdebugEvent event)
> >  }
> >  
> >  while (actions_count[ACTION_SUSPEND] > 0) {
> > -qemu_coroutine_yield();
> > +if (qemu_in_coroutine()) {
> > +qemu_coroutine_yield();
> > +} else {
> > +error_report("Non-coroutine event %s cannot suspend\n",
> > + BlkdebugEvent_lookup.array[event]);
> 
> error_report() already adds a newline, so we shouldn't have an "\n"
> here.
> 
> > +}
> >  actions_count[ACTION_SUSPEND]--;
> >  }
> >  }
> 
> Thanks, fixed this up and applied to the block branch.

In fact, this conflicts with a patch in my series:

[PATCH v2 13/14] block: Convert bdrv_debug_event() to co_wrapper_mixed

Resolving the conflict essentially reverts this one because after that
patch it actually is a coroutine_fn. So I may just drop this one agian.

Kevin




Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Peter Maydell
On Tue, 17 Jan 2023 at 16:10, Guenter Roeck  wrote:
>
> On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> > On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > > to mask interrupts on the controller while processing CQEs. While not
> > > required by the spec, it is *recommended* in setups not using MSI-X to
> > > reduce the risk of spurious and/or missed interrupts.
> >
> > That's assuming completions are deferred to a bottom half. We don't do
> > that by default in Linux nvme, though you can ask the driver to do that
> > if you want.
> >
> > > With the patch below, running 100 boot iterations, no timeouts were
> > > observed on QEMU emulated riscv64 or mips64.
> > >
> > > No changes are required in the QEMU hw/nvme interrupt logic.
> >
> > Yeah, I can see why: it forces the irq line to deassert then assert,
> > just like we had forced to happen within the device side patches. Still,
> > none of that is supposed to be necessary, but this idea of using these
> > registers is probably fine.
>
> There is still no answer why this would be necessary in the first place,
> on either side. In my opinion, unless someone can confirm that the problem
> is seen with real hardware, we should assume that it happens on the qemu
> side and address it there.

Sure, but that means identifying what the divergence
between QEMU's implementation and the hardware is first. I don't
want a fudged fix in QEMU's code any more than you want one in
the kernel's driver code :-)

thanks
-- PMM



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck
On Mon, Jan 16, 2023 at 09:58:13PM -0700, Keith Busch wrote:
> On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
> > I noticed that the Linux driver does not use the INTMS/INTMC registers
> > to mask interrupts on the controller while processing CQEs. While not
> > required by the spec, it is *recommended* in setups not using MSI-X to
> > reduce the risk of spurious and/or missed interrupts.
> 
> That's assuming completions are deferred to a bottom half. We don't do
> that by default in Linux nvme, though you can ask the driver to do that
> if you want.
>  
> > With the patch below, running 100 boot iterations, no timeouts were
> > observed on QEMU emulated riscv64 or mips64.
> >
> > No changes are required in the QEMU hw/nvme interrupt logic.
> 
> Yeah, I can see why: it forces the irq line to deassert then assert,
> just like we had forced to happen within the device side patches. Still,
> none of that is supposed to be necessary, but this idea of using these
> registers is probably fine.

There is still no answer why this would be necessary in the first place,
on either side. In my opinion, unless someone can confirm that the problem
is seen with real hardware, we should assume that it happens on the qemu
side and address it there.

Guenter

> 
> >  static irqreturn_t nvme_irq(int irq, void *data)
> > +{
> > +   struct nvme_queue *nvmeq = data;
> > +   struct nvme_dev *dev = nvmeq->dev;
> > +   u32 mask = 1 << nvmeq->cq_vector;
> > +   irqreturn_t ret = IRQ_NONE;
> > +   DEFINE_IO_COMP_BATCH(iob);
> > +
> > +   writel(mask, dev->bar + NVME_REG_INTMS);
> > +
> > +   if (nvme_poll_cq(nvmeq, )) {
> > +   if (!rq_list_empty(iob.req_list))
> > +   nvme_pci_complete_batch();
> > +   ret = IRQ_HANDLED;
> > +   }
> > +
> > +   writel(mask, dev->bar + NVME_REG_INTMC);
> > +
> > +   return ret;
> > +}
> 
> If threaded interrupts are used, you'll want to do the masking in
> nvme_irq_check(), then clear it in the threaded handler instead of doing
> both in the same callback.
> 
> > +static irqreturn_t nvme_irq_msix(int irq, void *data)
> >  {
> > struct nvme_queue *nvmeq = data;
> > DEFINE_IO_COMP_BATCH(iob);
> > @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue 
> > *nvmeq)
> >  {
> > struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> > int nr = nvmeq->dev->ctrl.instance;
> > +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : 
> > nvme_irq;
> > 
> > if (use_threaded_interrupts) {
> > return pci_request_irq(pdev, nvmeq->cq_vector, 
> > nvme_irq_check,
> > -   nvme_irq, nvmeq, "nvme%dq%d", nr, 
> > nvmeq->qid);
> > +   handler, nvmeq, "nvme%dq%d", nr, 
> > nvmeq->qid);
> > } else {
> > -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> > +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> > NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> > }
> >  }
> > 
> > 
> 
> 



Re: completion timeouts with pin-based interrupts in QEMU hw/nvme

2023-01-17 Thread Guenter Roeck
On Mon, Jan 16, 2023 at 10:14:07PM +0100, Klaus Jensen wrote:
[ ... ]
> 
> I noticed that the Linux driver does not use the INTMS/INTMC registers
> to mask interrupts on the controller while processing CQEs. While not
> required by the spec, it is *recommended* in setups not using MSI-X to
> reduce the risk of spurious and/or missed interrupts.
> 
> With the patch below, running 100 boot iterations, no timeouts were
> observed on QEMU emulated riscv64 or mips64.
> 
> No changes are required in the QEMU hw/nvme interrupt logic.
> 

Yes, but isn't that technically similar to dropping the
interrupt request and raising it again, or in other words
pulsing the interrupt ?

I still don't understand why the (level triggered) interrupt
pin would require pulsing in the first place.

Thanks,
Guenter

> 
> diff --git i/drivers/nvme/host/pci.c w/drivers/nvme/host/pci.c
> index b13baccedb4a..75f6b87c4c3f 100644
> --- i/drivers/nvme/host/pci.c
> +++ w/drivers/nvme/host/pci.c
> @@ -1128,6 +1128,27 @@ static inline int nvme_poll_cq(struct nvme_queue 
> *nvmeq,
>  }
> 
>  static irqreturn_t nvme_irq(int irq, void *data)
> +{
> +   struct nvme_queue *nvmeq = data;
> +   struct nvme_dev *dev = nvmeq->dev;
> +   u32 mask = 1 << nvmeq->cq_vector;
> +   irqreturn_t ret = IRQ_NONE;
> +   DEFINE_IO_COMP_BATCH(iob);
> +
> +   writel(mask, dev->bar + NVME_REG_INTMS);
> +
> +   if (nvme_poll_cq(nvmeq, )) {
> +   if (!rq_list_empty(iob.req_list))
> +   nvme_pci_complete_batch();
> +   ret = IRQ_HANDLED;
> +   }
> +
> +   writel(mask, dev->bar + NVME_REG_INTMC);
> +
> +   return ret;
> +}
> +
> +static irqreturn_t nvme_irq_msix(int irq, void *data)
>  {
> struct nvme_queue *nvmeq = data;
> DEFINE_IO_COMP_BATCH(iob);
> @@ -1602,12 +1623,13 @@ static int queue_request_irq(struct nvme_queue *nvmeq)
>  {
> struct pci_dev *pdev = to_pci_dev(nvmeq->dev->dev);
> int nr = nvmeq->dev->ctrl.instance;
> +   irq_handler_t handler = pdev->msix_enabled ? nvme_irq_msix : nvme_irq;
> 
> if (use_threaded_interrupts) {
> return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq_check,
> -   nvme_irq, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> +   handler, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> } else {
> -   return pci_request_irq(pdev, nvmeq->cq_vector, nvme_irq,
> +   return pci_request_irq(pdev, nvmeq->cq_vector, handler,
> NULL, nvmeq, "nvme%dq%d", nr, nvmeq->qid);
> }
>  }
> 
> 





Re: [PATCH] block: remove bdrv_coroutine_enter

2023-01-17 Thread Kevin Wolf
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben:
> It has only one caller---inline it and remove the function.
> 
> Signed-off-by: Paolo Bonzini 

Thanks, applied to the block branch.

Kevin




Re: [PATCH] qemu-io: do not reinvent the blk_pwrite_zeroes wheel

2023-01-17 Thread Kevin Wolf
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben:
> qemu-io's do_co_pwrite_zeroes is reinventing the coroutine wrapper
> blk_pwrite_zeroes.  Just use the real thing directly.
> 
> Signed-off-by: Paolo Bonzini 

Thanks, applied to the block branch.

Kevin




Re: [PATCH] blkdebug: ignore invalid rules in non-coroutine context

2023-01-17 Thread Kevin Wolf
Am 15.12.2022 um 14:02 hat Paolo Bonzini geschrieben:
> blkdebug events can be called from either non-coroutine or coroutine
> contexts.  However, suspend actions only make sense from within
> a coroutine.  Currently, using those action would lead to an abort() in
> qemu_coroutine_yield() ("Co-routine is yielding to no one").  Catch them
> and print an error instead.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  block.c  |  2 +-
>  block/blkdebug.c | 10 --
>  include/block/block-io.h |  2 +-
>  include/block/block_int-common.h |  3 ++-
>  4 files changed, 12 insertions(+), 5 deletions(-)
> 
> diff --git a/block.c b/block.c
> index 3f2bd128570e..49c66475c73e 100644
> --- a/block.c
> +++ b/block.c
> @@ -6334,7 +6334,7 @@ BlockStatsSpecific 
> *bdrv_get_specific_stats(BlockDriverState *bs)
>  return drv->bdrv_get_specific_stats(bs);
>  }
>  
> -void bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent event)
> +void coroutine_mixed_fn bdrv_debug_event(BlockDriverState *bs, BlkdebugEvent 
> event)
>  {
>  IO_CODE();
>  if (!bs || !bs->drv || !bs->drv->bdrv_debug_event) {
> diff --git a/block/blkdebug.c b/block/blkdebug.c
> index 4265ca125e25..ce297961b7db 100644
> --- a/block/blkdebug.c
> +++ b/block/blkdebug.c
> @@ -31,6 +31,7 @@
>  #include "block/qdict.h"
>  #include "qemu/module.h"
>  #include "qemu/option.h"
> +#include "qemu/error-report.h"
>  #include "qapi/qapi-visit-block-core.h"
>  #include "qapi/qmp/qdict.h"
>  #include "qapi/qmp/qlist.h"
> @@ -837,7 +838,7 @@ static void process_rule(BlockDriverState *bs, struct 
> BlkdebugRule *rule,
>  }
>  }
>  
> -static void blkdebug_debug_event(BlockDriverState *bs, BlkdebugEvent event)
> +static void coroutine_mixed_fn blkdebug_debug_event(BlockDriverState *bs, 
> BlkdebugEvent event)
>  {
>  BDRVBlkdebugState *s = bs->opaque;
>  struct BlkdebugRule *rule, *next;
> @@ -855,7 +856,12 @@ static void blkdebug_debug_event(BlockDriverState *bs, 
> BlkdebugEvent event)
>  }
>  
>  while (actions_count[ACTION_SUSPEND] > 0) {
> -qemu_coroutine_yield();
> +if (qemu_in_coroutine()) {
> +qemu_coroutine_yield();
> +} else {
> +error_report("Non-coroutine event %s cannot suspend\n",
> + BlkdebugEvent_lookup.array[event]);

error_report() already adds a newline, so we shouldn't have an "\n"
here.

> +}
>  actions_count[ACTION_SUSPEND]--;
>  }
>  }

Thanks, fixed this up and applied to the block branch.

Kevin




Re: [PATCH 0/2] Make coroutine annotations ready for static analysis

2023-01-17 Thread Kevin Wolf
Am 16.12.2022 um 12:07 hat Paolo Bonzini geschrieben:
> Clang has a generic __annotate__ attribute that can be used by
> static analyzers to understand properties of functions and
> analyze the control flow.
> 
> Unlike TSA annotations, the __annotate__ attribute applies to function
> pointers as well, which is very fortunate because many BlockDriver
> function driver run in coroutines.
> 
> Paolo
> 
> v1->v2: improved comments for patch 2

Thanks, applied to the block branch.

Kevin




Re: [PATCH 0/4] qemu-img: Fix exit code for errors closing the image

2023-01-17 Thread Hanna Czenczek

On 12.01.23 20:14, Kevin Wolf wrote:

This series addresses the problem described in these bug reports:
https://gitlab.com/qemu-project/qemu/-/issues/1330
https://bugzilla.redhat.com/show_bug.cgi?id=2147617

qcow2 can fail when writing back dirty bitmaps in qcow2_inactivate().
However, when the function is called through blk_unref(), in the case of
such errors, while an error message is written to stderr, the callers
never see an error return. Specifically, 'qemu-img bitmap/commit' are
reported to exit with an exit code 0 despite the errors.

The solution taken here is inactivating the images first, which can
still return errors, but already performs all of the write operations.
Only then the images are actually blk_unref()-ed.


Reviewed-by: Hanna Czenczek