from:"Fam Zheng"

Re: [PATCH 3/6] scsi-disk: add MODE_PAGE_APPLE quirk for Macintosh

2022-04-21 Thread Fam Zheng

On 2022-04-21 07:51, Mark Cave-Ayland wrote:
> One of the mechanisms MacOS uses to identify drives compatible with MacOS is 
> to
> send a custom MODE SELECT command for page 0x30 to the drive. The response to
> this is a hard-coded manufacturer string which must match in order for the
> drive to be usable within MacOS.
> 
> Add an implementation of the MODE SELECT page 0x30 response guarded by a newly
> defined SCSI_DISK_QUIRK_MODE_PAGE_APPLE quirk bit so that drives attached
> to non-Apple machines function exactly as before.
> 
> Signed-off-by: Mark Cave-Ayland 
> ---
>  hw/scsi/scsi-disk.c  | 19 +++
>  include/hw/scsi/scsi.h   |  3 +++
>  include/scsi/constants.h |  1 +
>  3 files changed, 23 insertions(+)
> 
> diff --git a/hw/scsi/scsi-disk.c b/hw/scsi/scsi-disk.c
> index d89cdd4e4a..37013756d5 100644
> --- a/hw/scsi/scsi-disk.c
> +++ b/hw/scsi/scsi-disk.c
> @@ -1085,6 +1085,7 @@ static int mode_sense_page(SCSIDiskState *s, int page, 
> uint8_t **p_outbuf,
>  [MODE_PAGE_R_W_ERROR]  = (1 << TYPE_DISK) | (1 << 
> TYPE_ROM),
>  [MODE_PAGE_AUDIO_CTL]  = (1 << TYPE_ROM),
>  [MODE_PAGE_CAPABILITIES]   = (1 << TYPE_ROM),
> +[MODE_PAGE_APPLE]  = (1 << TYPE_ROM),
>  };
>  
>  uint8_t *p = *p_outbuf + 2;
> @@ -1229,6 +1230,22 @@ static int mode_sense_page(SCSIDiskState *s, int page, 
> uint8_t **p_outbuf,
>  p[19] = (16 * 176) & 0xff;
>  break;
>  
> + case MODE_PAGE_APPLE:
> +if (s->qdev.type == TYPE_DISK &&
> +(s->quirks & SCSI_DISK_QUIRK_MODE_PAGE_APPLE)) {

This is always false. SCSI_DISK_QUIRK_MODE_PAGE_APPLE is defined 0.

You need (1 << SCSI_DISK_QUIRK_MODE_PAGE_APPLE) instead.

> +
> +length = 0x24;
> +if (page_control == 1) { /* Changeable Values */
> +break;
> +}
> +
> +memset(p, 0, length);
> +strcpy((char *)p + 8, "APPLE COMPUTER, INC   ");
> +break;
> +} else {
> +return -1;
> +}
> +
>  default:
>  return -1;
>  }
> @@ -3042,6 +3059,8 @@ static Property scsi_hd_properties[] = {
>  DEFINE_PROP_UINT16("rotation_rate", SCSIDiskState, rotation_rate, 0),
>  DEFINE_PROP_INT32("scsi_version", SCSIDiskState, 
> qdev.default_scsi_version,
>5),
> +DEFINE_PROP_BIT("quirk_mode_page_apple", SCSIDiskState, quirks,
> +SCSI_DISK_QUIRK_MODE_PAGE_APPLE, 0),
>  DEFINE_BLOCK_CHS_PROPERTIES(SCSIDiskState, qdev.conf),
>  DEFINE_PROP_END_OF_LIST(),
>  };
> diff --git a/include/hw/scsi/scsi.h b/include/hw/scsi/scsi.h
> index 1ffb367f94..f629706250 100644
> --- a/include/hw/scsi/scsi.h
> +++ b/include/hw/scsi/scsi.h
> @@ -226,4 +226,7 @@ SCSIDevice *scsi_device_get(SCSIBus *bus, int channel, 
> int target, int lun);
>  /* scsi-generic.c. */
>  extern const SCSIReqOps scsi_generic_req_ops;
>  
> +/* scsi-disk.c */
> +#define SCSI_DISK_QUIRK_MODE_PAGE_APPLE 0
> +
>  #endif
> diff --git a/include/scsi/constants.h b/include/scsi/constants.h
> index 2a32c08b5e..21ca7b50cd 100644
> --- a/include/scsi/constants.h
> +++ b/include/scsi/constants.h
> @@ -234,6 +234,7 @@
>  #define MODE_PAGE_FAULT_FAIL  0x1c
>  #define MODE_PAGE_TO_PROTECT  0x1d
>  #define MODE_PAGE_CAPABILITIES0x2a
> +#define MODE_PAGE_APPLE   0x30
>  #define MODE_PAGE_ALLS0x3f
>  /* Not in Mt. Fuji, but in ATAPI 2.6 -- deprecated now in favor
>   * of MODE_PAGE_SENSE_POWER */
> -- 
> 2.20.1
> 
> 

Fam

Re: [PATCH v2] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-06-22 Thread Fam Zheng




> On 22 Jun 2021, at 13:42, Philippe Mathieu-Daudé  wrote:
> 
> On 6/22/21 10:06 AM, Philippe Mathieu-Daudé wrote:
>> On 6/22/21 9:29 AM, Philippe Mathieu-Daudé wrote:
>>> On 6/21/21 5:36 PM, Fam Zheng wrote:
>>>>> On 21 Jun 2021, at 16:13, Philippe Mathieu-Daudé  
>>>>> wrote:
>>>>> On 6/21/21 3:18 PM, Fam Zheng wrote:
>>>>>>> On 21 Jun 2021, at 10:32, Philippe Mathieu-Daudé  
>>>>>>> wrote:
>>>>>>> 
>>>>>>> When the NVMe block driver was introduced (see commit bdd6a90a9e5,
>>>>>>> January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
>>>>>>> -ENOMEM in case of error. The driver was correctly handling the
>>>>>>> error path to recycle its volatile IOVA mappings.
>>>>>>> 
>>>>>>> To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
>>>>>>> DMA mappings per container", April 2019) added the -ENOSPC error to
>>>>>>> signal the user exhausted the DMA mappings available for a container.
>>>>>>> 
>>>>>>> The block driver started to mis-behave:
>>>>>>> 
>>>>>>> qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
>>>>>>> (qemu)
>>>>>>> (qemu) info status
>>>>>>> VM status: paused (io-error)
>>>>>>> (qemu) c
>>>>>>> VFIO_MAP_DMA failed: No space left on device
>>>>>>> qemu-system-x86_64: block/block-backend.c:1968: blk_get_aio_context: 
>>>>>>> Assertion `ctx == blk->ctx' failed.
>>>>>> 
>>>>>> Hi Phil,
>>>>>> 
>>>>>> 
>>>>>> The diff looks good to me, but I’m not sure what exactly caused the 
>>>>>> assertion failure. There is `if (r) { goto fail; }` that handles -ENOSPC 
>>>>>> before, so it should be treated as a general case. What am I missing?
>>>>> 
>>>>> Good catch, ENOSPC ends setting BLOCK_DEVICE_IO_STATUS_NOSPACE
>>>>> -> BLOCK_ERROR_ACTION_STOP, so the VM is paused with DMA mapping
>>>>> exhausted. I don't understand the full "VM resume" path, but this
>>>>> is not what we want (IO_NOSPACE is to warn the operator to add
>>>>> more storage and resume, which is pointless in our case, resuming
>>>>> won't help until we flush the mappings).
>>>>> 
>>>>> IIUC what we want is return ENOMEM to set BLOCK_DEVICE_IO_STATUS_FAILED.
>>>> 
>>>> I agree with that. It just makes me feel there’s another bug in the 
>>>> resuming code path. Can you get a backtrace?
>>> 
>>> It seems the resuming code path bug has been fixed elsewhere:
>>> 
>>> (qemu) info status
>>> info status
>>> VM status: paused (io-error)
>>> (qemu) c
>>> c
>>> 2021-06-22T07:27:00.745466Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>>> space left on device
>>> (qemu) info status
>>> info status
>>> VM status: paused (io-error)
>>> (qemu) c
>>> c
>>> 2021-06-22T07:27:12.458137Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>>> space left on device
>>> (qemu) c
>>> c
>>> 2021-06-22T07:27:13.439167Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>>> space left on device
>>> (qemu) c
>>> c
>>> 2021-06-22T07:27:14.272071Z qemu-system-x86_64: VFIO_MAP_DMA failed: No
>>> space left on device
>>> (qemu)
>>> 
>> 
>> I tested all releases up to v4.1.0 and could not trigger the
>> blk_get_aio_context() assertion. Building using --enable-debug.
>> IIRC Gentoo is more aggressive, so I'll restart using -O2.
> 
> Took 4h30 to test all releases with -O3, couldn't reproduce :(
> 
> I wish I hadn't postponed writing an Ansible test script...
> 
> On v1 Michal said he doesn't have access to the machine anymore,
> so I'll assume the other issue got fixed elsewhere.
> 
> 

Cool, then I think it’s just the commit message that should be updated.

Fam

Re: [PATCH v2] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-06-21 Thread Fam Zheng




> On 21 Jun 2021, at 16:13, Philippe Mathieu-Daudé  wrote:
> 
> On 6/21/21 3:18 PM, Fam Zheng wrote:
>> 
>> 
>>> On 21 Jun 2021, at 10:32, Philippe Mathieu-Daudé  wrote:
>>> 
>>> When the NVMe block driver was introduced (see commit bdd6a90a9e5,
>>> January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
>>> -ENOMEM in case of error. The driver was correctly handling the
>>> error path to recycle its volatile IOVA mappings.
>>> 
>>> To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
>>> DMA mappings per container", April 2019) added the -ENOSPC error to
>>> signal the user exhausted the DMA mappings available for a container.
>>> 
>>> The block driver started to mis-behave:
>>> 
>>> qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
>>> (qemu)
>>> (qemu) info status
>>> VM status: paused (io-error)
>>> (qemu) c
>>> VFIO_MAP_DMA failed: No space left on device
>>> qemu-system-x86_64: block/block-backend.c:1968: blk_get_aio_context: 
>>> Assertion `ctx == blk->ctx' failed.
>> 
>> Hi Phil,
>> 
>> 
>> The diff looks good to me, but I’m not sure what exactly caused the 
>> assertion failure. There is `if (r) { goto fail; }` that handles -ENOSPC 
>> before, so it should be treated as a general case. What am I missing?
> 
> Good catch, ENOSPC ends setting BLOCK_DEVICE_IO_STATUS_NOSPACE
> -> BLOCK_ERROR_ACTION_STOP, so the VM is paused with DMA mapping
> exhausted. I don't understand the full "VM resume" path, but this
> is not what we want (IO_NOSPACE is to warn the operator to add
> more storage and resume, which is pointless in our case, resuming
> won't help until we flush the mappings).
> 
> IIUC what we want is return ENOMEM to set BLOCK_DEVICE_IO_STATUS_FAILED.

I agree with that. It just makes me feel there’s another bug in the resuming 
code path. Can you get a backtrace?

Fam

Re: [PATCH v2] block/nvme: Fix VFIO_MAP_DMA failed: No space left on device

2021-06-21 Thread Fam Zheng




> On 21 Jun 2021, at 10:32, Philippe Mathieu-Daudé  wrote:
> 
> When the NVMe block driver was introduced (see commit bdd6a90a9e5,
> January 2018), Linux VFIO_IOMMU_MAP_DMA ioctl was only returning
> -ENOMEM in case of error. The driver was correctly handling the
> error path to recycle its volatile IOVA mappings.
> 
> To fix CVE-2019-3882, Linux commit 492855939bdb ("vfio/type1: Limit
> DMA mappings per container", April 2019) added the -ENOSPC error to
> signal the user exhausted the DMA mappings available for a container.
> 
> The block driver started to mis-behave:
> 
>  qemu-system-x86_64: VFIO_MAP_DMA failed: No space left on device
>  (qemu)
>  (qemu) info status
>  VM status: paused (io-error)
>  (qemu) c
>  VFIO_MAP_DMA failed: No space left on device
>  qemu-system-x86_64: block/block-backend.c:1968: blk_get_aio_context: 
> Assertion `ctx == blk->ctx' failed.

Hi Phil,
 

The diff looks good to me, but I’m not sure what exactly caused the assertion 
failure. There is `if (r) { goto fail; }` that handles -ENOSPC before, so it 
should be treated as a general case. What am I missing?

Reviewed-by: Fam Zheng 

> 
> Fix by handling the new -ENOSPC error (when DMA mappings are
> exhausted) without any distinction to the current -ENOMEM error,
> so we don't change the behavior on old kernels where the CVE-2019-3882
> fix is not present.
> 
> An easy way to reproduce this bug is to restrict the DMA mapping
> limit (65535 by default) when loading the VFIO IOMMU module:
> 
>  # modprobe vfio_iommu_type1 dma_entry_limit=666
> 
> Cc: qemu-sta...@nongnu.org
> Reported-by: Michal Prívozník 
> Fixes: bdd6a90a9e5 ("block: Add VFIO based NVMe driver")
> Buglink: https://bugs.launchpad.net/qemu/+bug/186
> Resolves: https://gitlab.com/qemu-project/qemu/-/issues/65
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
> v2: KISS checking both errors undistinguishedly (Maxim)
> ---
> block/nvme.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 2b5421e7aa6..c3d2a49866c 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -1030,7 +1030,7 @@ try_map:
> r = qemu_vfio_dma_map(s->vfio,
>   qiov->iov[i].iov_base,
>   len, true, );
> -if (r == -ENOMEM && retry) {
> +if ((r == -ENOMEM || r == -ENOSPC) && retry) {
> retry = false;
> trace_nvme_dma_flush_queue_wait(s);
> if (s->dma_map_count) {
> -- 
> 2.31.1
> 
>

Re: [PATCH 0/2] util/async: print leaked BH name when AioContext finalizes

2021-04-15 Thread Fam Zheng

On 2021-04-14 21:02, Stefan Hajnoczi wrote:
> Eric Ernst and I debugged a BH leak and it was more involved than it should 
> be.
> The problem is that BHs don't have a human-readable identifier, so low-level
> debugging techniques and inferences about the code are required to figure out
> which BH was leaked in production environments without easy debug access.
> 
> The leak ended up already being fixed upstream but let's improve diagnostics
> for leaked BHs so that this becomes quick and easy in the future.
> 
> Stefan Hajnoczi (2):
>   util/async: add a human-readable name to BHs for debugging
>   util/async: print leaked BH name when AioContext finalizes
> 
>  include/block/aio.h| 31 ---
>  include/qemu/main-loop.h   |  4 +++-
>  tests/unit/ptimer-test-stubs.c |  2 +-
>  util/async.c   | 25 +
>  util/main-loop.c   |  4 ++--
>  5 files changed, 55 insertions(+), 11 deletions(-)
> 
> -- 
> 2.30.2
> 

Reviewed-by: Fam Zheng

Re: [PATCH] block: Introduce zero-co:// and zero-aio://

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 15:02, Max Reitz  wrote:

> On 10.03.21 15:17, f...@euphon.net wrote:
> > From: Fam Zheng 
> >
> > null-co:// has a read-zeroes=off default, when used to in security
> > analysis, this can cause false positives because the driver doesn't
> > write to the read buffer.
> >
> > null-co:// has the highest possible performance as a block driver, so
> > let's keep it that way. This patch introduces zero-co:// and
> > zero-aio://, largely similar with null-*://, but have read-zeroes=on by
> > default, so it's more suitable in cases than null-co://.
> >
> > Signed-off-by: Fam Zheng 
> > ---
> >   block/null.c | 91 
> >   1 file changed, 91 insertions(+)
>
> You’ll also need to make all tests that currently use null-{co,aio} use
> zero-{co,aio}, because none of those are performance tests (as far as
> I’m aware), so they all want a block driver that memset()s.
>
> (And that’s basically my problem with this approach; nearly everyone who
> uses null-* right now probably wants read-zeroes=on, so keeping null-*
> as it is means all of those users should be changed.  Sure, they were
> all wrong to not specify read-zeroes=on, but that’s how it is.  So while
> technically this patch is a compatible change, in contrast to the one
> making read-zeroes=on the default, in practice it absolutely isn’t.)
>
> Another problem arising from that is I can imagine that some
> distributions might have whitelisted null-co because many iotests
> implicitly depend on it, so the iotests will fail if they aren’t
> whitelisted.  Now they’d need to whitelist zero-co, too.  Not
> impossible, sure, but it’s work that would need to be done.
>
>
> My problem is this: We have a not-really problem, namely “valgrind and
> other tools complain”.  Philippe (and I guess me on IRC) proposed a
> simple solution whose only drawbacks (AFAIU) are:
>
> (1) When writing performance tests, you’ll then need to explicitly
> specify read-zeroes=off.  Existing performance tests using null-co will
> probably wrongly show degredation.  (Are there such tests, though?)
>
> (2) null will not quite conform to its name, strictly speaking.
> Frankly, I don’t know who’d care other than people who write those
> performance tests mentioned in (1).  I know I don’t care.
>
> (Technically: (3) We should look into all qemu tests that use null-co to
> see whether they test performance.  In practice, they don’t, so we don’t
> need to.)
>
> So AFAIU change the read-zeroes default would affect very few people, if
> any.  I see you care about (2), and you’re the maintainer, so I can’t
> say that there is no problem.  But it isn’t a practical one.
>
> So on the practical front, we get a small benefit (tools won’t complain)
> for a small drawback (having to remember read-zeroes=off for performance
> tests).
>
>
> Now you propose a change that has the following drawbacks, as I see it:
>
> (1) All non-performance tests using null-* should be changed to zero-*.
>   And those are quite a few tests, so this is some work.
>
> (2) Distributions that have whitelisted null-co now should whitelist
> zero-co, too.
>
> Not impossible, but I consider these much bigger practical drawbacks
> than making read-zeroes=on the default.  It’s actual work that must be
> done.  To me personally, these drawbacks far outweigh the benefit of
> having valgrind and other tools not complain.
>
>
> I can’t stop you from updating this patch to do (1), but it doesn’t make
> sense to me from a practical perspective.  It just doesn’t seem worth it
> to me.
>

But using null-co:// in tests is not wrong, the problem is the caller
failed to initialize its buffer. IMO the valgrind issue should really be
fixed like this:

diff --git a/hw/block/hd-geometry.c b/hw/block/hd-geometry.c
index dcbccee294..9078718445 100644
--- a/hw/block/hd-geometry.c
+++ b/hw/block/hd-geometry.c
@@ -55,7 +55,7 @@ struct partition {
static int guess_disk_lchs(BlockBackend *blk,
   int *pcylinders, int *pheads, int *psectors)
{
-uint8_t buf[BDRV_SECTOR_SIZE];
+uint8_t buf[BDRV_SECTOR_SIZE] = {};
int i, heads, sectors, cylinders;
struct partition *p;
uint32_t nr_sects;

Re: [PATCH] block: Introduce zero-co:// and zero-aio://

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 14:51, Philippe Mathieu-Daudé 
wrote:

> On 3/10/21 3:28 PM, Fam Zheng wrote:
> > On Wed, 10 Mar 2021 at 14:24, Philippe Mathieu-Daudé  > <mailto:phi...@redhat.com>> wrote:
> >
> > On 3/10/21 3:17 PM, f...@euphon.net <mailto:f...@euphon.net> wrote:
> > > From: Fam Zheng mailto:famzh...@amazon.com>>
> > >
> > > null-co:// has a read-zeroes=off default, when used to in security
> > > analysis, this can cause false positives because the driver doesn't
> > > write to the read buffer.
> > >
> > > null-co:// has the highest possible performance as a block driver,
> so
> > > let's keep it that way. This patch introduces zero-co:// and
> > > zero-aio://, largely similar with null-*://, but have
> > read-zeroes=on by
> > > default, so it's more suitable in cases than null-co://.
> >
> > Thanks!
> >
> > >
> > > Signed-off-by: Fam Zheng mailto:f...@euphon.net>>
> > > ---
> > >  block/null.c | 91
> > 
> > >  1 file changed, 91 insertions(+)
> >
> > > +static int zero_file_open(BlockDriverState *bs, QDict *options,
> > int flags,
> > > +  Error **errp)
> > > +{
> > > +QemuOpts *opts;
> > > +BDRVNullState *s = bs->opaque;
> > > +int ret = 0;
> > > +
> > > +opts = qemu_opts_create(_opts, NULL, 0, _abort);
> > > +qemu_opts_absorb_qdict(opts, options, _abort);
> > > +s->length =
> > > +qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 1 << 30);
> >
> > Please do not use this magic default value, let's always explicit the
> > size.
> >
> > s->length = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
> > if (s->length < 0) {
> > error_setg(errp, "%s is invalid", BLOCK_OPT_SIZE);
> > ret = -EINVAL;
> > }
> >
> >
> > Doesn't that result in a bare
> >
> >   -blockdev zero-co://
> >
> > would have 0 byte in size?
>
> Yes, this will display an error.
>
> Maybe better something like:
>
> if (s->length == 0) {
> error_append_hint(errp, "Usage: zero-co://,size=NUM");
> error_setg(errp, "size must be specified");
> ret = -EINVAL;
> } else if (s->length < 0) {
> error_setg(errp, "%s is invalid", BLOCK_OPT_SIZE);
> ret = -EINVAL;
> }
>
> >
> > I'm copying what null-co:// does today, and it's convenient for tests.
> > Why is a default size bad?
>
> We learned default are almost always bad because you can not get
> rid of them. Also because we have reports of errors when using
> floppy and another one (can't remember) with null-co because of
> invalid size and we have to explain "the default is 1GB, you have
> to explicit your size".
>

Yeah I think that makes sense. I'll revise.

Thanks,
Fam

Re: [PATCH] block: Introduce zero-co:// and zero-aio://

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 14:41, Vladimir Sementsov-Ogievskiy <
vsement...@virtuozzo.com> wrote:

> 10.03.2021 17:17, f...@euphon.net wrote:
> > From: Fam Zheng 
> >
> > null-co:// has a read-zeroes=off default, when used to in security
> > analysis, this can cause false positives because the driver doesn't
> > write to the read buffer.
> >
> > null-co:// has the highest possible performance as a block driver, so
> > let's keep it that way. This patch introduces zero-co:// and
> > zero-aio://, largely similar with null-*://, but have read-zeroes=on by
> > default, so it's more suitable in cases than null-co://.
> >
> > Signed-off-by: Fam Zheng 
> > ---
> >   block/null.c | 91 
> >   1 file changed, 91 insertions(+)
> >
> > diff --git a/block/null.c b/block/null.c
> > index cc9b1d4ea7..5de97e8fda 100644
> > --- a/block/null.c
> > +++ b/block/null.c
> > @@ -76,6 +76,30 @@ static void null_aio_parse_filename(const char
> *filename, QDict *options,
> >   }
> >   }
> >
> > +static void zero_co_parse_filename(const char *filename, QDict *options,
> > +   Error **errp)
> > +{
> > +/* This functions only exists so that a zero-co:// filename is
> accepted
> > + * with the zero-co driver. */
> > +if (strcmp(filename, "zero-co://")) {
> > +error_setg(errp, "The only allowed filename for this driver is "
> > + "'zero-co://'");
> > +return;
> > +}
> > +}
> > +
> > +static void zero_aio_parse_filename(const char *filename, QDict
> *options,
> > +Error **errp)
> > +{
> > +/* This functions only exists so that a zero-aio:// filename is
> accepted
> > + * with the zero-aio driver. */
> > +if (strcmp(filename, "zero-aio://")) {
> > +error_setg(errp, "The only allowed filename for this driver is "
> > + "'zero-aio://'");
> > +return;
> > +}
> > +}
> > +
> >   static int null_file_open(BlockDriverState *bs, QDict *options, int
> flags,
> > Error **errp)
> >   {
> > @@ -99,6 +123,29 @@ static int null_file_open(BlockDriverState *bs,
> QDict *options, int flags,
> >   return ret;
> >   }
> >
> > +static int zero_file_open(BlockDriverState *bs, QDict *options, int
> flags,
> > +  Error **errp)
> > +{
> > +QemuOpts *opts;
> > +BDRVNullState *s = bs->opaque;
> > +int ret = 0;
> > +
> > +opts = qemu_opts_create(_opts, NULL, 0, _abort);
> > +qemu_opts_absorb_qdict(opts, options, _abort);
> > +s->length =
> > +qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 1 << 30);
> > +s->latency_ns =
> > +qemu_opt_get_number(opts, NULL_OPT_LATENCY, 0);
> > +if (s->latency_ns < 0) {
> > +error_setg(errp, "latency-ns is invalid");
> > +ret = -EINVAL;
> > +}
> > +s->read_zeroes = true;
> > +qemu_opts_del(opts);
> > +bs->supported_write_flags = BDRV_REQ_FUA;
> > +return ret;
> > +}
> > +
> >   static int64_t null_getlength(BlockDriverState *bs)
> >   {
> >   BDRVNullState *s = bs->opaque;
> > @@ -316,10 +363,54 @@ static BlockDriver bdrv_null_aio = {
> >   .strong_runtime_opts= null_strong_runtime_opts,
> >   };
> >
> > +static BlockDriver bdrv_zero_co = {
> > +.format_name= "zero-co",
> > +.protocol_name  = "zero-co",
> > +.instance_size  = sizeof(BDRVNullState),
> > +
> > +.bdrv_file_open = zero_file_open,
> > +.bdrv_parse_filename= zero_co_parse_filename,
> > +.bdrv_getlength = null_getlength,
> > +.bdrv_get_allocated_file_size = null_allocated_file_size,
> > +
> > +.bdrv_co_preadv = null_co_preadv,
> > +.bdrv_co_pwritev= null_co_pwritev,
> > +.bdrv_co_flush_to_disk  = null_co_flush,
> > +.bdrv_reopen_prepare= null_reopen_prepare,
> > +
> > +.bdrv_co_block_status   = null_co_block_status,
> > +
> > +.bdrv_refresh_filename  = null_refresh_filename,
> > +.strong_runtime_opts= null_strong_runtime_opts,
> > +};
> > +
> > +static BlockDriver bdrv_zero_aio = {
> > +.form

Re: [PATCH] block: Introduce zero-co:// and zero-aio://

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 14:24, Philippe Mathieu-Daudé 
wrote:

> On 3/10/21 3:17 PM, f...@euphon.net wrote:
> > From: Fam Zheng 
> >
> > null-co:// has a read-zeroes=off default, when used to in security
> > analysis, this can cause false positives because the driver doesn't
> > write to the read buffer.
> >
> > null-co:// has the highest possible performance as a block driver, so
> > let's keep it that way. This patch introduces zero-co:// and
> > zero-aio://, largely similar with null-*://, but have read-zeroes=on by
> > default, so it's more suitable in cases than null-co://.
>
> Thanks!
>
> >
> > Signed-off-by: Fam Zheng 
> > ---
> >  block/null.c | 91 
> >  1 file changed, 91 insertions(+)
>
> > +static int zero_file_open(BlockDriverState *bs, QDict *options, int
> flags,
> > +  Error **errp)
> > +{
> > +QemuOpts *opts;
> > +BDRVNullState *s = bs->opaque;
> > +int ret = 0;
> > +
> > +opts = qemu_opts_create(_opts, NULL, 0, _abort);
> > +qemu_opts_absorb_qdict(opts, options, _abort);
> > +s->length =
> > +qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 1 << 30);
>
> Please do not use this magic default value, let's always explicit the
> size.
>
> s->length = qemu_opt_get_size(opts, BLOCK_OPT_SIZE, 0);
> if (s->length < 0) {
> error_setg(errp, "%s is invalid", BLOCK_OPT_SIZE);
> ret = -EINVAL;
> }


Doesn't that result in a bare

  -blockdev zero-co://

would have 0 byte in size?

I'm copying what null-co:// does today, and it's convenient for tests. Why
is a default size bad?

Fam



>
>
> +s->latency_ns =
> > +qemu_opt_get_number(opts, NULL_OPT_LATENCY, 0);
> > +if (s->latency_ns < 0) {
> > +error_setg(errp, "latency-ns is invalid");
> > +ret = -EINVAL;
> > +}
> > +s->read_zeroes = true;
> > +qemu_opts_del(opts);
> > +bs->supported_write_flags = BDRV_REQ_FUA;
> > +return ret;
> > +}
>
> Otherwise LGTM :)
>
>
>

Re: [PATCH 0/3] block: Introduce the 'zeroes-co' driver to help security reports

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 12:38, Philippe Mathieu-Daudé 
wrote:

> On 3/10/21 1:32 PM, Fam Zheng wrote:
> > On Wed, 10 Mar 2021 at 11:44, Philippe Mathieu-Daudé  > <mailto:phi...@redhat.com>> wrote:
> >
> > Hi,
> >
> > This is an alternative approach to changing null-co driver
> > default 'read-zeroes' option to true:
> > https://www.mail-archive.com/qemu-block@nongnu.org/msg80873.html
> > <https://www.mail-archive.com/qemu-block@nongnu.org/msg80873.html>
> >
> > Instead we introduce yet another block driver with an explicit
> > name: 'zeroes-co'. We then clarify in secure-coding-practices.rst
> > that security reports have to be sent using this new driver.
> >
> > The 2nd patch is RFC because I won't spend time converting the
> > tests until the first patch is discussed, as I already spent enough
> > time doing that in the previous mentioned series.
> >
> > Regards,
> >
> > Phil.
> >
> > Philippe Mathieu-Daudé (3):
> >   block: Introduce the 'zeroes-co' driver
> >   tests/test-blockjob: Use zeroes-co instead of
> null-co,read-zeroes=on
> >   docs/secure-coding-practices: Describe null-co/zeroes-co block
> drivers
> >
> >  docs/devel/secure-coding-practices.rst |   7 +
> >  block/zeroes.c | 306
> +
> >
> > Why not add another BlockDriver struct to block/null.c and set the
> > read_zeroes field in the .bdrv_file_open callback? It would make the
> > patch much simpler.
>
> This is the first patch I wrote but then realized you are listed as
> null-co maintainer, so you might be uninterested in having this new
> driver in the same file. And declaring the prototypes public to
> reuse them seemed overkill.
>
>
I posted a patch for block/null.c to add the same interface, just as an
alternative to the first patch.

I think we all agree that both zeroed and non-zeroed read mode will be
supported going forward, the divergence is on approach:

a) -blockdev null-co:// | -blockdev null-co://,read-zeroes=off,

if we make read-zeroes=on the default.

b) -blockdev null-co:// | -blockdev zero-co://,

if we keep the current default of null-co://, but introduce zero-co://
which has a clearer interface.

Fam

Re: [PATCH 0/3] block: Introduce the 'zeroes-co' driver to help security reports

2021-03-10 Thread Fam Zheng

On Wed, 10 Mar 2021 at 11:44, Philippe Mathieu-Daudé 
wrote:

> Hi,
>
> This is an alternative approach to changing null-co driver
> default 'read-zeroes' option to true:
> https://www.mail-archive.com/qemu-block@nongnu.org/msg80873.html
>
> Instead we introduce yet another block driver with an explicit
> name: 'zeroes-co'. We then clarify in secure-coding-practices.rst
> that security reports have to be sent using this new driver.
>
> The 2nd patch is RFC because I won't spend time converting the
> tests until the first patch is discussed, as I already spent enough
> time doing that in the previous mentioned series.
>
> Regards,
>
> Phil.
>
> Philippe Mathieu-Daudé (3):
>   block: Introduce the 'zeroes-co' driver
>   tests/test-blockjob: Use zeroes-co instead of null-co,read-zeroes=on
>   docs/secure-coding-practices: Describe null-co/zeroes-co block drivers
>
>  docs/devel/secure-coding-practices.rst |   7 +
>  block/zeroes.c | 306 +
>


Why not add another BlockDriver struct to block/null.c and set the
read_zeroes field in the .bdrv_file_open callback? It would make the patch
much simpler.

Fam


>  tests/test-blockjob.c  |   4 +-
>  block/meson.build  |   1 +
>  4 files changed, 315 insertions(+), 3 deletions(-)
>  create mode 100644 block/zeroes.c
>
> --
> 2.26.2
>
>
>
>

Re: [PATCH v2 0/2] block: Use 'read-zeroes=true' mode by default with 'null-co' driver

2021-02-23 Thread Fam Zheng

On 2021-02-23 17:01, Max Reitz wrote:
> On 23.02.21 10:21, Fam Zheng wrote:
> > On 2021-02-22 18:55, Philippe Mathieu-Daudé wrote:
> > > On 2/22/21 6:35 PM, Fam Zheng wrote:
> > > > On 2021-02-19 15:09, Philippe Mathieu-Daudé wrote:
> > > > > On 2/19/21 12:07 PM, Max Reitz wrote:
> > > > > > On 13.02.21 22:54, Fam Zheng wrote:
> > > > > > > On 2021-02-11 15:26, Philippe Mathieu-Daudé wrote:
> > > > > > > > The null-co driver doesn't zeroize buffer in its default config,
> > > > > > > > because it is designed for testing and tests want to run fast.
> > > > > > > > However this confuses security researchers (access to uninit
> > > > > > > > buffers).
> > > > > > > 
> > > > > > > I'm a little surprised.
> > > > > > > 
> > > > > > > Is changing default the only way to fix this? I'm not opposed to
> > > > > > > changing the default but I'm not convinced this is the easiest 
> > > > > > > way.
> > > > > > > block/nvme.c also doesn't touch the memory, but defers to the 
> > > > > > > device
> > > > > > > DMA, why doesn't that confuse the security checker?
> > > > > 
> > > > > Generally speaking, there is a balance between security and 
> > > > > performance.
> > > > > We try to provide both, but when we can't, my understanding is 
> > > > > security
> > > > > is more important.
> > > > 
> > > > Why is hiding the code path behind a non-default more secure? What is
> > > > not secure now?
> > > 
> > > Se we are back to the problem of having default values.
> > > 
> > > I'd like to remove the default and have the option explicit,
> > > but qemu_opt_get_bool() expects a 'default' value.
> > > 
> > > Should we rename qemu_opt_get_bool() -> qemu_opt_get_bool_with_default()
> > > and add a simpler qemu_opt_get_bool()?
> > 
> > My point is I still don't get the full context of the problem this
> > series is trying to solve. If the problem is tools are confused, I would
> > like to understand why. What is the thing that matters here, exactly?
> 
> My personal opinion is that it isn’t even about tools, it’s just about the
> fact that operating on uninitialized data is something that should generally
> be avoided.  For the null drivers, there is a reason to allow it
> (performance testing), but that should be a special case, not the default.

How do we define uninitialized data? Are buffers passed to a successful
read (2) syscall initialized? We cannot know, because it's up to the
fs/driver/storage. It's the same to bdrv_pread().

In fact block/null.c doesn't operate on uninitialized data, we should
really fix guess_disk_lchs() and similar.

> 
> > But there's always been nullblk.ko in kernel that doesn't lie in the
> > name. If we change the default we are not very honest any more about
> > what the driver actually does.
> 
> Then we’re already lying, because if we model it after /dev/null, we should
> probably return -EIO on every read.

nullblk.ko is not /dev/null, it's /dev/nullb*:

https://www.kernel.org/doc/Documentation/block/null_blk.txt

Fam

Re: [PATCH v2 0/2] block: Use 'read-zeroes=true' mode by default with 'null-co' driver

2021-02-23 Thread Fam Zheng

On 2021-02-22 18:55, Philippe Mathieu-Daudé wrote:
> On 2/22/21 6:35 PM, Fam Zheng wrote:
> > On 2021-02-19 15:09, Philippe Mathieu-Daudé wrote:
> >> On 2/19/21 12:07 PM, Max Reitz wrote:
> >>> On 13.02.21 22:54, Fam Zheng wrote:
> >>>> On 2021-02-11 15:26, Philippe Mathieu-Daudé wrote:
> >>>>> The null-co driver doesn't zeroize buffer in its default config,
> >>>>> because it is designed for testing and tests want to run fast.
> >>>>> However this confuses security researchers (access to uninit
> >>>>> buffers).
> >>>>
> >>>> I'm a little surprised.
> >>>>
> >>>> Is changing default the only way to fix this? I'm not opposed to
> >>>> changing the default but I'm not convinced this is the easiest way.
> >>>> block/nvme.c also doesn't touch the memory, but defers to the device
> >>>> DMA, why doesn't that confuse the security checker?
> >>
> >> Generally speaking, there is a balance between security and performance.
> >> We try to provide both, but when we can't, my understanding is security
> >> is more important.
> > 
> > Why is hiding the code path behind a non-default more secure? What is
> > not secure now?
> 
> Se we are back to the problem of having default values.
> 
> I'd like to remove the default and have the option explicit,
> but qemu_opt_get_bool() expects a 'default' value.
> 
> Should we rename qemu_opt_get_bool() -> qemu_opt_get_bool_with_default()
> and add a simpler qemu_opt_get_bool()?

My point is I still don't get the full context of the problem this
series is trying to solve. If the problem is tools are confused, I would
like to understand why. What is the thing that matters here, exactly?

But there's always been nullblk.ko in kernel that doesn't lie in the
name. If we change the default we are not very honest any more about
what the driver actually does.

Even if null-co:// and null-aio:// is a bad idea, then let's add
zero-co://co and zero-aio://, and deprecate null-*://.

Fam

Re: [PATCH v2 0/2] block: Use 'read-zeroes=true' mode by default with 'null-co' driver

2021-02-22 Thread Fam Zheng

On 2021-02-19 15:09, Philippe Mathieu-Daudé wrote:
> On 2/19/21 12:07 PM, Max Reitz wrote:
> > On 13.02.21 22:54, Fam Zheng wrote:
> >> On 2021-02-11 15:26, Philippe Mathieu-Daudé wrote:
> >>> The null-co driver doesn't zeroize buffer in its default config,
> >>> because it is designed for testing and tests want to run fast.
> >>> However this confuses security researchers (access to uninit
> >>> buffers).
> >>
> >> I'm a little surprised.
> >>
> >> Is changing default the only way to fix this? I'm not opposed to
> >> changing the default but I'm not convinced this is the easiest way.
> >> block/nvme.c also doesn't touch the memory, but defers to the device
> >> DMA, why doesn't that confuse the security checker?
> 
> Generally speaking, there is a balance between security and performance.
> We try to provide both, but when we can't, my understanding is security
> is more important.

Why is hiding the code path behind a non-default more secure? What is
not secure now?

Fam

Re: [PATCH v2 0/2] block: Use 'read-zeroes=true' mode by default with 'null-co' driver

2021-02-13 Thread Fam Zheng

On 2021-02-11 15:26, Philippe Mathieu-Daudé wrote:
> The null-co driver doesn't zeroize buffer in its default config,
> because it is designed for testing and tests want to run fast.
> However this confuses security researchers (access to uninit
> buffers).

I'm a little surprised.

Is changing default the only way to fix this? I'm not opposed to
changing the default but I'm not convinced this is the easiest way.
block/nvme.c also doesn't touch the memory, but defers to the device
DMA, why doesn't that confuse the security checker?

Cannot we just somehow annotate it in a way that the checker can
understand (akin to how we provide coverity models) and be done?

Thanks,
Fam

> 
> A one-line patch supposed which became a painful one, because
> there is so many different syntax to express the same usage:
> 
>  opt = qdict_new();
>  qdict_put_str(opt, "read-zeroes", "off");
>  null_bs = bdrv_open("null-co://", NULL, opt, BDRV_O_RDWR | BDRV_O_PROTOCOL,
>  _abort);
> 
> vm.qmp('blockdev-add', driver='null-co', read_zeroes=False, ...)
> 
> vm.add_drive_raw("id=drive0,driver=null-co,read-zeroes=off,if=none")
> 
> blk0 = { 'node-name': 'src',
> 'driver': 'null-co',
> 'read-zeroes': 'off' }
> 
> 'file': {
> 'driver': 'null-co',
> 'read-zeroes': False,
> }
> 
> "file": {
> "driver": "null-co",
> "read-zeroes": "off"
> }
> 
> { "execute": "blockdev-add",
>   "arguments": {
>   "driver": "null-co",
>   "read-zeroes": false,
>   "node-name": "disk0"
> }
> }
> 
> opts = {'driver': 'null-co,read-zeroes=off', 'node-name': 'root', 'size': 
> 1024}
> 
> qemu -drive driver=null-co,read-zeroes=off
> 
> qemu-io ... "json:{'driver': 'null-co', 'read-zeroes': false, 'size': 65536}"
> 
> qemu-img null-co://,read-zeroes=off
> 
> qemu-img ... -o 
> data_file="json:{'driver':'null-co',,'read-zeroes':false,,'size':'4294967296'}"
> 
> There are probably more.
> 
> Anyhow, the iotests I am not sure and should be audited are 056, 155
> (I don't understand the syntax) and 162.
> 
> Please review,
> 
> Phil.
> 
> Philippe Mathieu-Daud=C3=A9 (2):
>   block: Explicit null-co uses 'read-zeroes=3Dfalse'
>   block/null: Enable 'read-zeroes' mode by default
> 
>  docs/devel/testing.rst | 14 +++---
>  tests/qtest/fuzz/generic_fuzz_configs.h| 11 ++-
>  block/null.c   |  2 +-
>  tests/test-bdrv-drain.c| 10 --
>  tests/acceptance/virtio_check_params.py|  2 +-
>  tests/perf/block/qcow2/convert-blockstatus |  6 +++---
>  tests/qemu-iotests/040 |  2 +-
>  tests/qemu-iotests/041 | 12 
>  tests/qemu-iotests/051 |  2 +-
>  tests/qemu-iotests/051.out |  2 +-
>  tests/qemu-iotests/051.pc.out  |  4 ++--
>  tests/qemu-iotests/087 |  6 --
>  tests/qemu-iotests/118 |  2 +-
>  tests/qemu-iotests/133 |  2 +-
>  tests/qemu-iotests/153 |  8 
>  tests/qemu-iotests/184 |  2 ++
>  tests/qemu-iotests/184.out | 10 +-
>  tests/qemu-iotests/218 |  3 +++
>  tests/qemu-iotests/224 |  3 ++-
>  tests/qemu-iotests/224.out |  8 
>  tests/qemu-iotests/225 |  2 +-
>  tests/qemu-iotests/227 |  4 ++--
>  tests/qemu-iotests/227.out |  4 ++--
>  tests/qemu-iotests/228 |  2 +-
>  tests/qemu-iotests/235 |  1 +
>  tests/qemu-iotests/245 |  2 +-
>  tests/qemu-iotests/270 |  2 +-
>  tests/qemu-iotests/283 |  3 ++-
>  tests/qemu-iotests/283.out |  4 ++--
>  tests/qemu-iotests/299 |  1 +
>  tests/qemu-iotests/299.out |  2 +-
>  tests/qemu-iotests/300 |  4 ++--
>  32 files changed, 82 insertions(+), 60 deletions(-)
> 
> --=20
> 2.26.2
> 
> 
>

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-10-22 Thread Fam Zheng

On Tue, 2020-10-20 at 09:34 +0800, Zhenyu Ye wrote:
> On 2020/10/19 21:25, Paolo Bonzini wrote:
> > On 19/10/20 14:40, Zhenyu Ye wrote:
> > > The kernel backtrace for io_submit in GUEST is:
> > > 
> > >   guest# ./offcputime -K -p `pgrep -nx fio`
> > >   b'finish_task_switch'
> > >   b'__schedule'
> > >   b'schedule'
> > >   b'io_schedule'
> > >   b'blk_mq_get_tag'
> > >   b'blk_mq_get_request'
> > >   b'blk_mq_make_request'
> > >   b'generic_make_request'
> > >   b'submit_bio'
> > >   b'blkdev_direct_IO'
> > >   b'generic_file_read_iter'
> > >   b'aio_read'
> > >   b'io_submit_one'
> > >   b'__x64_sys_io_submit'
> > >   b'do_syscall_64'
> > >   b'entry_SYSCALL_64_after_hwframe'
> > >   -fio (1464)
> > >   40031912
> > > 
> > > And Linux io_uring can avoid the latency problem.

Thanks for the info. What this tells us is basically the inflight
requests are high. It's sad that the linux-aio is in practice
implemented as a blocking API.

Host side backtrace will be of more help. Can you get that too?

Fam

> > 
> > What filesystem are you using?
> > 
> 
> On host, the VM image and disk images are based on ext4 filesystem.
> In guest, the '/' uses xfs filesystem, and the disks are raw devices.
> 
> guest# df -hT
> Filesystem  Type  Size  Used Avail Use% Mounted on
> devtmpfsdevtmpfs   16G 0   16G   0% /dev
> tmpfs   tmpfs  16G 0   16G   0% /dev/shm
> tmpfs   tmpfs  16G  976K   16G   1% /run
> /dev/mapper/fedora-root xfs   8.0G  3.2G  4.9G  40% /
> tmpfs   tmpfs  16G 0   16G   0% /tmp
> /dev/sda1   xfs  1014M  181M  834M  18% /boot
> tmpfs   tmpfs 3.2G 0  3.2G   0% /run/user/0
> 
> guest# lsblk
> NAMEMAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> sda   8:00  10G  0 disk
> ├─sda18:10   1G  0 part /boot
> └─sda28:20   9G  0 part
>   ├─fedora-root 253:00   8G  0 lvm  /
>   └─fedora-swap 253:10   1G  0 lvm  [SWAP]
> vda 252:00  10G  0 disk
> vdb 252:16   0  10G  0 disk
> vdc 252:32   0  10G  0 disk
> vdd 252:48   0  10G  0 disk
> 
> Thanks,
> Zhenyu
>

Re: [PATCH 0/9] util/vfio-helpers: Improve debugging experience

2020-10-14 Thread Fam Zheng

On Wed, 2020-10-14 at 14:42 +0200, Philippe Mathieu-Daudé wrote:
> On 10/14/20 2:34 PM, Fam Zheng wrote:
> > On Wed, 2020-10-14 at 13:52 +0200, Philippe Mathieu-Daudé wrote:
> > > A bunch of boring patches that have been proven helpful
> > > while debugging.
> > > 
> > > Philippe Mathieu-Daudé (9):
> > >util/vfio-helpers: Improve reporting unsupported IOMMU type
> > >util/vfio-helpers: Trace PCI I/O config accesses
> > >util/vfio-helpers: Trace PCI BAR region info
> > >util/vfio-helpers: Trace where BARs are mapped
> > >util/vfio-helpers: Improve DMA trace events
> > >util/vfio-helpers: Convert vfio_dump_mapping to trace events
> > >util/vfio-helpers: Let qemu_vfio_dma_map() propagate Error
> > >util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error
> > >util/vfio-helpers: Let qemu_vfio_verify_mappings() use
> > > error_report()
> > > 
> > >   include/qemu/vfio-helpers.h |  2 +-
> > >   block/nvme.c| 14 
> > >   util/vfio-helpers.c | 66 +-
> > > ---
> > > 
> > >   util/trace-events   | 10 ++++--
> > >   4 files changed, 54 insertions(+), 38 deletions(-)
> > > 
> > > -- 
> > > 2.26.2
> > > 
> > > 
> > > 
> > 
> > Modular the g_strdup_printf() memleak I pointed out:
> > 
> > Reviewed-by: Fam Zheng 

Overlooked the auto free qualifier, so that one is okay too!

Fam

> 
> Thanks!
> 
>

Re: [PATCH 0/9] util/vfio-helpers: Improve debugging experience

2020-10-14 Thread Fam Zheng

On Wed, 2020-10-14 at 13:52 +0200, Philippe Mathieu-Daudé wrote:
> A bunch of boring patches that have been proven helpful
> while debugging.
> 
> Philippe Mathieu-Daudé (9):
>   util/vfio-helpers: Improve reporting unsupported IOMMU type
>   util/vfio-helpers: Trace PCI I/O config accesses
>   util/vfio-helpers: Trace PCI BAR region info
>   util/vfio-helpers: Trace where BARs are mapped
>   util/vfio-helpers: Improve DMA trace events
>   util/vfio-helpers: Convert vfio_dump_mapping to trace events
>   util/vfio-helpers: Let qemu_vfio_dma_map() propagate Error
>   util/vfio-helpers: Let qemu_vfio_do_mapping() propagate Error
>   util/vfio-helpers: Let qemu_vfio_verify_mappings() use
> error_report()
> 
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c| 14 
>  util/vfio-helpers.c | 66 +
> 
>  util/trace-events   | 10 --
>  4 files changed, 54 insertions(+), 38 deletions(-)
> 
> -- 
> 2.26.2
> 
> 
> 

Modular the g_strdup_printf() memleak I pointed out:

Reviewed-by: Fam Zheng

Re: [PATCH 3/9] util/vfio-helpers: Trace PCI BAR region info

2020-10-14 Thread Fam Zheng

On Wed, 2020-10-14 at 13:52 +0200, Philippe Mathieu-Daudé wrote:
> For debug purpose, trace BAR regions info.
> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  util/vfio-helpers.c | 8 
>  util/trace-events   | 1 +
>  2 files changed, 9 insertions(+)
> 
> diff --git a/util/vfio-helpers.c b/util/vfio-helpers.c
> index 1d4efafcaa4..cd6287c3a98 100644
> --- a/util/vfio-helpers.c
> +++ b/util/vfio-helpers.c
> @@ -136,6 +136,7 @@ static inline void
> assert_bar_index_valid(QEMUVFIOState *s, int index)
>  
>  static int qemu_vfio_pci_init_bar(QEMUVFIOState *s, int index, Error
> **errp)
>  {
> +g_autofree char *barname = NULL;
>  assert_bar_index_valid(s, index);
>  s->bar_region_info[index] = (struct vfio_region_info) {
>  .index = VFIO_PCI_BAR0_REGION_INDEX + index,
> @@ -145,6 +146,10 @@ static int qemu_vfio_pci_init_bar(QEMUVFIOState
> *s, int index, Error **errp)
>  error_setg_errno(errp, errno, "Failed to get BAR region
> info");
>  return -errno;
>  }
> +barname = g_strdup_printf("bar[%d]", index);

Where is barname freed?

Fam

> +trace_qemu_vfio_region_info(barname, s-
> >bar_region_info[index].offset,
> +s->bar_region_info[index].size,
> +s-
> >bar_region_info[index].cap_offset);
>  
>  return 0;
>  }
> @@ -416,6 +421,9 @@ static int qemu_vfio_init_pci(QEMUVFIOState *s,
> const char *device,
>  ret = -errno;
>  goto fail;
>  }
> +trace_qemu_vfio_region_info("config", s-
> >config_region_info.offset,
> +s->config_region_info.size,
> +s->config_region_info.cap_offset);
>  
>  for (i = 0; i < ARRAY_SIZE(s->bar_region_info); i++) {
>  ret = qemu_vfio_pci_init_bar(s, i, errp);
> diff --git a/util/trace-events b/util/trace-events
> index c048f85f828..4d40c74a21f 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -87,3 +87,4 @@ qemu_vfio_dma_map(void *s, void *host, size_t size,
> bool temporary, uint64_t *io
>  qemu_vfio_dma_unmap(void *s, void *host) "s %p host %p"
>  qemu_vfio_pci_read_config(void *buf, int ofs, int size, uint64_t
> region_ofs, uint64_t region_size) "read cfg ptr %p ofs 0x%x size %d
> (region ofs 0x%"PRIx64" size %"PRId64")"
>  qemu_vfio_pci_write_config(void *buf, int ofs, int size, uint64_t
> region_ofs, uint64_t region_size) "write cfg ptr %p ofs 0x%x size %d
> (region ofs 0x%"PRIx64" size %"PRId64")"
> +qemu_vfio_region_info(const char *desc, uint64_t offset, uint64_t
> size, uint32_t cap_offset) "region '%s' ofs 0x%"PRIx64" size
> %"PRId64" cap_ofs %"PRId32

Re: [PATCH 1/4] vmdk: fix maybe uninitialized warnings

2020-09-30 Thread Fam Zheng

On Wed, 2020-09-30 at 17:58 +0200, Christian Borntraeger wrote:
> Fedora 32 gcc 10 seems to give false positives:
> 
> Compiling C object libblock.fa.p/block_vmdk.c.o
> ../block/vmdk.c: In function ‘vmdk_parse_extents’:
> ../block/vmdk.c:587:5: error: ‘extent’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
>   587 | g_free(extent->l1_table);
>   | ^~~~
> ../block/vmdk.c:754:17: note: ‘extent’ was declared here
>   754 | VmdkExtent *extent;
>   | ^~
> ../block/vmdk.c:620:11: error: ‘extent’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
>   620 | ret = vmdk_init_tables(bs, extent, errp);
>   |   ^~
> ../block/vmdk.c:598:17: note: ‘extent’ was declared here
>   598 | VmdkExtent *extent;
>   | ^~
> ../block/vmdk.c:1178:39: error: ‘extent’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
>  1178 | extent->flat_start_offset = flat_offset << 9;
>   | ~~^~
> ../block/vmdk.c: In function ‘vmdk_open_vmdk4’:
> ../block/vmdk.c:581:22: error: ‘extent’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
>   581 | extent->l2_cache =
>   | ~^
>   582 | g_malloc(extent->entry_size * extent->l2_size *
> L2_CACHE_SIZE);
>   | ~
> ~
> ../block/vmdk.c:872:17: note: ‘extent’ was declared here
>   872 | VmdkExtent *extent;
>   | ^~
> ../block/vmdk.c: In function ‘vmdk_open’:
> ../block/vmdk.c:620:11: error: ‘extent’ may be used uninitialized in
> this function [-Werror=maybe-uninitialized]
>   620 | ret = vmdk_init_tables(bs, extent, errp);
>   |   ^~
> ../block/vmdk.c:598:17: note: ‘extent’ was declared here
>   598 | VmdkExtent *extent;
>   | ^~
> cc1: all warnings being treated as errors
> make: *** [Makefile.ninja:884: libblock.fa.p/block_vmdk.c.o] Error 1
> 
> fix them by assigning a default value.
> 
> Signed-off-by: Christian Borntraeger 
> ---
>  block/vmdk.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 8ec62c7ab798..a00dc00eb47a 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -595,7 +595,7 @@ static int vmdk_open_vmfs_sparse(BlockDriverState
> *bs,
>  int ret;
>  uint32_t magic;
>  VMDK3Header header;
> -VmdkExtent *extent;
> +VmdkExtent *extent = NULL;
>  
>  ret = bdrv_pread(file, sizeof(magic), , sizeof(header));
>  if (ret < 0) {
> @@ -751,7 +751,7 @@ static int vmdk_open_se_sparse(BlockDriverState
> *bs,
>  int ret;
>  VMDKSESparseConstHeader const_header;
>  VMDKSESparseVolatileHeader volatile_header;
> -VmdkExtent *extent;
> +VmdkExtent *extent = NULL;
>  
>  ret = bdrv_apply_auto_read_only(bs,
>  "No write support for seSparse images available", errp);
> @@ -869,7 +869,7 @@ static int vmdk_open_vmdk4(BlockDriverState *bs,
>  uint32_t magic;
>  uint32_t l1_size, l1_entry_sectors;
>  VMDK4Header header;
> -VmdkExtent *extent;
> +VmdkExtent *extent = NULL;
>  BDRVVmdkState *s = bs->opaque;
>  int64_t l1_backup_offset = 0;
>  bool compressed;
> @@ -1088,7 +1088,7 @@ static int vmdk_parse_extents(const char *desc,
> BlockDriverState *bs,
>  BdrvChild *extent_file;
>  BdrvChildRole extent_role;
>  BDRVVmdkState *s = bs->opaque;
> -VmdkExtent *extent;
> +VmdkExtent *extent = NULL;
>  char extent_opt_prefix[32];
>  Error *local_err = NULL;
>  

Looks trivial, and correct.

Reviewed-by: Fam Zheng

Re: [PATCH v2 0/6] block/nvme: Map doorbells pages write-only, remove magic from nvme_init

2020-09-22 Thread Fam Zheng

On Tue, 2020-09-22 at 10:38 +0200, Philippe Mathieu-Daudé wrote:
> Instead of mapping 8K of I/O + doorbells as RW during the whole
> execution, maps I/O temporarly at init, and map doorbells WO.
> 
> Replace various magic values by slighly more explicit macros from
> "block/nvme.h".
> 
> Since v1: Fix uninitialized regs* (patchew)
> 
> Philippe Mathieu-Daudé (6):
>   util/vfio-helpers: Pass page protections to qemu_vfio_pci_map_bar()
>   block/nvme: Map doorbells pages write-only
>   block/nvme: Reduce I/O registers scope
>   block/nvme: Drop NVMeRegs structure, directly use NvmeBar
>   block/nvme: Use register definitions from 'block/nvme.h'
>   block/nvme: Replace magic value by SCALE_MS definition
> 
>  include/qemu/vfio-helpers.h |  2 +-
>  block/nvme.c| 73 +
> 
>  util/vfio-helpers.c |  4 +-
>  3 files changed, 44 insertions(+), 35 deletions(-)
> 

Reviewed-by: Fam Zheng

Re: [PATCH 2/6] block/nvme: Map doorbells pages write-only

2020-09-22 Thread Fam Zheng

On Tue, 2020-09-22 at 10:41 +0200, Philippe Mathieu-Daudé wrote:
> Hi Fam,
> 
> +Paolo?
> 
> On 9/22/20 10:18 AM, Fam Zheng wrote:
> > On Mon, 2020-09-21 at 18:29 +0200, Philippe Mathieu-Daudé wrote:
> > > Per the datasheet sections 3.1.13/3.1.14:
> > >   "The host should not read the doorbell registers."
> > > 
> > > As we don't need read access, map the doorbells with write-only
> > > permission. We keep a reference to this mapped address in the
> > > BDRVNVMeState structure.
> > 
> > Besides looking more correct in access mode, is there any side
> > effect
> > of WO mapping?
> 
> TBH I don't have enough knowledge to answer this question.
> I tested successfully on X86. I'm writing more tests.

The reason I'm asking is more because x86 pages are either RO or RW. So
I'd like to see if there's a practical reason behind this patch (I have
no idea about the effects on MTRR and/or IOMMU).

Fam

> 
> > 
> > Fam
> > 
> > > 
> > > Signed-off-by: Philippe Mathieu-Daudé 
> > > ---
> > >  block/nvme.c | 29 +++--
> > >  1 file changed, 19 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index 5a4dc6a722a..3c834da8fec 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -31,7 +31,7 @@
> > >  #define NVME_SQ_ENTRY_BYTES 64
> > >  #define NVME_CQ_ENTRY_BYTES 16
> > >  #define NVME_QUEUE_SIZE 128
> > > -#define NVME_BAR_SIZE 8192
> > > +#define NVME_DOORBELL_SIZE 4096
> > >  
> > >  /*
> > >   * We have to leave one slot empty as that is the full queue
> > > case
> > > where
> > > @@ -84,10 +84,6 @@ typedef struct {
> > >  /* Memory mapped registers */
> > >  typedef volatile struct {
> > >  NvmeBar ctrl;
> > > -struct {
> > > -uint32_t sq_tail;
> > > -uint32_t cq_head;
> > > -} doorbells[];
> > >  } NVMeRegs;
> > >  
> > >  #define INDEX_ADMIN 0
> > > @@ -103,6 +99,11 @@ struct BDRVNVMeState {
> > >  AioContext *aio_context;
> > >  QEMUVFIOState *vfio;
> > >  NVMeRegs *regs;
> > > +/* Memory mapped registers */
> > > +volatile struct {
> > > +uint32_t sq_tail;
> > > +uint32_t cq_head;
> > > +} *doorbells;
> > >  /* The submission/completion queue pairs.
> > >   * [0]: admin queue.
> > >   * [1..]: io queues.
> > > @@ -247,14 +248,14 @@ static NVMeQueuePair
> > > *nvme_create_queue_pair(BDRVNVMeState *s,
> > >  error_propagate(errp, local_err);
> > >  goto fail;
> > >  }
> > > -q->sq.doorbell = >regs->doorbells[idx * s-
> > > > doorbell_scale].sq_tail;
> > > 
> > > +q->sq.doorbell = >doorbells[idx * s-
> > > >doorbell_scale].sq_tail;
> > >  
> > >  nvme_init_queue(s, >cq, size, NVME_CQ_ENTRY_BYTES,
> > > _err);
> > >  if (local_err) {
> > >  error_propagate(errp, local_err);
> > >  goto fail;
> > >  }
> > > -q->cq.doorbell = >regs->doorbells[idx * s-
> > > > doorbell_scale].cq_head;
> > > 
> > > +q->cq.doorbell = >doorbells[idx * s-
> > > >doorbell_scale].cq_head;
> > >  
> > >  return q;
> > >  fail:
> > > @@ -712,13 +713,12 @@ static int nvme_init(BlockDriverState *bs,
> > > const char *device, int namespace,
> > >  goto out;
> > >  }
> > >  
> > > -s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0,
> > > NVME_BAR_SIZE,
> > > +s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0,
> > > sizeof(NvmeBar),
> > >  PROT_READ | PROT_WRITE,
> > > errp);
> > >  if (!s->regs) {
> > >  ret = -EINVAL;
> > >  goto out;
> > >  }
> > > -
> > >  /* Perform initialize sequence as described in NVMe spec
> > > "7.6.1
> > >   * Initialization". */
> > >  
> > > @@ -748,6 +748,13 @@ static int nvme_init(BlockDriverState *bs,
> > > const
> > > char *device, int namespace,
> > >  }
> > >  }
> > >  
> > > +s->doorbells = qemu_vfio_pci_map_bar(s->vfio, 0,
> > > sizeof(NvmeBar),
> > > + NVME_DOORBELL_SIZE,
> > > PROT_WRITE, errp);
> > > +if (!s->doorbells) {
> > > +ret = -EINVAL;
> > > +goto out;
> > > +}
> > > +
> > >  /* Set up admin queue. */
> > >  s->queues = g_new(NVMeQueuePair *, 1);
> > >  s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s,
> > > aio_context,
> > > 0,
> > > @@ -873,7 +880,9 @@ static void nvme_close(BlockDriverState *bs)
> > > 
> > > >irq_notifier[MSIX_SHARED_IRQ_IDX],
> > > false, NULL, NULL);
> > >  event_notifier_cleanup(
> > > >irq_notifier[MSIX_SHARED_IRQ_IDX]);
> > > -qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0,
> > > NVME_BAR_SIZE);
> > > +qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
> > > +sizeof(NvmeBar),
> > > NVME_DOORBELL_SIZE);
> > > +qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0,
> > > sizeof(NvmeBar));
> > >  qemu_vfio_close(s->vfio);
> > >  
> > >  g_free(s->device);
> 
>

Re: [PATCH 2/6] block/nvme: Map doorbells pages write-only

2020-09-22 Thread Fam Zheng

On Mon, 2020-09-21 at 18:29 +0200, Philippe Mathieu-Daudé wrote:
> Per the datasheet sections 3.1.13/3.1.14:
>   "The host should not read the doorbell registers."
> 
> As we don't need read access, map the doorbells with write-only
> permission. We keep a reference to this mapped address in the
> BDRVNVMeState structure.

Besides looking more correct in access mode, is there any side effect
of WO mapping?

Fam

> 
> Signed-off-by: Philippe Mathieu-Daudé 
> ---
>  block/nvme.c | 29 +++--
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 5a4dc6a722a..3c834da8fec 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -31,7 +31,7 @@
>  #define NVME_SQ_ENTRY_BYTES 64
>  #define NVME_CQ_ENTRY_BYTES 16
>  #define NVME_QUEUE_SIZE 128
> -#define NVME_BAR_SIZE 8192
> +#define NVME_DOORBELL_SIZE 4096
>  
>  /*
>   * We have to leave one slot empty as that is the full queue case
> where
> @@ -84,10 +84,6 @@ typedef struct {
>  /* Memory mapped registers */
>  typedef volatile struct {
>  NvmeBar ctrl;
> -struct {
> -uint32_t sq_tail;
> -uint32_t cq_head;
> -} doorbells[];
>  } NVMeRegs;
>  
>  #define INDEX_ADMIN 0
> @@ -103,6 +99,11 @@ struct BDRVNVMeState {
>  AioContext *aio_context;
>  QEMUVFIOState *vfio;
>  NVMeRegs *regs;
> +/* Memory mapped registers */
> +volatile struct {
> +uint32_t sq_tail;
> +uint32_t cq_head;
> +} *doorbells;
>  /* The submission/completion queue pairs.
>   * [0]: admin queue.
>   * [1..]: io queues.
> @@ -247,14 +248,14 @@ static NVMeQueuePair
> *nvme_create_queue_pair(BDRVNVMeState *s,
>  error_propagate(errp, local_err);
>  goto fail;
>  }
> -q->sq.doorbell = >regs->doorbells[idx * s-
> >doorbell_scale].sq_tail;
> +q->sq.doorbell = >doorbells[idx * s->doorbell_scale].sq_tail;
>  
>  nvme_init_queue(s, >cq, size, NVME_CQ_ENTRY_BYTES,
> _err);
>  if (local_err) {
>  error_propagate(errp, local_err);
>  goto fail;
>  }
> -q->cq.doorbell = >regs->doorbells[idx * s-
> >doorbell_scale].cq_head;
> +q->cq.doorbell = >doorbells[idx * s->doorbell_scale].cq_head;
>  
>  return q;
>  fail:
> @@ -712,13 +713,12 @@ static int nvme_init(BlockDriverState *bs,
> const char *device, int namespace,
>  goto out;
>  }
>  
> -s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, NVME_BAR_SIZE,
> +s->regs = qemu_vfio_pci_map_bar(s->vfio, 0, 0, sizeof(NvmeBar),
>  PROT_READ | PROT_WRITE, errp);
>  if (!s->regs) {
>  ret = -EINVAL;
>  goto out;
>  }
> -
>  /* Perform initialize sequence as described in NVMe spec "7.6.1
>   * Initialization". */
>  
> @@ -748,6 +748,13 @@ static int nvme_init(BlockDriverState *bs, const
> char *device, int namespace,
>  }
>  }
>  
> +s->doorbells = qemu_vfio_pci_map_bar(s->vfio, 0,
> sizeof(NvmeBar),
> + NVME_DOORBELL_SIZE,
> PROT_WRITE, errp);
> +if (!s->doorbells) {
> +ret = -EINVAL;
> +goto out;
> +}
> +
>  /* Set up admin queue. */
>  s->queues = g_new(NVMeQueuePair *, 1);
>  s->queues[INDEX_ADMIN] = nvme_create_queue_pair(s, aio_context,
> 0,
> @@ -873,7 +880,9 @@ static void nvme_close(BlockDriverState *bs)
> >irq_notifier[MSIX_SHARED_IRQ_IDX],
> false, NULL, NULL);
>  event_notifier_cleanup(>irq_notifier[MSIX_SHARED_IRQ_IDX]);
> -qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0,
> NVME_BAR_SIZE);
> +qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->doorbells,
> +sizeof(NvmeBar), NVME_DOORBELL_SIZE);
> +qemu_vfio_pci_unmap_bar(s->vfio, 0, (void *)s->regs, 0,
> sizeof(NvmeBar));
>  qemu_vfio_close(s->vfio);
>  
>  g_free(s->device);

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-09-21 Thread Fam Zheng

On 2020-09-19 10:22, Zhenyu Ye wrote:
> On 2020/9/18 22:06, Fam Zheng wrote:
> > 
> > I can see how blocking in a slow io_submit can cause trouble for main
> > thread. I think one way to fix it (until it's made truly async in new
> > kernels) is moving the io_submit call to thread pool, and wrapped in a
> > coroutine, perhaps.
> >
> 
> I'm not sure if any other operation will block the main thread, other
> than io_submit().

Then that's a problem with io_submit which should be fixed. Or more
precisely, that is a long held lock that we should avoid in QEMU's event
loops.

> 
> > I'm not sure qmp timeout is a complete solution because we would still
> > suffer from a blocked state for a period, in this exact situation before
> > the timeout.
> 
> Anyway, the qmp timeout may be the last measure to prevent the VM
> soft lockup. 

Maybe, but I don't think baking such a workaround into the QMP API is a
good idea. No QMP command should be synchronously long running, so
having a timeout parameter is just a wrong design.

Thanks,

Fam

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-09-18 Thread Fam Zheng

On 2020-09-18 19:23, Zhenyu Ye wrote:
>   Thread 5 (LWP 4802):
>   #0  0x83086b54 in syscall () at /lib64/libc.so.6
>   #1  0x834598b8 in io_submit () at /lib64/libaio.so.1
>   #2  0xe851e89c in ioq_submit (s=0xfffd3c001bb0) at 
> ../block/linux-aio.c:299
>   #3  0xe851eb50 in laio_io_unplug (bs=0xef0f2340, 
> s=0xfffd3c001bb0)
>   at ../block/linux-aio.c:344
>   #4  0xe8559f1c in raw_aio_unplug (bs=0xef0f2340) at 
> ../block/file-posix.c:2063
>   #5  0xe8538344 in bdrv_io_unplug (bs=0xef0f2340) at 
> ../block/io.c:3135
>   #6  0xe8538360 in bdrv_io_unplug (bs=0xef0eb020) at 
> ../block/io.c:3140
>   #7  0xe8496104 in blk_io_unplug (blk=0xef0e8f20)
>   at ../block/block-backend.c:2147
>   #8  0xe830e1a4 in virtio_blk_handle_vq (s=0xf0374280, 
> vq=0x700fc1d8)
>   at ../hw/block/virtio-blk.c:796
>   #9  0xe82e6b68 in virtio_blk_data_plane_handle_output
>   (vdev=0xf0374280, vq=0x700fc1d8) at 
> ../hw/block/dataplane/virtio-blk.c:165
>   #10 0xe83878fc in virtio_queue_notify_aio_vq (vq=0x700fc1d8)
>   at ../hw/virtio/virtio.c:2325
>   #11 0xe838ab50 in virtio_queue_host_notifier_aio_poll 
> (opaque=0x700fc250)
>   at ../hw/virtio/virtio.c:3545
>   #12 0xe85fab3c in run_poll_handlers_once
>   (ctx=0xef0a87b0, now=77604310618960, timeout=0x73ffdf78)
>   at ../util/aio-posix.c:398
>   #13 0xe85fae5c in run_poll_handlers
>   (ctx=0xef0a87b0, max_ns=4000, timeout=0x73ffdf78) at 
> ../util/aio-posix.c:492
>   #14 0xe85fb078 in try_poll_mode (ctx=0xef0a87b0, 
> timeout=0x73ffdf78)
>   at ../util/aio-posix.c:535
>   #15 0xe85fb180 in aio_poll (ctx=0xef0a87b0, blocking=true)
>   at ../util/aio-posix.c:571
>   #16 0xe8027004 in iothread_run (opaque=0xeee79a00) at 
> ../iothread.c:73
>   #17 0xe85f269c in qemu_thread_start (args=0xef0a8d10)
>   at ../util/qemu-thread-posix.c:521
>   #18 0x831428bc in  () at /lib64/libpthread.so.0
>   #19 0x8308aa1c in  () at /lib64/libc.so.6

I can see how blocking in a slow io_submit can cause trouble for main
thread. I think one way to fix it (until it's made truly async in new
kernels) is moving the io_submit call to thread pool, and wrapped in a
coroutine, perhaps.

I'm not sure qmp timeout is a complete solution because we would still
suffer from a blocked state for a period, in this exact situation before
the timeout.

Fam

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-09-17 Thread Fam Zheng

On 2020-09-17 16:44, Stefan Hajnoczi wrote:
> On Thu, Sep 17, 2020 at 03:36:57PM +0800, Zhenyu Ye wrote:
> > When the hang occurs, the QEMU is blocked at:
> > 
> > #0  0x95762b64 in ?? () from target:/usr/lib64/libpthread.so.0
> > #1  0x9575bd88 in pthread_mutex_lock () from 
> > target:/usr/lib64/libpthread.so.0
> > #2  0xbb1f5948 in qemu_mutex_lock_impl (mutex=0xcc8e1860,
> > file=0xbb4e1bd0 
> > "/Images/eillon/CODE/5-opensource/qemu/util/async.c", line=605)
> > #3  0xbb20acd4 in aio_context_acquire (ctx=0xcc8e1800)
> > #4  0xbb105e90 in bdrv_query_image_info (bs=0xcc934620,
> > p_info=0xccc41e18, errp=0xca669118)
> > #5  0xbb105968 in bdrv_block_device_info (blk=0xcdca19f0, 
> > bs=0xcc934620,
> > flat=false, errp=0xca6692b8)
> > #6  0xbb1063dc in bdrv_query_info (blk=0xcdca19f0, 
> > p_info=0xcd29c9a8,
> > errp=0xca6692b8)
> > #7  0xbb106c14 in qmp_query_block (errp=0x0)
> > #8  0xbacb8e6c in hmp_info_block (mon=0xca6693d0, 
> > qdict=0xcd089790)
> 
> Great, this shows that the main loop thread is stuck waiting for the
> AioContext lock.
> 
> Please post backtraces from all QEMU threads ((gdb) thread apply all bt)
> so we can figure out which thread is holding up the main loop.

I think that is reflected in the perf backtrace posted by Zheny already:

And in the host, the information of sys_enter_io_submit() is:

Samples: 3K of event 'syscalls:sys_enter_io_submit', Event count
(approx.): 3150
   Children  Self  Trace output
   -   66.70%66.70%  ctx_id: 0x9c044000,
   nr: 0x0001, iocbpp: 0x9f7fad28
   0xae7f871c
   0xae8a27c4
   qemu_thread_start
   iothread_run
   aio_poll
   aio_dispatch_ready_handlers
   aio_dispatch_handler
   virtio_queue_host_notifier_aio_read
   virtio_queue_notify_aio_vq
   virtio_blk_data_plane_handle_output
   virtio_blk_handle_vq
   blk_io_unplug
   bdrv_io_unplug
   bdrv_io_unplug
   raw_aio_unplug
   laio_io_unplug
   syscall

So the iothread is blocked by a slow io_submit holding the AioContext
lock.

It would be interesting to know what in kernel is blocking io_submit
from returning.

Fam

Re: [PATCH v1 0/2] Add timeout mechanism to qmp actions

2020-09-17 Thread Fam Zheng

On 2020-09-17 15:36, Zhenyu Ye wrote:
> Hi Stefan,
> 
> On 2020/9/14 21:27, Stefan Hajnoczi wrote:
> >>
> >> Theoretically, everything running in an iothread is asynchronous. However,
> >> some 'asynchronous' actions are not non-blocking entirely, such as
> >> io_submit().  This will block while the iodepth is too big and I/O pressure
> >> is too high.  If we do some qmp actions, such as 'info block', at this 
> >> time,
> >> may cause vm soft lockup.  This series can make these qmp actions safer.
> >>
> >> I constructed the scene as follow:
> >> 1. create a vm with 4 disks, using iothread.
> >> 2. add press to the CPU on the host.  In my scene, the CPU usage exceeds 
> >> 95%.
> >> 3. add press to the 4 disks in the vm at the same time.  I used the fio and
> >> some parameters are:
> >>
> >> fio -rw=randrw -bs=1M -size=1G -iodepth=512 -ioengine=libaio -numjobs=4
> >>
> >> 4. do block query actions, for example, by virsh:
> >>
> >>virsh qemu-monitor-command [vm name] --hmp info block
> >>
> >> Then the vm will soft lockup, the calltrace is:
> >>
> >> [  192.311393] watchdog: BUG: soft lockup - CPU#1 stuck for 42s! 
> >> [kworker/1:1:33]
> > 
> > Hi,
> > Sorry I haven't had time to investigate this myself yet.
> > 
> > Do you also have a QEMU backtrace when the hang occurs?
> > 
> > Let's find out if QEMU is stuck in the io_submit(2) syscall or whether
> > there's an issue in QEMU itself that causes the softlockup (for example,
> > aio_poll() with the global mutex held).
> 
> Sorry for the delay.
> 
> I traced the io_submit() within the guest. The maximum execution time exceeds 
> 16s:
> 
>   guest# perf trace -p `pidof -s fio` --duration 1
>   14.808 (3843.082 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
> 3861.336 (11470.608 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>15341.998 (479.283 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>15831.380 (3704.997 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>19547.869 (3412.839 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>22966.953 (1080.854 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>24062.689 (6939.813 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>31019.219 (532.147 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>31556.610 (3469.920 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>35038.885 (9007.175 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>44053.578 (16006.405 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>60068.944 (3068.748 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>63138.474 (13012.499 ms): io_getevents(ctx_id: 281472924459008, 
> min_nr: 511, nr: 511, events: 0x19a83150) = 511
>76191.073 (2872.657 ms): io_submit(ctx_id: 281472924459008, nr: 1, 
> iocbpp: 0x19a87160  ) = 1
>   ...
> 
> And in the host, the information of sys_enter_io_submit() is:
> 
>   Samples: 3K of event 'syscalls:sys_enter_io_submit', Event count 
> (approx.): 3150
>   Children  Self  Trace output
>   -   66.70%66.70%  ctx_id: 0x9c044000, nr: 0x0001, iocbpp: 
> 0x9f7fad28
>   0xae7f871c
>   0xae8a27c4
>   qemu_thread_start
>   iothread_run
>   aio_poll
>   aio_dispatch_ready_handlers
>   aio_dispatch_handler
>   virtio_queue_host_notifier_aio_read
>   virtio_queue_notify_aio_vq
>   virtio_blk_data_plane_handle_output
>   virtio_blk_handle_vq
>   blk_io_unplug
>   bdrv_io_unplug
>   bdrv_io_unplug
>   raw_aio_unplug
>   laio_io_unplug
>   syscall
> 
> 
> When the hang occurs, the QEMU is blocked at:
> 
>   #0  0x95762b64 in ?? () from target:/usr/lib64/libpthread.so.0
>   #1  0x9575bd88 in pthread_mutex_lock () from 
> target:/usr/lib64/libpthread.so.0
>   #2  0xbb1f5948 in qemu_mutex_lock_impl (mutex=0xcc8e1860,
>   file=0xbb4e1bd0 
> "/Images/eillon/CODE/5-opensource/qemu/util/async.c", line=605)
>   #3  0xbb20acd4 in aio_context_acquire (ctx=0xcc8e1800)
>   #4  0xbb105e90 in bdrv_query_image_info (bs=0xcc934620,
>   p_info=0xccc41e18, errp=0xca669118)
>   #5  0xbb105968 in bdrv_block_device_info (blk=0xcdca19f0, 
> bs=0xcc934620,
>   flat=false, errp=0xca6692b8)
>   #6  0xbb1063dc in bdrv_query_info (blk=0xcdca19f0, 
> p_info=0xcd29c9a8,
>   errp=0xca6692b8)
>   #7  0xbb106c14 in qmp_query_block (errp=0x0)
>

Re: [RFC PATCH v5 0/4] util/vfio-helpers: Add support for multiple IRQs

2020-09-09 Thread Fam Zheng

On 2020-09-08 20:03, Philippe Mathieu-Daudé wrote:
> This series intends to setup the VFIO helper to allow
> binding notifiers on different IRQs.
> 
> For the NVMe use case, we only care about MSIX interrupts.
> To not disrupt other users, introduce the qemu_vfio_pci_init_msix_irqs
> function to initialize multiple MSIX IRQs and attach eventfd to
> them.
> 
> Since RFC v4:
> - addressed Alex review comment:
>   check ioctl(VFIO_DEVICE_SET_IRQS) return value

Reviewed-by: Fam Zheng

Re: [RFC PATCH v5 4/4] block/nvme: Use qemu_vfio_pci_init_msix_irqs() to initialize our IRQ

2020-09-09 Thread Fam Zheng

On 2020-09-08 20:03, Philippe Mathieu-Daudé wrote:
> Instead of initializing one MSIX IRQ with the generic
> qemu_vfio_pci_init_irq() function, use the MSIX specific one which
> ill allow us to use multiple IRQs. For now we provide an array of

s/ill/will/

> a single IRQ.
> 
> Signed-off-by: Philippe Mathieu-Daudé 

Fam

Re: [PATCH] MAINTAINERS: add Stefan Hajnoczi as block/nvme.c maintainer

2020-09-08 Thread Fam Zheng

On 2020-09-07 12:16, Stefan Hajnoczi wrote:
> Development of the userspace NVMe block driver picked up again recently.
> After talking with Fam I am stepping up as block/nvme.c maintainer.
> Patches will be merged through my 'block' tree.
> 
> Cc: Kevin Wolf 
> Cc: Klaus Jensen 
> Cc: Fam Zheng 
> Signed-off-by: Stefan Hajnoczi 

Acked-by: Fam Zheng

Re: [PATCH 0/3] block/nvme: Use NvmeBar structure from "block/nvme.h"

2020-09-04 Thread Fam Zheng

On 2020-09-04 14:41, Philippe Mathieu-Daudé wrote:
> Cleanups in the NVMeRegs structure:
> - Use the already existing NvmeBar structure from "block/nvme.h"
> - Pair doorbell registers
> 
> Based-on: <20200903122803.405265-1-phi...@redhat.com>
> 
> Philippe Mathieu-Daudé (3):
>   block/nvme: Group controller registers in NVMeRegs structure
>   block/nvme: Use generic NvmeBar structure
>   block/nvme: Pair doorbell registers
> 
>  block/nvme.c | 43 +++
>  1 file changed, 15 insertions(+), 28 deletions(-)
> 
> -- 
> 2.26.2
> 
> 

Reviewed-by: Fam Zheng

Re: [PATCH v2 00/20] nvme: support NVMe v1.3d, SGLs and multiple namespaces

2019-10-16 Thread Fam Zheng

On Tue, 10/15 12:38, Klaus Jensen wrote:
> Hi,
> 
> (Quick note to Fam): most of this series is irrelevant to you as the
> maintainer of the nvme block driver, but patch "nvme: add support for
> scatter gather lists" touches block/nvme.c due to changes in the shared
> NvmeCmd struct.

Yeah, that part looks sane to me. For the block/nvme.c bit:

Acked-by: Fam Zheng

Re: [Qemu-block] [PATCH v5 11/12] qemu-io: adds support for io_uring

2019-06-11 Thread Fam Zheng

On Mon, 06/10 19:19, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> ---
>  qemu-io.c | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/qemu-io.c b/qemu-io.c
> index 8d5d5911cb..54b82151c4 100644
> --- a/qemu-io.c
> +++ b/qemu-io.c
> @@ -129,6 +129,7 @@ static void open_help(void)
>  " -n, -- disable host cache, short for -t none\n"
>  " -U, -- force shared permissions\n"
>  " -k, -- use kernel AIO implementation (on Linux only)\n"
> +" -i  -- use kernel io_uring (Linux 5.1+)\n"
>  " -t, -- use the given cache mode for the image\n"
>  " -d, -- use the given discard mode for the image\n"
>  " -o, -- options to be given to the block driver"
> @@ -188,6 +189,11 @@ static int open_f(BlockBackend *blk, int argc, char 
> **argv)
>  case 'k':
>  flags |= BDRV_O_NATIVE_AIO;
>  break;
> +#ifdef CONFIG_LINUX_IO_URING
> +case 'i':

Maybe printing an error message saying the feature is not compiled in is
slightly better than just saying the argument is unknown?

Fam

> +flags |= BDRV_O_IO_URING;
> +break;
> +#endif
>  case 't':
>  if (bdrv_parse_cache_mode(optarg, , ) < 0) {
>  error_report("Invalid cache option: %s", optarg);
> @@ -290,6 +296,7 @@ static void usage(const char *name)
>  "  -C, --copy-on-read   enable copy-on-read\n"
>  "  -m, --misalign   misalign allocations for O_DIRECT\n"
>  "  -k, --native-aio use kernel AIO implementation (on Linux only)\n"
> +"  -i  --io_uring   use kernel io_uring (Linux 5.1+)\n"
>  "  -t, --cache=MODE use the given cache mode for the image\n"
>  "  -d, --discard=MODE   use the given discard mode for the image\n"
>  "  -T, --trace [[enable=]][,events=][,file=]\n"
> @@ -499,6 +506,7 @@ int main(int argc, char **argv)
>  { "copy-on-read", no_argument, NULL, 'C' },
>  { "misalign", no_argument, NULL, 'm' },
>  { "native-aio", no_argument, NULL, 'k' },
> +{ "io_uring", no_argument, NULL, 'i' },
>  { "discard", required_argument, NULL, 'd' },
>  { "cache", required_argument, NULL, 't' },
>  { "trace", required_argument, NULL, 'T' },
> @@ -566,6 +574,11 @@ int main(int argc, char **argv)
>  case 'k':
>  flags |= BDRV_O_NATIVE_AIO;
>  break;
> +#ifdef CONFIG_LINUX_IO_URING
> +case 'i':
> +flags |= BDRV_O_IO_URING;
> +break;
> +#endif
>  case 't':
>  if (bdrv_parse_cache_mode(optarg, , ) < 0) {
>  error_report("Invalid cache option: %s", optarg);
> -- 
> 2.17.1
>

Re: [Qemu-block] [PATCH v5 09/12] block: add trace events for io_uring

2019-06-11 Thread Fam Zheng

On Mon, 06/10 19:19, Aarushi Mehta wrote:
> Signed-off-by: Aarushi Mehta 
> ---
>  block/io_uring.c   | 14 --
>  block/trace-events |  8 
>  2 files changed, 20 insertions(+), 2 deletions(-)
> 
> diff --git a/block/io_uring.c b/block/io_uring.c
> index f327c7ef96..47e027364a 100644
> --- a/block/io_uring.c
> +++ b/block/io_uring.c
> @@ -17,6 +17,7 @@
>  #include "block/raw-aio.h"
>  #include "qemu/coroutine.h"
>  #include "qapi/error.h"
> +#include "trace.h"
>  
>  #define MAX_EVENTS 128
>  
> @@ -191,12 +192,15 @@ static int ioq_submit(LuringState *s)
>  
>  void luring_io_plug(BlockDriverState *bs, LuringState *s)
>  {
> +trace_luring_io_plug();
>  s->io_q.plugged++;
>  }
>  
>  void luring_io_unplug(BlockDriverState *bs, LuringState *s)
>  {
>  assert(s->io_q.plugged);
> +trace_luring_io_unplug(s->io_q.blocked, s->io_q.plugged,
> + s->io_q.in_queue, s->io_q.in_flight);
>  if (--s->io_q.plugged == 0 &&
>  !s->io_q.blocked && s->io_q.in_queue > 0) {
>  ioq_submit(s);
> @@ -217,6 +221,7 @@ void luring_io_unplug(BlockDriverState *bs, LuringState 
> *s)
>  static int luring_do_submit(int fd, LuringAIOCB *luringcb, LuringState *s,
>  uint64_t offset, int type)
>  {
> +int ret;
>  struct io_uring_sqe *sqes = io_uring_get_sqe(>ring);
>  if (!sqes) {
>  sqes = >sqeq;
> @@ -242,11 +247,14 @@ static int luring_do_submit(int fd, LuringAIOCB 
> *luringcb, LuringState *s,
>  }
>  io_uring_sqe_set_data(sqes, luringcb);
>  s->io_q.in_queue++;
> -
> +trace_luring_do_submit(s->io_q.blocked, s->io_q.plugged,
> +   s->io_q.in_queue, s->io_q.in_flight);
>  if (!s->io_q.blocked &&
>  (!s->io_q.plugged ||
>   s->io_q.in_flight + s->io_q.in_queue >= MAX_EVENTS)) {
> -return ioq_submit(s);
> +ret = ioq_submit(s);
> +trace_luring_do_submit_done(ret);
> +return ret;
>  }
>  return 0;
>  }
> @@ -294,6 +302,7 @@ LuringState *luring_init(Error **errp)
>  int rc;
>  LuringState *s;
>  s = g_malloc0(sizeof(*s));
> +trace_luring_init_state((void *)s, sizeof(*s));

The pointer type cast is unnecessary.

>  struct io_uring *ring = >ring;
>  rc =  io_uring_queue_init(MAX_EVENTS, ring, 0);
>  if (rc < 0) {
> @@ -311,4 +320,5 @@ void luring_cleanup(LuringState *s)
>  {
>  io_uring_queue_exit(>ring);
>  g_free(s);
> +trace_luring_cleanup_state();

Pass @s?

>  }
> diff --git a/block/trace-events b/block/trace-events
> index eab51497fc..c4564dcd96 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -60,6 +60,14 @@ qmp_block_stream(void *bs, void *job) "bs %p job %p"
>  file_paio_submit(void *acb, void *opaque, int64_t offset, int count, int 
> type) "acb %p opaque %p offset %"PRId64" count %d type %d"
>  file_copy_file_range(void *bs, int src, int64_t src_off, int dst, int64_t 
> dst_off, int64_t bytes, int flags, int64_t ret) "bs %p src_fd %d offset 
> %"PRIu64" dst_fd %d offset %"PRIu64" bytes %"PRIu64" flags %d ret %"PRId64
>  
> +#io_uring.c
> +luring_init_state(void *s, size_t size) "s %p size %zu"
> +luring_cleanup_state(void) "s freed"
> +disable luring_io_plug(void) "plug"
> +disable luring_io_unplug(int blocked, int plugged, int queued, int inflight) 
> "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit(int blocked, int plugged, int queued, int inflight) 
> "blocked %d plugged %d queued %d inflight %d"
> +disable luring_do_submit_done(int ret) "submitted to kernel %d"
> +
>  # qcow2.c
>  qcow2_writev_start_req(void *co, int64_t offset, int bytes) "co %p offset 
> 0x%" PRIx64 " bytes %d"
>  qcow2_writev_done_req(void *co, int ret) "co %p ret %d"
> -- 
> 2.17.1
> 

Fam

Re: [Qemu-block] [PATCH v5 04/12] block/io_uring: implements interfaces for io_uring

2019-06-11 Thread Fam Zheng

On Mon, 06/10 19:18, Aarushi Mehta wrote:
> Aborts when sqe fails to be set as sqes cannot be returned to the ring.
> 
> Signed-off-by: Aarushi Mehta 
> ---
>  MAINTAINERS |   7 +
>  block/Makefile.objs |   3 +
>  block/io_uring.c| 314 
>  include/block/aio.h |  16 +-
>  include/block/raw-aio.h |  12 ++
>  5 files changed, 351 insertions(+), 1 deletion(-)
>  create mode 100644 block/io_uring.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 7be1225415..49f896796e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -2516,6 +2516,13 @@ F: block/file-posix.c
>  F: block/file-win32.c
>  F: block/win32-aio.c
>  
> +Linux io_uring
> +M: Aarushi Mehta 
> +R: Stefan Hajnoczi 
> +L: qemu-block@nongnu.org
> +S: Maintained
> +F: block/io_uring.c
> +
>  qcow2
>  M: Kevin Wolf 
>  M: Max Reitz 
> diff --git a/block/Makefile.objs b/block/Makefile.objs
> index ae11605c9f..8fde7a23a5 100644
> --- a/block/Makefile.objs
> +++ b/block/Makefile.objs
> @@ -18,6 +18,7 @@ block-obj-y += block-backend.o snapshot.o qapi.o
>  block-obj-$(CONFIG_WIN32) += file-win32.o win32-aio.o
>  block-obj-$(CONFIG_POSIX) += file-posix.o
>  block-obj-$(CONFIG_LINUX_AIO) += linux-aio.o
> +block-obj-$(CONFIG_LINUX_IO_URING) += io_uring.o
>  block-obj-y += null.o mirror.o commit.o io.o create.o
>  block-obj-y += throttle-groups.o
>  block-obj-$(CONFIG_LINUX) += nvme.o
> @@ -61,5 +62,7 @@ block-obj-$(if $(CONFIG_LZFSE),m,n) += dmg-lzfse.o
>  dmg-lzfse.o-libs   := $(LZFSE_LIBS)
>  qcow.o-libs:= -lz
>  linux-aio.o-libs   := -laio
> +io_uring.o-cflags  := $(LINUX_IO_URING_CFLAGS)
> +io_uring.o-libs:= $(LINUX_IO_URING_LIBS)
>  parallels.o-cflags := $(LIBXML2_CFLAGS)
>  parallels.o-libs   := $(LIBXML2_LIBS)
> diff --git a/block/io_uring.c b/block/io_uring.c
> new file mode 100644
> index 00..f327c7ef96
> --- /dev/null
> +++ b/block/io_uring.c
> @@ -0,0 +1,314 @@
> +/*
> + * Linux io_uring support.
> + *
> + * Copyright (C) 2009 IBM, Corp.
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Copyright (C) 2019 Aarushi Mehta
> + *
> + * This work is licensed under the terms of the GNU GPL, version 2 or later.
> + * See the COPYING file in the top-level directory.
> + */
> +#include "qemu/osdep.h"
> +#include 
> +#include "qemu-common.h"
> +#include "block/aio.h"
> +#include "qemu/queue.h"
> +#include "block/block.h"
> +#include "block/raw-aio.h"
> +#include "qemu/coroutine.h"
> +#include "qapi/error.h"
> +
> +#define MAX_EVENTS 128
> +
> +typedef struct LuringAIOCB {

I have to say it is a good name.

> +Coroutine *co;
> +struct io_uring_sqe sqeq;
> +ssize_t ret;
> +QEMUIOVector *qiov;
> +bool is_read;
> +QSIMPLEQ_ENTRY(LuringAIOCB) next;
> +} LuringAIOCB;
> +
> +typedef struct LuringQueue {
> +int plugged;
> +unsigned int in_queue;
> +unsigned int in_flight;
> +bool blocked;
> +QSIMPLEQ_HEAD(, LuringAIOCB) sq_overflow;
> +} LuringQueue;
> +
> +typedef struct LuringState {
> +AioContext *aio_context;
> +
> +struct io_uring ring;
> +
> +/* io queue for submit at batch.  Protected by AioContext lock. */
> +LuringQueue io_q;
> +
> +/* I/O completion processing.  Only runs in I/O thread.  */
> +QEMUBH *completion_bh;
> +} LuringState;
> +
> +/**
> + * ioq_submit:
> + * @s: AIO state
> + *
> + * Queues pending sqes and submits them
> + *
> + */
> +static int ioq_submit(LuringState *s);
> +
> +/**
> + * qemu_luring_process_completions:
> + * @s: AIO state
> + *
> + * Fetches completed I/O requests, consumes cqes and invokes their callbacks.
> + *
> + */
> +static void qemu_luring_process_completions(LuringState *s)
> +{
> +struct io_uring_cqe *cqes;
> +int ret;
> +
> +/*
> + * Request completion callbacks can run the nested event loop.
> + * Schedule ourselves so the nested event loop will "see" remaining
> + * completed requests and process them.  Without this, completion
> + * callbacks that wait for other requests using a nested event loop
> + * would hang forever.
> + */
> +qemu_bh_schedule(s->completion_bh);
> +
> +while (io_uring_peek_cqe(>ring, ) == 0) {
> +if (!cqes) {
> +break;
> +}
> +LuringAIOCB *luringcb = io_uring_cqe_get_data(cqes);
> +ret = cqes->res;

Declarations should be in the beginning of the code block.

> +
> +if (ret == luringcb->qiov->size) {
> +ret = 0;
> +} else if (ret >= 0) {
> +/* Short Read/Write */
> +if (luringcb->is_read) {
> +/* Read, pad with zeroes */
> +qemu_iovec_memset(luringcb->qiov, ret, 0,
> +luringcb->qiov->size - ret);

Should you check that (ret < luringcb->qiov->size) since ret is from external?

Either way, ret should be assigned 0, I think.

> +} else {
> +ret = -ENOSPC;;

s/;;/;/

> +}
> +}
> +luringcb->ret = ret;
> +
> +

Re: [Qemu-block] [PATCH v5 02/12] qapi/block-core: add option for io_uring

2019-06-11 Thread Fam Zheng

On Mon, 06/10 19:18, Aarushi Mehta wrote:
> Option only enumerates for hosts that support it.
> 
> Signed-off-by: Aarushi Mehta 
> Reviewed-by: Stefan Hajnoczi 
> ---
>  qapi/block-core.json | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index 1defcde048..db7eedd058 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -2792,11 +2792,13 @@
>  #
>  # @threads: Use qemu's thread pool
>  # @native:  Use native AIO backend (only Linux and Windows)
> +# @io_uring:Use linux io_uring (since 4.1)
>  #
>  # Since: 2.9
>  ##
>  { 'enum': 'BlockdevAioOptions',
> -  'data': [ 'threads', 'native' ] }
> +  'data': [ 'threads', 'native',
> +{ 'name': 'io_uring', 'if': 'defined(CONFIG_LINUX_IO_URING)' } ] 
> }

Question: 'native' has a dependency on libaio but it doesn't have the
condition.  Is the inconsistency intended?

>  
>  ##
>  # @BlockdevCacheOptions:
> -- 
> 2.17.1
>

Re: [Qemu-block] [PATCH v2 5/5] block/nvme: add support for discard

2019-06-05 Thread Fam Zheng

On Wed, 04/17 22:53, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c   | 80 ++
>  block/trace-events |  2 ++
>  2 files changed, 82 insertions(+)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 35b925899f..b83912c627 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -110,6 +110,7 @@ typedef struct {
>  bool plugged;
>  
>  bool supports_write_zeros;
> +bool supports_discard;
>  
>  CoMutex dma_map_lock;
>  CoQueue dma_flush_queue;
> @@ -462,6 +463,7 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  
>  
>  s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> +s->supports_discard = (idctrl->oncs & NVME_ONCS_DSM) != 0;
>  
>  memset(resp, 0, 4096);
>  
> @@ -1144,6 +1146,83 @@ static coroutine_fn int 
> nvme_co_pwrite_zeroes(BlockDriverState *bs,
>  }
>  
>  
> +static int coroutine_fn nvme_co_pdiscard(BlockDriverState *bs,
> +int64_t offset, int bytes)

While you respin, you can align the parameters.

> +{
> +BDRVNVMeState *s = bs->opaque;
> +NVMeQueuePair *ioq = s->queues[1];
> +NVMeRequest *req;
> +NvmeDsmRange *buf;
> +QEMUIOVector local_qiov;
> +int r;
> +
> +NvmeCmd cmd = {
> +.opcode = NVME_CMD_DSM,
> +.nsid = cpu_to_le32(s->nsid),
> +.cdw10 = 0, /*number of ranges - 0 based*/
> +.cdw11 = cpu_to_le32(1 << 2), /*deallocate bit*/
> +};
> +
> +NVMeCoData data = {
> +.ctx = bdrv_get_aio_context(bs),
> +.ret = -EINPROGRESS,
> +};
> +
> +if (!s->supports_discard) {
> +return -ENOTSUP;
> +}
> +
> +assert(s->nr_queues > 1);
> +
> +buf = qemu_try_blockalign0(bs, 4096);
> +if (!buf) {
> +return -ENOMEM;
> +}
> +
> +buf->nlb = bytes >> s->blkshift;
> +buf->slba = offset >> s->blkshift;

This buffer is for the device, do we need to do anything about the endianness?

> +buf->cattr = 0;
> +
> +qemu_iovec_init(_qiov, 1);
> +qemu_iovec_add(_qiov, buf, 4096);
> +
> +req = nvme_get_free_req(ioq);
> +assert(req);
> +
> +qemu_co_mutex_lock(>dma_map_lock);
> +r = nvme_cmd_map_qiov(bs, , req, _qiov);
> +qemu_co_mutex_unlock(>dma_map_lock);
> +
> +if (r) {
> +req->busy = false;
> +return r;
> +}
> +
> +trace_nvme_dsm(s, offset, bytes);
> +
> +nvme_submit_command(s, ioq, req, , nvme_rw_cb, );
> +
> +data.co = qemu_coroutine_self();
> +while (data.ret == -EINPROGRESS) {
> +qemu_coroutine_yield();
> +}
> +
> +qemu_co_mutex_lock(>dma_map_lock);
> +r = nvme_cmd_unmap_qiov(bs, _qiov);
> +qemu_co_mutex_unlock(>dma_map_lock);
> +if (r) {
> +return r;
> +}
> +
> +trace_nvme_dsm_done(s, offset, bytes, data.ret);
> +
> +qemu_iovec_destroy(_qiov);
> +qemu_vfree(buf);
> +return data.ret;
> +
> +}
> +
> +
>  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> BlockReopenQueue *queue, Error **errp)
>  {
> @@ -1250,6 +1329,7 @@ static BlockDriver bdrv_nvme = {
>  .bdrv_co_pwritev  = nvme_co_pwritev,
>  
>  .bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
> +.bdrv_co_pdiscard = nvme_co_pdiscard,
>  
>  .bdrv_co_flush_to_disk= nvme_co_flush,
>  .bdrv_reopen_prepare  = nvme_reopen_prepare,
> diff --git a/block/trace-events b/block/trace-events
> index 943a58569f..e55ac5c40b 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -148,6 +148,8 @@ nvme_write_zeros(void *s, uint64_t offset, uint64_t 
> bytes, int flags) "s %p offs
>  nvme_qiov_unaligned(const void *qiov, int n, void *base, size_t size, int 
> align) "qiov %p n %d base %p size 0x%zx align 0x%x"
>  nvme_prw_buffered(void *s, uint64_t offset, uint64_t bytes, int niov, int 
> is_write) "s %p offset %"PRId64" bytes %"PRId64" niov %d is_write %d"
>  nvme_rw_done(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
> ret) "s %p is_write %d offset %"PRId64" bytes %"PRId64" ret %d"
> +nvme_dsm(void *s, uint64_t offset, uint64_t bytes) "s %p offset %"PRId64" 
> bytes %"PRId64""
> +nvme_dsm_done(void *s, uint64_t offset, uint64_t bytes, int ret) "s %p 
> offset %"PRId64" bytes %"PRId64" ret %d"
>  nvme_dma_map_flush(void *s) "s %p"
>  nvme_free_req_queue_wait(void *q) "q %p"
>  nvme_cmd_map_qiov(void *s, void *cmd, void *req, void *qiov, int entries) "s 
> %p cmd %p req %p qiov %p entries %d"
> -- 
> 2.17.2
>

Re: [Qemu-block] [PATCH v2 4/5] block/nvme: add support for write zeros

2019-06-05 Thread Fam Zheng

On Wed, 04/17 22:53, Maxim Levitsky wrote:
> Signed-off-by: Maxim Levitsky 
> ---
>  block/nvme.c | 69 +++-
>  block/trace-events   |  1 +
>  include/block/nvme.h | 19 +++-
>  3 files changed, 87 insertions(+), 2 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 0b1da54574..35b925899f 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -109,6 +109,8 @@ typedef struct {
>  uint64_t max_transfer;
>  bool plugged;
>  
> +bool supports_write_zeros;
> +
>  CoMutex dma_map_lock;
>  CoQueue dma_flush_queue;
>  
> @@ -457,6 +459,10 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  s->max_transfer = MIN_NON_ZERO(s->max_transfer,
>s->page_size / sizeof(uint64_t) * s->page_size);
>  
> +
> +

Too many blank lines here.

> +s->supports_write_zeros = (idctrl->oncs & NVME_ONCS_WRITE_ZEROS) != 0;
> +
>  memset(resp, 0, 4096);
>  
>  cmd.cdw10 = 0;
> @@ -469,6 +475,11 @@ static void nvme_identify(BlockDriverState *bs, int 
> namespace, Error **errp)
>  s->nsze = le64_to_cpu(idns->nsze);
>  lbaf = >lbaf[NVME_ID_NS_FLBAS_INDEX(idns->flbas)];
>  
> +if (NVME_ID_NS_DLFEAT_WRITE_ZEROS(idns->dlfeat) &&
> +NVME_ID_NS_DLFEAT_READ_BEHAVIOR(idns->dlfeat) ==
> +NVME_ID_NS_DLFEAT_READ_BEHAVIOR_ZEROS)
> +bs->supported_write_flags |= BDRV_REQ_MAY_UNMAP;
> +
>  if (lbaf->ms) {
>  error_setg(errp, "Namespaces with metadata are not yet supported");
>  goto out;
> @@ -763,6 +774,8 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  int ret;
>  BDRVNVMeState *s = bs->opaque;
>  
> +bs->supported_write_flags = BDRV_REQ_FUA;
> +
>  opts = qemu_opts_create(_opts, NULL, 0, _abort);
>  qemu_opts_absorb_qdict(opts, options, _abort);
>  device = qemu_opt_get(opts, NVME_BLOCK_OPT_DEVICE);
> @@ -791,7 +804,6 @@ static int nvme_file_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto fail;
>  }
>  }
> -bs->supported_write_flags = BDRV_REQ_FUA;
>  return 0;
>  fail:
>  nvme_close(bs);
> @@ -1080,6 +1092,58 @@ static coroutine_fn int nvme_co_flush(BlockDriverState 
> *bs)
>  }
>  
>  
> +static coroutine_fn int nvme_co_pwrite_zeroes(BlockDriverState *bs,
> +int64_t offset, int bytes, BdrvRequestFlags flags)
> +{
> +BDRVNVMeState *s = bs->opaque;
> +NVMeQueuePair *ioq = s->queues[1];
> +NVMeRequest *req;
> +
> +if (!s->supports_write_zeros) {
> +return -ENOTSUP;
> +}

Local variables declaration below statements is not allowed as per coding style.

> +
> +uint32_t cdw12 = ((bytes >> s->blkshift) - 1) & 0x;
> +
> +NvmeCmd cmd = {
> +.opcode = NVME_CMD_WRITE_ZEROS,
> +.nsid = cpu_to_le32(s->nsid),
> +.cdw10 = cpu_to_le32((offset >> s->blkshift) & 0x),
> +.cdw11 = cpu_to_le32(((offset >> s->blkshift) >> 32) & 0x),
> +};
> +
> +NVMeCoData data = {
> +.ctx = bdrv_get_aio_context(bs),
> +.ret = -EINPROGRESS,
> +};
> +
> +if (flags & BDRV_REQ_MAY_UNMAP) {
> +cdw12 |= (1 << 25);
> +}
> +
> +if (flags & BDRV_REQ_FUA) {
> +cdw12 |= (1 << 30);
> +}
> +
> +cmd.cdw12 = cpu_to_le32(cdw12);
> +
> +trace_nvme_write_zeros(s, offset, bytes, flags);
> +assert(s->nr_queues > 1);
> +req = nvme_get_free_req(ioq);
> +assert(req);
> +
> +nvme_submit_command(s, ioq, req, , nvme_rw_cb, );
> +
> +data.co = qemu_coroutine_self();
> +while (data.ret == -EINPROGRESS) {
> +qemu_coroutine_yield();
> +}
> +
> +trace_nvme_rw_done(s, true, offset, bytes, data.ret);
> +return data.ret;
> +}
> +
> +
>  static int nvme_reopen_prepare(BDRVReopenState *reopen_state,
> BlockReopenQueue *queue, Error **errp)
>  {
> @@ -1184,6 +1248,9 @@ static BlockDriver bdrv_nvme = {
>  
>  .bdrv_co_preadv   = nvme_co_preadv,
>  .bdrv_co_pwritev  = nvme_co_pwritev,
> +
> +.bdrv_co_pwrite_zeroes= nvme_co_pwrite_zeroes,
> +
>  .bdrv_co_flush_to_disk= nvme_co_flush,
>  .bdrv_reopen_prepare  = nvme_reopen_prepare,
>  
> diff --git a/block/trace-events b/block/trace-events
> index 7335a42540..943a58569f 100644
> --- a/block/trace-events
> +++ b/block/trace-events
> @@ -144,6 +144,7 @@ nvme_submit_command_raw(int c0, int c1, int c2, int c3, 
> int c4, int c5, int c6,
>  nvme_handle_event(void *s) "s %p"
>  nvme_poll_cb(void *s) "s %p"
>  nvme_prw_aligned(void *s, int is_write, uint64_t offset, uint64_t bytes, int 
> flags, int niov) "s %p is_write %d offset %"PRId64" bytes %"PRId64" flags %d 
> niov %d"
> +nvme_write_zeros(void *s, uint64_t offset, uint64_t bytes, int flags) "s %p 
> offset %"PRId64" bytes %"PRId64" flags %d"
>  nvme_qiov_unaligned(const void *qiov, int n, void *base,

Re: [Qemu-block] [Qemu-devel] [PATCH] block/file-posix: ignore fail on unlock bytes

2019-03-28 Thread Fam Zheng

On Wed, 03/27 15:49, Vladimir Sementsov-Ogievskiy wrote:
> bdrv_replace_child() calls bdrv_check_perm() with error_abort on
> loosening permissions. However file-locking operations may fail even
> in this case, for example on NFS. And this leads to Qemu crash.
> 
> Let's ignore such errors, as we do already on permission update commit
> and abort.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy 
> ---
>  block/file-posix.c | 14 ++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/block/file-posix.c b/block/file-posix.c
> index db4cccbe51..403e67fe90 100644
> --- a/block/file-posix.c
> +++ b/block/file-posix.c
> @@ -815,6 +815,20 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
>  
>  switch (op) {
>  case RAW_PL_PREPARE:
> +if ((s->perm | new_perm) == s->perm &&
> +(~s->shared_perm | ~new_perm) == ~s->shared_perm)
> +{
> +/*
> + * We are going to unlock bytes, it should not fail. If fail,
> + * just report it and ignore, like we do for ABORT and COMMIT
> + * anyway.
> + */
> +ret = raw_check_lock_bytes(s->fd, new_perm, new_shared, 
> _err);
> +if (local_err) {
> +error_report_err(local_err);
> +}
> +return 0;
> +}
>  ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
>         ~s->shared_perm | ~new_shared,
> false, errp);
> -- 
> 2.18.0
> 
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH] vmdk: Set vmdk parent backing_format to vmdk

2019-03-28 Thread Fam Zheng

On Tue, 03/26 21:58, Sam Eiderman wrote:
> Commit fb2105b ("vmdk: Support version=3 in VMDK descriptor files") fixed
> the probe function to correctly guess vmdk descriptors with version=3.
> 
> This solves the issue where vmdk snapshot with parent vmdk descriptor
> containing "version=3" would be treated as raw instead vmdk.
> 
> In the future case where a new vmdk version is introduced, we will again
> experience this issue, even if the user will provide "-f vmdk" it will
> only apply to the tip image and not to the underlying "misprobed" parent
> image.
> 
> The code in vmdk.c already assumes that the backing file of vmdk must be
> vmdk (see vmdk_is_cid_valid which returns 0 if backing file is not
> vmdk).
> 
> So let's make it official by supplying the backing_format as vmdk.
> 
> Reviewed-by: Mark Kanda 
> Reviewed-By: Liran Alon 
> Reviewed-by: Arbel Moshe 
> Signed-off-by: Shmuel Eiderman 
> ---
>  block/vmdk.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index 8dec6ef767..de8cb859f8 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -397,6 +397,8 @@ static int vmdk_parent_open(BlockDriverState *bs)
>  pstrcpy(bs->auto_backing_file, end_name - p_name + 1, p_name);
>  pstrcpy(bs->backing_file, sizeof(bs->backing_file),
>  bs->auto_backing_file);
> +pstrcpy(bs->backing_format, sizeof(bs->backing_format),
> +"vmdk");
>  }
>  
>  out:
> -- 
> 2.13.3
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [Qemu-devel] Combining synchronous and asynchronous IO

2019-03-17 Thread Fam Zheng




> On Mar 15, 2019, at 01:31, Sergio Lopez  wrote:
> 
> Hi,
> 
> Our current AIO path does a great job at unloading the work from the VM,
> and combined with IOThreads provides a good performance in most
> scenarios. But it also comes with its costs, in both a longer execution
> path and the need of the intervention of the scheduler at various
> points.
> 
> There's one particular workload that suffers from this cost, and that's
> when you have just 1 or 2 cores on the Guest issuing synchronous
> requests. This happens to be a pretty common workload for some DBs and,
> in a general sense, on small VMs.
> 
> I did a quick'n'dirty implementation on top of virtio-blk to get some
> numbers. This comes from a VM with 4 CPUs running on an idle server,
> with a secondary virtio-blk disk backed by a null_blk device with a
> simulated latency of 30us.
> 
> - Average latency (us)
> 
> 
> || AIO+iothread | SIO+iothread |
> | 1 job  |  70  |  55  |
> | 2 jobs |  83  |  82  |
> | 4 jobs |  90  | 159  |
> 
> 
> In this case the intuition matches the reality, and synchronous IO wins
> when there's just 1 job issuing the requests, while it loses hard when
> the are 4.
> 
> While my first thought was implementing this as a tunable, turns out we
> have a hint about the nature of the workload in the number of the
> requests in the VQ. So I updated the code to use SIO if there's just 1
> request and AIO otherwise, with these results:
> 
> ---
> || AIO+iothread | SIO+iothread | AIO+SIO+iothread |
> | 1 job  |  70  |  55  |55|
> | 2 jobs |  83  |  82  |78|
> | 4 jobs |  90  | 159  |90|
> ---
> 
> This data makes me think this is something worth pursuing, but I'd like
> to hear your opinion on it.

Nice. In many cases coroutines just forward the raw read/write to the raw file 
(no qcow2 LBA translation, backup, throttling, etc. in the data path), being 
able to transparently (and dynamically, since the said condition can change any 
time for any request) bypass block layer will be a very interesting idea to 
explore. The challenge is how not to totally break existing features (e.g. live 
snapshot and everything).

> 
> Thanks,
> Sergio.
>

Re: [Qemu-block] [Qemu-devel] [PATCH] tcmu: Introduce qemu-tcmu utility

2019-03-12 Thread Fam Zheng




> On Mar 12, 2019, at 18:03, Kevin Wolf  wrote:
> 
> Am 12.03.2019 um 03:18 hat Fam Zheng geschrieben:
>>> On Mar 11, 2019, at 19:06, Kevin Wolf  wrote:
>>> Am 09.03.2019 um 02:46 hat Yaowei Bai geschrieben:
>>>> Thanks for explaining the background. It comes to my mind that actually we
>>>> talked about these two cases with Fam a bit long time ago and decided to
>>>> support both these two cases. The reason why we implement case2 first is
>>>> currently we care more about exporting new opened images and it's a bit
>>>> more convenient, exporting from a VM or QMP can be added in the later
>>>> release. Do you think it's reasonable/acceptable that we support both
>>>> cases and use case2 for normal new opened images and case1 for the
>>>> circumstances you mention above?
>>> 
>>> I would like to avoid a second code path because it comes with a
>>> maintenance cost.
>>> 
>>> Experience also tells that adding a new way to parse option strings will
>>> come back at us later because it we must always maintain compatibility
>>> with yet another format.
>> 
>> If the rule is that cfgstr strictly follows -blockdev syntax and the
>> parsing code is mostly shared, it shouldn’t be a problem, right? I
>> suppose this will limit the expressiveness of cfgstr but might be a
>> reasonable tradeoff between implementation complexity, user
>> friendliness and interface consistency.
> 
> Yes, if we could pass the cfgstr directly to an existing parser, that
> would make it less bad at least.
> 
> Of course, if we directly pass cfgstr to the -blockdev parser, it also
> means that we can't have additional options for configuring the
> BlockBackend (which could be part of the -export command line option for
> qemu-tcmu).
> 
> That could be addressed by wrapping another QAPI object around it that
> only contain BlockdevOptions as one field, but then you have to prefix
> every block node option if you use keyval visitor syntax.
> 
>> Another possibility is to use json: format in cfgstr for anything more
>> complex than a single -blockdev.
> 
> Parsing like -blockdev already includes JSON syntax (which is necessary
> to get the full expressiveness).
> 
> If you want to support the equivalent of multiple -blockdevs, you'd need
> to wrap the BlockdevOptions in a QAPI list.
> 
>>> So I would prefer not doing this and just passing command line options
>>> to qemu-tcmu, which can reuse the already existing code paths.
>> 
>> I think the effect of not supporting adding new blockdev from cfgstr
>> is that we have to resort to QMP to allow hot-adding targets. It’s not
>> necessarily bad, though I’m not sure hwo that aligns with Yaowei’s
>> development plan.
> 
> This is true, but we need a way to do this from QMP anyway. So the
> question is whether we want to introduce a second way to achieve the
> same thing a bit more conveniently. But I expect that hot-adding is
> something that is usually done with management software anyway, so does
> the convenience actually buy us much?

The real difference is, are existing management software to adopt qemu-tcmu 
built around targetcli family or QMP? I think the management must understand 
targetcli interface to work, talking QMP is an additional burden.

So IMO cfgstr can ideally be the only channel that the management interacts 
with, if we could reuse existing QMP/QAPI code well enough.

Fam

> 
> Kevin

Re: [Qemu-block] [Qemu-devel] [PATCH] tcmu: Introduce qemu-tcmu utility

2019-03-12 Thread Fam Zheng

> On Mar 11, 2019, at 19:06, Kevin Wolf  wrote:
> 
> Am 09.03.2019 um 02:46 hat Yaowei Bai geschrieben:
>> Thanks for explaining the background. It comes to my mind that actually we
>> talked about these two cases with Fam a bit long time ago and decided to
>> support both these two cases. The reason why we implement case2 first is
>> currently we care more about exporting new opened images and it's a bit
>> more convenient, exporting from a VM or QMP can be added in the later
>> release. Do you think it's reasonable/acceptable that we support both
>> cases and use case2 for normal new opened images and case1 for the
>> circumstances you mention above?
> 
> I would like to avoid a second code path because it comes with a
> maintenance cost.
> 
> Experience also tells that adding a new way to parse option strings will
> come back at us later because it we must always maintain compatibility
> with yet another format.

If the rule is that cfgstr strictly follows -blockdev syntax and the parsing 
code is mostly shared, it shouldn’t be a problem, right? I suppose this will 
limit the expressiveness of cfgstr but might be a reasonable tradeoff between 
implementation complexity, user friendliness and interface consistency.

Another possibility is to use json: format in cfgstr for anything more complex 
than a single -blockdev.

> 
> So I would prefer not doing this and just passing command line options
> to qemu-tcmu, which can reuse the already existing code paths.

I think the effect of not supporting adding new blockdev from cfgstr is that we 
have to resort to QMP to allow hot-adding targets. It’s not necessarily bad, 
though I’m not sure hwo that aligns with Yaowei’s development plan.

Fam

> 
> Kevin
>

Re: [Qemu-block] When AioHandler->is_external=true?

2018-11-20 Thread Fam Zheng

On Tue, 11/20 20:34, Dongli Zhang wrote:
> Hi,
> 
> Would you please help explain in which case AioHandler->is_external is true, 
> and
> when it is false?
> 
> I read about iothread and mainloop and I am little bit confused about it.

VirtIO's ioeventfd is an example of is_external == true. It means the events
handler on this fd may initiate more I/O, such as read/write on virtual storage
backend, so are specially taken care of at certain points when we won't want
more I/O requests to be processed, such as when a block job is completing, or in
the middle of a QMP transaction.

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH v5 0/3] file-posix: Simplifications on image locking

2018-11-07 Thread Fam Zheng

On Thu, 10/11 15:21, Fam Zheng wrote:
> v5: Address Max's comments (Thanks for reviewing):
> - Clean up after test done.
> - Add rev-by to patch 1 and 2.

Ping?

Fam

Re: [Qemu-block] [PATCH] block/nvme: optimize the performance of nvme driver based on vfio-pci

2018-11-02 Thread Fam Zheng

On Thu, 11/01 18:38, Li Feng wrote:
> When the IO size is larger than 2 pages, we move the the pointer one by
> one in the pagelist, this is inefficient.
> 
> This is a simple benchmark result:
> 
> Before:
> $ qemu-io -c 'write 0 1G' nvme://:00:04.0/1
> 
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:02.41 (424.504 MiB/sec and 0.4146 ops/sec)
> 
>  $ qemu-io -c 'read 0 1G' nvme://:00:04.0/1
> 
> read 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:02.03 (503.055 MiB/sec and 0.4913 ops/sec)
> 
> After:
> $ qemu-io -c 'write 0 1G' nvme://:00:04.0/1
> 
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:02.17 (471.517 MiB/sec and 0.4605 ops/sec)
> 
>  $ qemu-io -c 'read 0 1G' nvme://:00:04.0/1   
>   
>   
>1 ↵
> 
> read 1073741824/1073741824 bytes at offset 0
> 1 GiB, 1 ops; 0:00:01.94 (526.770 MiB/sec and 0.5144 ops/sec)
> 
> Signed-off-by: Li Feng 
> ---
>  block/nvme.c | 16 ++--
>  1 file changed, 6 insertions(+), 10 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index 29294038fc..982097b5b1 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -837,7 +837,7 @@ try_map:
>  }
>  
>  for (j = 0; j < qiov->iov[i].iov_len / s->page_size; j++) {
> -pagelist[entries++] = iova + j * s->page_size;
> +pagelist[entries++] = cpu_to_le64(iova + j * s->page_size);
>  }
>  trace_nvme_cmd_map_qiov_iov(s, i, qiov->iov[i].iov_base,
>  qiov->iov[i].iov_len / s->page_size);
> @@ -850,20 +850,16 @@ try_map:
>  case 0:
>  abort();
>  case 1:
> -cmd->prp1 = cpu_to_le64(pagelist[0]);
> +cmd->prp1 = pagelist[0];
>  cmd->prp2 = 0;
>  break;
>  case 2:
> -cmd->prp1 = cpu_to_le64(pagelist[0]);
> -cmd->prp2 = cpu_to_le64(pagelist[1]);;
> +cmd->prp1 = pagelist[0];
> +cmd->prp2 = pagelist[1];
>  break;
>  default:
> -cmd->prp1 = cpu_to_le64(pagelist[0]);
> -cmd->prp2 = cpu_to_le64(req->prp_list_iova);
> -for (i = 0; i < entries - 1; ++i) {
> -pagelist[i] = cpu_to_le64(pagelist[i + 1]);
> -}
> -pagelist[entries - 1] = 0;
> +cmd->prp1 = pagelist[0];
> +cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
>  break;
>  }
>  trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> -- 
> 2.11.0
> 

Nice! Thanks. I've queued the patch.

Fam

[Qemu-block] [PATCH v3] file-posix: Use error API properly

2018-11-01 Thread Fam Zheng

Use error_report for situations that affect user operation (i.e.  we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.

For raw_normalize_devicepath, add Error parameter to propagate to
its callers.

Suggested-by: Markus Armbruster 
Signed-off-by: Fam Zheng 

---

v3: Use error_setg_errno. [Eric]

v2: Add Error ** to raw_normalize_devicepath. [Markus]
Use error_printf for splitting multi-sentence message. [Markus]
---
 block/file-posix.c | 39 ---
 1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2da3a76355..e5606655b6 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -205,7 +205,7 @@ static int cdrom_reopen(BlockDriverState *bs);
 #endif
 
 #if defined(__NetBSD__)
-static int raw_normalize_devicepath(const char **filename)
+static int raw_normalize_devicepath(const char **filename, Error **errp)
 {
 static char namebuf[PATH_MAX];
 const char *dp, *fname;
@@ -214,8 +214,7 @@ static int raw_normalize_devicepath(const char **filename)
 fname = *filename;
 dp = strrchr(fname, '/');
 if (lstat(fname, ) < 0) {
-fprintf(stderr, "%s: stat failed: %s\n",
-fname, strerror(errno));
+error_setg_errno(errp, errno, "%s: stat failed", fname);
 return -errno;
 }
 
@@ -229,14 +228,13 @@ static int raw_normalize_devicepath(const char **filename)
 snprintf(namebuf, PATH_MAX, "%.*s/r%s",
 (int)(dp - fname), fname, dp + 1);
 }
-fprintf(stderr, "%s is a block device", fname);
 *filename = namebuf;
-fprintf(stderr, ", using %s\n", *filename);
+warn_report("%s is a block device, using %s", fname, *filename);
 
 return 0;
 }
 #else
-static int raw_normalize_devicepath(const char **filename)
+static int raw_normalize_devicepath(const char **filename, Error **errp)
 {
 return 0;
 }
@@ -461,9 +459,8 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 
 filename = qemu_opt_get(opts, "filename");
 
-ret = raw_normalize_devicepath();
+ret = raw_normalize_devicepath(, errp);
 if (ret != 0) {
-error_setg_errno(errp, -ret, "Could not normalize device path");
 goto fail;
 }
 
@@ -492,11 +489,10 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 case ON_OFF_AUTO_ON:
 s->use_lock = true;
 if (!qemu_has_ofd_lock()) {
-fprintf(stderr,
-"File lock requested but OFD locking syscall is "
-"unavailable, falling back to POSIX file locks.\n"
-"Due to the implementation, locks can be lost "
-"unexpectedly.\n");
+warn_report("File lock requested but OFD locking syscall is "
+"unavailable, falling back to POSIX file locks");
+error_printf("Due to the implementation, locks can be lost "
+ "unexpectedly.\n");
 }
 break;
 case ON_OFF_AUTO_OFF:
@@ -805,7 +801,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 case RAW_PL_COMMIT:
@@ -815,7 +811,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 }
@@ -892,10 +888,8 @@ static int raw_reopen_prepare(BDRVReopenState *state,
 /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
 if (rs->fd == -1) {
 const char *normalized_filename = state->bs->filename;
-ret = raw_normalize_devicepath(_filename);
-if (ret < 0) {
-error_setg_errno(errp, -ret, "Could not normalize device path");
-} else {
+ret = raw_normalize_devicepath(_filename, errp);
+if (ret >= 0) {
 assert(!(rs->open_flags & O_CREAT));
 rs->fd = qemu_open(normalized_filename, rs->open_flags);
 if (rs->fd == -1) {
@@ -1775,7 +1769,7 @@ static int aio_worker(void *arg)
 ret = handle_aiocb_truncate(aiocb);
 break;
 default:
-fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
+error_report("invalid aio request (0x%x)", aiocb-

Re: [Qemu-block] [Qemu-devel] [PATCH v2] file-posix: Use error API properly

2018-11-01 Thread Fam Zheng

On Wed, 10/31 15:51, Markus Armbruster wrote:
> Fam Zheng  writes:
> 
> > Use error_report for situations that affect user operation (i.e.  we're
> > actually returning error), and warn_report/warn_report_err when some
> > less critical error happened but the user operation can still carry on.
> >
> > For raw_normalize_devicepath, add Error parameter to propagate to
> > its callers.
> >
> > Suggested-by: Markus Armbruster 
> > Signed-off-by: Fam Zheng 
> 
> Reviewed-by: Markus Armbruster 
> 
> Kevin, if you'd prefer this to go through my tree, let me know.

Thanks for the review. I'll respin and use error_setg_errno as suggested by
Eric, if it's okay.

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH] iotests: 082: Update output

2018-10-31 Thread Fam Zheng

On Wed, 10/31 16:04, Fam Zheng wrote:
> Commit 9cbef9d68e (qemu-option: improve qemu_opts_print_help() output)
> affected qemu-img help output, and broke this test case.
> 
> Update the output reference to fix it.
> 
> Signed-off-by: Fam Zheng 
> 
> ---
> 
> I'm once again looking at enabling iotests on patchew (via vm based
> tests), but immediately got blocked by this. :(

Please ignore this one, there's already another patch and comments buried deep
in my maildir. I'll peacefully wait for a fix then.

Fam

[Qemu-block] [PATCH] iotests: 082: Update output

2018-10-31 Thread Fam Zheng

Commit 9cbef9d68e (qemu-option: improve qemu_opts_print_help() output)
affected qemu-img help output, and broke this test case.

Update the output reference to fix it.

Signed-off-by: Fam Zheng 

---

I'm once again looking at enabling iotests on patchew (via vm based
tests), but immediately got blocked by this. :(
---
 tests/qemu-iotests/082.out | 956 ++---
 1 file changed, 478 insertions(+), 478 deletions(-)

diff --git a/tests/qemu-iotests/082.out b/tests/qemu-iotests/082.out
index 19e9fb13ff..2672349e1d 100644
--- a/tests/qemu-iotests/082.out
+++ b/tests/qemu-iotests/082.out
@@ -44,171 +44,171 @@ cluster_size: 8192
 
 Testing: create -f qcow2 -o help TEST_DIR/t.qcow2 128M
 Supported options:
-size Virtual disk size
-compat   Compatibility level (0.10 or 1.1)
-backing_file File name of a base image
-backing_fmt  Image format of the base image
-encryption   Encrypt the image with format 'aes'. (Deprecated in favor of 
encrypt.format=aes)
-encrypt.format   Encrypt the image, format choices: 'aes', 'luks'
-encrypt.key-secret ID of secret providing qcow AES key or LUKS passphrase
-encrypt.cipher-alg Name of encryption cipher algorithm
-encrypt.cipher-mode Name of encryption cipher mode
-encrypt.ivgen-alg Name of IV generator algorithm
-encrypt.ivgen-hash-alg Name of IV generator hash algorithm
-encrypt.hash-alg Name of encryption hash algorithm
-encrypt.iter-time Time to spend in PBKDF in milliseconds
-cluster_size qcow2 cluster size
-preallocationPreallocation mode (allowed values: off, metadata, falloc, 
full)
-lazy_refcounts   Postpone refcount updates
-refcount_bitsWidth of a reference count entry in bits
-nocowTurn off copy-on-write (valid only on btrfs)
+backing_file=str - File name of a base image
+backing_fmt=str - Image format of the base image
+cluster_size=size - qcow2 cluster size
+compat=str - Compatibility level (0.10 or 1.1)
+encrypt.cipher-alg=str - Name of encryption cipher algorithm
+encrypt.cipher-mode=str - Name of encryption cipher mode
+encrypt.format=str - Encrypt the image, format choices: 'aes', 'luks'
+encrypt.hash-alg=str - Name of encryption hash algorithm
+encrypt.iter-time=num - Time to spend in PBKDF in milliseconds
+encrypt.ivgen-alg=str - Name of IV generator algorithm
+encrypt.ivgen-hash-alg=str - Name of IV generator hash algorithm
+encrypt.key-secret=str - ID of secret providing qcow AES key or LUKS passphrase
+encryption=bool (on/off) - Encrypt the image with format 'aes'. (Deprecated in 
favor of encrypt.format=aes)
+lazy_refcounts=bool (on/off) - Postpone refcount updates
+nocow=bool (on/off) - Turn off copy-on-write (valid only on btrfs)
+preallocation=str - Preallocation mode (allowed values: off, metadata, falloc, 
full)
+refcount_bits=num - Width of a reference count entry in bits
+size=size - Virtual disk size
 
 Testing: create -f qcow2 -o ? TEST_DIR/t.qcow2 128M
 Supported options:
-size Virtual disk size
-compat   Compatibility level (0.10 or 1.1)
-backing_file File name of a base image
-backing_fmt  Image format of the base image
-encryption   Encrypt the image with format 'aes'. (Deprecated in favor of 
encrypt.format=aes)
-encrypt.format   Encrypt the image, format choices: 'aes', 'luks'
-encrypt.key-secret ID of secret providing qcow AES key or LUKS passphrase
-encrypt.cipher-alg Name of encryption cipher algorithm
-encrypt.cipher-mode Name of encryption cipher mode
-encrypt.ivgen-alg Name of IV generator algorithm
-encrypt.ivgen-hash-alg Name of IV generator hash algorithm
-encrypt.hash-alg Name of encryption hash algorithm
-encrypt.iter-time Time to spend in PBKDF in milliseconds
-cluster_size qcow2 cluster size
-preallocationPreallocation mode (allowed values: off, metadata, falloc, 
full)
-lazy_refcounts   Postpone refcount updates
-refcount_bitsWidth of a reference count entry in bits
-nocowTurn off copy-on-write (valid only on btrfs)
+backing_file=str - File name of a base image
+backing_fmt=str - Image format of the base image
+cluster_size=size - qcow2 cluster size
+compat=str - Compatibility level (0.10 or 1.1)
+encrypt.cipher-alg=str - Name of encryption cipher algorithm
+encrypt.cipher-mode=str - Name of encryption cipher mode
+encrypt.format=str - Encrypt the image, format choices: 'aes', 'luks'
+encrypt.hash-alg=str - Name of encryption hash algorithm
+encrypt.iter-time=num - Time to spend in PBKDF in milliseconds
+encrypt.ivgen-alg=str - Name of IV generator algorithm
+encrypt.ivgen-hash-alg=str - Name of IV generator hash algorithm
+encrypt.key-secret=str - ID of secret providing qcow AES key or LUKS passphrase
+encryption=bool (on/off) - Encrypt the image with format 'aes'. (Deprecated in 
favor of encrypt.format=aes)
+lazy_refcounts=bool (on/off) - Postpone refcount updates
+nocow=bool (on/off) - Turn off copy-on-write (valid only on btrfs)
+preallocation=str - Preallocation mode (allowed values

[Qemu-block] [PATCH v2] file-posix: Use error API properly

2018-10-30 Thread Fam Zheng

Use error_report for situations that affect user operation (i.e.  we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.

For raw_normalize_devicepath, add Error parameter to propagate to
its callers.

Suggested-by: Markus Armbruster 
Signed-off-by: Fam Zheng 

---

v2: Add Error ** to raw_normalize_devicepath. [Markus]
Use error_printf for splitting multi-sentence message. [Markus]
---
 block/file-posix.c | 39 ---
 1 file changed, 16 insertions(+), 23 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2da3a76355..b7f0316005 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -205,7 +205,7 @@ static int cdrom_reopen(BlockDriverState *bs);
 #endif
 
 #if defined(__NetBSD__)
-static int raw_normalize_devicepath(const char **filename)
+static int raw_normalize_devicepath(const char **filename, Error **errp)
 {
 static char namebuf[PATH_MAX];
 const char *dp, *fname;
@@ -214,8 +214,7 @@ static int raw_normalize_devicepath(const char **filename)
 fname = *filename;
 dp = strrchr(fname, '/');
 if (lstat(fname, ) < 0) {
-fprintf(stderr, "%s: stat failed: %s\n",
-fname, strerror(errno));
+error_setg(errp, "%s: stat failed: %s", fname, strerror(errno));
 return -errno;
 }
 
@@ -229,14 +228,13 @@ static int raw_normalize_devicepath(const char **filename)
 snprintf(namebuf, PATH_MAX, "%.*s/r%s",
 (int)(dp - fname), fname, dp + 1);
 }
-fprintf(stderr, "%s is a block device", fname);
 *filename = namebuf;
-fprintf(stderr, ", using %s\n", *filename);
+warn_report("%s is a block device, using %s", fname, *filename);
 
 return 0;
 }
 #else
-static int raw_normalize_devicepath(const char **filename)
+static int raw_normalize_devicepath(const char **filename, Error **errp)
 {
 return 0;
 }
@@ -461,9 +459,8 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 
 filename = qemu_opt_get(opts, "filename");
 
-ret = raw_normalize_devicepath();
+ret = raw_normalize_devicepath(, errp);
 if (ret != 0) {
-error_setg_errno(errp, -ret, "Could not normalize device path");
 goto fail;
 }
 
@@ -492,11 +489,10 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 case ON_OFF_AUTO_ON:
 s->use_lock = true;
 if (!qemu_has_ofd_lock()) {
-fprintf(stderr,
-"File lock requested but OFD locking syscall is "
-"unavailable, falling back to POSIX file locks.\n"
-"Due to the implementation, locks can be lost "
-"unexpectedly.\n");
+warn_report("File lock requested but OFD locking syscall is "
+"unavailable, falling back to POSIX file locks");
+error_printf("Due to the implementation, locks can be lost "
+ "unexpectedly.\n");
 }
 break;
 case ON_OFF_AUTO_OFF:
@@ -805,7 +801,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 case RAW_PL_COMMIT:
@@ -815,7 +811,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 }
@@ -892,10 +888,8 @@ static int raw_reopen_prepare(BDRVReopenState *state,
 /* If we cannot use fcntl, or fcntl failed, fall back to qemu_open() */
 if (rs->fd == -1) {
 const char *normalized_filename = state->bs->filename;
-ret = raw_normalize_devicepath(_filename);
-if (ret < 0) {
-error_setg_errno(errp, -ret, "Could not normalize device path");
-} else {
+ret = raw_normalize_devicepath(_filename, errp);
+if (ret >= 0) {
 assert(!(rs->open_flags & O_CREAT));
 rs->fd = qemu_open(normalized_filename, rs->open_flags);
 if (rs->fd == -1) {
@@ -1775,7 +1769,7 @@ static int aio_worker(void *arg)
 ret = handle_aiocb_truncate(aiocb);
 break;
 default:
-fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
+error_report("invalid aio request (0x%x)", aiocb->aio_type);
 ret =

[Qemu-block] [PATCH] file-posix: Use error API properly

2018-10-22 Thread Fam Zheng

Use error_report for situations that affect user operation (i.e.  we're
actually returning error), and warn_report/warn_report_err when some
less critical error happened but the user operation can still carry on.

Suggested-by: Markus Armbruster 
Signed-off-by: Fam Zheng 
---
 block/file-posix.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2da3a76355..2a46899313 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -214,8 +214,7 @@ static int raw_normalize_devicepath(const char **filename)
 fname = *filename;
 dp = strrchr(fname, '/');
 if (lstat(fname, ) < 0) {
-fprintf(stderr, "%s: stat failed: %s\n",
-fname, strerror(errno));
+error_report("%s: stat failed: %s", fname, strerror(errno));
 return -errno;
 }
 
@@ -229,9 +228,8 @@ static int raw_normalize_devicepath(const char **filename)
 snprintf(namebuf, PATH_MAX, "%.*s/r%s",
 (int)(dp - fname), fname, dp + 1);
 }
-fprintf(stderr, "%s is a block device", fname);
 *filename = namebuf;
-fprintf(stderr, ", using %s\n", *filename);
+warn_report("%s is a block device, using %s", fname, *filename);
 
 return 0;
 }
@@ -492,11 +490,11 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 case ON_OFF_AUTO_ON:
 s->use_lock = true;
 if (!qemu_has_ofd_lock()) {
-fprintf(stderr,
+warn_report(
 "File lock requested but OFD locking syscall is "
-"unavailable, falling back to POSIX file locks.\n"
+"unavailable, falling back to POSIX file locks. "
 "Due to the implementation, locks can be lost "
-"unexpectedly.\n");
+"unexpectedly.");
 }
 break;
 case ON_OFF_AUTO_OFF:
@@ -805,7 +803,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 case RAW_PL_COMMIT:
@@ -815,7 +813,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 /* Theoretically the above call only unlocks bytes and it cannot
  * fail. Something weird happened, report it.
  */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 break;
 }
@@ -1775,7 +1773,7 @@ static int aio_worker(void *arg)
 ret = handle_aiocb_truncate(aiocb);
 break;
 default:
-fprintf(stderr, "invalid aio request (0x%x)\n", aiocb->aio_type);
+error_report("invalid aio request (0x%x)", aiocb->aio_type);
 ret = -EINVAL;
 break;
 }
@@ -2263,7 +2261,7 @@ out_unlock:
  * not mean the whole creation operation has failed.  So
  * report it the user for their convenience, but do not report
  * it to the caller. */
-error_report_err(local_err);
+warn_report_err(local_err);
 }
 
 out_close:
-- 
2.17.1

[Qemu-block] [PATCH v5 3/3] tests: Add unit tests for image locking

2018-10-11 Thread Fam Zheng

Signed-off-by: Fam Zheng 
---
 tests/Makefile.include |   2 +
 tests/test-image-locking.c | 157 +
 2 files changed, 159 insertions(+)
 create mode 100644 tests/test-image-locking.c

diff --git a/tests/Makefile.include b/tests/Makefile.include
index 7a3059bf6c..995c84f38a 100644
--- a/tests/Makefile.include
+++ b/tests/Makefile.include
@@ -95,6 +95,7 @@ check-unit-y += tests/test-bdrv-drain$(EXESUF)
 check-unit-y += tests/test-blockjob$(EXESUF)
 check-unit-y += tests/test-blockjob-txn$(EXESUF)
 check-unit-y += tests/test-block-backend$(EXESUF)
+check-unit-y += tests/test-image-locking$(EXESUF)
 check-unit-y += tests/test-x86-cpuid$(EXESUF)
 # all code tested by test-x86-cpuid is inside topology.h
 gcov-files-test-x86-cpuid-y =
@@ -651,6 +652,7 @@ tests/test-bdrv-drain$(EXESUF): tests/test-bdrv-drain.o 
$(test-block-obj-y) $(te
 tests/test-blockjob$(EXESUF): tests/test-blockjob.o $(test-block-obj-y) 
$(test-util-obj-y)
 tests/test-blockjob-txn$(EXESUF): tests/test-blockjob-txn.o 
$(test-block-obj-y) $(test-util-obj-y)
 tests/test-block-backend$(EXESUF): tests/test-block-backend.o 
$(test-block-obj-y) $(test-util-obj-y)
+tests/test-image-locking$(EXESUF): tests/test-image-locking.o 
$(test-block-obj-y) $(test-util-obj-y)
 tests/test-thread-pool$(EXESUF): tests/test-thread-pool.o $(test-block-obj-y)
 tests/test-iov$(EXESUF): tests/test-iov.o $(test-util-obj-y)
 tests/test-hbitmap$(EXESUF): tests/test-hbitmap.o $(test-util-obj-y) 
$(test-crypto-obj-y)
diff --git a/tests/test-image-locking.c b/tests/test-image-locking.c
new file mode 100644
index 00..7614cbf90c
--- /dev/null
+++ b/tests/test-image-locking.c
@@ -0,0 +1,157 @@
+/*
+ * Image locking tests
+ *
+ * Copyright (c) 2018 Red Hat Inc.
+ *
+ * Author: Fam Zheng 
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to 
deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "qemu/osdep.h"
+#include "block/block.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+#include "qapi/qmp/qdict.h"
+
+static BlockBackend *open_image(const char *path,
+uint64_t perm, uint64_t shared_perm,
+Error **errp)
+{
+Error *local_err = NULL;
+BlockBackend *blk;
+QDict *options = qdict_new();
+
+qdict_put_str(options, "driver", "raw");
+blk = blk_new_open(path, NULL, options, BDRV_O_RDWR, _err);
+if (blk) {
+g_assert_null(local_err);
+if (blk_set_perm(blk, perm, shared_perm, errp)) {
+blk_unref(blk);
+blk = NULL;
+}
+} else {
+error_propagate(errp, local_err);
+}
+return blk;
+}
+
+static void check_locked_bytes(int fd, uint64_t perm_locks,
+   uint64_t shared_perm_locks)
+{
+int i;
+
+if (!perm_locks && !shared_perm_locks) {
+g_assert(!qemu_lock_fd_test(fd, 0, 0, true));
+return;
+}
+for (i = 0; (1ULL << i) <= BLK_PERM_ALL; i++) {
+uint64_t bit = (1ULL << i);
+bool perm_expected = !!(bit & perm_locks);
+bool shared_perm_expected = !!(bit & shared_perm_locks);
+g_assert_cmpint(perm_expected, ==,
+!!qemu_lock_fd_test(fd, 100 + i, 1, true));
+g_assert_cmpint(shared_perm_expected, ==,
+!!qemu_lock_fd_test(fd, 200 + i, 1, true));
+}
+}
+
+static void test_image_locking_basic(void)
+{
+BlockBackend *blk1, *blk2, *blk3;
+char img_path[] = "/tmp/qtest.XX";
+uint64_t perm, shared_perm;
+
+int fd = mkstemp(img_path);
+assert(fd >= 0);
+
+perm = BLK_PERM_WRITE | BLK_PERM_CONSISTENT_READ;
+shared_perm = BLK_PERM_ALL;
+blk1 = open_image(img_path, perm, shared_perm, _abort);
+g_assert(blk1);
+
+check_locked_bytes(fd, perm, ~shared_perm);
+
+/* co

[Qemu-block] [PATCH v5 2/3] file-posix: Drop s->lock_fd

2018-10-11 Thread Fam Zheng

The lock_fd field is not strictly necessary because transferring locked
bytes from old fd to the new one shouldn't fail anyway. This spares the
user one fd per image.

Signed-off-by: Fam Zheng 
Reviewed-by: Max Reitz 
---
 block/file-posix.c | 37 +
 1 file changed, 13 insertions(+), 24 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index cf5eb98caa..d493357b89 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -142,7 +142,6 @@ do { \
 
 typedef struct BDRVRawState {
 int fd;
-int lock_fd;
 bool use_lock;
 int type;
 int open_flags;
@@ -153,7 +152,7 @@ typedef struct BDRVRawState {
 uint64_t shared_perm;
 
 /* The perms bits whose corresponding bytes are already locked in
- * s->lock_fd. */
+ * s->fd. */
 uint64_t locked_perm;
 uint64_t locked_shared_perm;
 
@@ -542,18 +541,6 @@ static int raw_open_common(BlockDriverState *bs, QDict 
*options,
 }
 s->fd = fd;
 
-s->lock_fd = -1;
-if (s->use_lock) {
-fd = qemu_open(filename, s->open_flags);
-if (fd < 0) {
-ret = -errno;
-error_setg_errno(errp, errno, "Could not open '%s' for locking",
- filename);
-qemu_close(s->fd);
-goto fail;
-}
-s->lock_fd = fd;
-}
 s->perm = 0;
 s->shared_perm = BLK_PERM_ALL;
 
@@ -814,15 +801,13 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 return 0;
 }
 
-assert(s->lock_fd > 0);
-
 switch (op) {
 case RAW_PL_PREPARE:
-ret = raw_apply_lock_bytes(s, s->lock_fd, s->perm | new_perm,
+ret = raw_apply_lock_bytes(s, s->fd, s->perm | new_perm,
~s->shared_perm | ~new_shared,
false, errp);
 if (!ret) {
-ret = raw_check_lock_bytes(s->lock_fd, new_perm, new_shared, errp);
+ret = raw_check_lock_bytes(s->fd, new_perm, new_shared, errp);
 if (!ret) {
 return 0;
 }
@@ -833,7 +818,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 op = RAW_PL_ABORT;
 /* fall through to unlock bytes. */
 case RAW_PL_ABORT:
-raw_apply_lock_bytes(s, s->lock_fd, s->perm, ~s->shared_perm,
+raw_apply_lock_bytes(s, s->fd, s->perm, ~s->shared_perm,
  true, _err);
 if (local_err) {
 /* Theoretically the above call only unlocks bytes and it cannot
@@ -843,7 +828,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 }
 break;
 case RAW_PL_COMMIT:
-raw_apply_lock_bytes(s, s->lock_fd, new_perm, ~new_shared,
+raw_apply_lock_bytes(s, s->fd, new_perm, ~new_shared,
  true, _err);
 if (local_err) {
 /* Theoretically the above call only unlocks bytes and it cannot
@@ -960,10 +945,18 @@ static void raw_reopen_commit(BDRVReopenState *state)
 {
 BDRVRawReopenState *rs = state->opaque;
 BDRVRawState *s = state->bs->opaque;
+Error *local_err = NULL;
 
 s->check_cache_dropped = rs->check_cache_dropped;
 s->open_flags = rs->open_flags;
 
+/* Copy locks to the new fd before closing the old one. */
+raw_apply_lock_bytes(NULL, rs->fd, s->locked_perm,
+ ~s->locked_shared_perm, false, _err);
+if (local_err) {
+/* shouldn't fail in a sane host, but report it just in case. */
+error_report_err(local_err);
+}
 qemu_close(s->fd);
 s->fd = rs->fd;
 
@@ -1956,10 +1949,6 @@ static void raw_close(BlockDriverState *bs)
 qemu_close(s->fd);
 s->fd = -1;
 }
-if (s->lock_fd >= 0) {
-qemu_close(s->lock_fd);
-s->lock_fd = -1;
-}
 }
 
 /**
-- 
2.17.1

[Qemu-block] [PATCH v5 1/3] file-posix: Skip effectiveless OFD lock operations

2018-10-11 Thread Fam Zheng

If we know we've already locked the bytes, don't do it again; similarly
don't unlock a byte if we haven't locked it. This doesn't change the
behavior, but fixes a corner case explained below.

Libvirt had an error handling bug that an image can get its (ownership,
file mode, SELinux) permissions changed (RHBZ 1584982) by mistake behind
QEMU. Specifically, an image in use by Libvirt VM has:

$ ls -lhZ b.img
-rw-r--r--. qemu qemu system_u:object_r:svirt_image_t:s0:c600,c690 b.img

Trying to attach it a second time won't work because of image locking.
And after the error, it becomes:

$ ls -lhZ b.img
-rw-r--r--. root root system_u:object_r:virt_image_t:s0 b.img

Then, we won't be able to do OFD lock operations with the existing fd.
In other words, the code such as in blk_detach_dev:

blk_set_perm(blk, 0, BLK_PERM_ALL, _abort);

can abort() QEMU, out of environmental changes.

This patch is an easy fix to this and the change is regardlessly
reasonable, so do it.

Signed-off-by: Fam Zheng 
Reviewed-by: Max Reitz 
---
 block/file-posix.c | 54 +-
 1 file changed, 44 insertions(+), 10 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index 2da3a76355..cf5eb98caa 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -152,6 +152,11 @@ typedef struct BDRVRawState {
 uint64_t perm;
 uint64_t shared_perm;
 
+/* The perms bits whose corresponding bytes are already locked in
+ * s->lock_fd. */
+uint64_t locked_perm;
+uint64_t locked_shared_perm;
+
 #ifdef CONFIG_XFS
 bool is_xfs:1;
 #endif
@@ -680,43 +685,72 @@ typedef enum {
  * file; if @unlock == true, also unlock the unneeded bytes.
  * @shared_perm_lock_bits is the mask of all permissions that are NOT shared.
  */
-static int raw_apply_lock_bytes(int fd,
+static int raw_apply_lock_bytes(BDRVRawState *s, int fd,
 uint64_t perm_lock_bits,
 uint64_t shared_perm_lock_bits,
 bool unlock, Error **errp)
 {
 int ret;
 int i;
+uint64_t locked_perm, locked_shared_perm;
+
+if (s) {
+locked_perm = s->locked_perm;
+locked_shared_perm = s->locked_shared_perm;
+} else {
+/*
+ * We don't have the previous bits, just lock/unlock for each of the
+ * requested bits.
+ */
+if (unlock) {
+locked_perm = BLK_PERM_ALL;
+locked_shared_perm = BLK_PERM_ALL;
+} else {
+locked_perm = 0;
+locked_shared_perm = 0;
+}
+}
 
 PERM_FOREACH(i) {
 int off = RAW_LOCK_PERM_BASE + i;
-if (perm_lock_bits & (1ULL << i)) {
+uint64_t bit = (1ULL << i);
+if ((perm_lock_bits & bit) && !(locked_perm & bit)) {
 ret = qemu_lock_fd(fd, off, 1, false);
 if (ret) {
 error_setg(errp, "Failed to lock byte %d", off);
 return ret;
+} else if (s) {
+s->locked_perm |= bit;
 }
-} else if (unlock) {
+} else if (unlock && (locked_perm & bit) && !(perm_lock_bits & bit)) {
 ret = qemu_unlock_fd(fd, off, 1);
 if (ret) {
 error_setg(errp, "Failed to unlock byte %d", off);
 return ret;
+} else if (s) {
+s->locked_perm &= ~bit;
 }
 }
 }
 PERM_FOREACH(i) {
 int off = RAW_LOCK_SHARED_BASE + i;
-if (shared_perm_lock_bits & (1ULL << i)) {
+uint64_t bit = (1ULL << i);
+if ((shared_perm_lock_bits & bit) && !(locked_shared_perm & bit)) {
 ret = qemu_lock_fd(fd, off, 1, false);
 if (ret) {
 error_setg(errp, "Failed to lock byte %d", off);
 return ret;
+} else if (s) {
+s->locked_shared_perm |= bit;
 }
-} else if (unlock) {
+} else if (unlock && (locked_shared_perm & bit) &&
+   !(shared_perm_lock_bits & bit)) {
 ret = qemu_unlock_fd(fd, off, 1);
 if (ret) {
 error_setg(errp, "Failed to unlock byte %d", off);
 return ret;
+} else if (s) {
+s->locked_shared_perm &= ~bit;
 }
 }
 }
@@ -784,7 +818,7 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 
 switch (op) {
 case RAW_PL_PREPARE:
-ret = raw_apply_lock_bytes(s->lock_fd, s->perm | new_perm,
+ret = raw_apply_lock_bytes(s, s->lock_fd, s->perm | new_perm,
~s->shared_perm | ~new_shared,
false, errp);
 if (!ret) {
@@ -79

[Qemu-block] [PATCH v5 0/3] file-posix: Simplifications on image locking

2018-10-11 Thread Fam Zheng

v5: Address Max's comments (Thanks for reviewing):
- Clean up after test done.
- Add rev-by to patch 1 and 2.

v4: Fix test on systems without OFD. [Patchew]

The first patch reduces chances of QEMU crash in unusual (but not unlikely)
cases especially when used by Libvirt (see commit message).

The second patch halves fd for images.

The third adds some more test for patch one (would have caught the regression
caused by v2).

Fam Zheng (3):
  file-posix: Skip effectiveless OFD lock operations
  file-posix: Drop s->lock_fd
  tests: Add unit tests for image locking

 block/file-posix.c |  83 +---
 tests/Makefile.include |   2 +
 tests/test-image-locking.c | 157 +
 3 files changed, 212 insertions(+), 30 deletions(-)
 create mode 100644 tests/test-image-locking.c

-- 
2.17.1

Re: [Qemu-block] [PATCH v2] nvme: correct locking around completion

2018-10-10 Thread Fam Zheng

On Wed, 10/10 13:19, Paolo Bonzini wrote:
> On 09/10/2018 21:37, John Snow wrote:
> > 
> > 
> > On 08/14/2018 02:27 AM, Paolo Bonzini wrote:
> >> nvme_poll_queues is already protected by q->lock, and
> >> AIO callbacks are invoked outside the AioContext lock.
> >> So remove the acquire/release pair in nvme_handle_event.
> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >>  block/nvme.c | 2 --
> >>  1 file changed, 2 deletions(-)
> >>
> >> diff --git a/block/nvme.c b/block/nvme.c
> >> index 6f71122bf5..42116907ed 100644
> >> --- a/block/nvme.c
> >> +++ b/block/nvme.c
> >> @@ -489,10 +489,8 @@ static void nvme_handle_event(EventNotifier *n)
> >>  BDRVNVMeState *s = container_of(n, BDRVNVMeState, irq_notifier);
> >>  
> >>  trace_nvme_handle_event(s);
> >> -aio_context_acquire(s->aio_context);
> >>  event_notifier_test_and_clear(n);
> >>  nvme_poll_queues(s);
> >> -aio_context_release(s->aio_context);
> >>  }
> >>  
> >>  static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> >>
> > 
> > This is over a month old (and seemingly didn't land); do we still want it?
> > 
> 
> Yes, we do.
> 

I'll send a pull request today. Thanks!

Fam

Re: [Qemu-block] [PATCH v2] nvme: correct locking around completion

2018-10-10 Thread Fam Zheng

On Wed, 10/10 13:19, Paolo Bonzini wrote:
> On 09/10/2018 21:37, John Snow wrote:
> > 
> > 
> > On 08/14/2018 02:27 AM, Paolo Bonzini wrote:
> >> nvme_poll_queues is already protected by q->lock, and
> >> AIO callbacks are invoked outside the AioContext lock.
> >> So remove the acquire/release pair in nvme_handle_event.
> >>
> >> Signed-off-by: Paolo Bonzini 
> >> ---
> >>  block/nvme.c | 2 --
> >>  1 file changed, 2 deletions(-)
> >>
> >> diff --git a/block/nvme.c b/block/nvme.c
> >> index 6f71122bf5..42116907ed 100644
> >> --- a/block/nvme.c
> >> +++ b/block/nvme.c
> >> @@ -489,10 +489,8 @@ static void nvme_handle_event(EventNotifier *n)
> >>  BDRVNVMeState *s = container_of(n, BDRVNVMeState, irq_notifier);
> >>  
> >>  trace_nvme_handle_event(s);
> >> -aio_context_acquire(s->aio_context);
> >>  event_notifier_test_and_clear(n);
> >>  nvme_poll_queues(s);
> >> -aio_context_release(s->aio_context);
> >>  }
> >>  
> >>  static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> >>
> > 
> > This is over a month old (and seemingly didn't land); do we still want it?
> > 
> 
> Yes, we do.

Queued, thanks!

Fam

Re: [Qemu-block] [PATCH v3] vmdk: align end of file to a sector boundary

2018-10-07 Thread Fam Zheng

On Fri, 10/05 10:00, yuchenlin wrote:
> Ping?

Hi,

This was merged as 51b3c6b73acae1e3fd3c7d441fc86dd17356695f.

Fam

> 
> On 2018-09-13 16:34, Fam Zheng wrote:
> > On Thu, 09/13 16:29, yuchen...@synology.com wrote:
> > > From: yuchenlin 
> > > 
> > > There is a rare case which the size of last compressed cluster
> > > is larger than the cluster size, which will cause the file is
> > > not aligned at the sector boundary.
> > > 
> > > There are three reasons to do it. First, if vmdk doesn't align at
> > > the sector boundary, there may be many undefined behaviors,
> > > such as, in vbox it will show VMDK: Compressed image is corrupted
> > > 'syno-vm-disk1.vmdk' (VERR_ZIP_CORRUPTED) when we try to import an
> > > ova with unaligned vmdk. Second, all the cluster_sector is aligned
> > > to sector, the last one should be like this, too. Third, it ease
> > > reading with sector based I/Os.
> > > 
> > > Signed-off-by: yuchenlin 
> > 
> > Reviewed-by: Fam Zheng 
>

Re: [Qemu-block] [Qemu-stable] [PATCH] block-backend: Set werror/rerror defaults in blk_new()

2018-09-28 Thread Fam Zheng

On Fri, Sep 28, 2018 at 5:32 PM Kevin Wolf  wrote:
>
> Currently, the default values for werror and rerror have to be set
> explicitly with blk_set_on_error() by the callers of blk_new(). The only
> caller actually doing this is blockdev_init(), which is called for
> BlockBackends created using -drive.
>
> In particular, anonymous BlockBackends created with
> -device ...,drive= didn't get the correct default set and
> instead defaulted to the integer value 0 (= BLOCKDEV_ON_ERROR_REPORT).
> This is the intended default for rerror anyway, but the default for
> werror should be BLOCKDEV_ON_ERROR_ENOSPC.
>
> Set the defaults in blk_new() instead so that they apply no matter what
> way the BlockBackend was created.
>
> Cc: qemu-sta...@nongnu.org
> Signed-off-by: Kevin Wolf 
> ---
>  block/block-backend.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/block/block-backend.c b/block/block-backend.c
> index 7b1ec5071b..dc0cd57724 100644
> --- a/block/block-backend.c
> +++ b/block/block-backend.c
> @@ -325,6 +325,9 @@ BlockBackend *blk_new(uint64_t perm, uint64_t shared_perm)
>  blk->shared_perm = shared_perm;
>  blk_set_enable_write_cache(blk, true);
>
> +blk->on_read_error = BLOCKDEV_ON_ERROR_REPORT;
> +blk->on_write_error = BLOCKDEV_ON_ERROR_ENOSPC;
> +
>  block_acct_init(>stats);
>
>  notifier_list_init(>remove_bs_notifiers);
> --
> 2.13.6
>
>

Reviewed-by: Fam Zheng

Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] MAINTAINERS: Remove myself as block maintainer

2018-09-25 Thread Fam Zheng

On Tue, 09/25 09:37, Markus Armbruster wrote:
> Do we want to have a dedicated VHDX driver submaintainer again?  Fam,
> you're maintaining VMDK, could you cover VHDX as well?

I don't know a lot VHDX internals. Considering my capacity at the moment I'd
rather not take this one.

Fam

Re: [Qemu-block] [Qemu-devel] [PATCH 2/2] MAINTAINERS: Remove myself as block maintainer

2018-09-24 Thread Fam Zheng

On Tue, 09/25 07:00, Markus Armbruster wrote:
> Jeff Cody  writes:
> 
> > I'll not be involved in day-to-day qemu development.  Remove
> > myself as maintainer from the remainder of the network block drivers
> > (and vhdx), and revert them to the general block layer maintainership.
> >
> > Signed-off-by: Jeff Cody 
> > ---
> >  MAINTAINERS | 14 --
> >  1 file changed, 14 deletions(-)
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index e93f79672f..6ef6932628 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -1982,28 +1982,22 @@ F: block/vmdk.c
> >  
> >  RBD
> >  M: Josh Durgin 
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/rbd.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> >  
> >  Sheepdog
> >  M: Hitoshi Mitake 
> >  M: Liu Yuan 
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  L: sheep...@lists.wpkg.org
> >  S: Supported
> >  F: block/sheepdog.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> >  
> >  VHDX
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/vhdx*
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> 
> Does "S: Supported" make sense without an M:?
> 
> >  
> >  VDI
> >  M: Stefan Weil 
> > @@ -2034,34 +2028,26 @@ F: docs/interop/nbd.txt
> >  T: git git://repo.or.cz/qemu/ericb.git nbd
> >  
> >  NFS
> > -M: Jeff Cody 
> >  M: Peter Lieven 
> >  L: qemu-block@nongnu.org
> >  S: Maintained
> >  F: block/nfs.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> >  
> >  SSH
> >  M: Richard W.M. Jones 
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/ssh.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> >  
> >  CURL
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/curl.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> 
> Likewise.
> 
> >  GLUSTER
> > -M: Jeff Cody 
> >  L: qemu-block@nongnu.org
> >  S: Supported
> >  F: block/gluster.c
> > -T: git git://github.com/codyprime/qemu-kvm-jtc.git block
> 
> Likewise.
> 
> >  Null Block Driver
> >  M: Fam Zheng 
> 

Block drivers without an M: naturally fall under the overall maintainership of
block layer (Kevin), so IMO keeping the statuses is fine. Maybe CURL can be
degraded to Maintained, though.

Fam

[Qemu-block] [PATCH] file-posix: Include filename in locking error message

2018-09-24 Thread Fam Zheng

Image locking errors happening at device initialization time doesn't say
which file cannot be locked, for instance,

-device scsi-disk,drive=drive-1: Failed to get shared "write" lock
Is another process using the image?

could refer to either the overlay image or its backing image.

Hoist the error_append_hint to the caller of raw_check_lock_bytes where
file name is known, and include it in the error hint.

Signed-off-by: Fam Zheng 
---
 block/file-posix.c | 10 +++--
 tests/qemu-iotests/153.out | 76 +++---
 tests/qemu-iotests/182.out |  2 +-
 3 files changed, 45 insertions(+), 43 deletions(-)

diff --git a/block/file-posix.c b/block/file-posix.c
index fe83cbf0eb..327f39ca45 100644
--- a/block/file-posix.c
+++ b/block/file-posix.c
@@ -741,8 +741,6 @@ static int raw_check_lock_bytes(int fd, uint64_t perm, 
uint64_t shared_perm,
"Failed to get \"%s\" lock",
perm_name);
 g_free(perm_name);
-error_append_hint(errp,
-  "Is another process using the image?\n");
 return ret;
 }
 }
@@ -758,8 +756,6 @@ static int raw_check_lock_bytes(int fd, uint64_t perm, 
uint64_t shared_perm,
"Failed to get shared \"%s\" lock",
perm_name);
 g_free(perm_name);
-error_append_hint(errp,
-  "Is another process using the image?\n");
 return ret;
 }
 }
@@ -796,6 +792,9 @@ static int raw_handle_perm_lock(BlockDriverState *bs,
 if (!ret) {
 return 0;
 }
+error_append_hint(errp,
+  "Is another process using the image [%s]?\n",
+  bs->filename);
 }
 op = RAW_PL_ABORT;
 /* fall through to unlock bytes. */
@@ -2217,6 +2216,9 @@ raw_co_create(BlockdevCreateOptions *options, Error 
**errp)
 /* Step two: Check that nobody else has taken conflicting locks */
 result = raw_check_lock_bytes(fd, perm, shared, errp);
 if (result < 0) {
+error_append_hint(errp,
+  "Is another process using the image [%s]?\n",
+  file_opts->filename);
 goto out_unlock;
 }
 
diff --git a/tests/qemu-iotests/153.out b/tests/qemu-iotests/153.out
index 93eaf10486..884254868c 100644
--- a/tests/qemu-iotests/153.out
+++ b/tests/qemu-iotests/153.out
@@ -12,11 +12,11 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=33554432 
backing_file=TEST_DIR/t
 
 == Launching another QEMU, opts: '' ==
 QEMU_PROG: -drive file=TEST_DIR/t.qcow2,if=none,: Failed to get "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 == Launching another QEMU, opts: 'read-only=on' ==
 QEMU_PROG: -drive file=TEST_DIR/t.qcow2,if=none,read-only=on: Failed to get 
shared "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 == Launching another QEMU, opts: 'read-only=on,force-share=on' ==
 
@@ -24,77 +24,77 @@ Is another process using the image?
 
 _qemu_io_wrapper -c read 0 512 TEST_DIR/t.qcow2
 can't open device TEST_DIR/t.qcow2: Failed to get "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 _qemu_io_wrapper -r -c read 0 512 TEST_DIR/t.qcow2
 can't open device TEST_DIR/t.qcow2: Failed to get shared "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 _qemu_io_wrapper -c open  TEST_DIR/t.qcow2 -c read 0 512
 can't open device TEST_DIR/t.qcow2: Failed to get "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 no file open, try 'help open'
 
 _qemu_io_wrapper -c open -r  TEST_DIR/t.qcow2 -c read 0 512
 can't open device TEST_DIR/t.qcow2: Failed to get shared "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 no file open, try 'help open'
 
 _qemu_img_wrapper info TEST_DIR/t.qcow2
 qemu-img: Could not open 'TEST_DIR/t.qcow2': Failed to get shared "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 _qemu_img_wrapper check TEST_DIR/t.qcow2
 qemu-img: Could not open 'TEST_DIR/t.qcow2': Failed to get shared "write" lock
-Is another process using the image?
+Is another process using the image [TEST_DIR/t.qcow2]?
 
 _qemu_img_wrapper compare TEST_DIR/t.qcow2 TEST_DIR/t.qcow2
 qemu-img: Could not open 'TEST_DIR/t.qcow2': Failed to get shared "write" lock
-Is another process using

Re: [Qemu-block] [Qemu-devel] [PATCH v4 0/3] file-posix: Simplifications on image locking

2018-09-18 Thread Fam Zheng

On Tue, 08/21 08:58, Fam Zheng wrote:
> v4: Fix test on systems without OFD. [Patchew]

Ping?

> 
> The first patch reduces chances of QEMU crash in unusual (but not unlikely)
> cases especially when used by Libvirt (see commit message).
> 
> The second patch halves fd for images.
> 
> The third adds some more test for patch one (would have caught the regression
> caused by v2).
> 
> Fam Zheng (3):
>   file-posix: Skip effectiveless OFD lock operations
>   file-posix: Drop s->lock_fd
>   tests: Add unit tests for image locking
> 
>  block/file-posix.c |  83 
>  tests/Makefile.include |   2 +
>  tests/test-image-locking.c | 154 +
>  3 files changed, 209 insertions(+), 30 deletions(-)
>  create mode 100644 tests/test-image-locking.c
> 
> -- 
> 2.17.1
> 
>

Re: [Qemu-block] [PATCH 0/3] aio-posix: polling mode bug fixes

2018-09-18 Thread Fam Zheng

On Wed, 09/12 19:10, Paolo Bonzini wrote:
> Patch 1 fixes a too-strict assertion that could fire when aio_poll
> is called in parallel with aio_set_fd_handler.
> 
> Patch 2 and 3 reinstate the performance benefits of polling, which were
> essentially disabled by commit 70232b5253 ("aio-posix: Don't count
> ctx->notifier as progress when polling", 2018-08-15).
> 
> Please review carefully! :)

Queued, thanks!

Fam

Re: [Qemu-block] [Qemu-devel] [RFC PATCH] scsi-block: Deprecate rotation_rate

2018-09-17 Thread Fam Zheng

On Mon, 09/17 10:55, Thomas Huth wrote:
> On 2018-09-17 10:31, Fam Zheng wrote:
> > This option is added together with scsi-disk but is never honoured,
> > becuase we don't emulate the VPD page for scsi-block. We could intercept
> > and inject the user specified value like for max xfer len, but it's
> > probably not helpful since the intent of 070f80095ad was for random
> > entropy aspects, not for performance. If emulated rotation rate is
> > desired, scsi-hd is more suitable.
> > 
> > Signed-off-by: Fam Zheng 
> > 
> > ---
> > 
> > RFC to discuss about if we want to keep the option. Another possibility
> > is, naturally, to actually write code to make use of the option.
> 
> As far as I've understood, scsi-disk is considered to be legacy anyway
> and scsi-cd or scsi-hd should rather always be used instead. Thus
> another idea that was mentioned in the past already is: Should we maybe
> rather deprecate the whole "scsi-disk" device? Or deprecate
> "media=cdrom" and make this an alias to "scsi-hd" instead?
> 
> Independent of that discussion, I think your patch is fine, it does
> certainly not make sense to implement this for a legacy device anymore.
> But please also add a deprecation note to qemu-deprecated.texi to mark
> it as deprecated "officially".

Ah, that is not what I meant. Sorry for the confusion. I should have typed
scsi-hd in the beginning of the commit message.  Scsi-block, which this patch
applies to, is for scsi command passthrough and is different from both scsi-hd
and scsi-disk.

Fam

Re: [Qemu-block] [PATCH v2 11/17] block-backend: Decrease in_flight only after callback

2018-09-14 Thread Fam Zheng

On Thu, 09/13 18:59, Kevin Wolf wrote:
> Am 13.09.2018 um 17:10 hat Paolo Bonzini geschrieben:
> > On 13/09/2018 14:52, Kevin Wolf wrote:
> > > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > > + /* If we are in the main thread, the callback is allowed to unref
> > > + * the BlockBackend, so we have to hold an additional reference */
> > > + blk_ref(acb->rwco.blk);
> > > + }
> > > acb->common.cb(acb->common.opaque, acb->rwco.ret);
> > > + blk_dec_in_flight(acb->rwco.blk);
> > > + if (qemu_get_current_aio_context() == qemu_get_aio_context()) {
> > > + blk_unref(acb->rwco.blk);
> > > + }
> > 
> > Is this something that happens only for some specific callers?  That is,
> > which callers are sure that the callback is invoked from the main thread?
> 
> I can't seem to reproduce the problem I saw any more even when reverting
> the bdrv_ref/unref pair. If I remember correctly it was actually a
> nested aio_poll() that was running a block job completion or something
> like that - which would obviously only happen on the main thread because
> the job intentionally defers to the main thread.
> 
> The only reason I made this conditional is that I think bdrv_unref()
> still isn't safe outside the main thread, is it?

I think that is correct.

Fam

Re: [Qemu-block] [PATCH v3] vmdk: align end of file to a sector boundary

2018-09-13 Thread Fam Zheng

On Thu, 09/13 16:29, yuchen...@synology.com wrote:
> From: yuchenlin 
> 
> There is a rare case which the size of last compressed cluster
> is larger than the cluster size, which will cause the file is
> not aligned at the sector boundary.
> 
> There are three reasons to do it. First, if vmdk doesn't align at
> the sector boundary, there may be many undefined behaviors,
> such as, in vbox it will show VMDK: Compressed image is corrupted
> 'syno-vm-disk1.vmdk' (VERR_ZIP_CORRUPTED) when we try to import an
> ova with unaligned vmdk. Second, all the cluster_sector is aligned
> to sector, the last one should be like this, too. Third, it ease
> reading with sector based I/Os.
> 
> Signed-off-by: yuchenlin 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 1/3] aio-posix: fix concurrent access to poll_disable_cnt

2018-09-13 Thread Fam Zheng

On Thu, 09/13 10:29, Paolo Bonzini wrote:
> On 13/09/2018 08:56, Fam Zheng wrote:
> >> +/* No need to order poll_disable_cnt writes against other updates;
> >> + * the counter is only used to avoid wasting time and latency on
> >> + * iterated polling when the system call will be ultimately necessary.
> >> + * Changing handlers is a rare event, and a little wasted polling 
> >> until
> >> + * the aio_notify below is not an issue.
> >> + */
> >> +atomic_set(>poll_disable_cnt,
> >> +   atomic_read(>poll_disable_cnt) + poll_disable_change);
> >
> > Why not atomic_add?
> 
> This is not lockless, it's protected by list_lock, so there's no race
> condition involved.  I'm just mimicking what is done for other similar
> cases, for example involving seqlocks.

Yeah, no doubt about the correctness. Just curious about the preference since
atomic_add is less typing and one less atomic_* op.

Fam

> 
> The alternative would be to add a full set of
> atomic_{add,sub,...}_relaxed atomics.
> 
> Paolo

Re: [Qemu-block] [PATCH v2] vmdk: align end of file to a sector boundary

2018-09-13 Thread Fam Zheng

On Thu, 09/13 15:47, yuchenlin wrote:
> On 2018-09-13 10:54, Fam Zheng wrote:
> > On Thu, 09/13 10:31, yuchen...@synology.com wrote:
> > > From: yuchenlin 
> > > 
> > > There is a rare case which the size of last compressed cluster
> > > is larger than the cluster size, which will cause the file is
> > > not aligned at the sector boundary.
> > 
> > The code looks good to me. Can you also explain why it is important to
> > align
> > file size to sector boundary in the comment?
> > 
> > Fam
> > 
> 
> In my opinion, there are three reasons to do it. First, if vmdk doesn't
> align at the sector boundary, there may be many undefined behaviors, such as
> in vbox it will show VMDK: Compressed image is corrupted '88-disk1.vmdk'
> (VERR_ZIP_CORRUPTED) when we try to import an ova with unaligned vmdk.
> Second, all the cluster_sector is aligned
> to sector, the last one should be like this, too. Third, it ease reading
> with sector based I/Os.
> 
> What do you think?

Yes, it would be nice if you could add this information to the commit message or
the code comment.

Fam

> 
> yuchenlin
> 
> > > 
> > > Signed-off-by: yuchenlin 
> > > ---
> > > v1 -> v2:
> > > * Add more detail comment.
> > > * Add QEMU_ALIGN_UP to show the intention more clearly.
> > > * Add newline in the end of bdrv_truncate.
> > > 
> > > thanks
> > > 
> > >  block/vmdk.c | 21 +
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/block/vmdk.c b/block/vmdk.c
> > > index a9d0084e36..2c9e86d98f 100644
> > > --- a/block/vmdk.c
> > > +++ b/block/vmdk.c
> > > @@ -1698,6 +1698,27 @@ static int coroutine_fn
> > >  vmdk_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
> > > uint64_t bytes, QEMUIOVector *qiov)
> > >  {
> > > +if (bytes == 0) {
> > > +/* The caller will write bytes 0 to signal EOF.
> > > + * When receive it, we align EOF to a sector boundary. */
> > > +BDRVVmdkState *s = bs->opaque;
> > > +int i, ret;
> > > +int64_t length;
> > > +
> > > +for (i = 0; i < s->num_extents; i++) {
> > > +length = bdrv_getlength(s->extents[i].file->bs);
> > > +if (length < 0) {
> > > +return length;
> > > +}
> > > +length = QEMU_ALIGN_UP(length, BDRV_SECTOR_SIZE);
> > > +ret = bdrv_truncate(s->extents[i].file, length,
> > > +PREALLOC_MODE_OFF, NULL);
> > > +if (ret < 0) {
> > > +return ret;
> > > +}
> > > +}
> > > +return 0;
> > > +}
> > >  return vmdk_co_pwritev(bs, offset, bytes, qiov, 0);
> > >  }
> > > 
> > > --
> > > 2.18.0
> > > 
>

Re: [Qemu-block] [PATCH 3/3] aio-posix: do skip system call if ctx->notifier polling succeeds

2018-09-13 Thread Fam Zheng

On Wed, 09/12 19:10, Paolo Bonzini wrote:
> Commit 70232b5253 ("aio-posix: Don't count ctx->notifier as progress when
> 2018-08-15), by not reporting progress, causes aio_poll to execute the
> system call when polling succeeds because of ctx->notifier.  This introduces
> latency before the call to aio_bh_poll() and negates the advantages of
> polling, unfortunately.
> 
> The fix builds on the previous patch, separating the effect of polling on
> the timeout from the progress reported to aio_poll().  ctx->notifier
> does zero the timeout, causing the caller to skip the system call,
> but it does not report progress, so that the bug fix of commit 70232b5253
> still stands.
> 
> Fixes: 70232b5253a3c4e03ed1ac47ef9246a8ac66c6fa
> Signed-off-by: Paolo Bonzini 
> ---
>  util/aio-posix.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 21d8987179..efa79b9e62 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -498,10 +498,11 @@ static bool run_poll_handlers_once(AioContext *ctx, 
> int64_t *timeout)
>  QLIST_FOREACH_RCU(node, >aio_handlers, node) {
>  if (!node->deleted && node->io_poll &&
>  aio_node_check(ctx, node->is_external) &&
> -node->io_poll(node->opaque) &&
> -node->opaque != >notifier) {
> +node->io_poll(node->opaque)) {
>  *timeout = 0;
> -progress = true;
> +if (node->opaque != >notifier) {
> +progress = true;
> +}
>  }
>  
>  /* Caller handles freeing deleted nodes.  Don't do it here. */
> -- 
> 2.17.1
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 2/3] aio-posix: compute timeout before polling

2018-09-13 Thread Fam Zheng

>  
>  bool aio_poll(AioContext *ctx, bool blocking)
> @@ -605,8 +610,14 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  start = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>  }
>  
> -progress = try_poll_mode(ctx, blocking);
> -if (!progress) {
> +timeout = blocking ? aio_compute_timeout(ctx) : 0;
> +progress = try_poll_mode(ctx, );
> +assert(!(timeout && progress));
> +
> +/* If polling is allowed, non-blocking aio_poll does not need the
> + * system call---a single round of run_poll_handlers_once suffices.
> + */
> +if (timeout || atomic_read(>poll_disable_cnt)) {
>  assert(npfd == 0);
>  
>  /* fill pollfds */
> @@ -620,8 +631,6 @@ bool aio_poll(AioContext *ctx, bool blocking)
>  }
>  }
>  
> -timeout = blocking ? aio_compute_timeout(ctx) : 0;
> -
>  /* wait until next event */
>  if (aio_epoll_check_poll(ctx, pollfds, npfd, timeout)) {
>  AioHandler epoll_handler;
> diff --git a/util/trace-events b/util/trace-events
> index 4822434c89..79569b7fdf 100644
> --- a/util/trace-events
> +++ b/util/trace-events
> @@ -1,8 +1,8 @@
>  # See docs/devel/tracing.txt for syntax documentation.
>  
>  # util/aio-posix.c
> -run_poll_handlers_begin(void *ctx, int64_t max_ns) "ctx %p max_ns %"PRId64
> -run_poll_handlers_end(void *ctx, bool progress) "ctx %p progress %d"
> +run_poll_handlers_begin(void *ctx, int64_t max_ns, int64_t timeout) "ctx %p 
> max_ns %"PRId64 " timeout %"PRId64
> +run_poll_handlers_end(void *ctx, bool progress, int64_t timeout) "ctx %p 
> progress %d new timeout %"PRId64
>  poll_shrink(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new 
> %"PRId64
>  poll_grow(void *ctx, int64_t old, int64_t new) "ctx %p old %"PRId64" new 
> %"PRId64
>  
> -- 
> 2.17.1
> 
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 1/3] aio-posix: fix concurrent access to poll_disable_cnt

2018-09-13 Thread Fam Zheng

On Wed, 09/12 19:10, Paolo Bonzini wrote:
> It is valid for an aio_set_fd_handler to happen concurrently with
> aio_poll.  In that case, poll_disable_cnt can change under the heels
> of aio_poll, and the assertion on poll_disable_cnt can fail in
> run_poll_handlers.
> 
> Therefore, this patch simply checks the counter on every polling
> iteration.  There are no particular needs for ordering, since the
> polling loop is terminated anyway by aio_notify at the end of
> aio_set_fd_handler.
> 
> Signed-off-by: Paolo Bonzini 
> ---
>  util/aio-posix.c | 26 +++---
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 
> diff --git a/util/aio-posix.c b/util/aio-posix.c
> index 131ba6b4a8..5c29380575 100644
> --- a/util/aio-posix.c
> +++ b/util/aio-posix.c
> @@ -211,6 +211,7 @@ void aio_set_fd_handler(AioContext *ctx,
>  AioHandler *node;
>  bool is_new = false;
>  bool deleted = false;
> +int poll_disable_change;
>  
>  qemu_lockcnt_lock(>list_lock);
>  
> @@ -244,11 +245,9 @@ void aio_set_fd_handler(AioContext *ctx,
>  QLIST_REMOVE(node, node);
>  deleted = true;
>  }
> -
> -if (!node->io_poll) {
> -ctx->poll_disable_cnt--;
> -}
> +poll_disable_change = -!node->io_poll;
>  } else {
> +poll_disable_change = !io_poll - (node && !node->io_poll);
>  if (node == NULL) {
>  /* Alloc and insert if it's not already there */
>  node = g_new0(AioHandler, 1);
> @@ -257,10 +256,6 @@ void aio_set_fd_handler(AioContext *ctx,
>  
>  g_source_add_poll(>source, >pfd);
>  is_new = true;
> -
> -ctx->poll_disable_cnt += !io_poll;
> -} else {
> -ctx->poll_disable_cnt += !io_poll - !node->io_poll;
>  }
>  
>  /* Update handler with latest information */
> @@ -274,6 +269,15 @@ void aio_set_fd_handler(AioContext *ctx,
>  node->pfd.events |= (io_write ? G_IO_OUT | G_IO_ERR : 0);
>  }
>  
> +/* No need to order poll_disable_cnt writes against other updates;
> + * the counter is only used to avoid wasting time and latency on
> + * iterated polling when the system call will be ultimately necessary.
> + * Changing handlers is a rare event, and a little wasted polling until
> + * the aio_notify below is not an issue.
> + */
> +atomic_set(>poll_disable_cnt,
> +   atomic_read(>poll_disable_cnt) + poll_disable_change);

Why not atomic_add?

> +
>  aio_epoll_update(ctx, node, is_new);
>  qemu_lockcnt_unlock(>list_lock);
>  aio_notify(ctx);
> @@ -525,7 +529,6 @@ static bool run_poll_handlers(AioContext *ctx, int64_t 
> max_ns)
>  
>  assert(ctx->notify_me);
>  assert(qemu_lockcnt_count(>list_lock) > 0);
> -assert(ctx->poll_disable_cnt == 0);
>  
>  trace_run_poll_handlers_begin(ctx, max_ns);
>  
> @@ -533,7 +536,8 @@ static bool run_poll_handlers(AioContext *ctx, int64_t 
> max_ns)
>  
>  do {
>  progress = run_poll_handlers_once(ctx);
> -} while (!progress && qemu_clock_get_ns(QEMU_CLOCK_REALTIME) < end_time);
> +} while (!progress && qemu_clock_get_ns(QEMU_CLOCK_REALTIME) < end_time
> + && !atomic_read(>poll_disable_cnt));
>  
>  trace_run_poll_handlers_end(ctx, progress);
>  
> @@ -552,7 +556,7 @@ static bool run_poll_handlers(AioContext *ctx, int64_t 
> max_ns)
>   */
>  static bool try_poll_mode(AioContext *ctx, bool blocking)
>  {
> -    if (blocking && ctx->poll_max_ns && ctx->poll_disable_cnt == 0) {
> +if (blocking && ctx->poll_max_ns && 
> !atomic_read(>poll_disable_cnt)) {
>  /* See qemu_soonest_timeout() uint64_t hack */
>  int64_t max_ns = MIN((uint64_t)aio_compute_timeout(ctx),
>   (uint64_t)ctx->poll_ns);
> -- 
> 2.17.1
> 
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH v2] vmdk: align end of file to a sector boundary

2018-09-12 Thread Fam Zheng

On Thu, 09/13 10:31, yuchen...@synology.com wrote:
> From: yuchenlin 
> 
> There is a rare case which the size of last compressed cluster
> is larger than the cluster size, which will cause the file is
> not aligned at the sector boundary.

The code looks good to me. Can you also explain why it is important to align
file size to sector boundary in the comment?

Fam

> 
> Signed-off-by: yuchenlin 
> ---
> v1 -> v2:
> * Add more detail comment.
> * Add QEMU_ALIGN_UP to show the intention more clearly.
> * Add newline in the end of bdrv_truncate.
> 
> thanks
> 
>  block/vmdk.c | 21 +
>  1 file changed, 21 insertions(+)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index a9d0084e36..2c9e86d98f 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -1698,6 +1698,27 @@ static int coroutine_fn
>  vmdk_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
> uint64_t bytes, QEMUIOVector *qiov)
>  {
> +if (bytes == 0) {
> +/* The caller will write bytes 0 to signal EOF.
> + * When receive it, we align EOF to a sector boundary. */
> +BDRVVmdkState *s = bs->opaque;
> +int i, ret;
> +int64_t length;
> +
> +for (i = 0; i < s->num_extents; i++) {
> +length = bdrv_getlength(s->extents[i].file->bs);
> +if (length < 0) {
> +return length;
> +}
> +length = QEMU_ALIGN_UP(length, BDRV_SECTOR_SIZE);
> +ret = bdrv_truncate(s->extents[i].file, length,
> +PREALLOC_MODE_OFF, NULL);
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +return 0;
> +}
>  return vmdk_co_pwritev(bs, offset, bytes, qiov, 0);
>  }
>  
> -- 
> 2.18.0
>

Re: [Qemu-block] [PATCH] vmdk: align end of file to a sector boundary

2018-09-12 Thread Fam Zheng

On Wed, 09/12 17:52, yuchenlin wrote:
> 
> Fam Zheng  於 2018-09-12 17:34 寫道：
> > On Tue, 08/28 11:17, yuchen...@synology.com wrote: > From: yuchenlin 
> >  > > There is a rare case which the size of last 
> > compressed cluster > is larger than the cluster size, which will cause the 
> > file is > not aligned at the sector boundary. I don't understand. Doesn't 
> > it mean that if you force the alignment by truncating out the extra bytes, 
> > some data is lost?   
> 
> You can take qcow2_co_pwritev_compressed in block/qcow2.c as an example.
> The bdrv_getlength will return the length in bytes which is always a multiple 
> of BDRV_SECTOR_SIZE.
> After truncates this size, the vmdk is extended to align sector size.
> > > > Signed-off-by: yuchenlin  > --- > block/vmdk.c 
> > > > | 18 ++ > 1 file changed, 18 insertions(+) > > diff 
> > > > --git a/block/vmdk.c b/block/vmdk.c > index a9d0084e36..a8ae7c65d2 
> > > > 100644 > --- a/block/vmdk.c > +++ b/block/vmdk.c > @@ -1698,6 +1698,24 
> > > > @@ static int coroutine_fn > 
> > > > vmdk_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset, > 
> > > > uint64_t bytes, QEMUIOVector *qiov) > { > + if (bytes == 0) { Where is 
> > > > this bytes == 0 condition from?
> 
> From the end of convert_do_copy in qemu-img.c.
> 
> if (s->compressed && !s->ret) {
> /* signal EOF to align */
> ret = blk_pwrite_compressed(s->target, 0, NULL, 0);
> if (ret < 0) {
> return ret;
> }
> }

I see. Please mention this in the comment of the if check.

Fam

> 
> It signals the EOF to the block driver.
> > > + /* align end of file to a sector boundary. */ > + BDRVVmdkState *s = 
> > > bs->opaque; > + int i, ret; > + int64_t length; > + > + for (i = 0; i < 
> > > s->num_extents; i++) { > + length = 
> > > bdrv_getlength(s->extents[i].file->bs); > + if (length < 0) { > + return 
> > > length; > + } > + ret = bdrv_truncate(s->extents[i].file, length, 
> > > PREALLOC_MODE_OFF, NULL); > + if (ret < 0) { > + return ret; > + } > + } 
> > > > + return 0; > + } > return vmdk_co_pwritev(bs, offset, bytes, qiov, 0); 
> > > > } > > -- > 2.17.0 > Fam
> 
> 
> yuchenlin
>

Re: [Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-12 Thread Fam Zheng

On Wed, 09/12 13:11, Paolo Bonzini wrote:
> On 12/09/2018 03:31, Fam Zheng wrote:
> >>>
> >>> ctx is qemu_aio_context here, so there's no interaction with IOThread.
> >> In this case, it should be okay to have the reentrancy, what is the bug
> >> that this patch is fixing?
> > The same symptom as in the previous patch: virtio_scsi_handle_cmd_vq hangs. 
> > The
> > reason it hangs is fixed by the previous patch, but I don't think it should 
> > be
> > invoked as we're in the middle of virtio_scsi_dataplane_stop(). Applying 
> > either
> > one of the two patches avoids the problem, but this one is more superficial.
> > What do you think?
> 
> I think it's okay if it is invoked.  The sequence is first you stop the
> vq, then you drain the BlockBackends, then you switch AioContext.  All
> that matters is the outcome when virtio_scsi_dataplane_stop returns.

Yes, but together with vIOMMU, it also effectively leads to a virtio_error(),
which is not clean. QEMU stderr when this call happens (with patch 1 but not
this patch):

2018-09-12T11:48:10.193023Z qemu-system-x86_64: vtd_iommu_translate: detected 
translation failure (dev=02:00:00, iova=0x0)
2018-09-12T11:48:10.193044Z qemu-system-x86_64: New fault is not recorded due 
to compression of faults
2018-09-12T11:48:10.193061Z qemu-system-x86_64: virtio: zero sized buffers are 
not allowed

Fam

Re: [Qemu-block] [PATCH] vmdk: align end of file to a sector boundary

2018-09-12 Thread Fam Zheng

On Tue, 08/28 11:17, yuchen...@synology.com wrote:
> From: yuchenlin 
> 
> There is a rare case which the size of last compressed cluster
> is larger than the cluster size, which will cause the file is
> not aligned at the sector boundary.

I don't understand. Doesn't it mean that if you force the alignment by
truncating out the extra bytes, some data is lost?

> 
> Signed-off-by: yuchenlin 
> ---
>  block/vmdk.c | 18 ++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/block/vmdk.c b/block/vmdk.c
> index a9d0084e36..a8ae7c65d2 100644
> --- a/block/vmdk.c
> +++ b/block/vmdk.c
> @@ -1698,6 +1698,24 @@ static int coroutine_fn
>  vmdk_co_pwritev_compressed(BlockDriverState *bs, uint64_t offset,
> uint64_t bytes, QEMUIOVector *qiov)
>  {
> +if (bytes == 0) {

Where is this bytes == 0 condition from?

> +/* align end of file to a sector boundary. */
> +BDRVVmdkState *s = bs->opaque;
> +int i, ret;
> +int64_t length;
> +
> +for (i = 0; i < s->num_extents; i++) {
> +length = bdrv_getlength(s->extents[i].file->bs);
> +if (length < 0) {
> +return length;
> +}
> +ret = bdrv_truncate(s->extents[i].file, length, 
> PREALLOC_MODE_OFF, NULL);
> +if (ret < 0) {
> +return ret;
> +}
> +}
> +return 0;
> +}
>  return vmdk_co_pwritev(bs, offset, bytes, qiov, 0);
>  }
>  
> -- 
> 2.17.0
> 

Fam

Re: [Qemu-block] [PATCH] util/async: use qemu_aio_coroutine_enter in co_schedule_bh_cb

2018-09-12 Thread Fam Zheng

On Wed, 09/05 11:33, Sergio Lopez wrote:
> AIO Coroutines shouldn't by managed by an AioContext different than the
> one assigned when they are created. aio_co_enter avoids entering a
> coroutine from a different AioContext, calling aio_co_schedule instead.
> 
> Scheduled coroutines are then entered by co_schedule_bh_cb using
> qemu_coroutine_enter, which just calls qemu_aio_coroutine_enter with the
> current AioContext obtained with qemu_get_current_aio_context.
> Eventually, co->ctx will be set to the AioContext passed as an argument
> to qemu_aio_coroutine_enter.
> 
> This means that, if an IO Thread's AioConext is being processed by the
> Main Thread (due to aio_poll being called with a BDS AioContext, as it
> happens in AIO_WAIT_WHILE among other places), the AioContext from some
> coroutines may be wrongly replaced with the one from the Main Thread.
> 
> This is the root cause behind some crashes, mainly triggered by the
> drain code at block/io.c. The most common are these abort and failed
> assertion:
> 
> util/async.c:aio_co_schedule
> 456 if (scheduled) {
> 457 fprintf(stderr,
> 458 "%s: Co-routine was already scheduled in '%s'\n",
> 459 __func__, scheduled);
> 460 abort();
> 461 }
> 
> util/qemu-coroutine-lock.c:
> 286 assert(mutex->holder == self);
> 
> But it's also known to cause random errors at different locations, and
> even SIGSEGV with broken coroutine backtraces.
> 
> By using qemu_aio_coroutine_enter directly in co_schedule_bh_cb, we can
> pass the correct AioContext as an argument, making sure co->ctx is not
> wrongly altered.
> 
> Signed-off-by: Sergio Lopez 
> ---
>  util/async.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 05979f8014..c10642a385 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -400,7 +400,7 @@ static void co_schedule_bh_cb(void *opaque)
>  
>  /* Protected by write barrier in qemu_aio_coroutine_enter */
>  atomic_set(>scheduled, NULL);
> -qemu_coroutine_enter(co);
> +qemu_aio_coroutine_enter(ctx, co);
>  aio_context_release(ctx);
>  }
>  }

Kevin, could you test this patch together with your next version of the drain
fix series? Since they are related, it's better if you could include it in your
series or even apply it yourself. Peter is not processing pull requests, so
scattering fixes in various trees will do no good.

Fam

Re: [Qemu-block] [PATCH] util/async: use qemu_aio_coroutine_enter in co_schedule_bh_cb

2018-09-12 Thread Fam Zheng

On Wed, 09/05 11:33, Sergio Lopez wrote:
> AIO Coroutines shouldn't by managed by an AioContext different than the
> one assigned when they are created. aio_co_enter avoids entering a
> coroutine from a different AioContext, calling aio_co_schedule instead.
> 
> Scheduled coroutines are then entered by co_schedule_bh_cb using
> qemu_coroutine_enter, which just calls qemu_aio_coroutine_enter with the
> current AioContext obtained with qemu_get_current_aio_context.
> Eventually, co->ctx will be set to the AioContext passed as an argument
> to qemu_aio_coroutine_enter.
> 
> This means that, if an IO Thread's AioConext is being processed by the
> Main Thread (due to aio_poll being called with a BDS AioContext, as it
> happens in AIO_WAIT_WHILE among other places), the AioContext from some
> coroutines may be wrongly replaced with the one from the Main Thread.
> 
> This is the root cause behind some crashes, mainly triggered by the
> drain code at block/io.c. The most common are these abort and failed
> assertion:
> 
> util/async.c:aio_co_schedule
> 456 if (scheduled) {
> 457 fprintf(stderr,
> 458 "%s: Co-routine was already scheduled in '%s'\n",
> 459 __func__, scheduled);
> 460 abort();
> 461 }
> 
> util/qemu-coroutine-lock.c:
> 286 assert(mutex->holder == self);
> 
> But it's also known to cause random errors at different locations, and
> even SIGSEGV with broken coroutine backtraces.
> 
> By using qemu_aio_coroutine_enter directly in co_schedule_bh_cb, we can
> pass the correct AioContext as an argument, making sure co->ctx is not
> wrongly altered.
> 
> Signed-off-by: Sergio Lopez 
> ---
>  util/async.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/util/async.c b/util/async.c
> index 05979f8014..c10642a385 100644
> --- a/util/async.c
> +++ b/util/async.c
> @@ -400,7 +400,7 @@ static void co_schedule_bh_cb(void *opaque)
>  
>  /* Protected by write barrier in qemu_aio_coroutine_enter */
>  atomic_set(>scheduled, NULL);
> -qemu_coroutine_enter(co);
> +qemu_aio_coroutine_enter(ctx, co);
>  aio_context_release(ctx);
>  }
>  }
> -- 
> 2.17.0
> 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-11 Thread Fam Zheng

On Tue, 09/11 17:30, Paolo Bonzini wrote:
> On 11/09/2018 16:12, Fam Zheng wrote:
> > On Tue, 09/11 13:32, Paolo Bonzini wrote:
> >> On 10/09/2018 16:56, Fam Zheng wrote:
> >>> We have this unwanted call stack:
> >>>
> >>>   > ...
> >>>   > #13 0x5586602b7793 in virtio_scsi_handle_cmd_vq
> >>>   > #14 0x5586602b8d66 in virtio_scsi_data_plane_handle_cmd
> >>>   > #15 0x5586602ddab7 in virtio_queue_notify_aio_vq
> >>>   > #16 0x5586602dfc9f in virtio_queue_host_notifier_aio_poll
> >>>   > #17 0x5586607885da in run_poll_handlers_once
> >>>   > #18 0x55866078880e in try_poll_mode
> >>>   > #19 0x5586607888eb in aio_poll
> >>>   > #20 0x558660784561 in aio_wait_bh_oneshot
> >>>   > #21 0x5586602b9582 in virtio_scsi_dataplane_stop
> >>>   > #22 0x5586605a7110 in virtio_bus_stop_ioeventfd
> >>>   > #23 0x5586605a9426 in virtio_pci_stop_ioeventfd
> >>>   > #24 0x5586605ab808 in virtio_pci_common_write
> >>>   > #25 0x558660242396 in memory_region_write_accessor
> >>>   > #26 0x5586602425ab in access_with_adjusted_size
> >>>   > #27 0x558660245281 in memory_region_dispatch_write
> >>>   > #28 0x5586601e008e in flatview_write_continue
> >>>   > #29 0x5586601e01d8 in flatview_write
> >>>   > #30 0x5586601e04de in address_space_write
> >>>   > #31 0x5586601e052f in address_space_rw
> >>>   > #32 0x5586602607f2 in kvm_cpu_exec
> >>>   > #33 0x558660227148 in qemu_kvm_cpu_thread_fn
> >>>   > #34 0x55866078bde7 in qemu_thread_start
> >>>   > #35 0x7f5784906594 in start_thread
> >>>   > #36 0x7f5784639e6f in clone
> >>>
> >>> Avoid it with the aio_disable_external/aio_enable_external pair, so that
> >>> no vq poll handlers can be called in aio_wait_bh_oneshot.
> >>
> >> I don't understand.  We are in the vCPU thread, so not in the
> >> AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
> >> than going through the aio_wait_bh path?
> > 
> > What do you mean by 'aio_wait_bh path'? Here is aio_wait_bh_oneshot:
> 
> Sorry, I meant the "atomic_inc(_->num_waiters);" path.  But if this
> backtrace is obtained without dataplane, that's the answer I was seeking.
> 
> > void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
> > {
> > AioWaitBHData data = {
> > .cb = cb,
> > .opaque = opaque,
> > };
> > 
> > assert(qemu_get_current_aio_context() == qemu_get_aio_context());
> > 
> > aio_bh_schedule_oneshot(ctx, aio_wait_bh, );
> > AIO_WAIT_WHILE(, ctx, !data.done);
> > }
> > 
> > ctx is qemu_aio_context here, so there's no interaction with IOThread.
> 
> In this case, it should be okay to have the reentrancy, what is the bug
> that this patch is fixing?

The same symptom as in the previous patch: virtio_scsi_handle_cmd_vq hangs. The
reason it hangs is fixed by the previous patch, but I don't think it should be
invoked as we're in the middle of virtio_scsi_dataplane_stop(). Applying either
one of the two patches avoids the problem, but this one is more superficial.
What do you think?

Fam

Re: [Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-11 Thread Fam Zheng

On Tue, 09/11 13:32, Paolo Bonzini wrote:
> On 10/09/2018 16:56, Fam Zheng wrote:
> > We have this unwanted call stack:
> > 
> >   > ...
> >   > #13 0x5586602b7793 in virtio_scsi_handle_cmd_vq
> >   > #14 0x5586602b8d66 in virtio_scsi_data_plane_handle_cmd
> >   > #15 0x5586602ddab7 in virtio_queue_notify_aio_vq
> >   > #16 0x5586602dfc9f in virtio_queue_host_notifier_aio_poll
> >   > #17 0x5586607885da in run_poll_handlers_once
> >   > #18 0x55866078880e in try_poll_mode
> >   > #19 0x5586607888eb in aio_poll
> >   > #20 0x558660784561 in aio_wait_bh_oneshot
> >   > #21 0x5586602b9582 in virtio_scsi_dataplane_stop
> >   > #22 0x5586605a7110 in virtio_bus_stop_ioeventfd
> >   > #23 0x5586605a9426 in virtio_pci_stop_ioeventfd
> >   > #24 0x5586605ab808 in virtio_pci_common_write
> >   > #25 0x558660242396 in memory_region_write_accessor
> >   > #26 0x5586602425ab in access_with_adjusted_size
> >   > #27 0x558660245281 in memory_region_dispatch_write
> >   > #28 0x5586601e008e in flatview_write_continue
> >   > #29 0x5586601e01d8 in flatview_write
> >   > #30 0x5586601e04de in address_space_write
> >   > #31 0x5586601e052f in address_space_rw
> >   > #32 0x5586602607f2 in kvm_cpu_exec
> >   > #33 0x558660227148 in qemu_kvm_cpu_thread_fn
> >   > #34 0x55866078bde7 in qemu_thread_start
> >   > #35 0x7f5784906594 in start_thread
> >   > #36 0x7f5784639e6f in clone
> > 
> > Avoid it with the aio_disable_external/aio_enable_external pair, so that
> > no vq poll handlers can be called in aio_wait_bh_oneshot.
> 
> I don't understand.  We are in the vCPU thread, so not in the
> AioContext's home thread.  Why is aio_wait_bh_oneshot polling rather
> than going through the aio_wait_bh path?

What do you mean by 'aio_wait_bh path'? Here is aio_wait_bh_oneshot:

void aio_wait_bh_oneshot(AioContext *ctx, QEMUBHFunc *cb, void *opaque)
{
AioWaitBHData data = {
.cb = cb,
.opaque = opaque,
};

assert(qemu_get_current_aio_context() == qemu_get_aio_context());

aio_bh_schedule_oneshot(ctx, aio_wait_bh, );
AIO_WAIT_WHILE(, ctx, !data.done);
}

ctx is qemu_aio_context here, so there's no interaction with IOThread. This is
the full backtrace:

https://paste.ubuntu.com/p/Dm7zGzmmRG/

Fam

Re: [Qemu-block] [PATCH 14/14] test-bdrv-drain: Test nested poll in bdrv_drain_poll_top_level()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> This is a regression test for a deadlock that could occur in callbacks
> called from the aio_poll() in bdrv_drain_poll_top_level(). The
> AioContext lock wasn't released and therefore would be taken a second
> time in the callback. This would cause a possible AIO_WAIT_WHILE() in
> the callback to hang.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 13/14] block: Remove aio_poll() in bdrv_drain_poll variants

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> bdrv_drain_poll_top_level() was buggy because it didn't release the
> AioContext lock of the node to be drained before calling aio_poll().
> This way, callbacks called by aio_poll() would possibly take the lock a
> second time and run into a deadlock with a nested AIO_WAIT_WHILE() call.
> 
> However, it turns out that the aio_poll() call isn't actually needed any
> more. It was introduced in commit 91af091f923, which is effectively
> reverted by this patch. The cases it was supposed to fix are now covered
> by bdrv_drain_poll(), which waits for block jobs to reach a quiescent
> state.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 12/14] blockjob: Lie better in child_job_drained_poll()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> Block jobs claim in .drained_poll() that they are in a quiescent state
> as soon as job->deferred_to_main_loop is true. This is obviously wrong,
> they still have a completion BH to run. We only get away with this
> because commit 91af091f923 added an unconditional aio_poll(false) to the
> drain functions, but this is bypassing the regular drain mechanisms.
> 
> However, just removing this and telling that the job is still active
> doesn't work either: The completion callbacks themselves call drain
> functions (directly, or indirectly with bdrv_reopen), so they would
> deadlock then.
> 
> As a better lie, tell that the job is active as long as the BH is
> pending, but falsely call it quiescent from the point in the BH when the
> completion callback is called. At this point, nested drain calls won't
> deadlock because they ignore the job, and outer drains will wait for the
> job to really reach a quiescent state because the callback is already
> running.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 11/14] mirror: Fix potential use-after-free in active commit

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> When starting an active commit job, other callbacks can run before
> mirror_start_job() calls bdrv_ref() where needed and cause the nodes to
> go away. Add another pair of bdrv_ref/unref() around it to protect
> against this case.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  block/mirror.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/block/mirror.c b/block/mirror.c
> index 6cc10df5c9..c42999eadf 100644
> --- a/block/mirror.c
> +++ b/block/mirror.c
> @@ -1679,6 +1679,11 @@ void commit_active_start(const char *job_id, 
> BlockDriverState *bs,
>  
>  orig_base_flags = bdrv_get_flags(base);
>  
> +/* bdrv_reopen() drains, which might make the BDSes go away before a
> + * reference is taken in mirror_start_job(). */
> +bdrv_ref(bs);
> +bdrv_ref(base);
> +
>  if (bdrv_reopen(base, bs->open_flags, errp)) {

Doesn't it need bdrv_unref's in this branch?

>  return;
>  }
> @@ -1689,6 +1694,10 @@ void commit_active_start(const char *job_id, 
> BlockDriverState *bs,
>   _active_job_driver, false, base, auto_complete,
>   filter_node_name, false, MIRROR_COPY_MODE_BACKGROUND,
>   _err);
> +
> +bdrv_unref(bs);
> +bdrv_unref(base);
> +
>  if (local_err) {
>  error_propagate(errp, local_err);
>  goto error_restore_flags;
> -- 
> 2.13.6
>

Re: [Qemu-block] [PATCH 10/14] block-backend: Decrease in_flight only after callback

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> Request callbacks can do pretty much anything, including operations that
> will yield from the coroutine (such as draining the backend). In that
> case, a decreased in_flight would be visible to other code and could
> lead to a drain completing while the callback hasn't actually completed
> yet.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 09/14] block-backend: Fix potential double blk_delete()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> blk_unref() first decreases the refcount of the BlockBackend and calls
> blk_delete() if the refcount reaches zero. Requests can still be in
> flight at this point, they are only drained during blk_delete():
> 
> At this point, arbitrary callbacks can run. If any callback takes a
> temporary BlockBackend reference, it will first increase the refcount to
> 1 and then decrease it to 0 again, triggering another blk_delete(). This
> will cause a use-after-free crash in the outer blk_delete().
> 
> Fix it by draining the BlockBackend before decreasing to refcount to 0.
> Assert in blk_ref() that it never takes the first refcount (which would
> mean that the BlockBackend is already being deleted).
> 
> Signed-off-by: Kevin Wolf 

Good one!

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 08/14] block-backend: Add .drained_poll callback

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> A bdrv_drain operation must ensure that all parents are quiesced, this
> includes BlockBackends. Otherwise, callbacks called by requests that are
> completed on the BDS layer, but not quite yet on the BlockBackend layer
> could still create new requests.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 07/14] aio-wait: Increase num_waiters even in home thread

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> Even if AIO_WAIT_WHILE() is called in the home context of the
> AioContext, we still want to allow the condition to change depending on
> other threads as long as they kick the AioWait. Specfically block jobs
> can be running in an I/O thread and should then be able to kick a drain
> in the main loop context.

We can now move the atomic_inc/atomic_dec pair outside the if/else block,
but that's cosmetic.

Reviewed-by: Fam Zheng 


> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/block/aio-wait.h | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/include/block/aio-wait.h b/include/block/aio-wait.h
> index c85a62f798..46ba7f9111 100644
> --- a/include/block/aio-wait.h
> +++ b/include/block/aio-wait.h
> @@ -77,10 +77,12 @@ typedef struct {
>  AioWait *wait_ = (wait);   \
>  AioContext *ctx_ = (ctx);  \
>  if (ctx_ && in_aio_context_home_thread(ctx_)) {\
> +atomic_inc(_->num_waiters);   \
>  while ((cond)) {   \
>  aio_poll(ctx_, true);  \
>  waited_ = true;\
>  }  \
> +atomic_dec(_->num_waiters);   \
>  } else {   \
>  assert(qemu_get_current_aio_context() ==   \
> qemu_get_aio_context());\
> -- 
> 2.13.6
>

Re: [Qemu-block] [PATCH 06/14] block: Add missing locking in bdrv_co_drain_bh_cb()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> bdrv_do_drained_begin/end() assume that they are called with the
> AioContext lock of bs held. If we call drain functions from a coroutine
> with the AioContext lock held, we yield and schedule a BH to move out of
> coroutine context. This means that the lock for the home context of the
> coroutine is released and must be re-acquired in the bottom half.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/qemu/coroutine.h |  5 +
>  block/io.c   | 15 +++
>  util/qemu-coroutine.c|  5 +
>  3 files changed, 25 insertions(+)
> 
> diff --git a/include/qemu/coroutine.h b/include/qemu/coroutine.h
> index 6f8a487041..9801e7f5a4 100644
> --- a/include/qemu/coroutine.h
> +++ b/include/qemu/coroutine.h
> @@ -90,6 +90,11 @@ void qemu_aio_coroutine_enter(AioContext *ctx, Coroutine 
> *co);
>  void coroutine_fn qemu_coroutine_yield(void);
>  
>  /**
> + * Get the AioContext of the given coroutine
> + */
> +AioContext *coroutine_fn qemu_coroutine_get_aio_context(Coroutine *co);
> +
> +/**
>   * Get the currently executing coroutine
>   */
>  Coroutine *coroutine_fn qemu_coroutine_self(void);
> diff --git a/block/io.c b/block/io.c
> index 7100344c7b..914ba78f1a 100644
> --- a/block/io.c
> +++ b/block/io.c
> @@ -288,6 +288,18 @@ static void bdrv_co_drain_bh_cb(void *opaque)
>  BlockDriverState *bs = data->bs;
>  
>  if (bs) {
> +AioContext *ctx = bdrv_get_aio_context(bs);
> +AioContext *co_ctx = qemu_coroutine_get_aio_context(co);
> +
> +/*
> + * When the coroutine yielded, the lock for its home context was
> + * released, so we need to re-acquire it here. If it explicitly
> + * acquired a different context, the lock is still held and we don't
> + * want to lock it a second time (or AIO_WAIT_WHILE() would hang).
> + */

This condition is rather obscure. When is ctx not equal to co_ctx?

> +if (ctx == co_ctx) {
> +aio_context_acquire(ctx);
> +}
>  bdrv_dec_in_flight(bs);
>  if (data->begin) {
>  bdrv_do_drained_begin(bs, data->recursive, data->parent,
> @@ -296,6 +308,9 @@ static void bdrv_co_drain_bh_cb(void *opaque)
>  bdrv_do_drained_end(bs, data->recursive, data->parent,
>  data->ignore_bds_parents);
>  }
> +if (ctx == co_ctx) {
> +aio_context_release(ctx);
> +}
>  } else {
>  assert(data->begin);
>  bdrv_drain_all_begin();
> diff --git a/util/qemu-coroutine.c b/util/qemu-coroutine.c
> index 1ba4191b84..2295928d33 100644
> --- a/util/qemu-coroutine.c
> +++ b/util/qemu-coroutine.c
> @@ -198,3 +198,8 @@ bool qemu_coroutine_entered(Coroutine *co)
>  {
>  return co->caller;
>  }
> +
> +AioContext *coroutine_fn qemu_coroutine_get_aio_context(Coroutine *co)
> +{
> +return co->ctx;
> +}
> -- 
> 2.13.6
>

Re: [Qemu-block] [PATCH 05/14] test-bdrv-drain: Test AIO_WAIT_WHILE() in completion callback

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> This is a regression test for a deadlock that occurred in block job
> completion callbacks (via job_defer_to_main_loop) because the AioContext
> lock was taken twice: once in job_finish_sync() and then again in
> job_defer_to_main_loop_bh(). This would cause AIO_WAIT_WHILE() to hang.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 04/14] job: Use AIO_WAIT_WHILE() in job_finish_sync()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> job_finish_sync() needs to release the AioContext lock of the job before
> calling aio_poll(). Otherwise, callbacks called by aio_poll() would
> possibly take the lock a second time and run into a deadlock with a
> nested AIO_WAIT_WHILE() call.
> 
> Also, job_drain() without aio_poll() isn't necessarily enough to make
> progress on a job, it could depend on bottom halves to be executed.
> 
> Combine both open-coded while loops into a single AIO_WAIT_WHILE() call
> that solves both of these problems.
> 
> Signed-off-by: Kevin Wolf 
> ---
>  job.c | 14 ++
>  1 file changed, 6 insertions(+), 8 deletions(-)
> 
> diff --git a/job.c b/job.c
> index 9ad0b7476a..8480eda188 100644
> --- a/job.c
> +++ b/job.c
> @@ -29,6 +29,7 @@
>  #include "qemu/job.h"
>  #include "qemu/id.h"
>  #include "qemu/main-loop.h"
> +#include "block/aio-wait.h"
>  #include "trace-root.h"
>  #include "qapi/qapi-events-job.h"
>  
> @@ -998,6 +999,7 @@ void job_defer_to_main_loop(Job *job, 
> JobDeferToMainLoopFn *fn, void *opaque)
>  int job_finish_sync(Job *job, void (*finish)(Job *, Error **errp), Error 
> **errp)
>  {
>  Error *local_err = NULL;
> +AioWait dummy_wait = {};
>  int ret;
>  
>  job_ref(job);
> @@ -1010,14 +1012,10 @@ int job_finish_sync(Job *job, void (*finish)(Job *, 
> Error **errp), Error **errp)
>  job_unref(job);
>  return -EBUSY;
>  }
> -/* job_drain calls job_enter, and it should be enough to induce progress
> - * until the job completes or moves to the main thread. */
> -while (!job->deferred_to_main_loop && !job_is_completed(job)) {
> -job_drain(job);
> -}
> -while (!job_is_completed(job)) {
> -aio_poll(qemu_get_aio_context(), true);
> -}
> +
> +AIO_WAIT_WHILE(_wait, job->aio_context,
> +   (job_drain(job), !job_is_completed(job)));

The condition expression would read more elegant if job_drain() returns
progress.

Reviewed-by: Fam Zheng 

> +
>  ret = (job_is_cancelled(job) && job->ret == 0) ? -ECANCELED : job->ret;
>  job_unref(job);
>  return ret;
> -- 
> 2.13.6
>

Re: [Qemu-block] [PATCH 03/14] test-blockjob: Acquire AioContext around job_finish_sync()

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> All callers in QEMU proper hold the AioContext lock when calling
> job_finish_sync(). test-blockjob should do the same.

I think s/job_finish_sync/job_cancel_sync/ in the subject is more accurate?

Reviewed-by: Fam Zheng 

> 
> Signed-off-by: Kevin Wolf 
> ---
>  include/qemu/job.h| 6 ++
>  tests/test-blockjob.c | 6 ++
>  2 files changed, 12 insertions(+)
> 
> diff --git a/include/qemu/job.h b/include/qemu/job.h
> index 0dae5b8481..8ac48dbd28 100644
> --- a/include/qemu/job.h
> +++ b/include/qemu/job.h
> @@ -520,6 +520,8 @@ void job_user_cancel(Job *job, bool force, Error **errp);
>   *
>   * Returns the return value from the job if the job actually completed
>   * during the call, or -ECANCELED if it was canceled.
> + *
> + * Callers must hold the AioContext lock of job->aio_context.
>   */
>  int job_cancel_sync(Job *job);
>  
> @@ -537,6 +539,8 @@ void job_cancel_sync_all(void);
>   * function).
>   *
>   * Returns the return value from the job.
> + *
> + * Callers must hold the AioContext lock of job->aio_context.
>   */
>  int job_complete_sync(Job *job, Error **errp);
>  
> @@ -579,6 +583,8 @@ void job_defer_to_main_loop(Job *job, 
> JobDeferToMainLoopFn *fn, void *opaque);
>   *
>   * Returns 0 if the job is successfully completed, -ECANCELED if the job was
>   * cancelled before completing, and -errno in other error cases.
> + *
> + * Callers must hold the AioContext lock of job->aio_context.
>   */
>  int job_finish_sync(Job *job, void (*finish)(Job *, Error **errp), Error 
> **errp);
>  
> diff --git a/tests/test-blockjob.c b/tests/test-blockjob.c
> index cb42f06e61..8c2babbe35 100644
> --- a/tests/test-blockjob.c
> +++ b/tests/test-blockjob.c
> @@ -230,6 +230,10 @@ static void cancel_common(CancelJob *s)
>  BlockJob *job = >common;
>  BlockBackend *blk = s->blk;
>  JobStatus sts = job->job.status;
> +AioContext *ctx;
> +
> +ctx = job->job.aio_context;
> +aio_context_acquire(ctx);
>  
>  job_cancel_sync(>job);
>  if (sts != JOB_STATUS_CREATED && sts != JOB_STATUS_CONCLUDED) {
> @@ -239,6 +243,8 @@ static void cancel_common(CancelJob *s)
>  assert(job->job.status == JOB_STATUS_NULL);
>  job_unref(>job);
>  destroy_blk(blk);
> +
> +aio_context_release(ctx);
>  }
>  
>  static void test_cancel_created(void)
> -- 
> 2.13.6
>

Re: [Qemu-block] [PATCH 02/14] test-bdrv-drain: Drain with block jobs in an I/O thread

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> This extends the existing drain test with a block job to include
> variants where the block job runs in a different AioContext.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

Re: [Qemu-block] [PATCH 01/14] blockjob: Wake up BDS when job becomes idle

2018-09-11 Thread Fam Zheng

On Fri, 09/07 18:15, Kevin Wolf wrote:
> In the context of draining a BDS, the .drained_poll callback of block
> jobs is called. If this returns true (i.e. there is still some activity
> pending), the drain operation may call aio_poll() with blocking=true to
> wait for completion.
> 
> As soon as the pending activity is completed and the job finally arrives
> in a quiescent state (i.e. its coroutine either yields with busy=false
> or terminates), the block job must notify the aio_poll() loop to wake
> up, otherwise we get a deadlock if both are running in different
> threads.
> 
> Signed-off-by: Kevin Wolf 

Reviewed-by: Fam Zheng

[Qemu-block] [PATCH 1/2] virtio: Return true from virtio_queue_empty if broken

2018-09-10 Thread Fam Zheng

Both virtio-blk and virtio-scsi use virtio_queue_empty() as the
loop condition in VQ handlers (virtio_blk_handle_vq,
virtio_scsi_handle_cmd_vq). When a device is marked broken in
virtqueue_pop, for example if a vIOMMU address translation failed, we
want to break out of the loop.

This fixes a hanging problem when booting a CentOS 3.10.0-862.el7.x86_64
kernel with ATS enabled:

  $ qemu-system-x86_64 \
... \
-device intel-iommu,intremap=on,caching-mode=on,eim=on,device-iotlb=on \
-device virtio-scsi-pci,iommu_platform=on,ats=on,id=scsi0,bus=pci.4,addr=0x0

The dead loop happens immediately when the kernel boots and initializes
the device, where virtio_scsi_data_plane_handle_cmd will not return:

> ...
> #13 0x5586602b7793 in virtio_scsi_handle_cmd_vq
> #14 0x5586602b8d66 in virtio_scsi_data_plane_handle_cmd
> #15 0x5586602ddab7 in virtio_queue_notify_aio_vq
> #16 0x5586602dfc9f in virtio_queue_host_notifier_aio_poll
> #17 0x5586607885da in run_poll_handlers_once
> #18 0x55866078880e in try_poll_mode
> #19 0x5586607888eb in aio_poll
> #20 0x558660784561 in aio_wait_bh_oneshot
> #21 0x5586602b9582 in virtio_scsi_dataplane_stop
> #22 0x5586605a7110 in virtio_bus_stop_ioeventfd
> #23 0x5586605a9426 in virtio_pci_stop_ioeventfd
> #24 0x5586605ab808 in virtio_pci_common_write
> #25 0x558660242396 in memory_region_write_accessor
> #26 0x5586602425ab in access_with_adjusted_size
> #27 0x558660245281 in memory_region_dispatch_write
> #28 0x5586601e008e in flatview_write_continue
> #29 0x5586601e01d8 in flatview_write
> #30 0x5586601e04de in address_space_write
> #31 0x5586601e052f in address_space_rw
> #32 0x5586602607f2 in kvm_cpu_exec
> #33 0x558660227148 in qemu_kvm_cpu_thread_fn
> #34 0x55866078bde7 in qemu_thread_start
> #35 0x7f5784906594 in start_thread
> #36 0x7f5784639e6f in clone

With this patch, virtio_queue_empty will now return 1 as soon as the
vdev is marked as broken, after a "virtio: zero sized buffers are not
allowed" error.

To be consistent, update virtio_queue_empty_rcu as well.

Signed-off-by: Fam Zheng 
---
 hw/virtio/virtio.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index d4e4d98b59..7a05c9e52c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -358,6 +358,10 @@ int virtio_queue_ready(VirtQueue *vq)
  * Called within rcu_read_lock().  */
 static int virtio_queue_empty_rcu(VirtQueue *vq)
 {
+if (unlikely(vq->vdev->broken)) {
+return 1;
+}
+
 if (unlikely(!vq->vring.avail)) {
 return 1;
 }
@@ -373,6 +377,10 @@ int virtio_queue_empty(VirtQueue *vq)
 {
 bool empty;
 
+if (unlikely(vq->vdev->broken)) {
+return 1;
+}
+
 if (unlikely(!vq->vring.avail)) {
 return 1;
 }
-- 
2.17.1

[Qemu-block] [PATCH 2/2] virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

2018-09-10 Thread Fam Zheng

We have this unwanted call stack:

  > ...
  > #13 0x5586602b7793 in virtio_scsi_handle_cmd_vq
  > #14 0x5586602b8d66 in virtio_scsi_data_plane_handle_cmd
  > #15 0x5586602ddab7 in virtio_queue_notify_aio_vq
  > #16 0x5586602dfc9f in virtio_queue_host_notifier_aio_poll
  > #17 0x5586607885da in run_poll_handlers_once
  > #18 0x55866078880e in try_poll_mode
  > #19 0x5586607888eb in aio_poll
  > #20 0x558660784561 in aio_wait_bh_oneshot
  > #21 0x5586602b9582 in virtio_scsi_dataplane_stop
  > #22 0x5586605a7110 in virtio_bus_stop_ioeventfd
  > #23 0x5586605a9426 in virtio_pci_stop_ioeventfd
  > #24 0x5586605ab808 in virtio_pci_common_write
  > #25 0x558660242396 in memory_region_write_accessor
  > #26 0x5586602425ab in access_with_adjusted_size
  > #27 0x558660245281 in memory_region_dispatch_write
  > #28 0x5586601e008e in flatview_write_continue
  > #29 0x5586601e01d8 in flatview_write
  > #30 0x5586601e04de in address_space_write
  > #31 0x5586601e052f in address_space_rw
  > #32 0x5586602607f2 in kvm_cpu_exec
  > #33 0x558660227148 in qemu_kvm_cpu_thread_fn
  > #34 0x55866078bde7 in qemu_thread_start
  > #35 0x7f5784906594 in start_thread
  > #36 0x7f5784639e6f in clone

Avoid it with the aio_disable_external/aio_enable_external pair, so that
no vq poll handlers can be called in aio_wait_bh_oneshot.

Signed-off-by: Fam Zheng 
---
 hw/block/dataplane/virtio-blk.c | 2 ++
 hw/scsi/virtio-scsi-dataplane.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/hw/block/dataplane/virtio-blk.c b/hw/block/dataplane/virtio-blk.c
index 8c37bd314a..8ab54218c1 100644
--- a/hw/block/dataplane/virtio-blk.c
+++ b/hw/block/dataplane/virtio-blk.c
@@ -279,7 +279,9 @@ void virtio_blk_data_plane_stop(VirtIODevice *vdev)
 trace_virtio_blk_data_plane_stop(s);
 
 aio_context_acquire(s->ctx);
+aio_disable_external(s->ctx);
 aio_wait_bh_oneshot(s->ctx, virtio_blk_data_plane_stop_bh, s);
+aio_enable_external(s->ctx);
 
 /* Drain and switch bs back to the QEMU main loop */
 blk_set_aio_context(s->conf->conf.blk, qemu_get_aio_context());
diff --git a/hw/scsi/virtio-scsi-dataplane.c b/hw/scsi/virtio-scsi-dataplane.c
index b995bab3a2..1452398491 100644
--- a/hw/scsi/virtio-scsi-dataplane.c
+++ b/hw/scsi/virtio-scsi-dataplane.c
@@ -208,7 +208,9 @@ void virtio_scsi_dataplane_stop(VirtIODevice *vdev)
 s->dataplane_stopping = true;
 
 aio_context_acquire(s->ctx);
+aio_disable_external(s->ctx);
 aio_wait_bh_oneshot(s->ctx, virtio_scsi_dataplane_stop_bh, s);
+aio_enable_external(s->ctx);
 aio_context_release(s->ctx);
 
 blk_drain_all(); /* ensure there are no in-flight requests */
-- 
2.17.1

[Qemu-block] [PATCH 0/2] virtio-scsi: Fix QEMU hang with vIOMMU and ATS

2018-09-10 Thread Fam Zheng

The first patch describes the bug and the reproducer.

The VQ poll handler that is called by mistake within virtio_scsi_dataplane_stop
enters a dead loop because it fails to detect an error state. Fix both sides of
the problem: the handler should break out from the loop if no progress can be
made due to virtio_error; the handler shouldn't be called in that situation in
the first place.

Fam Zheng (2):
  virtio: Return true from virtio_queue_empty if broken
  virtio-scsi/virtio-blk: Disable poll handlers when stopping vq handler

 hw/block/dataplane/virtio-blk.c | 2 ++
 hw/scsi/virtio-scsi-dataplane.c | 2 ++
 hw/virtio/virtio.c  | 8 
 3 files changed, 12 insertions(+)

-- 
2.17.1

Re: [Qemu-block] [PATCH v3 2/2] aio: Do aio_notify_accept only during blocking aio_poll

2018-09-09 Thread Fam Zheng

On Fri, 09/07 17:51, Kevin Wolf wrote:
> Am 09.08.2018 um 15:22 hat Fam Zheng geschrieben:
> > Furthermore, blocking aio_poll is only allowed on home thread
> > (in_aio_context_home_thread), because otherwise two blocking
> > aio_poll()'s can steal each other's ctx->notifier event and cause
> > hanging just like described above.
> 
> It's good to have this assertion now at least, but after digging into
> some bugs, I think in fact that any aio_poll() (even non-blocking) is
> only allowed in the home thread: At least one reason is that if you run
> it from a different thread, qemu_get_current_aio_context() returns the
> wrong AioContext in any callbacks called by aio_poll(). Anything else
> using TLS can have similar problems.
> 
> One instance where this matters is fixed/worked around by Sergio's
> "util/async: use qemu_aio_coroutine_enter in co_schedule_bh_cb". We
> wouldn't even need that patch if we could make sure that aio_poll() is
> never called from the wrong thread. This would feel more robust.
> 
> I'll fix the aio_poll() calls in drain (the AIO_WAIT_WHILE() ones are
> already fine, the rest by removing them). After that,
> bdrv_set_aio_context() is still problematic, but the rest should be
> okay. Hopefully we can use the tighter assertion then.

Fully agree with you.

Fam

Re: [Qemu-block] [Qemu-stable] [PATCH v2] job: Fix nested aio_poll() hanging in job_txn_apply

2018-09-03 Thread Fam Zheng

On Fri, 08/24 10:43, Fam Zheng wrote:
> All callers have acquired ctx already. Doing that again results in
> aio_poll() hang. This fixes the problem that a BDRV_POLL_WHILE() in the
> callback cannot make progress because ctx is recursively locked, for
> example, when drive-backup finishes.
> 
> There are two callers of job_finalize():
> 
> fam@lemon:~/work/qemu [master]$ git grep -w -A1 '^\s*job_finalize'
> blockdev.c:job_finalize(>job, errp);
> blockdev.c-aio_context_release(aio_context);
> --
> job-qmp.c:job_finalize(job, errp);
> job-qmp.c-aio_context_release(aio_context);
> --
> tests/test-blockjob.c:job_finalize(>job, _abort);
> tests/test-blockjob.c-assert(job->job.status == JOB_STATUS_CONCLUDED);
> 
> Ignoring the test, it's easy to see both callers to job_finalize (and
> job_do_finalize) have acquired the context.
> 
> Cc: qemu-sta...@nongnu.org
> Reported-by: Gu Nini 
> Reviewed-by: Eric Blake 
> Signed-off-by: Fam Zheng 

Ping?

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2965 matches

Mail list logo