Re: QEMU RBD is slow with QCOW2 images

2021-03-03 Thread Jason Dillaman
On Wed, Mar 3, 2021 at 12:41 PM Stefano Garzarella  wrote:
>
> Hi Jason,
> as reported in this BZ [1], when qemu-img creates a QCOW2 image on RBD
> writing data is very slow compared to a raw file.
>
> Comparing raw vs QCOW2 image creation with RBD I found that we use a
> different object size, for the raw file I see '4 MiB objects', for QCOW2
> I see '64 KiB objects' as reported on comment 14 [2].
> This should be the main issue of slowness, indeed forcing in the code 4
> MiB object size also for QCOW2 increased the speed a lot.
>
> Looking better I discovered that for raw files, we call rbd_create()
> with obj_order = 0 (if 'cluster_size' options is not defined), so the
> default object size is used.
> Instead for QCOW2, we use obj_order = 16, since the default
> 'cluster_size' defined for QCOW2, is 64 KiB.
>
> Using '-o cluster_size=2M' with qemu-img changed only the qcow2 cluster
> size, since in qcow2_co_create_opts() we remove the 'cluster_size' from
> QemuOpts calling qemu_opts_to_qdict_filtered().
> For some reason that I have yet to understand, after this deletion,
> however remains in QemuOpts the default value of 'cluster_size' for
> qcow2 (64 KiB), that it's used in qemu_rbd_co_create_opts()
>
> At this point my doubts are:
> Does it make sense to use the same cluster_size as qcow2 as object_size
> in RBD?

No, not really. But it also doesn't really make any sense to put a
QCOW2 image within an RBD image. To clarify from the BZ, OpenStack
does not put QCOW2 images on RBD, it converts QCOW2 images into raw
images to store in RBD.

> If we want to keep the 2 options separated, how can it be done? Should
> we rename the option in block/rbd.c?

You can already pass overrides to the RBD block driver by just
appending them after the
"rbd:[:option1=value1[:option2=value2]]" portion, perhaps
that could be re-used.

> Thanks,
> Stefano
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1744525
> [2] https://bugzilla.redhat.com/show_bug.cgi?id=1744525#c14
>


-- 
Jason




Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-22 Thread Jason Dillaman
On Mon, Feb 15, 2021 at 8:29 AM Peter Lieven  wrote:
>
> Am 15.02.21 um 13:13 schrieb Kevin Wolf:
> > Am 15.02.2021 um 12:45 hat Peter Lieven geschrieben:
> >> Am 15.02.21 um 12:41 schrieb Daniel P. Berrangé:
> >>> On Mon, Feb 15, 2021 at 12:32:24PM +0100, Peter Lieven wrote:
>  Am 15.02.21 um 11:24 schrieb Daniel P. Berrangé:
> > On Tue, Jan 26, 2021 at 12:25:34PM +0100, Peter Lieven wrote:
> >> even luminous (version 12.2) is unmaintained for over 3 years now.
> >> Bump the requirement to get rid of the ifdef'ry in the code.
> > We have clear rules on when we bump minimum versions, determined by
> > the OS platforms we target:
> >
> >  https://qemu.readthedocs.io/en/latest/system/build-platforms.html
> >
> > At this time RHEL-7 is usually the oldest platform, and it
> > builds with RBD 10.2.5, so we can't bump the version to 12.2.
> >
> > I'm afraid this patch has to be dropped.
>  I have asked exactly this question before I started work on this series 
>  and got reply
> 
>  from Jason that he sees no problem in bumping to a release which is 
>  already unmaintained
> 
>  for 3 years.
> >>> I'm afraid Jason is wrong here.  It doesn't matter what the upstream
> >>> consider the support status to be. QEMU targets what the OS vendors
> >>> ship, and they still consider this to be a supported version.
> >>
> >> Okay, but the whole coroutine stuff would get a total mess with all
> >> the ifdef'ry.
> > Hm, but how are these ifdefs even related to the coroutine conversation?
> > It's a bit more code that you're moving around, but shouldn't it be
> > unchanged from the old code, just moving from an AIO callback to a
> > coroutine? Or am I missing some complications?
>
>
> No, the ifdef's only come back in for the write zeroes part.
>
>
> >
> >> Would it be an option to make a big ifdef in the rbd driver? One with
> >> old code for < 12.0.0 and one
> >>
> >> with new code for >= 12.0.0?
> > I don't think this is a good idea, this would be a huge mess to
> > maintain.
> >
> > The conversion is probably a good idea in general, simply because it's
> > more in line with the rest of the block layer, but I don't think it adds
> > anything per se, so it's hard to justify such duplication with the
> > benefits it brings.
>
>
> I would wait for Jasons comment on the rbd part of the series and then spin a 
> V3
>
> with a for-6.1 tag.

Sorry for the long delay -- I was delayed from being out-of-town. I've
reviewed and play-tested the patches and it looks good for me. I'll
wait for V3 before adding my official review.

>
>
> Peter
>


-- 
Jason




Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-21 Thread Jason Dillaman
On Thu, Jan 21, 2021 at 3:29 PM Peter Lieven  wrote:
>
> Am 21.01.21 um 20:42 schrieb Jason Dillaman:
> > On Wed, Jan 20, 2021 at 6:01 PM Peter Lieven  wrote:
> >>
> >>> Am 19.01.2021 um 15:20 schrieb Jason Dillaman :
> >>>
> >>> On Tue, Jan 19, 2021 at 4:36 AM Peter Lieven  wrote:
> >>>>> Am 18.01.21 um 23:33 schrieb Jason Dillaman:
> >>>>> On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
> >>>>>> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
> >>>>>>> On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
> >>>>>>>> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> >>>>>>>>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >>>>>>>>>> since we implement byte interfaces and librbd supports aio on byte 
> >>>>>>>>>> granularity we can lift
> >>>>>>>>>> the 512 byte alignment.
> >>>>>>>>>> Signed-off-by: Peter Lieven 
> >>>>>>>>>> ---
> >>>>>>>>>> block/rbd.c | 2 --
> >>>>>>>>>> 1 file changed, 2 deletions(-)
> >>>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>>>> index 27b4404adf..8673e8f553 100644
> >>>>>>>>>> --- a/block/rbd.c
> >>>>>>>>>> +++ b/block/rbd.c
> >>>>>>>>>> @@ -223,8 +223,6 @@ done:
> >>>>>>>>>> static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error 
> >>>>>>>>>> **errp)
> >>>>>>>>>> {
> >>>>>>>>>>BDRVRBDState *s = bs->opaque;
> >>>>>>>>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? 
> >>>>>>>>>> */
> >>>>>>>>>> -bs->bl.request_alignment = 512;
> >>>>>>>>> Just a suggestion, but perhaps improve discard alignment, max 
> >>>>>>>>> discard,
> >>>>>>>>> optimal alignment (if that's something QEMU handles internally) if 
> >>>>>>>>> not
> >>>>>>>>> overridden by the user.
> >>>>>>>> Qemu supports max_discard and discard_alignment. Is there a call to 
> >>>>>>>> get these limits
> >>>>>>>> from librbd?
> >>>>>>>> What do you mean by optimal_alignment? The object size?
> >>>>>>> krbd does a good job of initializing defaults [1] where optimal and
> >>>>>>> discard alignment is 64KiB (can actually be 4KiB now), max IO size for
> >>>>>>> writes, discards, and write-zeroes is the object size * the stripe
> >>>>>>> count.
> >>>>>> Okay, I will have a look at it. If qemu issues a write, discard, 
> >>>>>> write_zero greater than
> >>>>>> obj_size  * stripe count will librbd split it internally or will the 
> >>>>>> request fail?
> >>>>> librbd will handle it as needed. My goal is really just to get the
> >>>>> hints down the guest OS.
> >>>>>> Regarding the alignment it seems that rbd_dev->opts->alloc_size is 
> >>>>>> something that comes from the device
> >>>>>> configuration and not from rbd? I don't have that information inside 
> >>>>>> the Qemu RBD driver.
> >>>>> librbd doesn't really have the information either. The 64KiB guess
> >>>>> that krbd uses was a compromise since that was the default OSD
> >>>>> allocation size for HDDs since Luminous. Starting with Pacific that
> >>>>> default is going down to 4KiB.
> >>>> I will try to adjust these values as far as it is possible and makes 
> >>>> sense.
> >>>> Is there a way to check the minimum supported OSD release in the backend 
> >>>> from librbd / librados?
> >>> It's not a minimum -- RADOS will gladly access 1 byte writes as well.
> >>> It's really just the optimal (performance and space-wise). Sadly,
> >>> there is no realistic way to query this data from the backend.
> >> So you would suggest to advertise an optimal transfer length of 64k and 
>

Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-21 Thread Jason Dillaman
On Wed, Jan 20, 2021 at 6:01 PM Peter Lieven  wrote:
>
>
> > Am 19.01.2021 um 15:20 schrieb Jason Dillaman :
> >
> > On Tue, Jan 19, 2021 at 4:36 AM Peter Lieven  wrote:
> >>> Am 18.01.21 um 23:33 schrieb Jason Dillaman:
> >>> On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
> >>>> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
> >>>>> On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
> >>>>>> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> >>>>>>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >>>>>>>> since we implement byte interfaces and librbd supports aio on byte 
> >>>>>>>> granularity we can lift
> >>>>>>>> the 512 byte alignment.
> >>>>>>>> Signed-off-by: Peter Lieven 
> >>>>>>>> ---
> >>>>>>>> block/rbd.c | 2 --
> >>>>>>>> 1 file changed, 2 deletions(-)
> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>> index 27b4404adf..8673e8f553 100644
> >>>>>>>> --- a/block/rbd.c
> >>>>>>>> +++ b/block/rbd.c
> >>>>>>>> @@ -223,8 +223,6 @@ done:
> >>>>>>>> static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error 
> >>>>>>>> **errp)
> >>>>>>>> {
> >>>>>>>>BDRVRBDState *s = bs->opaque;
> >>>>>>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> >>>>>>>> -bs->bl.request_alignment = 512;
> >>>>>>> Just a suggestion, but perhaps improve discard alignment, max discard,
> >>>>>>> optimal alignment (if that's something QEMU handles internally) if not
> >>>>>>> overridden by the user.
> >>>>>> Qemu supports max_discard and discard_alignment. Is there a call to 
> >>>>>> get these limits
> >>>>>> from librbd?
> >>>>>> What do you mean by optimal_alignment? The object size?
> >>>>> krbd does a good job of initializing defaults [1] where optimal and
> >>>>> discard alignment is 64KiB (can actually be 4KiB now), max IO size for
> >>>>> writes, discards, and write-zeroes is the object size * the stripe
> >>>>> count.
> >>>> Okay, I will have a look at it. If qemu issues a write, discard, 
> >>>> write_zero greater than
> >>>> obj_size  * stripe count will librbd split it internally or will the 
> >>>> request fail?
> >>> librbd will handle it as needed. My goal is really just to get the
> >>> hints down the guest OS.
> >>>> Regarding the alignment it seems that rbd_dev->opts->alloc_size is 
> >>>> something that comes from the device
> >>>> configuration and not from rbd? I don't have that information inside the 
> >>>> Qemu RBD driver.
> >>> librbd doesn't really have the information either. The 64KiB guess
> >>> that krbd uses was a compromise since that was the default OSD
> >>> allocation size for HDDs since Luminous. Starting with Pacific that
> >>> default is going down to 4KiB.
> >> I will try to adjust these values as far as it is possible and makes sense.
> >> Is there a way to check the minimum supported OSD release in the backend 
> >> from librbd / librados?
> >
> > It's not a minimum -- RADOS will gladly access 1 byte writes as well.
> > It's really just the optimal (performance and space-wise). Sadly,
> > there is no realistic way to query this data from the backend.
>
> So you would suggest to advertise an optimal transfer length of 64k and max 
> transfer length of obj size * stripe count to the guest unless we have an API 
> in the future to query these limits from the backend?

I'll open a Ceph tracker ticket to expose these via the API in a future release.

> I would leave request alignment at 1 byte as otherwise Qemu will issue RMWs 
> for all write requests that do not align. Everything that comes from a guest 
> OS is very likely 4k aligned anyway.

My goal is definitely not to have QEMU do any extra work for splitting
or aligning IOs. I am really only trying to get hints passed down the
guest via the virtio drivers. If there isn't the plumbing in QEMU for
that yet, disregard.

> Peter
>
>


-- 
Jason




Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-19 Thread Jason Dillaman
On Tue, Jan 19, 2021 at 4:36 AM Peter Lieven  wrote:
>
> Am 18.01.21 um 23:33 schrieb Jason Dillaman:
> > On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
> >> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
> >>> On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
> >>>> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> >>>>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >>>>>> since we implement byte interfaces and librbd supports aio on byte 
> >>>>>> granularity we can lift
> >>>>>> the 512 byte alignment.
> >>>>>>
> >>>>>> Signed-off-by: Peter Lieven 
> >>>>>> ---
> >>>>>>  block/rbd.c | 2 --
> >>>>>>  1 file changed, 2 deletions(-)
> >>>>>>
> >>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>> index 27b4404adf..8673e8f553 100644
> >>>>>> --- a/block/rbd.c
> >>>>>> +++ b/block/rbd.c
> >>>>>> @@ -223,8 +223,6 @@ done:
> >>>>>>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error 
> >>>>>> **errp)
> >>>>>>  {
> >>>>>>  BDRVRBDState *s = bs->opaque;
> >>>>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> >>>>>> -bs->bl.request_alignment = 512;
> >>>>> Just a suggestion, but perhaps improve discard alignment, max discard,
> >>>>> optimal alignment (if that's something QEMU handles internally) if not
> >>>>> overridden by the user.
> >>>> Qemu supports max_discard and discard_alignment. Is there a call to get 
> >>>> these limits
> >>>>
> >>>> from librbd?
> >>>>
> >>>>
> >>>> What do you mean by optimal_alignment? The object size?
> >>> krbd does a good job of initializing defaults [1] where optimal and
> >>> discard alignment is 64KiB (can actually be 4KiB now), max IO size for
> >>> writes, discards, and write-zeroes is the object size * the stripe
> >>> count.
> >>
> >> Okay, I will have a look at it. If qemu issues a write, discard, 
> >> write_zero greater than
> >>
> >> obj_size  * stripe count will librbd split it internally or will the 
> >> request fail?
> > librbd will handle it as needed. My goal is really just to get the
> > hints down the guest OS.
> >
> >> Regarding the alignment it seems that rbd_dev->opts->alloc_size is 
> >> something that comes from the device
> >>
> >> configuration and not from rbd? I don't have that information inside the 
> >> Qemu RBD driver.
> > librbd doesn't really have the information either. The 64KiB guess
> > that krbd uses was a compromise since that was the default OSD
> > allocation size for HDDs since Luminous. Starting with Pacific that
> > default is going down to 4KiB.
>
>
> I will try to adjust these values as far as it is possible and makes sense.
>
>
> Is there a way to check the minimum supported OSD release in the backend from 
> librbd / librados?

It's not a minimum -- RADOS will gladly access 1 byte writes as well.
It's really just the optimal (performance and space-wise). Sadly,
there is no realistic way to query this data from the backend.

>
> Anyway, I want to sent a V2 by the end of this week.
>
>
> Peter
>
>


-- 
Jason




Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-18 Thread Jason Dillaman
On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
>
> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
> > On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
> >> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> >>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >>>> since we implement byte interfaces and librbd supports aio on byte 
> >>>> granularity we can lift
> >>>> the 512 byte alignment.
> >>>>
> >>>> Signed-off-by: Peter Lieven 
> >>>> ---
> >>>>  block/rbd.c | 2 --
> >>>>  1 file changed, 2 deletions(-)
> >>>>
> >>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>> index 27b4404adf..8673e8f553 100644
> >>>> --- a/block/rbd.c
> >>>> +++ b/block/rbd.c
> >>>> @@ -223,8 +223,6 @@ done:
> >>>>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
> >>>>  {
> >>>>  BDRVRBDState *s = bs->opaque;
> >>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> >>>> -bs->bl.request_alignment = 512;
> >>> Just a suggestion, but perhaps improve discard alignment, max discard,
> >>> optimal alignment (if that's something QEMU handles internally) if not
> >>> overridden by the user.
> >>
> >> Qemu supports max_discard and discard_alignment. Is there a call to get 
> >> these limits
> >>
> >> from librbd?
> >>
> >>
> >> What do you mean by optimal_alignment? The object size?
> > krbd does a good job of initializing defaults [1] where optimal and
> > discard alignment is 64KiB (can actually be 4KiB now), max IO size for
> > writes, discards, and write-zeroes is the object size * the stripe
> > count.
>
>
> Okay, I will have a look at it. If qemu issues a write, discard, write_zero 
> greater than
>
> obj_size  * stripe count will librbd split it internally or will the request 
> fail?

librbd will handle it as needed. My goal is really just to get the
hints down the guest OS.

> Regarding the alignment it seems that rbd_dev->opts->alloc_size is something 
> that comes from the device
>
> configuration and not from rbd? I don't have that information inside the Qemu 
> RBD driver.

librbd doesn't really have the information either. The 64KiB guess
that krbd uses was a compromise since that was the default OSD
allocation size for HDDs since Luminous. Starting with Pacific that
default is going down to 4KiB.

>
> Peter
>
>


-- 
Jason




Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-15 Thread Jason Dillaman
On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
>
> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> > On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >> since we implement byte interfaces and librbd supports aio on byte 
> >> granularity we can lift
> >> the 512 byte alignment.
> >>
> >> Signed-off-by: Peter Lieven 
> >> ---
> >>  block/rbd.c | 2 --
> >>  1 file changed, 2 deletions(-)
> >>
> >> diff --git a/block/rbd.c b/block/rbd.c
> >> index 27b4404adf..8673e8f553 100644
> >> --- a/block/rbd.c
> >> +++ b/block/rbd.c
> >> @@ -223,8 +223,6 @@ done:
> >>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
> >>  {
> >>  BDRVRBDState *s = bs->opaque;
> >> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> >> -bs->bl.request_alignment = 512;
> > Just a suggestion, but perhaps improve discard alignment, max discard,
> > optimal alignment (if that's something QEMU handles internally) if not
> > overridden by the user.
>
>
> Qemu supports max_discard and discard_alignment. Is there a call to get these 
> limits
>
> from librbd?
>
>
> What do you mean by optimal_alignment? The object size?

krbd does a good job of initializing defaults [1] where optimal and
discard alignment is 64KiB (can actually be 4KiB now), max IO size for
writes, discards, and write-zeroes is the object size * the stripe
count.

> Peter
>
>
>

[1] https://github.com/torvalds/linux/blob/master/drivers/block/rbd.c#L4981

-- 
Jason




Re: [PATCH 6/7] block/rbd: add write zeroes support

2021-01-15 Thread Jason Dillaman
On Thu, Jan 14, 2021 at 2:41 PM Peter Lieven  wrote:
>
> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
> > On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
> >> Signed-off-by: Peter Lieven 
> >> ---
> >>  block/rbd.c | 31 ++-
> >>  1 file changed, 30 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/block/rbd.c b/block/rbd.c
> >> index 2d77d0007f..27b4404adf 100644
> >> --- a/block/rbd.c
> >> +++ b/block/rbd.c
> >> @@ -63,7 +63,8 @@ typedef enum {
> >>  RBD_AIO_READ,
> >>  RBD_AIO_WRITE,
> >>  RBD_AIO_DISCARD,
> >> -RBD_AIO_FLUSH
> >> +RBD_AIO_FLUSH,
> >> +RBD_AIO_WRITE_ZEROES
> >>  } RBDAIOCmd;
> >>
> >>  typedef struct BDRVRBDState {
> >> @@ -221,8 +222,12 @@ done:
> >>
> >>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
> >>  {
> >> +BDRVRBDState *s = bs->opaque;
> >>  /* XXX Does RBD support AIO on less than 512-byte alignment? */
> >>  bs->bl.request_alignment = 512;
> >> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> >> +bs->bl.pwrite_zeroes_alignment = s->object_size;
> > The optimal alignment is 512 bytes -- but it can safely work just fine
> > down to 1 byte alignment since it will pad the request with additional
> > writes if needed.
>
>
> Okay and this will likely be faster than having Qemu doing that request 
> split, right?
>
> If we drop the alignment hint Qemu will pass the original request.
>
>
> >
> >> +#endif
> >>  }
> >>
> >>
> >> @@ -695,6 +700,9 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> >> *options, int flags,
> >>  }
> >>
> >>  s->aio_context = bdrv_get_aio_context(bs);
> >> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> >> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> >> +#endif
> >>
> >>  /* When extending regular files, we get zeros from the OS */
> >>  bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
> >> @@ -808,6 +816,11 @@ static int coroutine_fn 
> >> qemu_rbd_start_co(BlockDriverState *bs,
> >>  case RBD_AIO_FLUSH:
> >>  r = rbd_aio_flush(s->image, c);
> >>  break;
> >> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> >> +case RBD_AIO_WRITE_ZEROES:
> >> +r = rbd_aio_write_zeroes(s->image, offset, bytes, c, 0, 0);
> >> +break;
> >> +#endif
> >>  default:
> >>  r = -EINVAL;
> >>  }
> >> @@ -880,6 +893,19 @@ static int coroutine_fn 
> >> qemu_rbd_co_pdiscard(BlockDriverState *bs,
> >>  return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
> >>  }
> >>
> >> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> >> +static int
> >> +coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t 
> >> offset,
> >> +  int count, BdrvRequestFlags flags)
> >> +{
> >> +if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> >> +return -ENOTSUP;
> >> +}
> > There is a "RBD_WRITE_ZEROES_FLAG_THICK_PROVISION" flag that you can
> > use to optionally disable unmap.
>
>
> I have seen that. If you want I can support for this, too. But afaik this
>
> is only available since Octopus release?

True -- I didn't backport that support to Nautilus since it was a new
feature vs the bug-fix that the write-zeroes API was introduced to
fix.

>
> Peter
>
>


-- 
Jason




Re: [PATCH 6/7] block/rbd: add write zeroes support

2021-01-14 Thread Jason Dillaman
On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 31 ++-
>  1 file changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 2d77d0007f..27b4404adf 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -63,7 +63,8 @@ typedef enum {
>  RBD_AIO_READ,
>  RBD_AIO_WRITE,
>  RBD_AIO_DISCARD,
> -RBD_AIO_FLUSH
> +RBD_AIO_FLUSH,
> +RBD_AIO_WRITE_ZEROES
>  } RBDAIOCmd;
>
>  typedef struct BDRVRBDState {
> @@ -221,8 +222,12 @@ done:
>
>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
> +BDRVRBDState *s = bs->opaque;
>  /* XXX Does RBD support AIO on less than 512-byte alignment? */
>  bs->bl.request_alignment = 512;
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +bs->bl.pwrite_zeroes_alignment = s->object_size;

The optimal alignment is 512 bytes -- but it can safely work just fine
down to 1 byte alignment since it will pad the request with additional
writes if needed.

> +#endif
>  }
>
>
> @@ -695,6 +700,9 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>
>  s->aio_context = bdrv_get_aio_context(bs);
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> +#endif
>
>  /* When extending regular files, we get zeros from the OS */
>  bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
> @@ -808,6 +816,11 @@ static int coroutine_fn 
> qemu_rbd_start_co(BlockDriverState *bs,
>  case RBD_AIO_FLUSH:
>  r = rbd_aio_flush(s->image, c);
>  break;
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +case RBD_AIO_WRITE_ZEROES:
> +r = rbd_aio_write_zeroes(s->image, offset, bytes, c, 0, 0);
> +break;
> +#endif
>  default:
>  r = -EINVAL;
>  }
> @@ -880,6 +893,19 @@ static int coroutine_fn 
> qemu_rbd_co_pdiscard(BlockDriverState *bs,
>  return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
>  }
>
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +static int
> +coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
> +  int count, BdrvRequestFlags flags)
> +{
> +if (!(flags & BDRV_REQ_MAY_UNMAP)) {
> +return -ENOTSUP;
> +}

There is a "RBD_WRITE_ZEROES_FLAG_THICK_PROVISION" flag that you can
use to optionally disable unmap.


> +return qemu_rbd_start_co(bs, offset, count, NULL, flags,
> + RBD_AIO_WRITE_ZEROES);
> +}
> +#endif
> +
>  static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
>  {
>  BDRVRBDState *s = bs->opaque;
> @@ -1108,6 +1134,9 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_co_pwritev= qemu_rbd_co_pwritev,
>  .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
>  .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
> +.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
> +#endif
>
>  .bdrv_snapshot_create   = qemu_rbd_snap_create,
>  .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
> --
> 2.17.1
>
>


--
Jason




Re: [PATCH 4/7] block/rbd: add bdrv_{attach,detach}_aio_context

2021-01-14 Thread Jason Dillaman
On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 21 +++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index a2da70e37f..27b232f4d8 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -91,6 +91,7 @@ typedef struct BDRVRBDState {
>  char *namespace;
>  uint64_t image_size;
>  uint64_t object_size;
> +AioContext *aio_context;
>  } BDRVRBDState;
>
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -749,6 +750,8 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>
> +s->aio_context = bdrv_get_aio_context(bs);
> +
>  /* When extending regular files, we get zeros from the OS */
>  bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
>
> @@ -839,8 +842,7 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
> *rcb)
>  rcb->ret = rbd_aio_get_return_value(c);
>  rbd_aio_release(c);
>
> -replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
> - rbd_finish_bh, rcb);
> +replay_bh_schedule_oneshot_event(acb->s->aio_context, rbd_finish_bh, 
> rcb);
>  }
>
>  static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
> @@ -1151,6 +1153,18 @@ static const char *const 
> qemu_rbd_strong_runtime_opts[] = {
>  NULL
>  };
>
> +static void qemu_rbd_attach_aio_context(BlockDriverState *bs,
> +   AioContext *new_context)
> +{
> +BDRVRBDState *s = bs->opaque;
> +s->aio_context = new_context;
> +}
> +
> +static void qemu_rbd_detach_aio_context(BlockDriverState *bs)
> +{

I don't know enough about the internals of QEMU, but this seems
suspicious to be a no-op.


> +}
> +
>  static BlockDriver bdrv_rbd = {
>  .format_name= "rbd",
>  .instance_size  = sizeof(BDRVRBDState),
> @@ -1180,6 +1194,9 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_snapshot_goto = qemu_rbd_snap_rollback,
>  .bdrv_co_invalidate_cache = qemu_rbd_co_invalidate_cache,
>
> +.bdrv_attach_aio_context  = qemu_rbd_attach_aio_context,
> +.bdrv_detach_aio_context  = qemu_rbd_detach_aio_context,
> +
>  .strong_runtime_opts= qemu_rbd_strong_runtime_opts,
>  };
>
> --
> 2.17.1
>
>


--
Jason




Re: [PATCH 3/7] block/rbd: use stored image_size in qemu_rbd_getlength

2021-01-14 Thread Jason Dillaman
On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 10 +-
>  1 file changed, 1 insertion(+), 9 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index bc8cf8af9b..a2da70e37f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -956,15 +956,7 @@ static int qemu_rbd_getinfo(BlockDriverState *bs, 
> BlockDriverInfo *bdi)
>  static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>  {
>  BDRVRBDState *s = bs->opaque;
> -rbd_image_info_t info;
> -int r;
> -
> -r = rbd_stat(s->image, , sizeof(info));
> -if (r < 0) {
> -return r;
> -}
> -
> -return info.size;
> +return s->image_size;
>  }
>
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
> --
> 2.17.1

An RBD image can technically change size dynamically while in-use. The
original code would provide the most up-to-date length but this
version will always return the size of the image when it was opened.


--
Jason




Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-14 Thread Jason Dillaman
On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>
> since we implement byte interfaces and librbd supports aio on byte 
> granularity we can lift
> the 512 byte alignment.
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 27b4404adf..8673e8f553 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -223,8 +223,6 @@ done:
>  static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
>  {
>  BDRVRBDState *s = bs->opaque;
> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
> -bs->bl.request_alignment = 512;

Just a suggestion, but perhaps improve discard alignment, max discard,
optimal alignment (if that's something QEMU handles internally) if not
overridden by the user.


>  #ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>  bs->bl.pwrite_zeroes_alignment = s->object_size;
>  #endif
> --
> 2.17.1
>
>


--
Jason




Re: [PATCH 5/7] block/rbd: migrate from aio to coroutines

2021-01-14 Thread Jason Dillaman
On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>
> Signed-off-by: Peter Lieven 
> ---
>  block/rbd.c | 247 ++--
>  1 file changed, 84 insertions(+), 163 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 27b232f4d8..2d77d0007f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -66,22 +66,6 @@ typedef enum {
>  RBD_AIO_FLUSH
>  } RBDAIOCmd;
>
> -typedef struct RBDAIOCB {
> -BlockAIOCB common;
> -int64_t ret;
> -QEMUIOVector *qiov;
> -RBDAIOCmd cmd;
> -int error;
> -struct BDRVRBDState *s;
> -} RBDAIOCB;
> -
> -typedef struct RADOSCB {
> -RBDAIOCB *acb;
> -struct BDRVRBDState *s;
> -int64_t size;
> -int64_t ret;
> -} RADOSCB;
> -
>  typedef struct BDRVRBDState {
>  rados_t cluster;
>  rados_ioctx_t io_ctx;
> @@ -94,6 +78,13 @@ typedef struct BDRVRBDState {
>  AioContext *aio_context;
>  } BDRVRBDState;
>
> +typedef struct RBDTask {
> +BDRVRBDState *s;
> +Coroutine *co;
> +bool complete;
> +int64_t ret;
> +} RBDTask;
> +
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
>  BlockdevOptionsRbd *opts, bool cache,
>  const char *keypairs, const char *secretid,
> @@ -316,13 +307,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
> char *keypairs_json,
>  return ret;
>  }
>
> -static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
> -{
> -RBDAIOCB *acb = rcb->acb;
> -iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
> -   acb->qiov->size - offs);
> -}
> -
>  /* FIXME Deprecate and remove keypairs or make it available in QMP. */
>  static int qemu_rbd_do_create(BlockdevCreateOptions *options,
>const char *keypairs, const char 
> *password_secret,
> @@ -440,46 +424,6 @@ exit:
>  return ret;
>  }
>
> -/*
> - * This aio completion is being called from rbd_finish_bh() and runs in qemu
> - * BH context.
> - */
> -static void qemu_rbd_complete_aio(RADOSCB *rcb)
> -{
> -RBDAIOCB *acb = rcb->acb;
> -int64_t r;
> -
> -r = rcb->ret;
> -
> -if (acb->cmd != RBD_AIO_READ) {
> -if (r < 0) {
> -acb->ret = r;
> -acb->error = 1;
> -} else if (!acb->error) {
> -acb->ret = rcb->size;
> -}
> -} else {
> -if (r < 0) {
> -qemu_rbd_memset(rcb, 0);
> -acb->ret = r;
> -acb->error = 1;
> -} else if (r < rcb->size) {
> -qemu_rbd_memset(rcb, r);
> -if (!acb->error) {
> -acb->ret = rcb->size;
> -}
> -} else if (!acb->error) {
> -acb->ret = r;
> -}
> -}
> -
> -g_free(rcb);
> -
> -acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
> -
> -qemu_aio_unref(acb);
> -}
> -
>  static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
>  {
>  const char **vals;
> @@ -817,88 +761,49 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
> uint64_t size)
>  return 0;
>  }
>
> -static const AIOCBInfo rbd_aiocb_info = {
> -.aiocb_size = sizeof(RBDAIOCB),
> -};
> -
> -static void rbd_finish_bh(void *opaque)
> +static void qemu_rbd_finish_bh(void *opaque)
>  {
> -RADOSCB *rcb = opaque;
> -qemu_rbd_complete_aio(rcb);
> +RBDTask *task = opaque;
> +task->complete = 1;
> +aio_co_wake(task->co);
>  }
>
> -/*
> - * This is the callback function for rbd_aio_read and _write
> - *
> - * Note: this function is being called from a non qemu thread so
> - * we need to be careful about what we do here. Generally we only
> - * schedule a BH, and do the rest of the io completion handling
> - * from rbd_finish_bh() which runs in a qemu context.
> - */
> -static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
> +static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
>  {
> -RBDAIOCB *acb = rcb->acb;
> -
> -rcb->ret = rbd_aio_get_return_value(c);
> +task->ret = rbd_aio_get_return_value(c);
>  rbd_aio_release(c);
> -
> -replay_bh_schedule_oneshot_event(acb->s->aio_context, rbd_finish_bh, 
> rcb);
> +aio_bh_schedule_oneshot(task->s->aio_context, qemu_rbd_finish_bh, task);
>  }
>
> -static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
> - int64_t off,
> - QEMUIOVector *qiov,
> - int64_t size,
> - BlockCompletionFunc *cb,
> - void *opaque,
> - RBDAIOCmd cmd)
> +static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
> +  uint64_t offset,
> +  uint64_t bytes,
> +  QEMUIOVector *qiov,
> +  int flags,
> +  

Re: qemu 6.0 rbd driver rewrite

2020-12-09 Thread Jason Dillaman
On Wed, Dec 9, 2020 at 7:19 AM Peter Lieven  wrote:
>
> Am 01.12.20 um 13:40 schrieb Peter Lieven:
> > Hi,
> >
> >
> > i would like to submit a series for 6.0 which will convert the aio hooks to 
> > native coroutine hooks and add write zeroes support.
> >
> > The aio routines are nowadays just an emulation on top of coroutines which 
> > add additional overhead.
> >
> > For this I would like to lift the minimum librbd requirement to luminous 
> > release to get rid of the ifdef'ry in the code.
> >
> >
> > Any objections?

None from me (speaking in my role under Ceph) -- even Luminous is EoL
for us upstream. Hopefully no one would attempt to install QEMU 6 but
expect to keep librbd frozen at a >3 year old Kraken or earlier
release.

> > Best,
> >
> > Peter
> >
>
> Kindly pinging as the 6.0 dev tree is now open. Also cc'ing qemu-devel which 
> I accidently forgot.
>
>
> Peter
>
>

-- 
Jason




Re: [PATCH] block/rbd: add 'namespace' to qemu_rbd_strong_runtime_opts[]

2020-09-14 Thread Jason Dillaman
On Mon, Sep 14, 2020 at 3:06 PM Stefano Garzarella  wrote:
>
> Commit 19ae9ae014 ("block/rbd: Add support for ceph namespaces")
> introduced namespace support for RBD, but we forgot to add the
> new 'namespace' options to qemu_rbd_strong_runtime_opts[].
>
> The 'namespace' is used to identify the image, so it is a strong
> option since it can changes the data of a BDS.
>
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1821528
> Fixes: 19ae9ae014 ("block/rbd: Add support for ceph namespaces")
> Cc: Florian Florensa 
> Signed-off-by: Stefano Garzarella 
> ---
>  block/rbd.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 688074c64b..5356753fbe 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1289,6 +1289,7 @@ static QemuOptsList qemu_rbd_create_opts = {
>
>  static const char *const qemu_rbd_strong_runtime_opts[] = {
>  "pool",
> +    "namespace",
>  "image",
>  "conf",
>  "snapshot",
> --
> 2.26.2
>

lgtm

Reviewed-by: Jason Dillaman 

-- 
Jason




Re: [PATCH] rbd: Use RBD fast-diff for querying actual allocation

2020-06-09 Thread Jason Dillaman
On Tue, Jun 9, 2020 at 3:31 AM Yi Li  wrote:
>
> Since Ceph version Infernalis (9.2.0) the new fast-diff mechanism
> of RBD allows for querying actual rbd image usage.
>
> Prior to this version there was no easy and fast way to query how
> much allocation a RBD image had inside a Ceph cluster.
>
> To use the fast-diff feature it needs to be enabled per RBD image
> and is only supported by Ceph cluster running version Infernalis
> (9.2.0) or newer.
>
> Without the fast-diff feature enabled qemu-img will report an allocation
> identical to the image capacity.
>
> 'qemu-img info rbd:cepharm/liyi-rbd' might output for example:
>
>   image: json:{"driver": "raw", "file": {"pool": "cepharm",
>   "image": "liyi-rbd", "driver": "rbd"}}
>   file format: raw
>   virtual size: 20 GiB (21474836480 bytes)
>   disk size: 0 B
>   cluster_size: 4194304
>
> Newly created rbds will have the fast-diff feature enabled.
>
> Signed-off-by: Yi Li 
> ---
>  block/rbd.c | 60 +
>  1 file changed, 60 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 617553b022..f231653f7b 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1107,6 +1107,65 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>  return info.size;
>  }
>
> +#if LIBRBD_VERSION_CODE > 265
> +static int disk_usage_callback(uint64_t offset, size_t len, int exists,
> +   void *arg)
> +{
> +  uint64_t *used_size = (uint64_t *)(arg);
> +  if (exists) {
> +(*used_size) += len;
> +  }
> +  return 0;
> +}
> +#endif
> +
> +static int64_t qemu_rbd_allocated_file_size(BlockDriverState *bs)
> +{
> +BDRVRBDState *s = bs->opaque;
> +rbd_image_info_t info;
> +int r;
> +uint64_t used_size = 0;
> +uint64_t features = 0;
> +
> +r = rbd_stat(s->image, , sizeof(info));
> +if (r < 0) {
> +return r;
> +}
> +
> +r = rbd_get_features(s->image, );
> +if (r < 0) {
> +return r;
> +}

You should probably test the flags to ensure that the
RBD_FLAG_FAST_DIFF_INVALID flag is not set [1]. It's potentially very
slow and expensive to calculate the disk usage w/o a fast-diff (on
large images) since it requires iterating over every possible 4MiB
backing data object (by default) to query its actual usage.

> +   /*
> +* rbd_diff_iterate2() is available in versions above Ceph 0.94 (Hammer)
> +* It uses a object map inside Ceph which is faster than 
> rbd_diff_iterate()
> +* which iterates all objects.
> +* LIBRBD_VERSION_CODE for Ceph 0.94 is 265. In 266 and upwards 
> diff_iterate2
> +* is available
> +*/
> +#if LIBRBD_VERSION_CODE > 265
> +if (features & RBD_FEATURE_FAST_DIFF) {
> +
> +/*
> + * RBD image fast-diff feature enabled
> + * Querying for actual allocation.
> + */
> +r = rbd_diff_iterate2(s->image, NULL, 0, info.size, 0, 1,
> +  _usage_callback,
> +  _size);
> +if (r < 0) {
> +return r;
> +}
> +} else {
> +used_size = info.size;
> +}
> +#else
> +used_size = info.size;
> +#endif
> +return used_size;
> +}
> +
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>   int64_t offset,
>   bool exact,
> @@ -1316,6 +1375,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_get_info  = qemu_rbd_getinfo,
>  .create_opts= _rbd_create_opts,
>  .bdrv_getlength = qemu_rbd_getlength,
> +.bdrv_get_allocated_file_size = qemu_rbd_allocated_file_size,
>  .bdrv_co_truncate   = qemu_rbd_co_truncate,
>  .protocol_name  = "rbd",
>
> --
> 2.25.3
>
>
>
>

[1] 
https://github.com/libvirt/libvirt/commit/21deeaf02fdf216b08210fc899579736973ca81d#diff-107c5451015e5980c90048ff615becb8

-- 
Jason




Re: [PATCH v3 1/1] block/rbd: Add support for ceph namespaces

2020-01-10 Thread Jason Dillaman
ser"));
> +loc->has_user= !!loc->user;
> +loc->q_namespace = g_strdup(qdict_get_try_str(options, "namespace"));
> +loc->image   = g_strdup(qdict_get_try_str(options, "image"));
> +keypairs = qdict_get_try_str(options, "=keyvalue-pairs");
>
>  ret = qemu_rbd_do_create(create_options, keypairs, password_secret, 
> errp);
>  if (ret < 0) {
> @@ -648,6 +661,11 @@ static int qemu_rbd_connect(rados_t *cluster, 
> rados_ioctx_t *io_ctx,
>  error_setg_errno(errp, -r, "error opening pool %s", opts->pool);
>  goto failed_shutdown;
>  }
> +/*
> + * Set the namespace after opening the io context on the pool,
> + * if nspace == NULL or if nspace == "", it is just as we did nothing
> + */
> +rados_ioctx_set_namespace(*io_ctx, opts->q_namespace);
>
>  return 0;
>
> diff --git a/qapi/block-core.json b/qapi/block-core.json
> index fcb52ec24f..c6f187ec9b 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3661,6 +3661,9 @@
>  #
>  # @pool:   Ceph pool name.
>  #
> +# @namespace:  Rados namespace name in the Ceph pool.
> +#  (Since 5.0)
> +#
>  # @image:  Image name in the Ceph pool.
>  #
>  # @conf:   path to Ceph configuration file.  Values
> @@ -3687,6 +3690,7 @@
>  ##
>  { 'struct': 'BlockdevOptionsRbd',
>'data': { 'pool': 'str',
> +'*namespace': 'str',
>  'image': 'str',
>  '*conf': 'str',
>  '*snapshot': 'str',
> --
> 2.24.1
>

Reviewed-by: Jason Dillaman 

-- 
Jason




Re: [PATCH] block/rbd: Add support for ceph namespaces

2019-12-20 Thread Jason Dillaman
On Fri, Dec 20, 2019 at 9:11 AM Florian Florensa  wrote:
>
> Hello Stefano and Jason,
>
> First of all thanks for the quick reply,
> Response inline belowe
> > Hi Florian,
> >
> > I think we need to add (Since: 5.0).
>
> Are you implying by that (Since: 5.0) that we need to specify its
> availability target is qemu 5.0 ?

FWIW, I took this as just a comment to add some documentation that the
field is only valid starting w/ qemu v5.

> I guess that maybe a version check would be better ? Like try to do
> namespaces stuff only if we have a recent enough librbd in the system ?
> Using something like :
>
> int rbd_major;
>
> rbd_version(_major, NULL, NULL);
> /*
>  * Target only nautilus+ librbd for namespace support
> */
> if (rbd_major >= 14) // tar
>  

Unfortunately, those versions weren't updated in the Mimic nor
Nautilus release so it would still return 1/12 (whoops). I think that
means you would need to add a probe in "configure" to test for librbd
namespace support (e.g. test for the existence of the `rbd_list2`
function or the `rbd_linked_image_spec_t` structure). I'll fix this
before the forthcoming Octopus release.

> > The patch LGTM, but I'd like to use 'namespace' instead of cryptic
> > 'nspace'. (as BlockdevOptionsNVMe did)
> > What do you think?
> >
> Yes no worries, I can rename it to 'rbd_namespace' to avoid any possible
> confusion, is this Ok for you ?

We use "pool_namespace" in the rbd CLI if you are trying to avoid the
word "namespace".

> > With those fixed:
> >
> > Reviewed-by: Stefano Garzarella 
> >
> > Thanks,
> > Stefano
>
> Regards,
> Florian

-- 
Jason




Re: [PATCH] block/rbd: Add support for ceph namespaces

2019-12-19 Thread Jason Dillaman
a/qapi/block-core.json b/qapi/block-core.json
> index 0cf68fea14..9ebc020e93 100644
> --- a/qapi/block-core.json
> +++ b/qapi/block-core.json
> @@ -3657,6 +3657,8 @@
>  #
>  # @pool:   Ceph pool name.
>  #
> +# @nspace: Rados namespace name in the Ceph pool.
> +#
>  # @image:  Image name in the Ceph pool.
>  #
>  # @conf:   path to Ceph configuration file.  Values
> @@ -3683,6 +3685,7 @@
>  ##
>  { 'struct': 'BlockdevOptionsRbd',
>'data': { 'pool': 'str',
> +'nspace': 'str',
>  'image': 'str',
>  '*conf': 'str',
>  '*snapshot': 'str',
> --
> 2.24.1
>

Thanks for tackling this. I had this and msgr v2 support on my todo
list for QEMU but I haven't had a chance to work on them yet. The
changes look good to me and it works as expected during CLI
play-testing.

Reviewed-by: Jason Dillaman 




[Bug 1843941] Re: RBD Namespaces are not supported

2019-12-19 Thread Jason Dillaman
Thanks for adding the support. I was actually already play-testing your
patch. I'll respond to the mailing list soon.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1843941

Title:
  RBD Namespaces are not supported

Status in QEMU:
  New

Bug description:
  Ceph Nautilus (v14.2.0) introduced the Namespaces concept for RADOS
  Block Devices. This provides a logical separation within a RADOS Pool
  for RBD images which enables granular access control. See
  https://docs.ceph.com/docs/nautilus/releases/nautilus/ for additional
  details.

  librados and librbd support this, however qemu does not. The rbd man
  page defines how rbd images within a namespace can be referenced.
  https://docs.ceph.com/docs/nautilus/man/8/rbd/#image-snap-group-and-
  journal-specs

  Adding support for RBD namespaces would be beneficial for security and
  reducing the impact of a hypervisor being compromised and putting an
  entire Ceph pool or cluster at risk.

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1843941/+subscriptions



Re: [Qemu-devel] [PATCH v3] block/rbd: add preallocation support

2019-07-29 Thread Jason Dillaman
On Mon, Jul 29, 2019 at 5:40 AM Stefano Garzarella  wrote:
>
> On Fri, Jul 26, 2019 at 08:46:56AM -0400, Jason Dillaman wrote:
> > On Fri, Jul 26, 2019 at 4:48 AM Stefano Garzarella  
> > wrote:
> > >
> > > On Thu, Jul 25, 2019 at 09:30:30AM -0400, Jason Dillaman wrote:
> > > > On Thu, Jul 25, 2019 at 4:13 AM Stefano Garzarella 
> > > >  wrote:
> > > > >
> > > > > On Wed, Jul 24, 2019 at 01:48:42PM -0400, Jason Dillaman wrote:
> > > > > > On Tue, Jul 23, 2019 at 3:13 AM Stefano Garzarella 
> > > > > >  wrote:
> > > > > > >
> > > > > > > This patch adds the support of preallocation (off/full) for the 
> > > > > > > RBD
> > > > > > > block driver.
> > > > > > > If rbd_writesame() is available and supports zeroed buffers, we 
> > > > > > > use
> > > > > > > it to quickly fill the image when full preallocation is required.
> > > > > > >
> > > > > > > Signed-off-by: Stefano Garzarella 
> > > > > > > ---
> > > > > > > v3:
> > > > > > >  - rebased on master
> > > > > > >  - filled with zeroed buffer [Max]
> > > > > > >  - used rbd_writesame() only when we can disable the discard of 
> > > > > > > zeroed
> > > > > > >buffers
> > > > > > >  - added 'since: 4.2' in qapi/block-core.json [Max]
> > > > > > >  - used buffer as large as the "stripe unit"
> > > > > > > ---
> > > > > > >  block/rbd.c  | 202 
> > > > > > > ---
> > > > > > >  qapi/block-core.json |   5 +-
> > > > > > >  2 files changed, 192 insertions(+), 15 deletions(-)
> > > > > > >
> > > > > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > > > > index 59757b3120..d923a5a26c 100644
> > > > > > > --- a/block/rbd.c
> > > > > > > +++ b/block/rbd.c
> > > > > > > @@ -64,6 +64,7 @@
> > > > > > >  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
> > > > > > >
> > > > > > >  #define RBD_MAX_SNAPS 100
> > > > > > > +#define RBD_DEFAULT_CONCURRENT_OPS 10
> > > > > > >
> > > > > > >  /* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
> > > > > > >  #ifdef LIBRBD_SUPPORTS_IOVEC
> > > > > > > @@ -104,6 +105,7 @@ typedef struct BDRVRBDState {
> > > > > > >  char *image_name;
> > > > > > >  char *snap;
> > > > > > >  uint64_t image_size;
> > > > > > > +bool ws_zero_supported; /* rbd_writesame() supports zeroed 
> > > > > > > buffers */
> > > > > > >  } BDRVRBDState;
> > > > > > >
> > > > > > >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t 
> > > > > > > *io_ctx,
> > > > > > > @@ -333,6 +335,155 @@ static void qemu_rbd_memset(RADOSCB *rcb, 
> > > > > > > int64_t offs)
> > > > > > >  }
> > > > > > >  }
> > > > > > >
> > > > > > > +static int qemu_rbd_get_max_concurrent_ops(rados_t cluster)
> > > > > > > +{
> > > > > > > +char buf[16];
> > > > > > > +int ret, max_concurrent_ops;
> > > > > > > +
> > > > > > > +ret = rados_conf_get(cluster, 
> > > > > > > "rbd_concurrent_management_ops", buf,
> > > > > > > + sizeof(buf));
> > > > > > > +if (ret < 0) {
> > > > > > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > > > > > +}
> > > > > > > +
> > > > > > > +ret = qemu_strtoi(buf, NULL, 10, _concurrent_ops);
> > > > > > > +if (ret < 0) {
> > > > > > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > > > > > +}
> > > > > > > +
> > > > > > > +return max_concurrent_ops;
> > > > > > > +}
> > > > > > > +
&g

Re: [Qemu-devel] [PATCH v3] block/rbd: add preallocation support

2019-07-26 Thread Jason Dillaman
On Fri, Jul 26, 2019 at 4:48 AM Stefano Garzarella  wrote:
>
> On Thu, Jul 25, 2019 at 09:30:30AM -0400, Jason Dillaman wrote:
> > On Thu, Jul 25, 2019 at 4:13 AM Stefano Garzarella  
> > wrote:
> > >
> > > On Wed, Jul 24, 2019 at 01:48:42PM -0400, Jason Dillaman wrote:
> > > > On Tue, Jul 23, 2019 at 3:13 AM Stefano Garzarella 
> > > >  wrote:
> > > > >
> > > > > This patch adds the support of preallocation (off/full) for the RBD
> > > > > block driver.
> > > > > If rbd_writesame() is available and supports zeroed buffers, we use
> > > > > it to quickly fill the image when full preallocation is required.
> > > > >
> > > > > Signed-off-by: Stefano Garzarella 
> > > > > ---
> > > > > v3:
> > > > >  - rebased on master
> > > > >  - filled with zeroed buffer [Max]
> > > > >  - used rbd_writesame() only when we can disable the discard of zeroed
> > > > >buffers
> > > > >  - added 'since: 4.2' in qapi/block-core.json [Max]
> > > > >  - used buffer as large as the "stripe unit"
> > > > > ---
> > > > >  block/rbd.c  | 202 
> > > > > ---
> > > > >  qapi/block-core.json |   5 +-
> > > > >  2 files changed, 192 insertions(+), 15 deletions(-)
> > > > >
> > > > > diff --git a/block/rbd.c b/block/rbd.c
> > > > > index 59757b3120..d923a5a26c 100644
> > > > > --- a/block/rbd.c
> > > > > +++ b/block/rbd.c
> > > > > @@ -64,6 +64,7 @@
> > > > >  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
> > > > >
> > > > >  #define RBD_MAX_SNAPS 100
> > > > > +#define RBD_DEFAULT_CONCURRENT_OPS 10
> > > > >
> > > > >  /* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
> > > > >  #ifdef LIBRBD_SUPPORTS_IOVEC
> > > > > @@ -104,6 +105,7 @@ typedef struct BDRVRBDState {
> > > > >  char *image_name;
> > > > >  char *snap;
> > > > >  uint64_t image_size;
> > > > > +bool ws_zero_supported; /* rbd_writesame() supports zeroed 
> > > > > buffers */
> > > > >  } BDRVRBDState;
> > > > >
> > > > >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> > > > > @@ -333,6 +335,155 @@ static void qemu_rbd_memset(RADOSCB *rcb, 
> > > > > int64_t offs)
> > > > >  }
> > > > >  }
> > > > >
> > > > > +static int qemu_rbd_get_max_concurrent_ops(rados_t cluster)
> > > > > +{
> > > > > +char buf[16];
> > > > > +int ret, max_concurrent_ops;
> > > > > +
> > > > > +ret = rados_conf_get(cluster, "rbd_concurrent_management_ops", 
> > > > > buf,
> > > > > + sizeof(buf));
> > > > > +if (ret < 0) {
> > > > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > > > +}
> > > > > +
> > > > > +ret = qemu_strtoi(buf, NULL, 10, _concurrent_ops);
> > > > > +if (ret < 0) {
> > > > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > > > +}
> > > > > +
> > > > > +return max_concurrent_ops;
> > > > > +}
> > > > > +
> > > > > +static int qemu_rbd_do_truncate(rados_t cluster, rbd_image_t image,
> > > > > +int64_t offset, PreallocMode 
> > > > > prealloc,
> > > > > +bool ws_zero_supported, Error **errp)
> > > > > +{
> > > > > +uint64_t current_length;
> > > > > +char *buf = NULL;
> > > > > +int ret;
> > > > > +
> > > > > +ret = rbd_get_size(image, _length);
> > > > > +if (ret < 0) {
> > > > > +error_setg_errno(errp, -ret, "Failed to get file length");
> > > > > +goto out;
> > > > > +}
> > > > > +
> > > > > +if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > > > +error_setg(errp, "Cannot

Re: [Qemu-devel] [PATCH v3] block/rbd: add preallocation support

2019-07-25 Thread Jason Dillaman
On Thu, Jul 25, 2019 at 4:13 AM Stefano Garzarella  wrote:
>
> On Wed, Jul 24, 2019 at 01:48:42PM -0400, Jason Dillaman wrote:
> > On Tue, Jul 23, 2019 at 3:13 AM Stefano Garzarella  
> > wrote:
> > >
> > > This patch adds the support of preallocation (off/full) for the RBD
> > > block driver.
> > > If rbd_writesame() is available and supports zeroed buffers, we use
> > > it to quickly fill the image when full preallocation is required.
> > >
> > > Signed-off-by: Stefano Garzarella 
> > > ---
> > > v3:
> > >  - rebased on master
> > >  - filled with zeroed buffer [Max]
> > >  - used rbd_writesame() only when we can disable the discard of zeroed
> > >buffers
> > >  - added 'since: 4.2' in qapi/block-core.json [Max]
> > >  - used buffer as large as the "stripe unit"
> > > ---
> > >  block/rbd.c  | 202 ---
> > >  qapi/block-core.json |   5 +-
> > >  2 files changed, 192 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 59757b3120..d923a5a26c 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -64,6 +64,7 @@
> > >  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
> > >
> > >  #define RBD_MAX_SNAPS 100
> > > +#define RBD_DEFAULT_CONCURRENT_OPS 10
> > >
> > >  /* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
> > >  #ifdef LIBRBD_SUPPORTS_IOVEC
> > > @@ -104,6 +105,7 @@ typedef struct BDRVRBDState {
> > >  char *image_name;
> > >  char *snap;
> > >  uint64_t image_size;
> > > +bool ws_zero_supported; /* rbd_writesame() supports zeroed buffers */
> > >  } BDRVRBDState;
> > >
> > >  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> > > @@ -333,6 +335,155 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t 
> > > offs)
> > >  }
> > >  }
> > >
> > > +static int qemu_rbd_get_max_concurrent_ops(rados_t cluster)
> > > +{
> > > +char buf[16];
> > > +int ret, max_concurrent_ops;
> > > +
> > > +ret = rados_conf_get(cluster, "rbd_concurrent_management_ops", buf,
> > > + sizeof(buf));
> > > +if (ret < 0) {
> > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > +}
> > > +
> > > +ret = qemu_strtoi(buf, NULL, 10, _concurrent_ops);
> > > +if (ret < 0) {
> > > +return RBD_DEFAULT_CONCURRENT_OPS;
> > > +}
> > > +
> > > +return max_concurrent_ops;
> > > +}
> > > +
> > > +static int qemu_rbd_do_truncate(rados_t cluster, rbd_image_t image,
> > > +int64_t offset, PreallocMode prealloc,
> > > +bool ws_zero_supported, Error **errp)
> > > +{
> > > +uint64_t current_length;
> > > +char *buf = NULL;
> > > +int ret;
> > > +
> > > +ret = rbd_get_size(image, _length);
> > > +if (ret < 0) {
> > > +error_setg_errno(errp, -ret, "Failed to get file length");
> > > +goto out;
> > > +}
> > > +
> > > +if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > +error_setg(errp, "Cannot use preallocation for shrinking files");
> > > +ret = -ENOTSUP;
> > > +goto out;
> > > +}
> > > +
> > > +switch (prealloc) {
> > > +case PREALLOC_MODE_FULL: {
> > > +uint64_t buf_size, current_offset = current_length;
> > > +ssize_t bytes;
> > > +
> > > +ret = rbd_get_stripe_unit(image, _size);
> > > +if (ret < 0) {
> > > +error_setg_errno(errp, -ret, "Failed to get stripe unit");
> > > +goto out;
> > > +}
> > > +
> > > +ret = rbd_resize(image, offset);
> > > +if (ret < 0) {
> > > +error_setg_errno(errp, -ret, "Failed to resize file");
> > > +goto out;
> > > +}
> > > +
> > > +buf = g_malloc0(buf_size);
> > > +
> > > +#ifdef LIBRBD_SUPPORTS_WRITESAME
> > > +if (ws_zero_supported) {
> > > +uint64_t writ

Re: [Qemu-devel] [PATCH v3] block/rbd: add preallocation support

2019-07-24 Thread Jason Dillaman
On Tue, Jul 23, 2019 at 3:13 AM Stefano Garzarella  wrote:
>
> This patch adds the support of preallocation (off/full) for the RBD
> block driver.
> If rbd_writesame() is available and supports zeroed buffers, we use
> it to quickly fill the image when full preallocation is required.
>
> Signed-off-by: Stefano Garzarella 
> ---
> v3:
>  - rebased on master
>  - filled with zeroed buffer [Max]
>  - used rbd_writesame() only when we can disable the discard of zeroed
>buffers
>  - added 'since: 4.2' in qapi/block-core.json [Max]
>  - used buffer as large as the "stripe unit"
> ---
>  block/rbd.c  | 202 ---
>  qapi/block-core.json |   5 +-
>  2 files changed, 192 insertions(+), 15 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 59757b3120..d923a5a26c 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -64,6 +64,7 @@
>  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
>
>  #define RBD_MAX_SNAPS 100
> +#define RBD_DEFAULT_CONCURRENT_OPS 10
>
>  /* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
>  #ifdef LIBRBD_SUPPORTS_IOVEC
> @@ -104,6 +105,7 @@ typedef struct BDRVRBDState {
>  char *image_name;
>  char *snap;
>  uint64_t image_size;
> +bool ws_zero_supported; /* rbd_writesame() supports zeroed buffers */
>  } BDRVRBDState;
>
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -333,6 +335,155 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>  }
>  }
>
> +static int qemu_rbd_get_max_concurrent_ops(rados_t cluster)
> +{
> +char buf[16];
> +int ret, max_concurrent_ops;
> +
> +ret = rados_conf_get(cluster, "rbd_concurrent_management_ops", buf,
> + sizeof(buf));
> +if (ret < 0) {
> +return RBD_DEFAULT_CONCURRENT_OPS;
> +}
> +
> +ret = qemu_strtoi(buf, NULL, 10, _concurrent_ops);
> +if (ret < 0) {
> +return RBD_DEFAULT_CONCURRENT_OPS;
> +}
> +
> +return max_concurrent_ops;
> +}
> +
> +static int qemu_rbd_do_truncate(rados_t cluster, rbd_image_t image,
> +int64_t offset, PreallocMode prealloc,
> +bool ws_zero_supported, Error **errp)
> +{
> +uint64_t current_length;
> +char *buf = NULL;
> +int ret;
> +
> +ret = rbd_get_size(image, _length);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to get file length");
> +goto out;
> +}
> +
> +if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> +error_setg(errp, "Cannot use preallocation for shrinking files");
> +ret = -ENOTSUP;
> +goto out;
> +}
> +
> +switch (prealloc) {
> +case PREALLOC_MODE_FULL: {
> +uint64_t buf_size, current_offset = current_length;
> +ssize_t bytes;
> +
> +ret = rbd_get_stripe_unit(image, _size);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to get stripe unit");
> +goto out;
> +}
> +
> +ret = rbd_resize(image, offset);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to resize file");
> +goto out;
> +}
> +
> +buf = g_malloc0(buf_size);
> +
> +#ifdef LIBRBD_SUPPORTS_WRITESAME
> +if (ws_zero_supported) {
> +uint64_t writesame_max_size;
> +int max_concurrent_ops;
> +
> +max_concurrent_ops = qemu_rbd_get_max_concurrent_ops(cluster);
> +/*
> + * We limit the rbd_writesame() size to avoid to spawn more then
> + * 'rbd_concurrent_management_ops' concurrent operations.
> + */
> +writesame_max_size = MIN(buf_size * max_concurrent_ops, INT_MAX);

In the most efficient world, the 'buf_size' would be some small, fixed
power of 2 value (like 512 bytes) since there isn't much need to send
extra zeroes. You would then want to writesame the full stripe period
(if possible), where a stripe period is the data block object size
(defaults to 4MiB and is availble via 'rbd_stat') * the stripe count.
In this case, the stripe count becomes the number of in-flight IOs.
Therefore, you could substitute its value w/ the max_concurrent_ops to
ensure you are issuing exactly max_concurrent_ops IOs per
rbd_writesame call.

> +
> +while (offset - current_offset > buf_size) {
> +bytes = MIN(offset - current_offset, writesame_max_size);
> +/*
> + * rbd_writesame() supports only request where the size of 
> the
> + * operation is multiple of buffer size.
> + */
> +bytes -= bytes % buf_size;
> +
> +bytes = rbd_writesame(image, current_offset, bytes, buf,
> +  buf_size, 0);

If the RBD in-memory cache is enabled during this operation, the
writesame will effectively just be turned into a write. Therefore,
when 

Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-09 Thread Jason Dillaman
On Tue, Jul 9, 2019 at 11:32 AM Max Reitz  wrote:
>
> On 09.07.19 15:09, Stefano Garzarella wrote:
> > On Tue, Jul 09, 2019 at 08:55:19AM -0400, Jason Dillaman wrote:
> >> On Tue, Jul 9, 2019 at 5:45 AM Max Reitz  wrote:
> >>>
> >>> On 09.07.19 10:55, Max Reitz wrote:
> >>>> On 09.07.19 05:08, Jason Dillaman wrote:
> >>>>> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella  
> >>>>> wrote:
> >>>>>>
> >>>>>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>>>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>>>>
> >>>>>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>>>>> to calculate the allocated size for the image.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Stefano Garzarella 
> >>>>>>>> ---
> >>>>>>>> v3:
> >>>>>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>>>>> [John, Jason]
> >>>>>>>> v2:
> >>>>>>>>   - calculate the actual usage only if the fast-diff feature is
> >>>>>>>> enabled [Jason]
> >>>>>>>> ---
> >>>>>>>>  block/rbd.c | 54 
> >>>>>>>> +
> >>>>>>>>  1 file changed, 54 insertions(+)
> >>>>>>>
> >>>>>>> Well, the librbd documentation is non-existing as always, but while
> >>>>>>> googling, I at least found that libvirt has exactly the same code.  
> >>>>>>> So I
> >>>>>>> suppose it must be quite correct, then.
> >>>>>>>
> >>>>>>
> >>>>>> While I wrote this code I took a look at libvirt implementation and 
> >>>>>> also
> >>>>>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>>>>> src/tools/rbd/action/DiskUsage.cc
> >>>>>>
> >>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>>>>> index 59757b3120..b6bed683e5 100644
> >>>>>>>> --- a/block/rbd.c
> >>>>>>>> +++ b/block/rbd.c
> >>>>>>>> @@ -1084,6 +1084,59 @@ static int64_t 
> >>>>>>>> qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>>>>  return info.size;
> >>>>>>>>  }
> >>>>>>>>
> >>>>>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
> >>>>>>>> exists,
> >>>>>>>> + void *arg)
> >>>>>>>> +{
> >>>>>>>> +int64_t *alloc_size = (int64_t *) arg;
> >>>>>>>> +
> >>>>>>>> +if (exists) {
> >>>>>>>> +(*alloc_size) += len;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +return 0;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState 
> >>>>>>>> *bs)
> >>>>>>>> +{
> >>>>>>>> +BDRVRBDState *s = bs->opaque;
> >>>>>>>> +uint64_t flags, features;
> >>>>>>>> +int64_t alloc_size = 0;
> >>>>>>>> +int r;
> >>>>>>>> +
> >>>>>>>> +r = rbd_get_flags(s->image, );
> >>>>>>>> +if (r < 0) {
> >>>>>>>> +return r;
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +r = rbd_get_features(s->image, );
> >>>>>>>> +if (r < 0) {
> >>>>>>>> +return r;
> >>>>>>>> +}
> 

Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-09 Thread Jason Dillaman
On Tue, Jul 9, 2019 at 5:45 AM Max Reitz  wrote:
>
> On 09.07.19 10:55, Max Reitz wrote:
> > On 09.07.19 05:08, Jason Dillaman wrote:
> >> On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella  
> >> wrote:
> >>>
> >>> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> >>>> On 05.07.19 11:32, Stefano Garzarella wrote:
> >>>>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>>>> the RBD images that have the fast-diff feature enabled.
> >>>>>
> >>>>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>>>> to calculate the allocated size for the image.
> >>>>>
> >>>>> Signed-off-by: Stefano Garzarella 
> >>>>> ---
> >>>>> v3:
> >>>>>   - return -ENOTSUP instead of -1 when fast-diff is not available
> >>>>> [John, Jason]
> >>>>> v2:
> >>>>>   - calculate the actual usage only if the fast-diff feature is
> >>>>> enabled [Jason]
> >>>>> ---
> >>>>>  block/rbd.c | 54 +
> >>>>>  1 file changed, 54 insertions(+)
> >>>>
> >>>> Well, the librbd documentation is non-existing as always, but while
> >>>> googling, I at least found that libvirt has exactly the same code.  So I
> >>>> suppose it must be quite correct, then.
> >>>>
> >>>
> >>> While I wrote this code I took a look at libvirt implementation and also
> >>> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> >>> src/tools/rbd/action/DiskUsage.cc
> >>>
> >>>>> diff --git a/block/rbd.c b/block/rbd.c
> >>>>> index 59757b3120..b6bed683e5 100644
> >>>>> --- a/block/rbd.c
> >>>>> +++ b/block/rbd.c
> >>>>> @@ -1084,6 +1084,59 @@ static int64_t 
> >>>>> qemu_rbd_getlength(BlockDriverState *bs)
> >>>>>  return info.size;
> >>>>>  }
> >>>>>
> >>>>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
> >>>>> exists,
> >>>>> + void *arg)
> >>>>> +{
> >>>>> +int64_t *alloc_size = (int64_t *) arg;
> >>>>> +
> >>>>> +if (exists) {
> >>>>> +(*alloc_size) += len;
> >>>>> +}
> >>>>> +
> >>>>> +return 0;
> >>>>> +}
> >>>>> +
> >>>>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> >>>>> +{
> >>>>> +BDRVRBDState *s = bs->opaque;
> >>>>> +uint64_t flags, features;
> >>>>> +int64_t alloc_size = 0;
> >>>>> +int r;
> >>>>> +
> >>>>> +r = rbd_get_flags(s->image, );
> >>>>> +if (r < 0) {
> >>>>> +return r;
> >>>>> +}
> >>>>> +
> >>>>> +r = rbd_get_features(s->image, );
> >>>>> +if (r < 0) {
> >>>>> +return r;
> >>>>> +}
> >>>>> +
> >>>>> +/*
> >>>>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> >>>>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> >>>>> + * very slow on a big image.
> >>>>> + */
> >>>>> +if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>>>> +(flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>>>> +return -ENOTSUP;
> >>>>> +}
> >>>>> +
> >>>>> +/*
> >>>>> + * rbd_diff_iterate2(), if the source snapshot name is NULL, 
> >>>>> invokes
> >>>>> + * the callback on all allocated regions of the image.
> >>>>> + */
> >>>>> +r = rbd_diff_iterate2(s->image, NULL, 0,
> >>>>> +  bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> >>>>> +  _allocated_size_cb, _size);
> >>>>
> >>>> But I have a qu

Re: [Qemu-devel] [PATCH v3] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-08 Thread Jason Dillaman
On Fri, Jul 5, 2019 at 6:43 AM Stefano Garzarella  wrote:
>
> On Fri, Jul 05, 2019 at 11:58:43AM +0200, Max Reitz wrote:
> > On 05.07.19 11:32, Stefano Garzarella wrote:
> > > This patch allows 'qemu-img info' to show the 'disk size' for
> > > the RBD images that have the fast-diff feature enabled.
> > >
> > > If this feature is enabled, we use the rbd_diff_iterate2() API
> > > to calculate the allocated size for the image.
> > >
> > > Signed-off-by: Stefano Garzarella 
> > > ---
> > > v3:
> > >   - return -ENOTSUP instead of -1 when fast-diff is not available
> > > [John, Jason]
> > > v2:
> > >   - calculate the actual usage only if the fast-diff feature is
> > > enabled [Jason]
> > > ---
> > >  block/rbd.c | 54 +
> > >  1 file changed, 54 insertions(+)
> >
> > Well, the librbd documentation is non-existing as always, but while
> > googling, I at least found that libvirt has exactly the same code.  So I
> > suppose it must be quite correct, then.
> >
>
> While I wrote this code I took a look at libvirt implementation and also
> at the "rbd" tool in the ceph repository: compute_image_disk_usage() in
> src/tools/rbd/action/DiskUsage.cc
>
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 59757b3120..b6bed683e5 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -1084,6 +1084,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState 
> > > *bs)
> > >  return info.size;
> > >  }
> > >
> > > +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> > > + void *arg)
> > > +{
> > > +int64_t *alloc_size = (int64_t *) arg;
> > > +
> > > +if (exists) {
> > > +(*alloc_size) += len;
> > > +}
> > > +
> > > +return 0;
> > > +}
> > > +
> > > +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > > +{
> > > +BDRVRBDState *s = bs->opaque;
> > > +uint64_t flags, features;
> > > +int64_t alloc_size = 0;
> > > +int r;
> > > +
> > > +r = rbd_get_flags(s->image, );
> > > +if (r < 0) {
> > > +return r;
> > > +}
> > > +
> > > +r = rbd_get_features(s->image, );
> > > +if (r < 0) {
> > > +return r;
> > > +}
> > > +
> > > +/*
> > > + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> > > + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> > > + * very slow on a big image.
> > > + */
> > > +if (!(features & RBD_FEATURE_FAST_DIFF) ||
> > > +(flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> > > +return -ENOTSUP;
> > > +}
> > > +
> > > +/*
> > > + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> > > + * the callback on all allocated regions of the image.
> > > + */
> > > +r = rbd_diff_iterate2(s->image, NULL, 0,
> > > +  bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> > > +  _allocated_size_cb, _size);
> >
> > But I have a question.  This is basically block_status, right?  So it
> > gives us information on which areas are allocated and which are not.
> > The result thus gives us a lower bound on the allocation size, but is it
> > really exactly the allocation size?
> >
> > There are two things I’m concerned about:
> >
> > 1. What about metadata?
>
> Good question, I don't think it includes the size used by metadata and I
> don't know if there is a way to know it. I'll check better.

It does not include the size of metadata, the "rbd_diff_iterate2"
function is literally just looking for touched data blocks within the
RBD image.

> >
> > 2. If you have multiple snapshots, this will only report the overall
> > allocation information, right?  So say there is something like this:
> >
> > (“A” means an allocated MB, “-” is an unallocated MB)
> >
> > Snapshot 1: ---
> > Snapshot 2: --A
> > Snapshot 3: ---
> >
> > I think the allocated data size is the number of As in total (13 MB).
> > But I suppose this API will just return 7 MB, because it looks on
> > everything an it sees the whole image range (7 MB) to be allocated.  It
> > doesn’t report in how many snapshots some region is allocated.

It should return 13 dirty data blocks (multipled by the size of the
data block) since when you don't provide a "from snapshot" name, it
will iterate from the first snapshot to the HEAD revision.

> Looking at the documentation of rbd_diff_iterate2() [1] they says:
>
>  *If the source snapshot name is NULL, we
>  * interpret that as the beginning of time and return all allocated
>  * regions of the image.
>
> But I don't know the answer of your question (maybe Jason can help
> here).
> I should check better the implementation to understand if I can cycle
> on all snapshot to get the exact allocated data size.
>
> https://github.com/ceph/ceph/blob/master/src/include/rbd/librbd.h#L925
>
> I'll back when I have more details on the rbd 

Re: [Qemu-devel] [Qemu-block] [PATCH v2] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-07-02 Thread Jason Dillaman
On Fri, Jun 28, 2019 at 4:59 AM Stefano Garzarella  wrote:
>
> On Thu, Jun 27, 2019 at 03:43:04PM -0400, Jason Dillaman wrote:
> > On Thu, Jun 27, 2019 at 1:24 PM John Snow  wrote:
> > > On 6/27/19 4:48 AM, Stefano Garzarella wrote:
> > > > On Wed, Jun 26, 2019 at 05:04:25PM -0400, John Snow wrote:
> > > >> It looks like this has hit a 30 day expiration without any reviews or
> > > >> being merged; do we still want this? If so, can you please resend?
> > > >
> > > > Yes, I think we still want :)
> > > >
> > > > Is it okay if I send a v3 following your comments?
> > > >
> > >
> > > Yes, but I don't know who is responsible for final approval; I guess
> > > that's Josh Durgin?
> >
> > I'm the new (for the past several years) upstream PTL for RBD, so feel
> > free to tag me.
> >
> > > >>
> > > >> On 5/10/19 11:33 AM, Stefano Garzarella wrote:
> > > >>> This patch allows 'qemu-img info' to show the 'disk size' for
> > > >>> the RBD images that have the fast-diff feature enabled.
> > > >>>
> > > >>> If this feature is enabled, we use the rbd_diff_iterate2() API
> > > >>> to calculate the allocated size for the image.
> > > >>>
> > > >>> Signed-off-by: Stefano Garzarella 
> > > >>> ---
> > > >>> v2:
> > > >>>   - calculate the actual usage only if the fast-diff feature is
> > > >>> enabled [Jason]
> > > >>> ---
> > > >>>  block/rbd.c | 54 
> > > >>> +
> > > >>>  1 file changed, 54 insertions(+)
> > > >>>
> > > >>> diff --git a/block/rbd.c b/block/rbd.c
> > > >>> index 0c549c9935..f1bc76ab80 100644
> > > >>> --- a/block/rbd.c
> > > >>> +++ b/block/rbd.c
> > > >>> @@ -1046,6 +1046,59 @@ static int64_t 
> > > >>> qemu_rbd_getlength(BlockDriverState *bs)
> > > >>>  return info.size;
> > > >>>  }
> > > >>>
> > > >>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int 
> > > >>> exists,
> > > >>> + void *arg)
> > > >>> +{
> > > >>> +int64_t *alloc_size = (int64_t *) arg;
> > > >>> +
> > > >>> +if (exists) {
> > > >>> +(*alloc_size) += len;
> > > >>> +}
> > > >>> +
> > > >>> +return 0;
> > > >>> +}
> > > >>> +
> > > >>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> > > >>> +{
> > > >>> +BDRVRBDState *s = bs->opaque;
> > > >>> +uint64_t flags, features;
> > > >>> +int64_t alloc_size = 0;
> > > >>> +int r;
> > > >>> +
> > > >>> +r = rbd_get_flags(s->image, );
> > > >>> +if (r < 0) {
> > > >>> +return r;
> > > >>> +}
> > > >>> +
> > > >>
> > > >> Do you know where rbd_get_flags is documented? I can't seem to quickly
> > > >> find a reference that tells me what to expect from calling it. It
> > > >> returns an int, I guess an error code, but how can I confirm this?
> > > >>
> > > >> *clones the ceph repository*
> > > >>
> > > >> src/librbd/internal.cc get_flags convinces me it probably works like I
> > > >> think, but is there not a reference here?
> > > >>
> > > >
> > > > Good question!
> > > > I didn't find any docs, but looking in the ceph tests 
> > > > test/librbd/fsx.cc,
> > > > they print an error message if the return value is less than 0.
> > > >
> > > > A 'get_flags' implemented in cls/rbd/cls_rbd.cc for example returns 0 
> > > > at the
> > > > end and -EINVAL in a try/catch. It also uses 'read_key()' that in some 
> > > > cases
> > > > returns -EIO, so I hope that the error returned by rbd_get_flags() is 
> > > > one of
> > > > the errors defined in errno.h
> > > >
> > > >>> +r = rbd_get_f

Re: [Qemu-devel] [Qemu-block] [PATCH v2] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-06-27 Thread Jason Dillaman
On Thu, Jun 27, 2019 at 3:45 PM John Snow  wrote:
>
>
>
> On 6/27/19 3:43 PM, Jason Dillaman wrote:
> > On Thu, Jun 27, 2019 at 1:24 PM John Snow  wrote:
> >>
> >>
> >>
> >> On 6/27/19 4:48 AM, Stefano Garzarella wrote:
> >>> On Wed, Jun 26, 2019 at 05:04:25PM -0400, John Snow wrote:
> >>>> It looks like this has hit a 30 day expiration without any reviews or
> >>>> being merged; do we still want this? If so, can you please resend?
> >>>
> >>> Yes, I think we still want :)
> >>>
> >>> Is it okay if I send a v3 following your comments?
> >>>
> >>
> >> Yes, but I don't know who is responsible for final approval; I guess
> >> that's Josh Durgin?
> >
> > I'm the new (for the past several years) upstream PTL for RBD, so feel
> > free to tag me.
> >
>
> I got Josh's name out of MAINTAINERS, does it need an update?
>
> > RBD
> > M: Josh Durgin 
> > L: qemu-bl...@nongnu.org
> > S: Supported
> > F: block/rbd.c

Yes, I'll submit a patch to update that tomorrow.

-- 
Jason



Re: [Qemu-devel] [Qemu-block] [PATCH v2] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-06-27 Thread Jason Dillaman
On Thu, Jun 27, 2019 at 1:24 PM John Snow  wrote:
>
>
>
> On 6/27/19 4:48 AM, Stefano Garzarella wrote:
> > On Wed, Jun 26, 2019 at 05:04:25PM -0400, John Snow wrote:
> >> It looks like this has hit a 30 day expiration without any reviews or
> >> being merged; do we still want this? If so, can you please resend?
> >
> > Yes, I think we still want :)
> >
> > Is it okay if I send a v3 following your comments?
> >
>
> Yes, but I don't know who is responsible for final approval; I guess
> that's Josh Durgin?

I'm the new (for the past several years) upstream PTL for RBD, so feel
free to tag me.

> >>
> >> On 5/10/19 11:33 AM, Stefano Garzarella wrote:
> >>> This patch allows 'qemu-img info' to show the 'disk size' for
> >>> the RBD images that have the fast-diff feature enabled.
> >>>
> >>> If this feature is enabled, we use the rbd_diff_iterate2() API
> >>> to calculate the allocated size for the image.
> >>>
> >>> Signed-off-by: Stefano Garzarella 
> >>> ---
> >>> v2:
> >>>   - calculate the actual usage only if the fast-diff feature is
> >>> enabled [Jason]
> >>> ---
> >>>  block/rbd.c | 54 +
> >>>  1 file changed, 54 insertions(+)
> >>>
> >>> diff --git a/block/rbd.c b/block/rbd.c
> >>> index 0c549c9935..f1bc76ab80 100644
> >>> --- a/block/rbd.c
> >>> +++ b/block/rbd.c
> >>> @@ -1046,6 +1046,59 @@ static int64_t qemu_rbd_getlength(BlockDriverState 
> >>> *bs)
> >>>  return info.size;
> >>>  }
> >>>
> >>> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> >>> + void *arg)
> >>> +{
> >>> +int64_t *alloc_size = (int64_t *) arg;
> >>> +
> >>> +if (exists) {
> >>> +(*alloc_size) += len;
> >>> +}
> >>> +
> >>> +return 0;
> >>> +}
> >>> +
> >>> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> >>> +{
> >>> +BDRVRBDState *s = bs->opaque;
> >>> +uint64_t flags, features;
> >>> +int64_t alloc_size = 0;
> >>> +int r;
> >>> +
> >>> +r = rbd_get_flags(s->image, );
> >>> +if (r < 0) {
> >>> +return r;
> >>> +}
> >>> +
> >>
> >> Do you know where rbd_get_flags is documented? I can't seem to quickly
> >> find a reference that tells me what to expect from calling it. It
> >> returns an int, I guess an error code, but how can I confirm this?
> >>
> >> *clones the ceph repository*
> >>
> >> src/librbd/internal.cc get_flags convinces me it probably works like I
> >> think, but is there not a reference here?
> >>
> >
> > Good question!
> > I didn't find any docs, but looking in the ceph tests test/librbd/fsx.cc,
> > they print an error message if the return value is less than 0.
> >
> > A 'get_flags' implemented in cls/rbd/cls_rbd.cc for example returns 0 at the
> > end and -EINVAL in a try/catch. It also uses 'read_key()' that in some cases
> > returns -EIO, so I hope that the error returned by rbd_get_flags() is one of
> > the errors defined in errno.h
> >
> >>> +r = rbd_get_features(s->image, );
> >>> +if (r < 0) {
> >>> +return r;
> >>> +}
> >>> +
> >>> +/*
> >>> + * We use rbd_diff_iterate2() only if the RBD image have fast-diff
> >>> + * feature enabled. If it is disabled, rbd_diff_iterate2() could be
> >>> + * very slow on a big image.
> >>> + */
> >>> +if (!(features & RBD_FEATURE_FAST_DIFF) ||
> >>> +(flags & RBD_FLAG_FAST_DIFF_INVALID)) {
> >>> +return -1;
> >>> +}
> >>> +
> >>
> >> (Is there a reference for the list of flags to make sure there aren't
> >> other cases we might want to skip this?)
> >
> > Unfortunately no :(
> > As Jason suggested, I followed what libvirt did in the
> > volStorageBackendRBDUseFastDiff() [src/storage/storage_backend_rbd.c]

These are the only ones.

> >>
> >> It looks reasonable at a glance, but maybe let's return -ENOTSUP instead
> >> of -1, based on the idea that bdrv_get_allocated_file_size returns
> >> -ENOMEDIUM in a prominent error case -- let's match that error convention.
> >
> > Sure, -ENOTSUP is absolutely better!
> >
> >>
> >> (Well, I wonder what the librbd calls are returning and if THOSE mean
> >> anything.)
> >
> > I hope they return an errno.h errors, but I'm not sure if the meaning
> > make sense for us.
> >
> > Do you think is better to return -ENOTSUP or -EIO when librbd calls
> > fail?
> >
>
> I'll be honest, I have no idea because I don't know what failure of
> these calls means _at all_, so I don't know if it should be something
> severe like EIO or something more mundane.
>
> I guess just leave them alone in absence of better information, honestly.

It looks like QEMU treats any negative error code like the actual size
isn't available. However, for clarity I would vote for -ENOTSUPP since
that's what would be returned if the driver doesn't support it.

> >
> > Thanks for your comments,
> > Stefano
> >
>
> Thank you for trying to patch rbd :)


-- 
Jason



Re: [Qemu-devel] [PATCH v2] block/rbd: increase dynamically the image size

2019-05-03 Thread Jason Dillaman
On Fri, May 3, 2019 at 12:30 PM Stefano Garzarella  wrote:
>
> RBD APIs don't allow us to write more than the size set with
> rbd_create() or rbd_resize().
> In order to support growing images (eg. qcow2), we resize the
> image before write operations that exceed the current size.
>
> Signed-off-by: Stefano Garzarella 
> ---
> v2:
>   - use bs->total_sectors instead of adding a new field [Kevin]
>   - resize the image only during write operation [Kevin]
> for read operation, the bdrv_aligned_preadv() already handles reads
> that exceed the length returned by bdrv_getlength(), so IMHO we can
> avoid to handle it in the rbd driver
> ---
>  block/rbd.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..613e8f4982 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -934,13 +934,25 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>  }
>
>  switch (cmd) {
> -case RBD_AIO_WRITE:
> +case RBD_AIO_WRITE: {
> +/*
> + * RBD APIs don't allow us to write more than actual size, so in 
> order
> + * to support growing images, we resize the image before write
> + * operations that exceed the current size.
> + */
> +if (off + size > bs->total_sectors * BDRV_SECTOR_SIZE) {

When will "bs->total_sectors" be refreshed to represent the correct
current size? You wouldn't want a future write whose extent was
greater than the original image size but less then a previous IO that
expanded the image to attempt to shrink the image.

> +r = rbd_resize(s->image, off + size);
> +if (r < 0) {
> +goto failed_completion;
> +}
> +}
>  #ifdef LIBRBD_SUPPORTS_IOVEC
>  r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
>  #else
>  r = rbd_aio_write(s->image, off, size, rcb->buf, c);
>  #endif
>  break;
> +}
>  case RBD_AIO_READ:
>  #ifdef LIBRBD_SUPPORTS_IOVEC
>  r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
> --
> 2.20.1
>
>


-- 
Jason



Re: [Qemu-devel] [PATCH] block/rbd: implement .bdrv_get_allocated_file_size callback

2019-05-03 Thread Jason Dillaman
On Fri, May 3, 2019 at 7:02 AM Stefano Garzarella  wrote:
>
> This patch allows 'qemu-img info' to show the 'disk size' for
> rbd images. We use the rbd_diff_iterate2() API to calculate the
> allocated size for the image.
>
> Signed-off-by: Stefano Garzarella 
> ---
>  block/rbd.c | 33 +
>  1 file changed, 33 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..61447bc0cb 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1046,6 +1046,38 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
>  return info.size;
>  }
>
> +static int rbd_allocated_size_cb(uint64_t offset, size_t len, int exists,
> + void *arg)
> +{
> +int64_t *alloc_size = (int64_t *) arg;
> +
> +if (exists) {
> +(*alloc_size) += len;
> +}
> +
> +return 0;
> +}
> +
> +static int64_t qemu_rbd_get_allocated_file_size(BlockDriverState *bs)
> +{
> +BDRVRBDState *s = bs->opaque;
> +int64_t alloc_size = 0;
> +int r;
> +
> +/*
> + * rbd_diff_iterate2(), if the source snapshot name is NULL, invokes
> + * the callback on all allocated regions of the image.
> + */
> +r = rbd_diff_iterate2(s->image, NULL, 0,
> +  bs->total_sectors * BDRV_SECTOR_SIZE, 0, 1,
> +  _allocated_size_cb, _size);

Is there any concern that running this on very large images will take
a very long time since it needs to iterate through each individual
4MiB (by default) backing object in the image? In libvirt, it only
attempts to calculate the actual usage if the fast-diff feature is
enabled, and recently it also got a new control to optionally disable
the functionality entirely since even with fast-diff it's can be very
slow to compute over hundreds of images in a libvirt storage pool.

> +if (r < 0) {
> +return r;
> +}
> +
> +return alloc_size;
> +}
> +
>  static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
>   int64_t offset,
>   PreallocMode prealloc,
> @@ -1254,6 +1286,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_get_info  = qemu_rbd_getinfo,
>  .create_opts= _rbd_create_opts,
>  .bdrv_getlength = qemu_rbd_getlength,
> +.bdrv_get_allocated_file_size = qemu_rbd_get_allocated_file_size,
>  .bdrv_co_truncate   = qemu_rbd_co_truncate,
>  .protocol_name  = "rbd",
>
> --
> 2.20.1
>


-- 
Jason



Re: [Qemu-devel] [Qemu-block] [PATCH] block/rbd: add preallocation support

2019-04-29 Thread Jason Dillaman
On Mon, Apr 29, 2019 at 8:47 AM Stefano Garzarella  wrote:
>
> On Sat, Apr 27, 2019 at 08:43:26AM -0400, Jason Dillaman wrote:
> > On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella  
> > wrote:
> > >
> > > This patch adds the support of preallocation (off/full) for the RBD
> > > block driver.
> > > If available, we use rbd_writesame() to quickly fill the image when
> > > full preallocation is required.
> > >
> > > Signed-off-by: Stefano Garzarella 
> > > ---
> > >  block/rbd.c  | 149 ++-
> > >  qapi/block-core.json |   4 +-
> > >  2 files changed, 136 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/block/rbd.c b/block/rbd.c
> > > index 0c549c9935..29dd1bb040 100644
> > > --- a/block/rbd.c
> > > +++ b/block/rbd.c
> > > @@ -13,6 +13,7 @@
> > >
> > >  #include "qemu/osdep.h"
> > >
> > > +#include "qemu/units.h"
> > >  #include 
> > >  #include "qapi/error.h"
> > >  #include "qemu/error-report.h"
> > > @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t 
> > > offs)
> > >  }
> > >  }
> > >
> > > +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset,
> > > +PreallocMode prealloc, Error **errp)
> > > +{
> > > +uint64_t current_length;
> > > +char *buf = NULL;
> > > +int ret;
> > > +
> > > +ret = rbd_get_size(image, _length);
> > > +if (ret < 0) {
> > > +error_setg_errno(errp, -ret, "Failed to get file length");
> > > +goto out;
> > > +}
> > > +
> > > +if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> > > +error_setg(errp, "Cannot use preallocation for shrinking files");
> > > +ret = -ENOTSUP;
> > > +goto out;
> > > +}
> > > +
> > > +switch (prealloc) {
> > > +case PREALLOC_MODE_FULL: {
> > > +uint64_t current_offset = current_length;
> > > +int buf_size = 64 * KiB;
> >
> > This should probably be 512B or 4KiB (which is the smallest striped
> > unit size). Not only will this avoid sending unnecessary zeroes to the
> > backing cluster, writesame silently turns into a standard write if
> > your buffer isn't properly aligned with the min(object size, stripe
> > unit size).
> >
>
> Okay, I'll change it on v2.
> Should we query about the "stripe_unit" size or we simply use the
> smallest allowed?

Technically we don't prevent a user from choosing terrible stripe unit
sizes (e.g. 1 byte), so you are probably safe to just use 4KiB.

> > > +ssize_t bytes;
> > > +
> > > +buf = g_malloc(buf_size);
> > > +/*
> > > + * Some versions of rbd_writesame() discards writes of buffers 
> > > with
> > > + * all zeroes. In order to avoid this behaviour, we set the 
> > > first byte
> > > + * to one.
> > > + */
> > > +buf[0] = 1;
> >
> > You could also use "rados_conf_set(cluster,
> > "rbd_discard_on_zeroed_write_same", "false").
> >
>
> I tried it, but it is not supported on all versions. (eg. I have Ceph
> v12.2.11 on my Fedora 29 and it is not supported, but rbd_writesame() is
> available)
>
> Maybe we can use both: "rbd_discard_on_zeroed_write_same = false" and
> "buf[0] = 1"

Probably not worth the effort if it's not supported across all releases.

> > > +ret = rbd_resize(image, offset);
> > > +if (ret < 0) {
> > > +error_setg_errno(errp, -ret, "Failed to resize file");
> > > +goto out;
> > > +}
> > > +
> > > +#ifdef LIBRBD_SUPPORTS_WRITESAME
> > > +while (offset - current_offset > buf_size) {
> > > +/*
> > > + * rbd_writesame() supports only request where the size of 
> > > the
> > > + * operation is multiple of buffer size and it must be less 
> > > or
> > > + * equal to INT_MAX.
> > > + */
> > > +bytes = MIN(offset - current_offset, INT_MAX);
> > > +bytes -= bytes % buf_size;
> >
> > Using the default object size of 4MiB, this write size would result in
> > up to 512 concurrent ops to the backing cluster. Perhaps the size
> > should be bounded such that only a dozen or so concurrent requests are
> > issued per write, always rounded next largest object / stripe period
> > size. librbd and the rbd CLI usually try to bound themselves to the
> > value in the "rbd_concurrent_management_ops" configuration setting
> > (currently defaults to 10).
> >
>
> Do you suggest to use "rbd_concurrent_management_ops" to limit
> concurrent requests or use a new QEMU parameters for the RBD driver?

I think it would be nicer to just query the
"rbd_concurrent_management_ops" limit to derive your writesame size
since the Ceph cluster admin can globally set that option to match the
available parallelism of the cluster.

> Thanks for your comments,
> Stefano


-- 
Jason



Re: [Qemu-devel] [Qemu-block] [PATCH] block/rbd: add preallocation support

2019-04-27 Thread Jason Dillaman
On Sat, Apr 27, 2019 at 7:37 AM Stefano Garzarella  wrote:
>
> This patch adds the support of preallocation (off/full) for the RBD
> block driver.
> If available, we use rbd_writesame() to quickly fill the image when
> full preallocation is required.
>
> Signed-off-by: Stefano Garzarella 
> ---
>  block/rbd.c  | 149 ++-
>  qapi/block-core.json |   4 +-
>  2 files changed, 136 insertions(+), 17 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..29dd1bb040 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -13,6 +13,7 @@
>
>  #include "qemu/osdep.h"
>
> +#include "qemu/units.h"
>  #include 
>  #include "qapi/error.h"
>  #include "qemu/error-report.h"
> @@ -331,6 +332,110 @@ static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>  }
>  }
>
> +static int qemu_rbd_do_truncate(rbd_image_t image, int64_t offset,
> +PreallocMode prealloc, Error **errp)
> +{
> +uint64_t current_length;
> +char *buf = NULL;
> +int ret;
> +
> +ret = rbd_get_size(image, _length);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to get file length");
> +goto out;
> +}
> +
> +if (current_length > offset && prealloc != PREALLOC_MODE_OFF) {
> +error_setg(errp, "Cannot use preallocation for shrinking files");
> +ret = -ENOTSUP;
> +goto out;
> +}
> +
> +switch (prealloc) {
> +case PREALLOC_MODE_FULL: {
> +uint64_t current_offset = current_length;
> +int buf_size = 64 * KiB;

This should probably be 512B or 4KiB (which is the smallest striped
unit size). Not only will this avoid sending unnecessary zeroes to the
backing cluster, writesame silently turns into a standard write if
your buffer isn't properly aligned with the min(object size, stripe
unit size).

> +ssize_t bytes;
> +
> +buf = g_malloc(buf_size);
> +/*
> + * Some versions of rbd_writesame() discards writes of buffers with
> + * all zeroes. In order to avoid this behaviour, we set the first 
> byte
> + * to one.
> + */
> +buf[0] = 1;

You could also use "rados_conf_set(cluster,
"rbd_discard_on_zeroed_write_same", "false").

> +ret = rbd_resize(image, offset);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to resize file");
> +goto out;
> +}
> +
> +#ifdef LIBRBD_SUPPORTS_WRITESAME
> +while (offset - current_offset > buf_size) {
> +/*
> + * rbd_writesame() supports only request where the size of the
> + * operation is multiple of buffer size and it must be less or
> + * equal to INT_MAX.
> + */
> +bytes = MIN(offset - current_offset, INT_MAX);
> +bytes -= bytes % buf_size;

Using the default object size of 4MiB, this write size would result in
up to 512 concurrent ops to the backing cluster. Perhaps the size
should be bounded such that only a dozen or so concurrent requests are
issued per write, always rounded next largest object / stripe period
size. librbd and the rbd CLI usually try to bound themselves to the
value in the "rbd_concurrent_management_ops" configuration setting
(currently defaults to 10).

> +bytes = rbd_writesame(image, current_offset, bytes, buf, 
> buf_size,
> +  0);
> +if (bytes < 0) {
> +ret = bytes;
> +error_setg_errno(errp, -ret,
> + "Failed to write for preallocation");
> +goto out;
> +}
> +
> +current_offset += bytes;
> +}
> +#endif /* LIBRBD_SUPPORTS_WRITESAME */
> +
> +while (current_offset < offset) {
> +bytes = rbd_write(image, current_offset,
> +  MIN(offset - current_offset, buf_size), buf);
> +if (bytes < 0) {
> +ret = bytes;
> +error_setg_errno(errp, -ret,
> + "Failed to write for preallocation");
> +goto out;
> +}
> +
> +current_offset += bytes;
> +}
> +
> +ret = rbd_flush(image);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to flush the file");
> +goto out;
> +}
> +
> +break;
> +}
> +case PREALLOC_MODE_OFF:
> +ret = rbd_resize(image, offset);
> +if (ret < 0) {
> +error_setg_errno(errp, -ret, "Failed to resize file");
> +goto out;
> +}
> +break;
> +default:
> +error_setg(errp, "Unsupported preallocation mode: %s",
> +   PreallocMode_str(prealloc));
> +ret = -ENOTSUP;
> +goto out;
> +}
> +
> +ret = 0;
> +
> +out:
> +g_free(buf);
> +return ret;
> +}
> +
>  static QemuOptsList runtime_opts = {
>  

Re: [Qemu-devel] [PATCH RFC 1/1] block/rbd: increase dynamically the image size

2019-04-14 Thread Jason Dillaman
On Sun, Apr 14, 2019 at 9:20 AM Stefano Garzarella  wrote:
>
> On Thu, Apr 11, 2019 at 01:06:49PM -0400, Jason Dillaman wrote:
> > On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella  
> > wrote:
> > >
> > > On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > > > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella 
> > > >  wrote:
> > > > >
> > > > > RBD APIs don't allow us to write more than the size set with 
> > > > > rbd_create()
> > > > > or rbd_resize().
> > > > > In order to support growing images (eg. qcow2), we resize the image
> > > > > before RW operations that exceed the current size.
> > > >
> > > > What's the use-case for storing qcow2 images within a RBD image? RBD
> > > > images are already thinly provisioned, they support snapshots, they
> > > > can form a parent/child linked image hierarchy.
> > > >
> > >
> > > Hi Jason,
> > > I understand your point of view, maybe one use case could be if you have
> > > a qcow2 image and you want to put it in the rdb pool without convert it.
> > >
> > > I'm going through this BZ [1] and I'll ask if they have other
> > > use cases in mind.
> >
> > Assuming no good use-cases, perhaps it would just be better to make
> > the qemu-img error messages more clear.
> >
>
> Hi Jason,
> I asked about use-cases and they want to use qcow2 on rbd in order to
> take advantage of these qcow2 features [1]: external snapshots,
> Copy-on-write, and optional compression and encryption.
>
> Maybe the more interesting are external snapshots and Copy-on-write,

Copy-on-write is natively supported by RBD. The concept of external
snapshots seems similar to just automating the process of creating a
new copy-on-write image. Compression is also supported by Ceph on the
cluster side by recent releases.

> since encryption can be achieved with LUKS and rbd should support
> compression for a specified pool.
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007#c13
>
> Cheers,
> Stefano



-- 
Jason



Re: [Qemu-devel] [PATCH RFC 1/1] block/rbd: increase dynamically the image size

2019-04-11 Thread Jason Dillaman
On Thu, Apr 11, 2019 at 9:02 AM Stefano Garzarella  wrote:
>
> On Thu, Apr 11, 2019 at 08:35:44AM -0400, Jason Dillaman wrote:
> > On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella  
> > wrote:
> > >
> > > RBD APIs don't allow us to write more than the size set with rbd_create()
> > > or rbd_resize().
> > > In order to support growing images (eg. qcow2), we resize the image
> > > before RW operations that exceed the current size.
> >
> > What's the use-case for storing qcow2 images within a RBD image? RBD
> > images are already thinly provisioned, they support snapshots, they
> > can form a parent/child linked image hierarchy.
> >
>
> Hi Jason,
> I understand your point of view, maybe one use case could be if you have
> a qcow2 image and you want to put it in the rdb pool without convert it.
>
> I'm going through this BZ [1] and I'll ask if they have other
> use cases in mind.

Assuming no good use-cases, perhaps it would just be better to make
the qemu-img error messages more clear.

> Anyway, if we want to support only raw format on RBD driver, maybe we
> should return something different from current behaviour, also avoiding
> to create the image:
>
> $ qemu-img create -f qcow2 rbd:test/qcow.img 1G
>   qemu-img: rbd:test/qcow.img: Could not write qcow2 header: Invalid argument
>
> What do you think?
>
> Thanks,
> Stefano
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1171007



-- 
Jason



Re: [Qemu-devel] [PATCH RFC 1/1] block/rbd: increase dynamically the image size

2019-04-11 Thread Jason Dillaman
On Thu, Apr 11, 2019 at 7:00 AM Stefano Garzarella  wrote:
>
> RBD APIs don't allow us to write more than the size set with rbd_create()
> or rbd_resize().
> In order to support growing images (eg. qcow2), we resize the image
> before RW operations that exceed the current size.

What's the use-case for storing qcow2 images within a RBD image? RBD
images are already thinly provisioned, they support snapshots, they
can form a parent/child linked image hierarchy.

> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=1171007
> Signed-off-by: Stefano Garzarella 
> ---
>  block/rbd.c | 25 +
>  1 file changed, 25 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 0c549c9935..228658e20a 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -102,6 +102,7 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t image_size;
>  } BDRVRBDState;
>
>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
> @@ -777,6 +778,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  goto failed_open;
>  }
>
> +r = rbd_get_size(s->image, >image_size);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading image size from %s",
> + s->image_name);
> +rbd_close(s->image);
> +goto failed_open;
> +}
> +
>  /* If we are using an rbd snapshot, we must be r/o, otherwise
>   * leave as-is */
>  if (s->snap != NULL) {
> @@ -921,6 +930,20 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>  rcb->buf = acb->bounce;
>  }
>
> +/*
> + * RBD APIs don't allow us to write more than actual size, so in order
> + * to support growing images, we resize the image before RW operations
> + * that exceed the current size.
> + */
> +if (s->image_size < off + size) {
> +r = rbd_resize(s->image, off + size);
> +if (r < 0) {
> +goto failed;
> +}
> +
> +s->image_size = off + size;
> +}
> +
>  acb->ret = 0;
>  acb->error = 0;
>  acb->s = s;
> @@ -1066,6 +1089,8 @@ static int coroutine_fn 
> qemu_rbd_co_truncate(BlockDriverState *bs,
>  return r;
>  }
>
> +s->image_size = offset;
> +
>  return 0;
>  }
>
> --
> 2.20.1
>
>


-- 
Jason



[Qemu-devel] [Bug 1811720] Re: storage physical_block_size is restricted to uint16_t

2019-01-14 Thread Jason Dillaman
I think the SCSI spec is limited to 16 bits for representing the block
length (in bytes) (see READ CAPACITY(10) command). It's also probably
sub-optimal to force a full 4MiB write even for small IOs. You might
achieve what you are looking for by setting the minimal and optimal IO
size hints to 4MiB (exposed via SCSI block limits VPD page) using
"min_io_size" and "opt_io_size" settings.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1811720

Title:
  storage physical_block_size is restricted to uint16_t

Status in QEMU:
  New

Bug description:
  It is desirable to set -global scsi-hd.physical_block_size=4194304 for
  RBD-backed storage.

  But unfortunatelly, this values is restricted with uint16_t, i.e.
  65536 maximum.

  For example, scsi-hd.discard_granularity=4194304 is not so restricted
  (and works as expected)

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1811720/+subscriptions



[Qemu-devel] [Bug 1701449] Re: high memory usage when using rbd with client caching

2018-10-01 Thread Jason Dillaman
@Nick: if you can recreate the librbd memory growth, any chance you can
help test a potential fix [1]?

[1] https://github.com/ceph/ceph/pull/24297

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1701449

Title:
  high memory usage when using rbd with client caching

Status in QEMU:
  New

Bug description:
  Hi,
  we are experiencing a quite high memory usage of a single qemu (used with 
KVM) process when using RBD with client caching as a disk backend. We are 
testing with 3GB memory qemu virtual machines and 128MB RBD client cache. When 
running 'fio' in the virtual machine you can see that after some time the 
machine uses a lot more memory (RSS) on the hypervisor than she should. We have 
seen values (in real production machines, no artificially fio tests) of 250% 
memory overhead. I reproduced this with qemu version 2.9 as well.

  Here the contents of our ceph.conf on the hypervisor:
  """
  [client]
  rbd cache writethrough until flush = False
  rbd cache max dirty = 100663296
  rbd cache size = 134217728
  rbd cache target dirty = 50331648
  """

  How to reproduce:
  * create a virtual machine with a RBD backed disk (100GB or so)
  * install a linux distribution on it (we are using Ubuntu)
  * install fio (apt-get install fio)
  * run fio multiple times with (e.g.) the following test file:
  """
  # This job file tries to mimic the Intel IOMeter File Server Access Pattern
  [global]
  description=Emulation of Intel IOmeter File Server Access Pattern
  randrepeat=0
  filename=/root/test.dat
  # IOMeter defines the server loads as the following:
  # iodepth=1 Linear
  # iodepth=4 Very Light
  # iodepth=8 Light
  # iodepth=64Moderate
  # iodepth=256   Heavy
  iodepth=8
  size=80g
  direct=0
  ioengine=libaio

  [iometer]
  stonewall
  bs=4M
  rw=randrw

  [iometer_just_write]
  stonewall
  bs=4M
  rw=write

  [iometer_just_read]
  stonewall
  bs=4M
  rw=read
  """

  You can measure the virtual machine RSS usage on the hypervisor with:
virsh dommemstat  | grep rss
  or if you are not using libvirt:
grep RSS /proc//status

  When switching off the RBD client cache, all is ok again, as the
  process does not use so much memory anymore.

  There is already a ticket on the ceph bug tracker for this ([1]).
  However I can reproduce that memory behaviour only when using qemu
  (maybe it is using librbd in a special way?). Running directly 'fio'
  with the rbd engine does not result in that high memory usage.

  [1] http://tracker.ceph.com/issues/20054

To manage notifications about this bug go to:
https://bugs.launchpad.net/qemu/+bug/1701449/+subscriptions



Re: [Qemu-devel] [Qemu-block] [PATCH v2 4/6] rbd: Switch to byte-based callbacks

2018-04-24 Thread Jason Dillaman
On Tue, Apr 24, 2018 at 3:25 PM, Eric Blake  wrote:
> We are gradually moving away from sector-based interfaces, towards
> byte-based.  Make the change for the last few sector-based callbacks
> in the rbd driver.
>
> Note that the driver was already using byte-based calls for
> performing actual I/O, so this just gets rid of a round trip
> of scaling; however, as I don't know if RBD is tolerant of
> non-sector AIO operations, I went with the conservate approach
> of adding .bdrv_refresh_limits to override the block layer
> defaults back to the pre-patch value of 512.
>
> Signed-off-by: Eric Blake 
>
> ---
> v2: override new block layer default alignment [Kevin]
> ---
>  block/rbd.c | 44 
>  1 file changed, 24 insertions(+), 20 deletions(-)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index c9359d0ad84..638ecf8d986 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -231,6 +231,13 @@ done:
>  }
>
>
> +static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
> +{
> +/* XXX Does RBD support AIO on less than 512-byte alignment? */

Yes, librbd internally supports 1-byte alignment for IO, but the
optimal alignment/length would be object size * stripe count.

> +bs->bl.request_alignment = 512;
> +}
> +
> +
>  static int qemu_rbd_set_auth(rados_t cluster, const char *secretid,
>   Error **errp)
>  {
> @@ -899,27 +906,23 @@ failed:
>  return NULL;
>  }
>
> -static BlockAIOCB *qemu_rbd_aio_readv(BlockDriverState *bs,
> -  int64_t sector_num,
> -  QEMUIOVector *qiov,
> -  int nb_sectors,
> -  BlockCompletionFunc *cb,
> -  void *opaque)
> -{
> -return rbd_start_aio(bs, sector_num << BDRV_SECTOR_BITS, qiov,
> - (int64_t) nb_sectors << BDRV_SECTOR_BITS, cb, 
> opaque,
> - RBD_AIO_READ);
> -}
> -
> -static BlockAIOCB *qemu_rbd_aio_writev(BlockDriverState *bs,
> -   int64_t sector_num,
> -   QEMUIOVector *qiov,
> -   int nb_sectors,
> +static BlockAIOCB *qemu_rbd_aio_preadv(BlockDriverState *bs,
> +   uint64_t offset, uint64_t bytes,
> +   QEMUIOVector *qiov, int flags,
> BlockCompletionFunc *cb,
> void *opaque)
>  {
> -return rbd_start_aio(bs, sector_num << BDRV_SECTOR_BITS, qiov,
> - (int64_t) nb_sectors << BDRV_SECTOR_BITS, cb, 
> opaque,
> +return rbd_start_aio(bs, offset, qiov, bytes, cb, opaque,
> + RBD_AIO_READ);
> +}
> +
> +static BlockAIOCB *qemu_rbd_aio_pwritev(BlockDriverState *bs,
> +uint64_t offset, uint64_t bytes,
> +QEMUIOVector *qiov, int flags,
> +BlockCompletionFunc *cb,
> +void *opaque)
> +{
> +return rbd_start_aio(bs, offset, qiov, bytes, cb, opaque,
>   RBD_AIO_WRITE);
>  }
>
> @@ -1158,6 +1161,7 @@ static BlockDriver bdrv_rbd = {
>  .format_name= "rbd",
>  .instance_size  = sizeof(BDRVRBDState),
>  .bdrv_parse_filename= qemu_rbd_parse_filename,
> +.bdrv_refresh_limits= qemu_rbd_refresh_limits,
>  .bdrv_file_open = qemu_rbd_open,
>  .bdrv_close = qemu_rbd_close,
>  .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
> @@ -1170,8 +1174,8 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_truncate  = qemu_rbd_truncate,
>  .protocol_name  = "rbd",
>
> -.bdrv_aio_readv = qemu_rbd_aio_readv,
> -.bdrv_aio_writev= qemu_rbd_aio_writev,
> +.bdrv_aio_preadv= qemu_rbd_aio_preadv,
> +.bdrv_aio_pwritev   = qemu_rbd_aio_pwritev,
>
>  #ifdef LIBRBD_SUPPORTS_AIO_FLUSH
>  .bdrv_aio_flush = qemu_rbd_aio_flush,
> --
> 2.14.3
>
>



-- 
Jason



Re: [Qemu-devel] [Qemu-block] [PATCH] rbd: Detect rbd image resizes and propagate them

2017-09-13 Thread Jason Dillaman
On Wed, 2017-09-13 at 10:44 -0600, Adam Wolfe Gordon wrote:
> Register a watcher with rbd so that we get notified when an image is
> resized. Propagate resizes to parent block devices so that guest devices
> get resized without user intervention.
> 
> Signed-off-by: Adam Wolfe Gordon 
> ---
> Hello!
> 
> We are using this patch in production at DigitalOcean and have tested it 
> fairly
> extensively. However, we use the same block device configuration everywhere, 
> so
> I'd be interested to get an answer to the question I've left in the code:
> 
> > /* NOTE: This assumes there's only one layer between us and the
> >block-backend. Is this always true? */
> 
> If that isn't true, this patch will need a bit of work to traverse up the 
> block
> device tree and find the block-backend.
> 
> Of course, any other feedback would be welcome too - this is my first foray 
> into
> the qemu code.
> 
> Regards,
> Adam
> 
>  block/rbd.c | 80 
> +
>  1 file changed, 80 insertions(+)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 144f350e1f..1c9fcbec1f 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -96,6 +96,8 @@ typedef struct BDRVRBDState {
>  rbd_image_t image;
>  char *image_name;
>  char *snap;
> +uint64_t watcher_handle;
> +uint64_t size_bytes;
>  } BDRVRBDState;
>  
>  static char *qemu_rbd_next_tok(char *src, char delim, char **p)
> @@ -540,6 +542,67 @@ out:
>  return rados_str;
>  }
>  
> +/* BH for when rbd notifies us of an update. */
> +static void qemu_rbd_update_bh(void *arg)
> +{
> +BlockDriverState *parent, *bs = arg;
> +BDRVRBDState *s = bs->opaque;
> +bool was_variable_length;
> +uint64_t new_size_bytes;
> +int64_t new_parent_len;
> +BdrvChild *c;
> +int r;
> +
> +r = rbd_get_size(s->image, _size_bytes);
> +if (r < 0) {
> +error_report("error reading size for %s: %s", s->name, strerror(-r));
> +return;
> +}
> +
> +/* Avoid no-op resizes on non-resize notifications. */
> +if (new_size_bytes == s->size_bytes) {
> +error_printf("skipping non-resize rbd cb\n");

Image update callbacks can be invoked for any number of reasons, not
just resize events. Therefore, you probably don't want to have an error
message printed out here.

> +return;
> +}
> +
> +/* NOTE: This assumes there's only one layer between us and the
> +   block-backend. Is this always true? */

I don't think that can be assumed to be true.

> +parent = bs->inherits_from;
> +if (parent == NULL) {
> +error_report("bs %s does not have parent", 
> bdrv_get_device_or_node_name(bs));
> +return;
> +}
> +
> +/* Force parents to re-read our size. */

Can you just invoke blk_truncate instead? 

> +was_variable_length = bs->drv->has_variable_length;
> +bs->drv->has_variable_length = true;
> +new_parent_len = bdrv_getlength(parent);
> +if (new_parent_len < 0) {
> +error_report("getlength failed on parent %s", 
> bdrv_get_device_or_node_name(parent));
> +bs->drv->has_variable_length = was_variable_length;
> +return;
> +}
> +bs->drv->has_variable_length = was_variable_length;
> +
> +/* Notify block backends that that we have resized.
> +   Copied from bdrv_parent_cb_resize. */
> +QLIST_FOREACH(c, >parents, next_parent) {
> +if (c->role->resize) {
> +c->role->resize(c);
> +}
> +}
> +
> +s->size_bytes = new_size_bytes;
> +}
> +
> +/* Called from non-qemu thread - careful! */
> +static void qemu_rbd_update_cb(void *arg)
> +{
> +BlockDriverState *bs = arg;
> +
> +aio_bh_schedule_oneshot(bdrv_get_aio_context(bs), qemu_rbd_update_bh, 
> bs);
> +}
> +
>  static int qemu_rbd_open(BlockDriverState *bs, QDict *options, int flags,
>   Error **errp)
>  {
> @@ -672,9 +735,25 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
> *options, int flags,
>  }
>  }
>  
> +r = rbd_update_watch(s->image, >watcher_handle, qemu_rbd_update_cb, 
> bs);

This API method was only added in the Ceph Jewel release just over a
year ago. This should probably gracefully degrade if compiled against an
older version of the librbd API.

> +if (r < 0) {
> +error_setg_errno(errp, -r, "error registering watcher on %s", 
> s->name);
> +goto failed_watch;
> +}
> +
> +r = rbd_get_size(s->image, >size_bytes);
> +if (r < 0) {
> +error_setg_errno(errp, -r, "error reading size for %s", s->name);
> +goto failed_sz;
> +}
> +
>  qemu_opts_del(opts);
>  return 0;
>  
> +failed_sz:
> +rbd_update_unwatch(s->image, s->watcher_handle);
> +failed_watch:
> +rbd_close(s->image);
>  failed_open:
>  rados_ioctx_destroy(s->io_ctx);
>  failed_shutdown:
> @@ -712,6 +791,7 @@ static void qemu_rbd_close(BlockDriverState *bs)
>  {
>  BDRVRBDState *s = 

Re: [Qemu-devel] [RFC v5] RBD: Add support readv,writev for rbd

2017-02-16 Thread Jason Dillaman
On Thu, Feb 16, 2017 at 10:13 AM, Alexandre DERUMIER
 wrote:
> Hi, I would like to bench it with small 4k read/write.
>
> On the ceph side,do we need this PR ? :
> https://github.com/ceph/ceph/pull/13447

Yes, that is the correct PR for the client-side librbd changes. You
should be able to test it against Hammer/Jewel-release clusters.

-- 
Jason



Re: [Qemu-devel] [PATCH] RBD: Add support readv,writev for rbd

2016-11-05 Thread Jason Dillaman
On Sat, Nov 5, 2016 at 1:17 AM,   wrote:
> From: tianqing 
>
> Rbd can do readv and writev directly, so wo do not need to transform
> iov to buf or vice versa any more.
>
> Signed-off-by: tianqing 
> ---
>  block/rbd.c | 124 
> 
>  1 file changed, 124 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index a57b3e3..a405c02 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -53,6 +53,13 @@
>  #undef LIBRBD_SUPPORTS_DISCARD
>  #endif
>
> +/* rbd_aio_readv, rbd_aio_writev added in 0.1.11 */
> +#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 11)
> +#define LIBRBD_SUPPORTS_IOV
> +#else
> +#undef LIBRBD_SUPPORTS_IOV
> +#endif
> +

The LIBRBD_SUPPORTS_IOVEC define is already defined by librbd.h if
it's supported.

>  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
>
>  #define RBD_MAX_CONF_NAME_SIZE 128
> @@ -73,7 +80,10 @@ typedef struct RBDAIOCB {
>  BlockAIOCB common;
>  int64_t ret;
>  QEMUIOVector *qiov;
> +#ifdef LIBRBD_SUPPORTS_IOV
> +#else
>  char *bounce;
> +#endif

Could use "#ifndef" to avoid the empty section.

>  RBDAIOCmd cmd;
>  int error;
>  struct BDRVRBDState *s;
> @@ -83,7 +93,10 @@ typedef struct RADOSCB {
>  RBDAIOCB *acb;
>  struct BDRVRBDState *s;
>  int64_t size;
> +#ifdef LIBRBD_SUPPORTS_IOV
> +#else
>  char *buf;
> +#endif

Same comment here re: just using #ifndef

>  int64_t ret;
>  } RADOSCB;
>
> @@ -406,6 +419,48 @@ shutdown:
>  return ret;
>  }
>
> +
> +#ifdef LIBRBD_SUPPORTS_IOV
> +/*
> + * This aio completion is being called from rbd_finish_bh() and runs in qemu
> + * BH context.
> + */
> +static void qemu_rbd_complete_aio(RADOSCB *rcb)
> +{
> +RBDAIOCB *acb = rcb->acb;
> +int64_t r;
> +
> +r = rcb->ret;
> +
> +if (acb->cmd != RBD_AIO_READ) {
> +if (r < 0) {
> +acb->ret = r;
> +acb->error = 1;
> +} else if (!acb->error) {
> +acb->ret = rcb->size;
> +}
> +} else {
> +if (r < 0) {
> +iov_memset(acb->qiov->iov, acb->qiov->niov, 0, 0, 
> acb->qiov->size);
> +acb->ret = r;
> +acb->error = 1;
> +} else if (r < rcb->size) {
> +iov_memset(acb->qiov->iov, acb->qiov->niov,
> +   rcb->size - r, 0, acb->qiov->size);
> +if (!acb->error) {
> +acb->ret = rcb->size;
> +}
> +} else if (!acb->error) {
> +acb->ret = r;
> +}
> +}
> +
> +g_free(rcb);
> +acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
> +
> +qemu_aio_unref(acb);
> +}

Instead of duplicating the existing qemu_rbd_complete_aio and changing
a couple lines, perhaps just #ifdef/#ifndef around the few lines that
need changes (e.g. the bounce buffer vs iov buffer).

> +#else
>  /*
>   * This aio completion is being called from rbd_finish_bh() and runs in qemu
>   * BH context.
> @@ -449,6 +504,7 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
>
>  qemu_aio_unref(acb);
>  }
> +#endif
>
>  /* TODO Convert to fine grained options */
>  static QemuOptsList runtime_opts = {
> @@ -644,6 +700,73 @@ static int rbd_aio_flush_wrapper(rbd_image_t image,
>  #endif
>  }
>
> +#ifdef LIBRBD_SUPPORTS_IOV
> +static BlockAIOCB *rbd_start_aio_vec(BlockDriverState *bs,
> + int64_t off,
> + QEMUIOVector *qiov,
> + int64_t size,
> + BlockCompletionFunc *cb,
> + void *opaque,
> + RBDAIOCmd cmd)
> +{

Is this method being invoked? I would also recommend just
intermingling the bounce buffer vs iov differences into the existing
rbd_start_aio method instead of duplicating a lot of code.

> +RBDAIOCB *acb;
> +RADOSCB *rcb = NULL;
> +rbd_completion_t c;
> +int r;
> +
> +BDRVRBDState *s = bs->opaque;
> +
> +acb = qemu_aio_get(_aiocb_info, bs, cb, opaque);
> +acb->cmd = cmd;
> +acb->qiov = qiov;
> +assert(!qiov || qiov->size == size);
> +
> +acb->ret = 0;
> +acb->error = 0;
> +acb->s = s;
> +
> +rcb = g_new(RADOSCB, 1);
> +rcb->acb = acb;
> +rcb->s = acb->s;
> +rcb->size = size;
> +r = rbd_aio_create_completion(rcb, (rbd_callback_t) rbd_finish_aiocb, 
> );
> +if (r < 0) {
> +goto failed;
> +}
> +
> +switch (cmd) {
> +case RBD_AIO_WRITE:
> +r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
> +break;
> +case RBD_AIO_READ:
> +r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
> +break;
> +case RBD_AIO_DISCARD:
> +r = rbd_aio_discard_wrapper(s->image, off, size, c);
> +break;
> +case RBD_AIO_FLUSH:
> +r = rbd_aio_flush_wrapper(s->image, c);
> +   

Re: [Qemu-devel] [PATCH v6 04/22] block: Introduce image file locking

2016-06-07 Thread Jason Dillaman
On Fri, Jun 3, 2016 at 4:48 AM, Fam Zheng  wrote:
> +typedef enum {
> +/* The values are ordered so that lower number implies higher 
> restriction.
> + * Starting from 1 to make 0 an invalid value.
> + * */
> +BDRV_LOCKF_EXCLUSIVE = 1,
> +BDRV_LOCKF_SHARED,
> +BDRV_LOCKF_UNLOCK,
> +} BdrvLockfCmd;
> +

We started to talk about new APIs in librbd to support this feature
where we don't need to worry about admin action should QEMU crash
while holding the lock.

Any chance for separating the UNLOCK enum into the exclusive vs shared
case? We could do some magic in the rbd block driver to guess how it
was locked but it seems like it would be cleaner (at least for us) to
explicitly call out what type of unlock you are requesting since it
will involve different API methods.

-- 
Jason



Re: [Qemu-devel] [PATCH] block/rbd: add .bdrv_reopen_prepare() stub

2016-05-18 Thread Jason Dillaman
On Wed, May 18, 2016 at 4:19 AM, Kevin Wolf  wrote:
>> Updating this setting on an open image won't do anything, but if you
>> rbd_close() and rbd_open() it again the setting will take effect.
>> rbd_close() will force a flush of any pending I/O in librbd and
>> free the memory for librbd's ImageCtx, which may or may not be desired
>> here.
>
> First rbd_close() and then rbd_open() risks that the rbd_open() fails
> and we end up with no usable image at all. Can we open a second instance
> of the image first and only close the first one if that succeeded?
>
> We already flush all requests before calling this, so that part
> shouldn't make a difference.


You can open the same image twice without issue.  It's perfectly fine
to rbd_open() with the new settings and upon success rbd_close() the
original image handle.

-- 
Jason



Re: [Qemu-devel] [PATCH] block/rbd: add .bdrv_reopen_prepare() stub

2016-05-17 Thread Jason Dillaman
On Tue, May 17, 2016 at 6:03 AM, Sebastian Färber  wrote:
> Hi Kevin,
>
>> A correct reopen implementation must consider all options and flags that
>> .bdrv_open() looked at.
>>
>> The options are okay, as both "filename" and "password-secret" aren't
>> things that we want to allow a reopen to change. However, in the flags
>> BDRV_O_NOCACHE makes a difference:
>>
>> if (flags & BDRV_O_NOCACHE) {
>> rados_conf_set(s->cluster, "rbd_cache", "false");
>> } else {
>> rados_conf_set(s->cluster, "rbd_cache", "true");
>> }
>>
>> A reopen must either update the setting, or if it can't (e.g. because
>> librbd doesn't support it) any attempt to change the flag must fail.
>
> Thanks for the feedback.
> As far as i can tell it's not possible to update the cache settings
> without reconnecting. I've added a check in the following patch.
> Would be great if someone who knows the internals of ceph/rbd could
> have a look as well.

Correct, you cannot currently dynamically enable nor disable the
cache.  You would need to close the image and re-open it to change the
cache settings.

> Sebastian
>
> -- >8 --
> Subject: [PATCH] block/rbd: add .bdrv_reopen_prepare() stub
>
> Add support for reopen() by adding the .bdrv_reopen_prepare() stub
>
> Signed-off-by: Sebastian Färber 
> ---
>  block/rbd.c | 14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 5bc5b32..8ecf096 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -577,6 +577,19 @@ failed_opts:
>  return r;
>  }
>
> +/* Note that this will not re-establish a connection with the Ceph cluster
> +   - it is effectively a NOP.  */
> +static int qemu_rbd_reopen_prepare(BDRVReopenState *state,
> +   BlockReopenQueue *queue, Error **errp)
> +{
> +if (state->flags & BDRV_O_NOCACHE &&
> +((state->bs->open_flags & BDRV_O_NOCACHE) == 0)) {
> +error_setg(errp, "Cannot turn off rbd_cache during reopen");
> +return -EINVAL;
> +}
> +return 0;
> +}
> +
>  static void qemu_rbd_close(BlockDriverState *bs)
>  {
>  BDRVRBDState *s = bs->opaque;
> @@ -976,6 +989,7 @@ static BlockDriver bdrv_rbd = {
>  .instance_size  = sizeof(BDRVRBDState),
>  .bdrv_needs_filename = true,
>  .bdrv_file_open = qemu_rbd_open,
> +.bdrv_reopen_prepare = qemu_rbd_reopen_prepare,
>  .bdrv_close = qemu_rbd_close,
>  .bdrv_create= qemu_rbd_create,
>  .bdrv_has_zero_init = bdrv_has_zero_init_1,
> --
> 1.8.3.1
>
>



-- 
Jason



[Qemu-devel] [Bug 1581334] Re: qemu + librbd takes high %sy cpu under high random io workload

2016-05-14 Thread Jason Dillaman
Any chance you can re-test with a more recent kernel on the hypervisor
host? If the spin-lock was coming from user-space, I would expect
futex_wait_setup and futex_wake to be much higher.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1581334

Title:
  qemu + librbd takes high %sy cpu under high random io workload

Status in Ceph:
  New
Status in Linux:
  New
Status in QEMU:
  New

Bug description:
  I got an IO problem. When running Qemu + ceph(use librbd), and do a random IO 
benchmark or some high load random IO test, it will exhaust all my host cpu on 
%sy cpu.
  It doesn’t happen all the time, but when it appear it will reproduce every 
time I start a random IO benchmark(test with Fio).
  And the only way to fix the problem is shutdown my vm and start it, but the 
problem will happen again with high random IO load.

  Some information:
  Vendor  : HP
  Product : HP ProLiant BL460c Gen9
  Kernel  : 3.16.0-4-amd64
  Disto   : Debian
  Version : 8.4
  Arch: amd64
  Qemu: 2.1 ~ 2.6 (Yes, I already test the latest qemu2.6 version, 
but still got the problem)
  Ceph: Hammer 0.94.5
  Librbd  : 0.94.5 ~ 10.2 (I rebuild librbd with ceph 10.2 source code, 
but the problem still here)
  Qemu config : cache=none
  Qemu cpu: 4core, 8GB

  How can i reproduce the problem?

  while :; do bash randwrite.sh ; sleep 3600; done >test.log 2>&1 &
  (Sleep 3600 is the key to reproduce my problem. I don’t known how long sleep 
suit for reproduce, but one hour sleep is enough. the problem will easy 
reproduce after a long sleep, if i keep benchmark running without sleep, i 
can't reproduce it)

  My randwrite.sh script
  --
  #!/bin/sh
  sync
  echo 3 > /proc/sys/vm/drop_caches

  FILENAME=/dev/vdc
  RUNTIME=100
  BLOCKSIZE=4K
  IOENGINE=libaio
  RESULTFILE=fio-randwrite.log
  IODEPTH=32
  RAMP_TIME=5
  SIZE=100G

  fio --numjobs 10 --norandommap --randrepeat=0 --readwrite=randwrite 
--ramp_time=$RAMP_TIME --bs=$BLOCKSIZE --runtime=$RUNTIME --iodepth=$IODEPTH 
--filename=$FILENAME --ioengine=$IOENGINE --direct=1 --name=iops_randwrite 
--group_reporting  | tee $RESULTFILE
  --

  What happened after the problem appear?
  my vm will got huge IOPS drop. In my case, it will drop from 15000 IOPS to 
3500 IOPS. And other thing, my host cpu will exhaust on %sy. Top output like 
this.

  Qemu Fio benchmark
  
  Tasks: 284 total,   2 running, 282 sleeping,   0 stopped,   0 zombie
  %Cpu0  : 11.8 us, 66.7 sy,  0.0 ni, 21.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu1  : 12.7 us, 64.9 sy,  0.0 ni, 22.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu2  : 13.7 us, 64.5 sy,  0.0 ni, 21.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu3  : 13.2 us, 64.1 sy,  0.0 ni, 22.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu4  : 11.7 us, 65.4 sy,  0.0 ni, 22.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu5  : 13.2 us, 64.4 sy,  0.0 ni, 22.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu6  : 12.4 us, 65.1 sy,  0.0 ni, 22.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu7  : 13.6 us, 63.8 sy,  0.0 ni, 22.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu8  :  9.8 us, 73.0 sy,  0.0 ni, 17.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu9  :  7.8 us, 74.5 sy,  0.0 ni, 17.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu10 :  6.0 us, 81.4 sy,  0.0 ni,  6.6 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 
st
  %Cpu11 :  8.4 us, 79.5 sy,  0.0 ni,  8.8 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu12 :  7.6 us, 80.7 sy,  0.0 ni,  7.0 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 
st
  %Cpu13 :  7.4 us, 79.9 sy,  0.0 ni,  7.7 id,  0.0 wa,  0.0 hi,  5.0 si,  0.0 
st
  %Cpu14 :  9.8 us, 75.4 sy,  0.0 ni, 11.4 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu15 :  6.7 us, 80.1 sy,  0.0 ni, 10.1 id,  0.0 wa,  0.0 hi,  3.0 si,  0.0 
st
  %Cpu16 :  9.2 us, 69.2 sy,  0.0 ni, 17.5 id,  0.0 wa,  0.0 hi,  4.1 si,  0.0 
st
  %Cpu17 :  9.9 us, 66.6 sy,  0.0 ni, 20.1 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu18 : 16.6 us, 49.0 sy,  0.0 ni, 34.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu19 : 16.7 us, 46.4 sy,  0.0 ni, 36.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu20 : 13.0 us, 50.8 sy,  0.0 ni, 36.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu21 : 18.9 us, 46.2 sy,  0.0 ni, 34.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu22 : 12.1 us, 52.9 sy,  0.0 ni, 35.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu23 : 15.9 us, 47.6 sy,  0.0 ni, 36.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu24 :  6.7 us, 62.0 sy,  0.0 ni, 31.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu25 :  7.6 us, 63.7 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu26 :  8.1 us, 75.8 sy,  0.0 ni, 16.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu27 :  6.7 us, 73.6 sy,  

[Qemu-devel] [Bug 1581334] Re: qemu + librbd takes high %sy cpu under high random io workload

2016-05-13 Thread Jason Dillaman
Can you run 'perf top' against just the QEMU process?  There was an
email chain from nearly a year ago about tcmalloc causing extremely high
'_raw_spin_lock' calls under high IOPS scenarios.

-- 
You received this bug notification because you are a member of qemu-
devel-ml, which is subscribed to QEMU.
https://bugs.launchpad.net/bugs/1581334

Title:
  qemu + librbd takes high %sy cpu under high random io workload

Status in Ceph:
  New
Status in Linux:
  New
Status in QEMU:
  New

Bug description:
  I got an IO problem. When running Qemu + ceph(use librbd), and do a random IO 
benchmark or some high load random IO test, it will exhaust all my host cpu on 
%sy cpu.
  It doesn’t happen all the time, but when it appear it will reproduce every 
time I start a random IO benchmark(test with Fio).
  And the only way to fix the problem is shutdown my vm and start it, but the 
problem will happen again with high random IO load.

  Some information:
  Vendor  : HP
  Product : HP ProLiant BL460c Gen9
  Kernel  : 3.16.0-4-amd64
  Disto   : Debian
  Version : 8.4
  Arch: amd64
  Qemu: 2.1 ~ 2.6 (Yes, I already test the latest qemu2.6 version, 
but still got the problem)
  Ceph: Hammer 0.94.5
  Librbd  : 0.94.5 ~ 10.2 (I rebuild librbd with ceph 10.2 source code, 
but the problem still here)
  Qemu config : cache=none
  Qemu cpu: 4core, 8GB

  How can i reproduce the problem?

  while :; do bash randwrite.sh ; sleep 3600; done >test.log 2>&1 &
  (Sleep 3600 is the key to reproduce my problem. I don’t known how long sleep 
suit for reproduce, but one hour sleep is enough. the problem will easy 
reproduce after a long sleep, if i keep benchmark running without sleep, i 
can't reproduce it)

  My randwrite.sh script
  --
  #!/bin/sh
  sync
  echo 3 > /proc/sys/vm/drop_caches

  FILENAME=/dev/vdc
  RUNTIME=100
  BLOCKSIZE=4K
  IOENGINE=libaio
  RESULTFILE=fio-randwrite.log
  IODEPTH=32
  RAMP_TIME=5
  SIZE=100G

  fio --numjobs 10 --norandommap --randrepeat=0 --readwrite=randwrite 
--ramp_time=$RAMP_TIME --bs=$BLOCKSIZE --runtime=$RUNTIME --iodepth=$IODEPTH 
--filename=$FILENAME --ioengine=$IOENGINE --direct=1 --name=iops_randwrite 
--group_reporting  | tee $RESULTFILE
  --

  What happened after the problem appear?
  my vm will got huge IOPS drop. In my case, it will drop from 15000 IOPS to 
3500 IOPS. And other thing, my host cpu will exhaust on %sy. Top output like 
this.

  Qemu Fio benchmark
  
  Tasks: 284 total,   2 running, 282 sleeping,   0 stopped,   0 zombie
  %Cpu0  : 11.8 us, 66.7 sy,  0.0 ni, 21.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu1  : 12.7 us, 64.9 sy,  0.0 ni, 22.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu2  : 13.7 us, 64.5 sy,  0.0 ni, 21.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu3  : 13.2 us, 64.1 sy,  0.0 ni, 22.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu4  : 11.7 us, 65.4 sy,  0.0 ni, 22.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu5  : 13.2 us, 64.4 sy,  0.0 ni, 22.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu6  : 12.4 us, 65.1 sy,  0.0 ni, 22.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu7  : 13.6 us, 63.8 sy,  0.0 ni, 22.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu8  :  9.8 us, 73.0 sy,  0.0 ni, 17.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu9  :  7.8 us, 74.5 sy,  0.0 ni, 17.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu10 :  6.0 us, 81.4 sy,  0.0 ni,  6.6 id,  0.0 wa,  0.0 hi,  6.0 si,  0.0 
st
  %Cpu11 :  8.4 us, 79.5 sy,  0.0 ni,  8.8 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu12 :  7.6 us, 80.7 sy,  0.0 ni,  7.0 id,  0.0 wa,  0.0 hi,  4.7 si,  0.0 
st
  %Cpu13 :  7.4 us, 79.9 sy,  0.0 ni,  7.7 id,  0.0 wa,  0.0 hi,  5.0 si,  0.0 
st
  %Cpu14 :  9.8 us, 75.4 sy,  0.0 ni, 11.4 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu15 :  6.7 us, 80.1 sy,  0.0 ni, 10.1 id,  0.0 wa,  0.0 hi,  3.0 si,  0.0 
st
  %Cpu16 :  9.2 us, 69.2 sy,  0.0 ni, 17.5 id,  0.0 wa,  0.0 hi,  4.1 si,  0.0 
st
  %Cpu17 :  9.9 us, 66.6 sy,  0.0 ni, 20.1 id,  0.0 wa,  0.0 hi,  3.4 si,  0.0 
st
  %Cpu18 : 16.6 us, 49.0 sy,  0.0 ni, 34.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu19 : 16.7 us, 46.4 sy,  0.0 ni, 36.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu20 : 13.0 us, 50.8 sy,  0.0 ni, 36.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu21 : 18.9 us, 46.2 sy,  0.0 ni, 34.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu22 : 12.1 us, 52.9 sy,  0.0 ni, 35.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu23 : 15.9 us, 47.6 sy,  0.0 ni, 36.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu24 :  6.7 us, 62.0 sy,  0.0 ni, 31.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu25 :  7.6 us, 63.7 sy,  0.0 ni, 28.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu26 :  8.1 us, 75.8 sy,  0.0 ni, 16.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu27 :  6.7 us, 73.6 sy,  

Re: [Qemu-devel] [PATCH for-2.7 v2 07/17] rbd: Implement image locking

2016-04-27 Thread Jason Dillaman
On Tue, Apr 26, 2016 at 7:20 PM, Fam Zheng <f...@redhat.com> wrote:
> On Tue, 04/26 10:42, Jason Dillaman wrote:
>> On Sun, Apr 24, 2016 at 7:42 PM, Fam Zheng <f...@redhat.com> wrote:
>> > On Fri, 04/22 21:57, Jason Dillaman wrote:
>> >> Since this cannot automatically recover from a crashed QEMU client with an
>> >> RBD image, perhaps this RBD locking should not default to enabled.
>> >> Additionally, this will conflict with the "exclusive-lock" feature
>> >> available since the Ceph Hammer-release since both utilize the same 
>> >> locking
>> >> construct.
>> >>
>> >> As a quick background, the optional exclusive-lock feature can be
>> >> enabled/disabled on image and safely/automatically handles the case of
>> >> recovery from a crashed client.  Under normal conditions, the RBD
>> >> exclusive-lock feature automatically acquires the lock upon the first
>> >> attempt to write to the image and transparently transitions ownership of
>> >> the lock between two or more clients -- used for QEMU live-migration.
>> >
>> > Is it enabled by default?
>> >
>>
>> Starting with the Jewel release of Ceph it is enabled by default.
>
> OK, then I'll leave rbd in this QEMU series for now.

Without exposing some new API methods, this patch will, unfortunately,
directly conflict with the new Jewel rbd defaults and will actually
result in the image becoming read-only since librbd won't be able to
acquire the exclusive lock when QEMU owns the advisory lock.

We can probably get the new API methods upstream within a week or two
[1].  If that's too long of a delay, I'd recommend dropping rbd
locking from the series for now.

[1] http://tracker.ceph.com/issues/15632

-- 
Jason



Re: [Qemu-devel] [PATCH for-2.7 v2 07/17] rbd: Implement image locking

2016-04-26 Thread Jason Dillaman
On Sun, Apr 24, 2016 at 7:42 PM, Fam Zheng <f...@redhat.com> wrote:
> On Fri, 04/22 21:57, Jason Dillaman wrote:
>> Since this cannot automatically recover from a crashed QEMU client with an
>> RBD image, perhaps this RBD locking should not default to enabled.
>> Additionally, this will conflict with the "exclusive-lock" feature
>> available since the Ceph Hammer-release since both utilize the same locking
>> construct.
>>
>> As a quick background, the optional exclusive-lock feature can be
>> enabled/disabled on image and safely/automatically handles the case of
>> recovery from a crashed client.  Under normal conditions, the RBD
>> exclusive-lock feature automatically acquires the lock upon the first
>> attempt to write to the image and transparently transitions ownership of
>> the lock between two or more clients -- used for QEMU live-migration.
>
> Is it enabled by default?
>

Starting with the Jewel release of Ceph it is enabled by default.

>>
>> I'd be happy to add a new librbd API method to explicitly expose acquiring
>> and releasing the RBD exclusive lock since it certainly solves a couple
>> compromises in our current QEMU integration.
>
> Does the API do enable/disable or acquire/relase? Could you explain the
> differences between it and rbd_lock_exclusive?

There is already an API for enabling/disabling the exclusive-lock
feature (and associated CLI tooling).  This would be a new API method
to explicitly acquire / release the exclusive-lock (instead of
implicitly acquiring the lock upon first write request as it currently
does in order to support QEMU live-migration).

The rbd_lock_exclusive/rbd_lock_shared are essentially advisory locks.
Nothing stops a client from accessing the image if it doesn't attempt
to acquire the lock (even if another client already has the image
locked exclusively).  It also doesn't support automatic recovery from
failed clients.

-- 
Jason



Re: [Qemu-devel] [PATCH for-2.7 v2 07/17] rbd: Implement image locking

2016-04-22 Thread Jason Dillaman
Since this cannot automatically recover from a crashed QEMU client with an
RBD image, perhaps this RBD locking should not default to enabled.
Additionally, this will conflict with the "exclusive-lock" feature
available since the Ceph Hammer-release since both utilize the same locking
construct.

As a quick background, the optional exclusive-lock feature can be
enabled/disabled on image and safely/automatically handles the case of
recovery from a crashed client.  Under normal conditions, the RBD
exclusive-lock feature automatically acquires the lock upon the first
attempt to write to the image and transparently transitions ownership of
the lock between two or more clients -- used for QEMU live-migration.

I'd be happy to add a new librbd API method to explicitly expose acquiring
and releasing the RBD exclusive lock since it certainly solves a couple
compromises in our current QEMU integration.

On Thu, Apr 14, 2016 at 11:27 PM, Fam Zheng  wrote:

> librbd has the locking API that can be used to implement .bdrv_lockf.
>
> Signed-off-by: Fam Zheng 
> ---
>  block/rbd.c | 25 +
>  1 file changed, 25 insertions(+)
>
> diff --git a/block/rbd.c b/block/rbd.c
> index 5bc5b32..a495083 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -810,6 +810,30 @@ static int qemu_rbd_truncate(BlockDriverState *bs,
> int64_t offset)
>  return 0;
>  }
>
> +static int qemu_rbd_lockf(BlockDriverState *bs, BdrvLockfCmd cmd)
> +{
> +int ret;
> +BDRVRBDState *s = bs->opaque;
> +
> +/* XXX: RBD locks are not released automatically when program exits,
> which
> + * means if QEMU dies it cannot open the image next time until
> manually
> + * unlocked. */
> +switch (cmd) {
> +case BDRV_LOCKF_RWLOCK:
> +ret = rbd_lock_exclusive(s->image, NULL);
> +break;
> +case BDRV_LOCKF_ROLOCK:
> +ret = rbd_lock_shared(s->image, NULL, NULL);
> +break;
> +case BDRV_LOCKF_UNLOCK:
> +ret = rbd_unlock(s->image, NULL);
> +break;
> +default:
> +abort();
> +}
> +return ret;
> +}
> +
>  static int qemu_rbd_snap_create(BlockDriverState *bs,
>  QEMUSnapshotInfo *sn_info)
>  {
> @@ -998,6 +1022,7 @@ static BlockDriver bdrv_rbd = {
>  .bdrv_aio_discard   = qemu_rbd_aio_discard,
>  #endif
>
> +.bdrv_lockf = qemu_rbd_lockf,
>  .bdrv_snapshot_create   = qemu_rbd_snap_create,
>  .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
>  .bdrv_snapshot_list = qemu_rbd_snap_list,
> --
> 2.8.0
>
>
>

-- 

Jason


Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop is hanging forever

2015-11-09 Thread Jason Dillaman
Can you reproduce with Ceph debug logging enabled (i.e. debug rbd=20 in your 
ceph.conf)?  If you could attach the log to the Ceph tracker ticket I opened 
[1], that would be very helpful.

[1] http://tracker.ceph.com/issues/13726

Thanks,
Jason 


- Original Message -
> From: "Alexandre DERUMIER" 
> To: "ceph-devel" 
> Cc: "qemu-devel" , jdur...@redhat.com
> Sent: Monday, November 9, 2015 5:48:45 AM
> Subject: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and 
> vm_stop is hanging forever
> 
> adding to ceph.conf
> 
> [client]
> rbd_non_blocking_aio = false
> 
> 
> fix the problem for me (with rbd_cache=false)
> 
> 
> (@cc jdur...@redhat.com)
> 
> 
> 
> - Mail original -
> De: "Denis V. Lunev" 
> À: "aderumier" , "ceph-devel"
> , "qemu-devel" 
> Envoyé: Lundi 9 Novembre 2015 08:22:34
> Objet: Re: [Qemu-devel] qemu : rbd block driver internal snapshot and vm_stop
> is hanging forever
> 
> On 11/09/2015 10:19 AM, Denis V. Lunev wrote:
> > On 11/09/2015 06:10 AM, Alexandre DERUMIER wrote:
> >> Hi,
> >> 
> >> with qemu (2.4.1), if I do an internal snapshot of an rbd device,
> >> then I pause the vm with vm_stop,
> >> 
> >> the qemu process is hanging forever
> >> 
> >> 
> >> monitor commands to reproduce:
> >> 
> >> 
> >> # snapshot_blkdev_internal drive-virtio0 yoursnapname
> >> # stop
> >> 
> >> 
> >> 
> >> 
> >> I don't see this with qcow2 or sheepdog block driver for example.
> >> 
> >> 
> >> Regards,
> >> 
> >> Alexandre
> >> 
> > this could look like the problem I have recenty trying to
> > fix with dataplane enabled. Patch series is named as
> > 
> > [PATCH for 2.5 v6 0/10] dataplane snapshot fixes
> > 
> > Den
> 
> anyway, even if above will not help, can you collect gdb
> traces from all threads in QEMU process. May be I'll be
> able to give a hit.
> 
> Den
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>



Re: [Qemu-devel] is there a limit on the number of in-flight I/O operations?

2015-09-09 Thread Jason Dillaman
>> Bumping this...
>>
>> For now, we are rarely suffering with an unlimited cache growth issue
>> which can be observed on all post-1.4 versions of qemu with rbd
>> backend in a writeback mode and certain pattern of a guest operations.
>> The issue is confirmed for virtio and can be re-triggered by issuing
>> excessive amount of write requests without completing returned acks
>> from a emulator` cache timely. Since most applications behave in a
>> right way, the oom issue is very rare (and we developed an ugly
>> workaround for such situations long ago). If anybody is interested in
>> fixing this, I can send a prepared image for a reproduction or
>> instructions to make one, whichever is preferable.
>>
>> Thanks!
>
>A gentle bump: for at least rbd backend with writethrough/writeback
>cache it is possible to achieve unlimited growth with lot of large
>unfinished ops, what can be considered as a DoS. Usually it is
>triggered by poorly written applications in the wild, like proprietary
>KV databases or MSSQL under Windows, but regular applications,
>primarily OSS databases, can trigger the RSS growth for hundreds of
>megabytes just easily. There is probably no straight way to limit
>in-flight request size by re-chunking it, as supposedly malicious
>guest can inflate it up to very high numbers, but it`s fine to crash
>such a guest, saving real-world stuff with simple in-flight op count
>limiter looks like more achievable option.

Any chance you can provide the reproducer VM image via ceph-post-file [1]?  
Using the latest Firefly release with QEMU 2.3.1, I was unable to reproduce 
unlimited growth while hammering the VM with a randwrite fio job with 
iodepth=256, blocksize=4k.   

[1] http://ceph.com/docs/master/man/8/ceph-post-file/

-- Jason