[PATCH] MAINTAINERS: update email of Peter Lieven

2023-01-05 Thread Peter Lieven
I will leave KAMP in the next days. Update email to stay reachable.

Signed-off-by: Peter Lieven 
---
 MAINTAINERS | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index b270eb8e5b..995f1156f9 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3428,7 +3428,7 @@ F: block/vmdk.c

 RBD
 M: Ilya Dryomov 
-R: Peter Lieven 
+R: Peter Lieven 
 L: qemu-bl...@nongnu.org
 S: Supported
 F: block/rbd.c
@@ -3454,7 +3454,7 @@ F: block/blkio.c
 iSCSI
 M: Ronnie Sahlberg 
 M: Paolo Bonzini 
-M: Peter Lieven 
+M: Peter Lieven 
 L: qemu-bl...@nongnu.org
 S: Odd Fixes
 F: block/iscsi.c
@@ -3477,7 +3477,7 @@ T: git https://repo.or.cz/qemu/ericb.git nbd
 T: git https://gitlab.com/vsementsov/qemu.git block

 NFS
-M: Peter Lieven 
+M: Peter Lieven 
 L: qemu-bl...@nongnu.org
 S: Maintained
 F: block/nfs.c
--
2.34.1





KAMP Netzwerkdienste GmbH
Vestische Straße 89-91 | 46117 Oberhausen

Fon:+49 (0) 208 89 402-0
Fax:+49 (0) 208 89 402-40
WWW:http://www.kamp.de<https://www.kamp.de/>

Geschäftsführer: Michael Lante | Falk Brockerhoff | Daniel Hagemeier | Marcel 
Chorengel | Dr. Claus Boyens
Amtsgericht Duisburg | HRB Nr. 12154 | USt-IdNr.: DE120607556

HINWEIS: UNSERE HINWEISE ZUM UMGANG MIT PERSONENBEZOGENEN DATEN FINDEN SIE IN 
UNSERER DATENSCHUTZERKLÄRUNG UNTER 
HTTPS://WWW.KAMP.DE/DATENSCHUTZ.HTML<https://www.kamp.de/DATENSCHUTZ.HTML>

DIESE NACHRICHT IST NUR FÜR DEN ADRESSATEN BESTIMMT. ES IST NICHT ERLAUBT, 
DIESE NACHRICHT ZU KOPIEREN ODER DRITTEN ZUGÄNGLICH ZU MACHEN. SOLLTEN SIE 
IRRTÜMLICH DIESE NACHRICHT ERHALTEN HABEN, BITTE ICH UM IHRE MITTEILUNG PER 
E-MAIL ODER UNTER DER OBEN ANGEGEBENEN TELEFONNUMMER.





Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-24 Thread Peter Lieven

Am 24.03.22 um 12:06 schrieb Hanna Reitz:

On 24.03.22 11:42, Peter Lieven wrote:

Am 24.03.22 um 11:40 schrieb Stefano Garzarella:

On Thu, Mar 24, 2022 at 10:52:04AM +0100, Peter Lieven wrote:

Am 22.03.22 um 10:38 schrieb Hanna Reitz:

On 21.03.22 09:31, Stefano Garzarella wrote:

On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 17:47 schrieb Stefano Garzarella :

On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 09:25 schrieb Stefano Garzarella :


On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:




Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :


Commit d24f80234b ("block/rbd: increase dynamically the image size")
added a workaround to support growing images (eg. qcow2), resizing
the image before write operations that exceed the current size.

We recently added support for write zeroes and without the
workaround we can have problems with qcow2.

So let's move the resize into qemu_rbd_start_co() and do it when
the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
Signed-off-by: Stefano Garzarella 
---
block/rbd.c | 26 ++
1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 8f183eba2a..6caf35cbba 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1107,6 +1107,20 @@ static int coroutine_fn 
qemu_rbd_start_co(BlockDriverState *bs,

  assert(!qiov || qiov->size == bytes);

+    if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
+    /*
+ * RBD APIs don't allow us to write more than actual size, so in order
+ * to support growing images, we resize the image before write
+ * operations that exceed the current size.
+ */
+    if (offset + bytes > s->image_size) {
+    int r = qemu_rbd_resize(bs, offset + bytes);
+    if (r < 0) {
+    return r;
+    }
+    }
+    }
+
  r = rbd_aio_create_completion(,
(rbd_callback_t) qemu_rbd_completion_cb, );
  if (r < 0) {
@@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, 
int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
BdrvRequestFlags flags)
{
-    BDRVRBDState *s = bs->opaque;
-    /*
- * RBD APIs don't allow us to write more than actual size, so in order
- * to support growing images, we resize the image before write
- * operations that exceed the current size.
- */
-    if (offset + bytes > s->image_size) {
-    int r = qemu_rbd_resize(bs, offset + bytes);
-    if (r < 0) {
-    return r;
-    }
-    }
  return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
}

-- 2.35.1



Do we really have a use case for growing rbd images?


The use case is to have a qcow2 image on rbd.
I don't think it's very common, but some people use it and here [1] we had a 
little discussion about features that could be interesting (e.g. persistent 
dirty bitmaps for incremental backup).

In any case the support is quite simple and does not affect other use cases 
since we only increase the size when we go beyond the current size.

IMHO we can have it in :-)



The QCOW2 alone doesn’t make much sense, but additional metadata might be a use 
case.


Yep.


Be aware that the current approach will serialize requests. If there is a real 
use case, we might think of a better solution.


Good point, but it only happens when we have to resize, so maybe it's okay for 
now, but I agree we could do better ;-)


There might also be a problem if a write for a higher offset past eof will be executed shortly before a write to a slightly lower offset past eof. The second resize will fail as it would shrink the image. We would need proper locking to avoid 
this. Maybe we need to check if we write past eof. If yes, take a lock around the resize op and then check again if it’s still eof and only resize if true.


I thought rbd_resize() was synchronous. Indeed when you said this could 
serialize writes it sounded like confirmation to me.

Since we call rbd_resize() before rbd_aio_writev(), I thought this case could 
not occur.

Can you please elaborate?


Seconding this request, because if rbd_resize() is allowed to shrink data, it 
being asynchronous might cause data corruption.

I’ll keep your patch because I find this highly unlikely, though: 
qemu_rbd_resize() itself is definitely synchronous, it can’t invoke 
qemu_coroutine_yield().

The only other possibility that comes to my mind is that rbd_resize() might delay the actual resize operation, but I would still expect consecutive resize requests to be executed in order, and since we call rbd_aio_writev()/rbd_aio_write_zeroes() 
immediately after the rbd_resize() (with no yielding in between), everything should be executed in the order that w

Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-24 Thread Peter Lieven

Am 24.03.22 um 11:40 schrieb Stefano Garzarella:

On Thu, Mar 24, 2022 at 10:52:04AM +0100, Peter Lieven wrote:

Am 22.03.22 um 10:38 schrieb Hanna Reitz:

On 21.03.22 09:31, Stefano Garzarella wrote:

On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 17:47 schrieb Stefano Garzarella :

On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 09:25 schrieb Stefano Garzarella :


On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:




Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :


Commit d24f80234b ("block/rbd: increase dynamically the image size")
added a workaround to support growing images (eg. qcow2), resizing
the image before write operations that exceed the current size.

We recently added support for write zeroes and without the
workaround we can have problems with qcow2.

So let's move the resize into qemu_rbd_start_co() and do it when
the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
Signed-off-by: Stefano Garzarella 
---
block/rbd.c | 26 ++
1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 8f183eba2a..6caf35cbba 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1107,6 +1107,20 @@ static int coroutine_fn 
qemu_rbd_start_co(BlockDriverState *bs,

  assert(!qiov || qiov->size == bytes);

+    if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
+    /*
+ * RBD APIs don't allow us to write more than actual size, so in order
+ * to support growing images, we resize the image before write
+ * operations that exceed the current size.
+ */
+    if (offset + bytes > s->image_size) {
+    int r = qemu_rbd_resize(bs, offset + bytes);
+    if (r < 0) {
+    return r;
+    }
+    }
+    }
+
  r = rbd_aio_create_completion(,
    (rbd_callback_t) qemu_rbd_completion_cb, );
  if (r < 0) {
@@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, 
int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
   BdrvRequestFlags flags)
{
-    BDRVRBDState *s = bs->opaque;
-    /*
- * RBD APIs don't allow us to write more than actual size, so in order
- * to support growing images, we resize the image before write
- * operations that exceed the current size.
- */
-    if (offset + bytes > s->image_size) {
-    int r = qemu_rbd_resize(bs, offset + bytes);
-    if (r < 0) {
-    return r;
-    }
-    }
  return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
}

-- 2.35.1



Do we really have a use case for growing rbd images?


The use case is to have a qcow2 image on rbd.
I don't think it's very common, but some people use it and here [1] we had a 
little discussion about features that could be interesting (e.g.  persistent 
dirty bitmaps for incremental backup).

In any case the support is quite simple and does not affect other use cases 
since we only increase the size when we go beyond the current size.

IMHO we can have it in :-)



The QCOW2 alone doesn’t make much sense, but additional metadata might be a use 
case.


Yep.


Be aware that the current approach will serialize requests. If there is a real 
use case, we might think of a better solution.


Good point, but it only happens when we have to resize, so maybe it's okay for 
now, but I agree we could do better ;-)


There might also be a problem if a write for a higher offset past eof will be executed shortly before a write to a slightly lower offset past eof. The second resize will fail as it would shrink the image. We would need proper locking to avoid this. 
Maybe we need to check if we write past eof. If yes, take a lock around the resize op and then check again if it’s still eof and only resize if true.


I thought rbd_resize() was synchronous. Indeed when you said this could 
serialize writes it sounded like confirmation to me.

Since we call rbd_resize() before rbd_aio_writev(), I thought this case could 
not occur.

Can you please elaborate?


Seconding this request, because if rbd_resize() is allowed to shrink data, it 
being asynchronous might cause data corruption.

I’ll keep your patch because I find this highly unlikely, though: 
qemu_rbd_resize() itself is definitely synchronous, it can’t invoke 
qemu_coroutine_yield().

The only other possibility that comes to my mind is that rbd_resize() might delay the actual resize operation, but I would still expect consecutive resize requests to be executed in order, and since we call rbd_aio_writev()/rbd_aio_write_zeroes() 
immediately after the rbd_resize() (with no yielding in between), everything should be executed in the order that we expect.



Maybe

Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-24 Thread Peter Lieven

Am 22.03.22 um 10:38 schrieb Hanna Reitz:

On 21.03.22 09:31, Stefano Garzarella wrote:

On Sat, Mar 19, 2022 at 04:15:33PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 17:47 schrieb Stefano Garzarella :

On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:




Am 18.03.2022 um 09:25 schrieb Stefano Garzarella :


On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:




Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :


Commit d24f80234b ("block/rbd: increase dynamically the image size")
added a workaround to support growing images (eg. qcow2), resizing
the image before write operations that exceed the current size.

We recently added support for write zeroes and without the
workaround we can have problems with qcow2.

So let's move the resize into qemu_rbd_start_co() and do it when
the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.

Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
Signed-off-by: Stefano Garzarella 
---
block/rbd.c | 26 ++
1 file changed, 14 insertions(+), 12 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 8f183eba2a..6caf35cbba 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1107,6 +1107,20 @@ static int coroutine_fn 
qemu_rbd_start_co(BlockDriverState *bs,

  assert(!qiov || qiov->size == bytes);

+    if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
+    /*
+ * RBD APIs don't allow us to write more than actual size, so in order
+ * to support growing images, we resize the image before write
+ * operations that exceed the current size.
+ */
+    if (offset + bytes > s->image_size) {
+    int r = qemu_rbd_resize(bs, offset + bytes);
+    if (r < 0) {
+    return r;
+    }
+    }
+    }
+
  r = rbd_aio_create_completion(,
    (rbd_callback_t) qemu_rbd_completion_cb, );
  if (r < 0) {
@@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, 
int64_t offset,
   int64_t bytes, QEMUIOVector *qiov,
   BdrvRequestFlags flags)
{
-    BDRVRBDState *s = bs->opaque;
-    /*
- * RBD APIs don't allow us to write more than actual size, so in order
- * to support growing images, we resize the image before write
- * operations that exceed the current size.
- */
-    if (offset + bytes > s->image_size) {
-    int r = qemu_rbd_resize(bs, offset + bytes);
-    if (r < 0) {
-    return r;
-    }
-    }
  return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
}

--
2.35.1



Do we really have a use case for growing rbd images?


The use case is to have a qcow2 image on rbd.
I don't think it's very common, but some people use it and here [1] we had a 
little discussion about features that could be interesting (e.g.  persistent 
dirty bitmaps for incremental backup).

In any case the support is quite simple and does not affect other use cases 
since we only increase the size when we go beyond the current size.

IMHO we can have it in :-)



The QCOW2 alone doesn’t make much sense, but additional metadata might be a use 
case.


Yep.


Be aware that the current approach will serialize requests. If there is a real 
use case, we might think of a better solution.


Good point, but it only happens when we have to resize, so maybe it's okay for 
now, but I agree we could do better ;-)


There might also be a problem if a write for a higher offset past eof will be executed shortly before a write to a slightly lower offset past eof. The second resize will fail as it would shrink the image. We would need proper locking to avoid this. 
Maybe we need to check if we write past eof. If yes, take a lock around the resize op and then check again if it’s still eof and only resize if true.


I thought rbd_resize() was synchronous. Indeed when you said this could 
serialize writes it sounded like confirmation to me.

Since we call rbd_resize() before rbd_aio_writev(), I thought this case could 
not occur.

Can you please elaborate?


Seconding this request, because if rbd_resize() is allowed to shrink data, it 
being asynchronous might cause data corruption.

I’ll keep your patch because I find this highly unlikely, though: 
qemu_rbd_resize() itself is definitely synchronous, it can’t invoke 
qemu_coroutine_yield().

The only other possibility that comes to my mind is that rbd_resize() might delay the actual resize operation, but I would still expect consecutive resize requests to be executed in order, and since we call rbd_aio_writev()/rbd_aio_write_zeroes() 
immediately after the rbd_resize() (with no yielding in between), everything should be executed in the order that we expect.



Maybe my assumption of parallelism here was wrong. I was thinking of:


Request A: write at offset (EOL + 4k).


Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-19 Thread Peter Lieven



> Am 18.03.2022 um 17:47 schrieb Stefano Garzarella :
> 
> On Fri, Mar 18, 2022 at 04:48:18PM +0100, Peter Lieven wrote:
>> 
>> 
>>>> Am 18.03.2022 um 09:25 schrieb Stefano Garzarella :
>>> 
>>> On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:
>>>> 
>>>> 
>>>>>> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :
>>>>> 
>>>>> Commit d24f80234b ("block/rbd: increase dynamically the image size")
>>>>> added a workaround to support growing images (eg. qcow2), resizing
>>>>> the image before write operations that exceed the current size.
>>>>> 
>>>>> We recently added support for write zeroes and without the
>>>>> workaround we can have problems with qcow2.
>>>>> 
>>>>> So let's move the resize into qemu_rbd_start_co() and do it when
>>>>> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
>>>>> 
>>>>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
>>>>> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
>>>>> Signed-off-by: Stefano Garzarella 
>>>>> ---
>>>>> block/rbd.c | 26 ++
>>>>> 1 file changed, 14 insertions(+), 12 deletions(-)
>>>>> 
>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>> index 8f183eba2a..6caf35cbba 100644
>>>>> --- a/block/rbd.c
>>>>> +++ b/block/rbd.c
>>>>> @@ -1107,6 +1107,20 @@ static int coroutine_fn 
>>>>> qemu_rbd_start_co(BlockDriverState *bs,
>>>>> 
>>>>>   assert(!qiov || qiov->size == bytes);
>>>>> 
>>>>> +if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
>>>>> +/*
>>>>> + * RBD APIs don't allow us to write more than actual size, so in 
>>>>> order
>>>>> + * to support growing images, we resize the image before write
>>>>> + * operations that exceed the current size.
>>>>> + */
>>>>> +if (offset + bytes > s->image_size) {
>>>>> +int r = qemu_rbd_resize(bs, offset + bytes);
>>>>> +if (r < 0) {
>>>>> +return r;
>>>>> +}
>>>>> +}
>>>>> +}
>>>>> +
>>>>>   r = rbd_aio_create_completion(,
>>>>> (rbd_callback_t) qemu_rbd_completion_cb, 
>>>>> );
>>>>>   if (r < 0) {
>>>>> @@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState 
>>>>> *bs, int64_t offset,
>>>>>int64_t bytes, QEMUIOVector *qiov,
>>>>>BdrvRequestFlags flags)
>>>>> {
>>>>> -BDRVRBDState *s = bs->opaque;
>>>>> -/*
>>>>> - * RBD APIs don't allow us to write more than actual size, so in 
>>>>> order
>>>>> - * to support growing images, we resize the image before write
>>>>> - * operations that exceed the current size.
>>>>> - */
>>>>> -if (offset + bytes > s->image_size) {
>>>>> -int r = qemu_rbd_resize(bs, offset + bytes);
>>>>> -if (r < 0) {
>>>>> -return r;
>>>>> -}
>>>>> -}
>>>>>   return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
>>>>> }
>>>>> 
>>>>> --
>>>>> 2.35.1
>>>>> 
>>>> 
>>>> Do we really have a use case for growing rbd images?
>>> 
>>> The use case is to have a qcow2 image on rbd.
>>> I don't think it's very common, but some people use it and here [1] we had 
>>> a little discussion about features that could be interesting (e.g.  
>>> persistent dirty bitmaps for incremental backup).
>>> 
>>> In any case the support is quite simple and does not affect other use cases 
>>> since we only increase the size when we go beyond the current size.
>>> 
>>> IMHO we can have it in :-)
>>> 
>> 
>> The QCOW2 alone doesn’t make much sense, but additional metadata might be a 
>> use case.
> 
> Yep.
> 
>> Be aware that the current approach will serialize requests. If there is a 
>> real use case, we might think of a better solution.
> 
> Good point, but it only happens when we have to resize, so maybe it's okay 
> for now, but I agree we could do better ;-)

There might also be a problem if a write for a higher offset past eof will be 
executed shortly before a write to a slightly lower offset past eof. The second 
resize will fail as it would shrink the image. We would need proper locking to 
avoid this. Maybe we need to check if we write past eof. If yes, take a lock 
around the resize op and then check again if it’s still eof and only resize if 
true.

Peter

> 
> Thanks,
> Stefano
> 





Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-18 Thread Peter Lieven



> Am 18.03.2022 um 09:25 schrieb Stefano Garzarella :
> 
> On Thu, Mar 17, 2022 at 07:27:05PM +0100, Peter Lieven wrote:
>> 
>> 
>>>> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :
>>> 
>>> Commit d24f80234b ("block/rbd: increase dynamically the image size")
>>> added a workaround to support growing images (eg. qcow2), resizing
>>> the image before write operations that exceed the current size.
>>> 
>>> We recently added support for write zeroes and without the
>>> workaround we can have problems with qcow2.
>>> 
>>> So let's move the resize into qemu_rbd_start_co() and do it when
>>> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
>>> 
>>> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
>>> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
>>> Signed-off-by: Stefano Garzarella 
>>> ---
>>> block/rbd.c | 26 ++
>>> 1 file changed, 14 insertions(+), 12 deletions(-)
>>> 
>>> diff --git a/block/rbd.c b/block/rbd.c
>>> index 8f183eba2a..6caf35cbba 100644
>>> --- a/block/rbd.c
>>> +++ b/block/rbd.c
>>> @@ -1107,6 +1107,20 @@ static int coroutine_fn 
>>> qemu_rbd_start_co(BlockDriverState *bs,
>>> 
>>>assert(!qiov || qiov->size == bytes);
>>> 
>>> +if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
>>> +/*
>>> + * RBD APIs don't allow us to write more than actual size, so in 
>>> order
>>> + * to support growing images, we resize the image before write
>>> + * operations that exceed the current size.
>>> + */
>>> +if (offset + bytes > s->image_size) {
>>> +int r = qemu_rbd_resize(bs, offset + bytes);
>>> +if (r < 0) {
>>> +return r;
>>> +}
>>> +}
>>> +}
>>> +
>>>r = rbd_aio_create_completion(,
>>>  (rbd_callback_t) qemu_rbd_completion_cb, 
>>> );
>>>if (r < 0) {
>>> @@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState 
>>> *bs, int64_t offset,
>>> int64_t bytes, QEMUIOVector *qiov,
>>> BdrvRequestFlags flags)
>>> {
>>> -BDRVRBDState *s = bs->opaque;
>>> -/*
>>> - * RBD APIs don't allow us to write more than actual size, so in order
>>> - * to support growing images, we resize the image before write
>>> - * operations that exceed the current size.
>>> - */
>>> -if (offset + bytes > s->image_size) {
>>> -int r = qemu_rbd_resize(bs, offset + bytes);
>>> -if (r < 0) {
>>> -return r;
>>> -}
>>> -}
>>>return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
>>> }
>>> 
>>> --
>>> 2.35.1
>>> 
>> 
>> Do we really have a use case for growing rbd images?
> 
> The use case is to have a qcow2 image on rbd.
> I don't think it's very common, but some people use it and here [1] we had a 
> little discussion about features that could be interesting (e.g.  persistent 
> dirty bitmaps for incremental backup).
> 
> In any case the support is quite simple and does not affect other use cases 
> since we only increase the size when we go beyond the current size.
> 
> IMHO we can have it in :-)
> 

The QCOW2 alone doesn’t make much sense, but additional metadata might be a use 
case.
Be aware that the current approach will serialize requests. If there is a real 
use case, we might think of a better solution.

Peter

> Thanks,
> Stefano
> 
> [1] https://lore.kernel.org/all/20190415080452.GA6031@localhost.localdomain/
> 





Re: [PATCH] block/rbd: fix write zeroes with growing images

2022-03-17 Thread Peter Lieven



> Am 17.03.2022 um 17:26 schrieb Stefano Garzarella :
> 
> Commit d24f80234b ("block/rbd: increase dynamically the image size")
> added a workaround to support growing images (eg. qcow2), resizing
> the image before write operations that exceed the current size.
> 
> We recently added support for write zeroes and without the
> workaround we can have problems with qcow2.
> 
> So let's move the resize into qemu_rbd_start_co() and do it when
> the command is RBD_AIO_WRITE or RBD_AIO_WRITE_ZEROES.
> 
> Buglink: https://bugzilla.redhat.com/show_bug.cgi?id=2020993
> Fixes: c56ac27d2a ("block/rbd: add write zeroes support")
> Signed-off-by: Stefano Garzarella 
> ---
> block/rbd.c | 26 ++
> 1 file changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/block/rbd.c b/block/rbd.c
> index 8f183eba2a..6caf35cbba 100644
> --- a/block/rbd.c
> +++ b/block/rbd.c
> @@ -1107,6 +1107,20 @@ static int coroutine_fn 
> qemu_rbd_start_co(BlockDriverState *bs,
> 
> assert(!qiov || qiov->size == bytes);
> 
> +if (cmd == RBD_AIO_WRITE || cmd == RBD_AIO_WRITE_ZEROES) {
> +/*
> + * RBD APIs don't allow us to write more than actual size, so in 
> order
> + * to support growing images, we resize the image before write
> + * operations that exceed the current size.
> + */
> +if (offset + bytes > s->image_size) {
> +int r = qemu_rbd_resize(bs, offset + bytes);
> +if (r < 0) {
> +return r;
> +}
> +}
> +}
> +
> r = rbd_aio_create_completion(,
>   (rbd_callback_t) qemu_rbd_completion_cb, 
> );
> if (r < 0) {
> @@ -1182,18 +1196,6 @@ coroutine_fn qemu_rbd_co_pwritev(BlockDriverState *bs, 
> int64_t offset,
>  int64_t bytes, QEMUIOVector *qiov,
>  BdrvRequestFlags flags)
> {
> -BDRVRBDState *s = bs->opaque;
> -/*
> - * RBD APIs don't allow us to write more than actual size, so in order
> - * to support growing images, we resize the image before write
> - * operations that exceed the current size.
> - */
> -if (offset + bytes > s->image_size) {
> -int r = qemu_rbd_resize(bs, offset + bytes);
> -if (r < 0) {
> -return r;
> -}
> -}
> return qemu_rbd_start_co(bs, offset, bytes, qiov, flags, RBD_AIO_WRITE);
> }
> 
> -- 
> 2.35.1
> 

Do we really have a use case for growing rbd images?

Peter




Re: [PATCH V2 for-6.2 0/2] fixes for bdrv_co_block_status

2022-02-03 Thread Peter Lieven
Am 01.02.22 um 15:39 schrieb Kevin Wolf:
> Am 13.01.2022 um 15:44 hat Peter Lieven geschrieben:
>> V1->V2:
>>  Patch 1: Treat a hole just like an unallocated area. [Ilya]
>>  Patch 2: Apply workaround only for pre-Quincy librbd versions and
>>   ensure default striping and non child images. [Ilya]
>>
>> Peter Lieven (2):
>>   block/rbd: fix handling of holes in .bdrv_co_block_status
>>   block/rbd: workaround for ceph issue #53784
> Thanks, applied to the block branch.
>
> Kevin
>
Hi Kevin,


thanks for taking care of this. I was a few days out of office.

@Stefano: it seems Kevin addresses your comments that should have gone into a 
V3.


Best,

Peter





Re: [PATCH V2 for-6.2 0/2] fixes for bdrv_co_block_status

2022-01-20 Thread Peter Lieven
Am 19.01.22 um 15:57 schrieb Stefano Garzarella:
> On Fri, Jan 14, 2022 at 11:58:40AM +0100, Ilya Dryomov wrote:
>> On Thu, Jan 13, 2022 at 3:44 PM Peter Lieven  wrote:
>>>
>>> V1->V2:
>>>  Patch 1: Treat a hole just like an unallocated area. [Ilya]
>>>  Patch 2: Apply workaround only for pre-Quincy librbd versions and
>>>   ensure default striping and non child images. [Ilya]
>>>
>>> Peter Lieven (2):
>>>   block/rbd: fix handling of holes in .bdrv_co_block_status
>>>   block/rbd: workaround for ceph issue #53784
>>>
>>>  block/rbd.c | 52 +---
>>>  1 file changed, 45 insertions(+), 7 deletions(-)
>>>
>>> -- 
>>> 2.25.1
>>>
>>>
>>
>> These patches have both "for-6.2" in the subject and
>> Cc: qemu-sta...@nongnu.org in the description, which is a little
>> confusing.  Just want to clarify that they should go into master
>> and be backported to 6.2.
>
> Yeah, a bit confusing. These are for 7.0, so @Kevin can these patches go with 
> your tree?


Yes, sorry, my fault. It should be 7.0


Peter






[PATCH V2 for-6.2 2/2] block/rbd: workaround for ceph issue #53784

2022-01-13 Thread Peter Lieven
librbd had a bug until early 2022 that affected all versions of ceph that
supported fast-diff. This bug results in reporting of incorrect offsets
if the offset parameter to rbd_diff_iterate2 is not object aligned.

This patch works around this bug for pre Quincy versions of librbd.

Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 20bb896c4a..d174d51659 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1320,6 +1320,7 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 int status, r;
 RBDDiffIterateReq req = { .offs = offset };
 uint64_t features, flags;
+uint64_t head = 0;
 
 assert(offset + bytes <= s->image_size);
 
@@ -1347,7 +1348,43 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 return status;
 }
 
-r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
+#if LIBRBD_VERSION_CODE < LIBRBD_VERSION(1, 17, 0)
+/*
+ * librbd had a bug until early 2022 that affected all versions of ceph 
that
+ * supported fast-diff. This bug results in reporting of incorrect offsets
+ * if the offset parameter to rbd_diff_iterate2 is not object aligned.
+ * Work around this bug by rounding down the offset to object boundaries.
+ * This is OK because we call rbd_diff_iterate2 with whole_object = true.
+ * However, this workaround only works for non cloned images with default
+ * striping.
+ *
+ * See: https://tracker.ceph.com/issues/53784
+ */
+
+/*  check if RBD image has non-default striping enabled */
+if (features & RBD_FEATURE_STRIPINGV2) {
+return status;
+}
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+/*
+ * check if RBD image is a clone (= has a parent).
+ *
+ * rbd_get_parent_info is deprecated from Nautilus onwards, but the
+ * replacement rbd_get_parent is not present in Luminous and Mimic.
+ */
+if (rbd_get_parent_info(s->image, NULL, 0, NULL, 0, NULL, 0) != -ENOENT) {
+return status;
+}
+#pragma GCC diagnostic pop
+
+head = req.offs & (s->object_size - 1);
+req.offs -= head;
+bytes += head;
+#endif
+
+r = rbd_diff_iterate2(s->image, NULL, req.offs, bytes, true, true,
   qemu_rbd_diff_iterate_cb, );
 if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
 return status;
@@ -1366,7 +1403,8 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
 }
 
-*pnum = req.bytes;
+assert(req.bytes > head);
+*pnum = req.bytes - head;
 return status;
 }
 
-- 
2.25.1





[PATCH V2 for-6.2 1/2] block/rbd: fix handling of holes in .bdrv_co_block_status

2022-01-13 Thread Peter Lieven
the assumption that we can't hit a hole if we do not diff against a snapshot 
was wrong.

We can see a hole in an image if we diff against base if there exists an older 
snapshot
of the image and we have discarded blocks in the image where the snapshot has 
data.

Fix this by simply handling a hole like an unallocated area. There are no 
callbacks
for unallocated areas so just bail out if we hit a hole.

Fixes: 0347a8fd4c3faaedf119be04c197804be40a384b
Suggested-by: Ilya Dryomov 
Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index def96292e0..20bb896c4a 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1279,11 +1279,11 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
size_t len,
 RBDDiffIterateReq *req = opaque;
 
 assert(req->offs + req->bytes <= offs);
-/*
- * we do not diff against a snapshot so we should never receive a callback
- * for a hole.
- */
-assert(exists);
+
+/* treat a hole like an unallocated area and bail out */
+if (!exists) {
+return 0;
+}
 
 if (!req->exists && offs > req->offs) {
 /*
-- 
2.25.1





[PATCH V2 for-6.2 0/2] fixes for bdrv_co_block_status

2022-01-13 Thread Peter Lieven
V1->V2:
 Patch 1: Treat a hole just like an unallocated area. [Ilya]
 Patch 2: Apply workaround only for pre-Quincy librbd versions and
  ensure default striping and non child images. [Ilya]

Peter Lieven (2):
  block/rbd: fix handling of holes in .bdrv_co_block_status
  block/rbd: workaround for ceph issue #53784

 block/rbd.c | 52 +---
 1 file changed, 45 insertions(+), 7 deletions(-)

-- 
2.25.1





Re: [PATCH 1/2] block/rbd: fix handling of holes in .bdrv_co_block_status

2022-01-12 Thread Peter Lieven


> Am 12.01.2022 um 22:06 schrieb Ilya Dryomov :
> 
> On Wed, Jan 12, 2022 at 9:39 PM Peter Lieven  wrote:
>> 
>>> Am 12.01.22 um 10:05 schrieb Ilya Dryomov:
>>> On Mon, Jan 10, 2022 at 12:42 PM Peter Lieven  wrote:
>>>> the assumption that we can't hit a hole if we do not diff against a 
>>>> snapshot was wrong.
>>>> 
>>>> We can see a hole in an image if we diff against base if there exists an 
>>>> older snapshot
>>>> of the image and we have discarded blocks in the image where the snapshot 
>>>> has data.
>>>> 
>>>> Fixes: 0347a8fd4c3faaedf119be04c197804be40a384b
>>>> Cc: qemu-sta...@nongnu.org
>>>> Signed-off-by: Peter Lieven 
>>>> ---
>>>> block/rbd.c | 55 +
>>>> 1 file changed, 34 insertions(+), 21 deletions(-)
>>>> 
>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>> index def96292e0..5e9dc91d81 100644
>>>> --- a/block/rbd.c
>>>> +++ b/block/rbd.c
>>>> @@ -1279,13 +1279,24 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
>>>> size_t len,
>>>> RBDDiffIterateReq *req = opaque;
>>>> 
>>>> assert(req->offs + req->bytes <= offs);
>>>> -/*
>>>> - * we do not diff against a snapshot so we should never receive a 
>>>> callback
>>>> - * for a hole.
>>>> - */
>>>> -assert(exists);
>>>> 
>>>> -if (!req->exists && offs > req->offs) {
>>>> +if (req->exists && offs > req->offs + req->bytes) {
>>>> +/*
>>>> + * we started in an allocated area and jumped over an unallocated 
>>>> area,
>>>> + * req->bytes contains the length of the allocated area before the
>>>> + * unallocated area. stop further processing.
>>>> + */
>>>> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
>>>> +}
>>>> +if (req->exists && !exists) {
>>>> +/*
>>>> + * we started in an allocated area and reached a hole. req->bytes
>>>> + * contains the length of the allocated area before the hole.
>>>> + * stop further processing.
>>>> + */
>>>> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
>>>> +}
>>>> +if (!req->exists && exists && offs > req->offs) {
>>>> /*
>>>>  * we started in an unallocated area and hit the first allocated
>>>>  * block. req->bytes must be set to the length of the unallocated 
>>>> area
>>>> @@ -1295,17 +1306,19 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
>>>> size_t len,
>>>> return QEMU_RBD_EXIT_DIFF_ITERATE2;
>>>> }
>>>> 
>>>> -if (req->exists && offs > req->offs + req->bytes) {
>>>> -/*
>>>> - * we started in an allocated area and jumped over an unallocated 
>>>> area,
>>>> - * req->bytes contains the length of the allocated area before the
>>>> - * unallocated area. stop further processing.
>>>> - */
>>>> -return QEMU_RBD_EXIT_DIFF_ITERATE2;
>>>> -}
>>>> +/*
>>>> + * assert that we caught all cases above and allocation state has not
>>>> + * changed during callbacks.
>>>> + */
>>>> +assert(exists == req->exists || !req->bytes);
>>>> +req->exists = exists;
>>>> 
>>>> -req->bytes += len;
>>>> -req->exists = true;
>>>> +/*
>>>> + * assert that we either return an unallocated block or have got 
>>>> callbacks
>>>> + * for all allocated blocks present.
>>>> + */
>>>> +assert(!req->exists || offs == req->offs + req->bytes);
>>>> +req->bytes = offs + len - req->offs;
>>>> 
>>>> return 0;
>>>> }
>>>> @@ -1354,13 +1367,13 @@ static int coroutine_fn 
>>>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>>> }
>>>> assert(req.bytes <= bytes);
>>>> if (!req.exists) {
>>>> -if (r == 0) {
>>>>

Re: [PATCH 1/2] block/rbd: fix handling of holes in .bdrv_co_block_status

2022-01-12 Thread Peter Lieven
Am 12.01.22 um 10:05 schrieb Ilya Dryomov:
> On Mon, Jan 10, 2022 at 12:42 PM Peter Lieven  wrote:
>> the assumption that we can't hit a hole if we do not diff against a snapshot 
>> was wrong.
>>
>> We can see a hole in an image if we diff against base if there exists an 
>> older snapshot
>> of the image and we have discarded blocks in the image where the snapshot 
>> has data.
>>
>> Fixes: 0347a8fd4c3faaedf119be04c197804be40a384b
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 55 +
>>  1 file changed, 34 insertions(+), 21 deletions(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index def96292e0..5e9dc91d81 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -1279,13 +1279,24 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
>> size_t len,
>>  RBDDiffIterateReq *req = opaque;
>>
>>  assert(req->offs + req->bytes <= offs);
>> -/*
>> - * we do not diff against a snapshot so we should never receive a 
>> callback
>> - * for a hole.
>> - */
>> -assert(exists);
>>
>> -if (!req->exists && offs > req->offs) {
>> +if (req->exists && offs > req->offs + req->bytes) {
>> +/*
>> + * we started in an allocated area and jumped over an unallocated 
>> area,
>> + * req->bytes contains the length of the allocated area before the
>> + * unallocated area. stop further processing.
>> + */
>> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
>> +}
>> +if (req->exists && !exists) {
>> +/*
>> + * we started in an allocated area and reached a hole. req->bytes
>> + * contains the length of the allocated area before the hole.
>> + * stop further processing.
>> + */
>> +return QEMU_RBD_EXIT_DIFF_ITERATE2;
>> +}
>> +if (!req->exists && exists && offs > req->offs) {
>>  /*
>>   * we started in an unallocated area and hit the first allocated
>>   * block. req->bytes must be set to the length of the unallocated 
>> area
>> @@ -1295,17 +1306,19 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
>> size_t len,
>>  return QEMU_RBD_EXIT_DIFF_ITERATE2;
>>  }
>>
>> -if (req->exists && offs > req->offs + req->bytes) {
>> -/*
>> - * we started in an allocated area and jumped over an unallocated 
>> area,
>> - * req->bytes contains the length of the allocated area before the
>> - * unallocated area. stop further processing.
>> - */
>> -return QEMU_RBD_EXIT_DIFF_ITERATE2;
>> -}
>> +/*
>> + * assert that we caught all cases above and allocation state has not
>> + * changed during callbacks.
>> + */
>> +assert(exists == req->exists || !req->bytes);
>> +req->exists = exists;
>>
>> -req->bytes += len;
>> -req->exists = true;
>> +/*
>> + * assert that we either return an unallocated block or have got 
>> callbacks
>> + * for all allocated blocks present.
>> + */
>> +assert(!req->exists || offs == req->offs + req->bytes);
>> +req->bytes = offs + len - req->offs;
>>
>>  return 0;
>>  }
>> @@ -1354,13 +1367,13 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>  }
>>  assert(req.bytes <= bytes);
>>  if (!req.exists) {
>> -if (r == 0) {
>> +if (r == 0 && !req.bytes) {
>>  /*
>> - * rbd_diff_iterate2 does not invoke callbacks for unallocated
>> - * areas. This here catches the case where no callback was
>> - * invoked at all (req.bytes == 0).
>> + * rbd_diff_iterate2 does not invoke callbacks for unallocated 
>> areas
>> + * except for the case where an overlay has a hole where the 
>> parent
>> + * or an older snapshot of the image has not. This here catches 
>> the
>> + * case where no callback was invoked at all.
>>   */
>> -assert(req.bytes == 0);
>>  req.bytes = bytes;
>>  }
>>  status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
>> --
>> 2.25.1
>>
>>
> Hi Peter,
>
> Can we just skip these "holes" by replacing the existing assert with
> an if statement that would simply bail from the callback on !exists?
>
> Just trying to keep the logic as simple as possible since as it turns
> out we get to contend with ages-old librbd bugs here...


I'm afraid I think this would not work. Consider qemu-img convert.

If we bail out we would immediately call get_block_status with the offset

where we stopped and hit the !exist again.


Peter




Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784

2022-01-12 Thread Peter Lieven
Am 12.01.22 um 10:59 schrieb Ilya Dryomov:
> On Mon, Jan 10, 2022 at 12:43 PM Peter Lieven  wrote:
>> librbd had a bug until early 2022 that affected all versions of ceph that
>> supported fast-diff. This bug results in reporting of incorrect offsets
>> if the offset parameter to rbd_diff_iterate2 is not object aligned.
>> Work around this bug by rounding down the offset to object boundaries.
>>
>> Fixes: https://tracker.ceph.com/issues/53784
> I don't think the Fixes tag is appropriate here.  Linking librbd
> ticket is fine but this patch doesn't really fix anything.


Okay, I will change that to See:


>
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 17 -
>>  1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 5e9dc91d81..260cb9f4b4 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -1333,6 +1333,7 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>  int status, r;
>>  RBDDiffIterateReq req = { .offs = offset };
>>  uint64_t features, flags;
>> +int64_t head;
>>
>>  assert(offset + bytes <= s->image_size);
>>
>> @@ -1360,6 +1361,19 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>  return status;
>>  }
>>
>> +/*
>> + * librbd had a bug until early 2022 that affected all versions of ceph 
>> that
>> + * supported fast-diff. This bug results in reporting of incorrect 
>> offsets
>> + * if the offset parameter to rbd_diff_iterate2 is not object aligned.
>> + * Work around this bug by rounding down the offset to object 
>> boundaries.
>> + *
>> + * See: https://tracker.ceph.com/issues/53784
>> + */
>> +head = offset & (s->object_size - 1);
>> +offset -= head;
>> +req.offs -= head;
>> +bytes += head;
> So it looks like the intention is to have more or less a permanent
> workaround since all librbd versions are affected, right?  For that,
> I think we would need to also reject custom striping patterns and
> clones.  For the above to be reliable, the image has to be standalone
> and have a default striping pattern (stripe_unit == object_size &&
> stripe_count == 1).  Otherwise, behave as if fast-diff is disabled or
> invalid.


Do you have a fealing how many users use a different striping pattern than 
default?

What about EC backed pools?

Do you have another idea how we can detect if the librbd version is broken?


>
>> +
> Nit: I'd replace { .offs = offset } initialization at the top with {}
> and assign to req.offs here, right before calling rbd_diff_iterate2().
>
>>  r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
>>qemu_rbd_diff_iterate_cb, );
>>  if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
>> @@ -1379,7 +1393,8 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>  status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
>>  }
>>
>> -*pnum = req.bytes;
>> +assert(req.bytes > head);
> I'd expand the workaround comment with an explanation of why it's OK
> to round down the offset -- because rbd_diff_iterate2() is called with
> whole_object=true.  If that wasn't the case, on top of inconsistent
> results for different offsets within an object, this assert could be
> triggered.

Sure, you are right. I had this in mind. This also does not change complexity

since we stay with the offset in the same object. I will mention both.


Peter






Re: [PATCH 2/2] block/rbd: workaround for ceph issue #53784

2022-01-11 Thread Peter Lieven
Am 10.01.22 um 15:18 schrieb Stefano Garzarella:
> On Mon, Jan 10, 2022 at 12:41:54PM +0100, Peter Lieven wrote:
>> librbd had a bug until early 2022 that affected all versions of ceph that
>> supported fast-diff. This bug results in reporting of incorrect offsets
>> if the offset parameter to rbd_diff_iterate2 is not object aligned.
>> Work around this bug by rounding down the offset to object boundaries.
>>
>> Fixes: https://tracker.ceph.com/issues/53784
>> Cc: qemu-sta...@nongnu.org
>> Signed-off-by: Peter Lieven 
>> ---
>> block/rbd.c | 17 -
>> 1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 5e9dc91d81..260cb9f4b4 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -1333,6 +1333,7 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>     int status, r;
>>     RBDDiffIterateReq req = { .offs = offset };
>>     uint64_t features, flags;
>> +    int64_t head;
>>
>>     assert(offset + bytes <= s->image_size);
>>
>> @@ -1360,6 +1361,19 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>     return status;
>>     }
>>
>> +    /*
>> + * librbd had a bug until early 2022 that affected all versions of ceph 
>> that
>> + * supported fast-diff. This bug results in reporting of incorrect 
>> offsets
>> + * if the offset parameter to rbd_diff_iterate2 is not object aligned.
>> + * Work around this bug by rounding down the offset to object 
>> boundaries.
>> + *
>> + * See: https://tracker.ceph.com/issues/53784
>> + */
>> +    head = offset & (s->object_size - 1);
>> +    offset -= head;
>> +    req.offs -= head;
>> +    bytes += head;
>> +
>>     r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
>>   qemu_rbd_diff_iterate_cb, );
>>     if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
>> @@ -1379,7 +1393,8 @@ static int coroutine_fn 
>> qemu_rbd_co_block_status(BlockDriverState *bs,
>>     status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
>>     }
>>
>> -    *pnum = req.bytes;
>> +    assert(req.bytes > head);
>> +    *pnum = req.bytes - head;
>>     return status;
>> }
>
> Thanks for the workaround!
>
> I just tested this patch for the issue reported in this BZ [1] and the test 
> now works correctly!
>
> Tested-by: Stefano Garzarella 
>
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=2034791
>


Hi Stefano,


thanks for the feedback. Please note that you also need the other patch or you 
will sooner or later run into another assertion as soon as rbd snapshots are 
involved.


Regarding the workaround I need confirmation from Ilya that it covers all 
cases. I do not know if it works if striping or EC is configured on the pool.


Best,

Peter






[PATCH 1/2] block/rbd: fix handling of holes in .bdrv_co_block_status

2022-01-10 Thread Peter Lieven
the assumption that we can't hit a hole if we do not diff against a snapshot 
was wrong.

We can see a hole in an image if we diff against base if there exists an older 
snapshot
of the image and we have discarded blocks in the image where the snapshot has 
data.

Fixes: 0347a8fd4c3faaedf119be04c197804be40a384b
Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 55 +
 1 file changed, 34 insertions(+), 21 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index def96292e0..5e9dc91d81 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1279,13 +1279,24 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
size_t len,
 RBDDiffIterateReq *req = opaque;
 
 assert(req->offs + req->bytes <= offs);
-/*
- * we do not diff against a snapshot so we should never receive a callback
- * for a hole.
- */
-assert(exists);
 
-if (!req->exists && offs > req->offs) {
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
 /*
  * we started in an unallocated area and hit the first allocated
  * block. req->bytes must be set to the length of the unallocated area
@@ -1295,17 +1306,19 @@ static int qemu_rbd_diff_iterate_cb(uint64_t offs, 
size_t len,
 return QEMU_RBD_EXIT_DIFF_ITERATE2;
 }
 
-if (req->exists && offs > req->offs + req->bytes) {
-/*
- * we started in an allocated area and jumped over an unallocated area,
- * req->bytes contains the length of the allocated area before the
- * unallocated area. stop further processing.
- */
-return QEMU_RBD_EXIT_DIFF_ITERATE2;
-}
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
 
-req->bytes += len;
-req->exists = true;
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
 
 return 0;
 }
@@ -1354,13 +1367,13 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 }
 assert(req.bytes <= bytes);
 if (!req.exists) {
-if (r == 0) {
+if (r == 0 && !req.bytes) {
 /*
- * rbd_diff_iterate2 does not invoke callbacks for unallocated
- * areas. This here catches the case where no callback was
- * invoked at all (req.bytes == 0).
+ * rbd_diff_iterate2 does not invoke callbacks for unallocated 
areas
+ * except for the case where an overlay has a hole where the parent
+ * or an older snapshot of the image has not. This here catches the
+ * case where no callback was invoked at all.
  */
-assert(req.bytes == 0);
 req.bytes = bytes;
 }
 status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
-- 
2.25.1





[PATCH 0/2] block/rbd: fixes for bdrv_co_block_status

2022-01-10 Thread Peter Lieven
Peter Lieven (2):
  block/rbd: fix handling of holes in .bdrv_co_block_status
  block/rbd: workaround for ceph issue #53784

 block/rbd.c | 72 +
 1 file changed, 50 insertions(+), 22 deletions(-)

-- 
2.25.1





[PATCH 2/2] block/rbd: workaround for ceph issue #53784

2022-01-10 Thread Peter Lieven
librbd had a bug until early 2022 that affected all versions of ceph that
supported fast-diff. This bug results in reporting of incorrect offsets
if the offset parameter to rbd_diff_iterate2 is not object aligned.
Work around this bug by rounding down the offset to object boundaries.

Fixes: https://tracker.ceph.com/issues/53784
Cc: qemu-sta...@nongnu.org
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 5e9dc91d81..260cb9f4b4 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1333,6 +1333,7 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 int status, r;
 RBDDiffIterateReq req = { .offs = offset };
 uint64_t features, flags;
+int64_t head;
 
 assert(offset + bytes <= s->image_size);
 
@@ -1360,6 +1361,19 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 return status;
 }
 
+/*
+ * librbd had a bug until early 2022 that affected all versions of ceph 
that
+ * supported fast-diff. This bug results in reporting of incorrect offsets
+ * if the offset parameter to rbd_diff_iterate2 is not object aligned.
+ * Work around this bug by rounding down the offset to object boundaries.
+ *
+ * See: https://tracker.ceph.com/issues/53784
+ */
+head = offset & (s->object_size - 1);
+offset -= head;
+req.offs -= head;
+bytes += head;
+
 r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
   qemu_rbd_diff_iterate_cb, );
 if (r < 0 && r != QEMU_RBD_EXIT_DIFF_ITERATE2) {
@@ -1379,7 +1393,8 @@ static int coroutine_fn 
qemu_rbd_co_block_status(BlockDriverState *bs,
 status = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID;
 }
 
-*pnum = req.bytes;
+assert(req.bytes > head);
+*pnum = req.bytes - head;
 return status;
 }
 
-- 
2.25.1





Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2022-01-08 Thread Peter Lieven
Am 06.01.22 um 17:01 schrieb Ilya Dryomov:
> On Thu, Jan 6, 2022 at 4:27 PM Peter Lieven  wrote:
>> Am 05.10.21 um 10:36 schrieb Ilya Dryomov:
>>> On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:
>>>> Am 05.10.21 um 09:54 schrieb Ilya Dryomov:
>>>>> On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
>>>>>> the qemu rbd driver currently lacks support for bdrv_co_block_status.
>>>>>> This results mainly in incorrect progress during block operations (e.g.
>>>>>> qemu-img convert with an rbd image as source).
>>>>>>
>>>>>> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
>>>>>> allocated and unallocated (all zero areas).
>>>>>>
>>>>>> To avoid querying the ceph OSDs for the answer this is only done if
>>>>>> the image has the fast-diff feature which depends on the object-map and
>>>>>> exclusive-lock features. In this case it is guaranteed that the 
>>>>>> information
>>>>>> is present in memory in the librbd client and thus very fast.
>>>>>>
>>>>>> If fast-diff is not available all areas are reported to be allocated
>>>>>> which is the current behaviour if bdrv_co_block_status is not 
>>>>>> implemented.
>>>>>>
>>>>>> Signed-off-by: Peter Lieven 
>>>>>> ---
>>>>>> V2->V3:
>>>>>> - check rbd_flags every time (they can change during runtime) [Ilya]
>>>>>> - also check for fast-diff invalid flag [Ilya]
>>>>>> - *map and *file cant be NULL [Ilya]
>>>>>> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
>>>>>> unallocated area [Ilya]
>>>>>> - typo: catched -> caught [Ilya]
>>>>>> - changed wording about fast-diff, object-map and exclusive lock in
>>>>>> commit msg [Ilya]
>>>>>>
>>>>>> V1->V2:
>>>>>> - add commit comment [Stefano]
>>>>>> - use failed_post_open [Stefano]
>>>>>> - remove redundant assert [Stefano]
>>>>>> - add macro+comment for the magic -9000 value [Stefano]
>>>>>> - always set *file if its non NULL [Stefano]
>>>>>>
>>>>>>block/rbd.c | 126 
>>>>>>1 file changed, 126 insertions(+)
>>>>>>
>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>> index dcf82b15b8..3cb24f9981 100644
>>>>>> --- a/block/rbd.c
>>>>>> +++ b/block/rbd.c
>>>>>> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
>>>>>> *qemu_rbd_get_specific_info(BlockDriverState *bs,
>>>>>>return spec_info;
>>>>>>}
>>>>>>
>>>>>> +typedef struct rbd_diff_req {
>>>>>> +uint64_t offs;
>>>>>> +uint64_t bytes;
>>>>>> +int exists;
>>>>> Hi Peter,
>>>>>
>>>>> Nit: make exists a bool.  The one in the callback has to be an int
>>>>> because of the callback signature but let's not spread that.
>>>>>
>>>>>> +} rbd_diff_req;
>>>>>> +
>>>>>> +/*
>>>>>> + * rbd_diff_iterate2 allows to interrupt the exection by returning a 
>>>>>> negative
>>>>>> + * value in the callback routine. Choose a value that does not conflict 
>>>>>> with
>>>>>> + * an existing exitcode and return it if we want to prematurely stop the
>>>>>> + * execution because we detected a change in the allocation status.
>>>>>> + */
>>>>>> +#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
>>>>>> +
>>>>>> +static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
>>>>>> +   int exists, void *opaque)
>>>>>> +{
>>>>>> +struct rbd_diff_req *req = opaque;
>>>>>> +
>>>>>> +assert(req->offs + req->bytes <= offs);
>>>>>> +
>>>>>> +if (req->exists && offs > req->offs + req->bytes) {
>>>>>> +/*
>>>>>> + * we started in an allocated area and jumped 

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2022-01-06 Thread Peter Lieven
Am 06.01.22 um 18:47 schrieb Ilya Dryomov:
> On Thu, Jan 6, 2022 at 5:33 PM Peter Lieven  wrote:
>> Am 06.01.22 um 17:01 schrieb Ilya Dryomov:
>>> On Thu, Jan 6, 2022 at 4:27 PM Peter Lieven  wrote:
>>>> Am 05.10.21 um 10:36 schrieb Ilya Dryomov:
>>>>> On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:
>>>>>> Am 05.10.21 um 09:54 schrieb Ilya Dryomov:
>>>>>>> On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:
>>>>>>>> the qemu rbd driver currently lacks support for bdrv_co_block_status.
>>>>>>>> This results mainly in incorrect progress during block operations (e.g.
>>>>>>>> qemu-img convert with an rbd image as source).
>>>>>>>>
>>>>>>>> This patch utilizes the rbd_diff_iterate2 call from librbd to detect
>>>>>>>> allocated and unallocated (all zero areas).
>>>>>>>>
>>>>>>>> To avoid querying the ceph OSDs for the answer this is only done if
>>>>>>>> the image has the fast-diff feature which depends on the object-map and
>>>>>>>> exclusive-lock features. In this case it is guaranteed that the 
>>>>>>>> information
>>>>>>>> is present in memory in the librbd client and thus very fast.
>>>>>>>>
>>>>>>>> If fast-diff is not available all areas are reported to be allocated
>>>>>>>> which is the current behaviour if bdrv_co_block_status is not 
>>>>>>>> implemented.
>>>>>>>>
>>>>>>>> Signed-off-by: Peter Lieven 
>>>>>>>> ---
>>>>>>>> V2->V3:
>>>>>>>> - check rbd_flags every time (they can change during runtime) [Ilya]
>>>>>>>> - also check for fast-diff invalid flag [Ilya]
>>>>>>>> - *map and *file cant be NULL [Ilya]
>>>>>>>> - set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
>>>>>>>>  unallocated area [Ilya]
>>>>>>>> - typo: catched -> caught [Ilya]
>>>>>>>> - changed wording about fast-diff, object-map and exclusive lock in
>>>>>>>>  commit msg [Ilya]
>>>>>>>>
>>>>>>>> V1->V2:
>>>>>>>> - add commit comment [Stefano]
>>>>>>>> - use failed_post_open [Stefano]
>>>>>>>> - remove redundant assert [Stefano]
>>>>>>>> - add macro+comment for the magic -9000 value [Stefano]
>>>>>>>> - always set *file if its non NULL [Stefano]
>>>>>>>>
>>>>>>>> block/rbd.c | 126 
>>>>>>>> 
>>>>>>>> 1 file changed, 126 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index dcf82b15b8..3cb24f9981 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
>>>>>>>> *qemu_rbd_get_specific_info(BlockDriverState *bs,
>>>>>>>> return spec_info;
>>>>>>>> }
>>>>>>>>
>>>>>>>> +typedef struct rbd_diff_req {
>>>>>>>> +uint64_t offs;
>>>>>>>> +uint64_t bytes;
>>>>>>>> +int exists;
>>>>>>> Hi Peter,
>>>>>>>
>>>>>>> Nit: make exists a bool.  The one in the callback has to be an int
>>>>>>> because of the callback signature but let's not spread that.
>>>>>>>
>>>>>>>> +} rbd_diff_req;
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * rbd_diff_iterate2 allows to interrupt the exection by returning a 
>>>>>>>> negative
>>>>>>>> + * value in the callback routine. Choose a value that does not 
>>>>>>>> conflict with
>>>>>>>> + * an existing exitcode and return it if we want to prematurely stop 
>>>>>>>> the
>>>>>>>> + * execution because we detected a change in the allocation 

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2022-01-06 Thread Peter Lieven

Am 06.01.22 um 17:01 schrieb Ilya Dryomov:

On Thu, Jan 6, 2022 at 4:27 PM Peter Lieven  wrote:

Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
 unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
 commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

block/rbd.c | 126 
1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
return spec_info;
}

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?

That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ B

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2022-01-06 Thread Peter Lieven

Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret t

Re: [PATCH v2 0/2] qemu-img convert: Fix sparseness detection

2021-12-21 Thread Peter Lieven
Am 17.12.21 um 17:46 schrieb Vladimir Sementsov-Ogievskiy:
> Hi all!
>
> 01: only update test output rebasing on master
> 02: replaced with my proposed solution.
>
> Kevin Wolf (1):
>   iotests: Test qemu-img convert of zeroed data cluster
>
> Vladimir Sementsov-Ogievskiy (1):
>   qemu-img: make is_allocated_sectors() more efficient
>
>  qemu-img.c | 23 +++
>  tests/qemu-iotests/122 |  1 +
>  tests/qemu-iotests/122.out |  2 ++
>  3 files changed, 22 insertions(+), 4 deletions(-)
>
Tested-by: Peter Lieven 





Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-17 Thread Peter Lieven
Am 04.12.21 um 00:04 schrieb Vladimir Sementsov-Ogievskiy:
> 03.12.2021 14:17, Peter Lieven wrote:
>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed 
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>>
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>>
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>>
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>>
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>>
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>>
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>>
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>>
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>>>> zero too.
>>>>>
>>>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>>>> the buf, isn't it?
>>>>>
>>>>> what about something like this:
>>>>>
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t 
>>>>> *buf, int n, int *pnum,
>>>>>   }
>>>>>   }
>>>>>   +    if (i == n) {
>>>>> +    /*
>>>>> + * The whole buf is the same.
>>>>> + *
>>>>> + * if it's data, just return it. It's the old behavior.
>>>>

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-04 Thread Peter Lieven



> Am 04.12.2021 um 00:04 schrieb Vladimir Sementsov-Ogievskiy 
> :
> 
> 03.12.2021 14:17, Peter Lieven wrote:
>>> Am 19.05.21 um 18:48 schrieb Kevin Wolf:
>>> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>>>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed 
>>>>>>>> areas
>>>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>>>> commit 8dcd3c9b91a.
>>>>>>>> 
>>>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>>>> any more. It turns out that this even happens when not only the
>>>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>>>> size is only 4 KiB.
>>>>>>>> 
>>>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>>> 
>>>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>>>> write-zero. And that helps.
>>>>>> This can happen (mostly raw images on block devices, I think?), but
>>>>>> usually it just means skipping the write because we know that the target
>>>>>> image is already zeroed.
>>>>>> 
>>>>>> What it does mean is that if the next part is data, we'll have an
>>>>>> unaligned data write.
>>>>>> 
>>>>>>> Doesn't that mean that alignment is wrongly detected?
>>>>>> The problem is that you can have bdrv_block_status_above() return the
>>>>>> same allocation status multiple times in a row, but *pnum can be
>>>>>> unaligned for the conversion.
>>>>>> 
>>>>>> We only look at a single range returned by it when detecting the
>>>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>>>> perfectly aligned zeroed range.
>>>>>> 
>>>>>> In theory we could try to do some lookahead and merge ranges where
>>>>>> possible, which should give us the perfect result, but it would make the
>>>>>> code considerably more complicated. (Whether we want to merge them
>>>>>> doesn't only depend on the block status, but possibly also on the
>>>>>> content of a DATA range.)
>>>>>> 
>>>>>> Kevin
>>>>>> 
>>>>> Oh, I understand now the problem, thanks for explanation.
>>>>> 
>>>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>>>> zero too.
>>>>> 
>>>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>>>> the buf, isn't it?
>>>>> 
>>>>> what about something like this:
>>>>> 
>>>>> diff --git a/qemu-img.c b/qemu-img.c
>>>>> index babb5573ab..d1704584a0 100644
>>>>> --- a/qemu-img.c
>>>>> +++ b/qemu-img.c
>>>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t 
>>>>> *buf, int n, int *pnum,
>>>>>  }
>>>>>  }
>>>>>  +if (i == n) {
>>>>> +/*
>>>>> + * The whole buf is the same.
>>>>> + *
>>>>> + * if it's data, just return it. 

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-12-03 Thread Peter Lieven
Am 19.05.21 um 18:48 schrieb Kevin Wolf:
> Am 19.05.2021 um 15:24 hat Peter Lieven geschrieben:
>> Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
>>> 20.04.2021 18:04, Kevin Wolf wrote:
>>>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>>>> 15.04.2021 18:22, Kevin Wolf wrote:
>>>>>> In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
>>>>>> like non-zero data if the end of the checked area isn't aligned. This
>>>>>> can improve the efficiency of the conversion and was introduced in
>>>>>> commit 8dcd3c9b91a.
>>>>>>
>>>>>> However, it comes with a correctness problem: qemu-img convert is
>>>>>> supposed to sparsify areas that contain only zeros, which it doesn't do
>>>>>> any more. It turns out that this even happens when not only the
>>>>>> unaligned area is zeroed, but also the blocks before and after it. In
>>>>>> the bug report, conversion of a fragmented 10G image containing only
>>>>>> zeros resulted in an image consuming 2.82 GiB even though the expected
>>>>>> size is only 4 KiB.
>>>>>>
>>>>>> As a tradeoff between both, let's ignore zeroed sectors only after
>>>>>> non-zero data to fix the alignment, but if we're only looking at zeros,
>>>>>> keep them as such, even if it may mean additional RMW cycles.
>>>>>>
>>>>> Hmm.. If I understand correctly, we are going to do unaligned
>>>>> write-zero. And that helps.
>>>> This can happen (mostly raw images on block devices, I think?), but
>>>> usually it just means skipping the write because we know that the target
>>>> image is already zeroed.
>>>>
>>>> What it does mean is that if the next part is data, we'll have an
>>>> unaligned data write.
>>>>
>>>>> Doesn't that mean that alignment is wrongly detected?
>>>> The problem is that you can have bdrv_block_status_above() return the
>>>> same allocation status multiple times in a row, but *pnum can be
>>>> unaligned for the conversion.
>>>>
>>>> We only look at a single range returned by it when detecting the
>>>> alignment, so it could be that we have zero buffers for both 0-11 and
>>>> 12-16 and detect two misaligned ranges, when both together are a
>>>> perfectly aligned zeroed range.
>>>>
>>>> In theory we could try to do some lookahead and merge ranges where
>>>> possible, which should give us the perfect result, but it would make the
>>>> code considerably more complicated. (Whether we want to merge them
>>>> doesn't only depend on the block status, but possibly also on the
>>>> content of a DATA range.)
>>>>
>>>> Kevin
>>>>
>>> Oh, I understand now the problem, thanks for explanation.
>>>
>>> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors 
>>> must not align it down, to be possibly "merged" with next chunk if it is 
>>> zero too.
>>>
>>> But it's still good to align zeroes down, if data starts somewhere inside 
>>> the buf, isn't it?
>>>
>>> what about something like this:
>>>
>>> diff --git a/qemu-img.c b/qemu-img.c
>>> index babb5573ab..d1704584a0 100644
>>> --- a/qemu-img.c
>>> +++ b/qemu-img.c
>>> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, 
>>> int n, int *pnum,
>>>  }
>>>  }
>>>  
>>> +    if (i == n) {
>>> +    /*
>>> + * The whole buf is the same.
>>> + *
>>> + * if it's data, just return it. It's the old behavior.
>>> + *
>>> + * if it's zero, just return too. It will work good if target is 
>>> alredy
>>> + * zeroed. And if next chunk is zero too we'll have no RMW and no 
>>> reason
>>> + * to write data.
>>> + */
>>> +    *pnum = i;
>>> +    return !is_zero;
>>> +    }
>>> +
>>>  tail = (sector_num + i) & (alignment - 1);
>>>  if (tail) {
>>>  if (is_zero && i <= tail) {
>>> -    /* treat unallocated areas which only consist
>>> - * of a small tail as allocated. */
>>> 

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-11-15 Thread Peter Lieven

Am 26.10.21 um 16:53 schrieb Peter Lieven:

Am 25.10.21 um 14:58 schrieb Kevin Wolf:

Am 25.10.2021 um 13:39 hat Peter Lieven geschrieben:

Am 16.09.21 um 14:34 schrieb Peter Lieven:

Am 09.07.21 um 12:21 schrieb Kevin Wolf:

Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:

Am 08.07.2021 um 14:18 schrieb Kevin Wolf :

Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:

Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:

I will have a decent look after my vacation.

Sounds good, thanks. Enjoy your vacation!

As I had to fire up my laptop to look into another issue anyway, I
have sent two patches for updating MAINTAINERS and to fix the int vs.
bool mix for task->complete.

I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending
patches can be quite relaxing :-)

Heh, fair enough. :-)


But thanks anyway.


As Paolos fix (5f50be9b5) is relatively new and there are maybe other
non obvious problems when removing the BH indirection and we are close
to soft freeze I would leave the BH removal change for 6.2.

Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Yeah, I guess technically it is, though I doubt it's measurable.

As promised I was trying to remove the indirection through the BH after Qemu 
6.1 release.

However, if I remove the BH I run into the following assertion while running 
some fio tests:


qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: Assertion 
`blk->in_flight > 0' failed.


Any idea?


This is what I changed:


diff --git a/block/rbd.c b/block/rbd.c
index 3cb24f9981..bc1dbc20f7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
uint64_t size)
  return 0;
  }

-static void qemu_rbd_finish_bh(void *opaque)
-{
-    RBDTask *task = opaque;
-    task->complete = true;
-    aio_co_wake(task->co);
-}
-
  /*
   * This is the completion callback function for all rbd aio calls
   * started from qemu_rbd_start_co().
@@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, 
RBDTask *task)
  {
  task->ret = rbd_aio_get_return_value(c);
  rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
-    qemu_rbd_finish_bh, task);
+    task->complete = true;
+    aio_co_wake(task->co);
  }

Kevin, Paolo, any idea?

Not really, I don't see the connection between both places.

Do you have a stack trace for the crash?


The crash seems not to be limited to that assertion. I have also seen:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion 
`!qiov || qiov->size == acb->bytes' failed.


Altough harder to trigger I catch this backtrace in gdb:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: Assertion 
`!qiov || qiov->size == acb->bytes' failed.
[Wechseln zu Thread 0x77fa8f40 (LWP 17852)]

Thread 1 "qemu-system-x86" hit Breakpoint 1, __GI_abort () at abort.c:49
49    abort.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  0x742567e0 in __GI_abort () at abort.c:49
#1  0x7424648a in __assert_fail_base (fmt=0x743cd750 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == acb->bytes", file=file@entry=0x55e634b2 
"../block/block-backend.c", line=line@entry=1497, function=function@entry=0x55e63b20 
<__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:92
#2  0x74246502 in __GI___assert_fail (assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == 
acb->bytes", file=file@entry=0x55e634b2 "../block/block-backend.c", line=line@entry=1497, 
function=function@entry=0x55e63b20 <__PRETTY_FUNCTION__.32450> "blk_aio_write_entry") at assert.c:101
#3  0x55becc78 in blk_aio_write_entry (opaque=0x56b534f0) at 
../block/block-backend.c:1497
#4  0x55cf0e4c in coroutine_trampoline (i0=, i1=) at ../util/coroutine-ucontext.c:173
#5  0x7426e7b0 in __start_context () at /lib/x86_64-linux-gnu/libc.so.6
#6  0x7fffd5a0 in  ()
#7  0x in  ()




any ideas? Or should we just abandon the idea of removing the BH?


Peter






Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-10-26 Thread Peter Lieven
Am 25.10.21 um 14:58 schrieb Kevin Wolf:
> Am 25.10.2021 um 13:39 hat Peter Lieven geschrieben:
>> Am 16.09.21 um 14:34 schrieb Peter Lieven:
>>> Am 09.07.21 um 12:21 schrieb Kevin Wolf:
>>>> Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:
>>>>> Am 08.07.2021 um 14:18 schrieb Kevin Wolf :
>>>>>> Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:
>>>>>>>> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
>>>>>>>> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>>>>>>>>> I will have a decent look after my vacation.
>>>>>>>> Sounds good, thanks. Enjoy your vacation!
>>>>>>> As I had to fire up my laptop to look into another issue anyway, I
>>>>>>> have sent two patches for updating MAINTAINERS and to fix the int vs.
>>>>>>> bool mix for task->complete.
>>>>>> I think you need to reevaluate your definition of vacation. ;-)
>>>>> Lets talk about this when the kids are grown up. Sometimes sending
>>>>> patches can be quite relaxing :-)
>>>> Heh, fair enough. :-)
>>>>
>>>>>> But thanks anyway.
>>>>>>
>>>>>>> As Paolos fix (5f50be9b5) is relatively new and there are maybe other
>>>>>>> non obvious problems when removing the BH indirection and we are close
>>>>>>> to soft freeze I would leave the BH removal change for 6.2.
>>>>>> Sure, code cleanups aren't urgent.
>>>>> Isn’t the indirection also a slight performance drop?
>>>> Yeah, I guess technically it is, though I doubt it's measurable.
>>>
>>> As promised I was trying to remove the indirection through the BH after 
>>> Qemu 6.1 release.
>>>
>>> However, if I remove the BH I run into the following assertion while 
>>> running some fio tests:
>>>
>>>
>>> qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: 
>>> Assertion `blk->in_flight > 0' failed.
>>>
>>>
>>> Any idea?
>>>
>>>
>>> This is what I changed:
>>>
>>>
>>> diff --git a/block/rbd.c b/block/rbd.c
>>> index 3cb24f9981..bc1dbc20f7 100644
>>> --- a/block/rbd.c
>>> +++ b/block/rbd.c
>>> @@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
>>> uint64_t size)
>>>  return 0;
>>>  }
>>>
>>> -static void qemu_rbd_finish_bh(void *opaque)
>>> -{
>>> -    RBDTask *task = opaque;
>>> -    task->complete = true;
>>> -    aio_co_wake(task->co);
>>> -}
>>> -
>>>  /*
>>>   * This is the completion callback function for all rbd aio calls
>>>   * started from qemu_rbd_start_co().
>>> @@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t 
>>> c, RBDTask *task)
>>>  {
>>>  task->ret = rbd_aio_get_return_value(c);
>>>  rbd_aio_release(c);
>>> -    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
>>> -    qemu_rbd_finish_bh, task);
>>> +    task->complete = true;
>>> +    aio_co_wake(task->co);
>>>  }
>> Kevin, Paolo, any idea?
> Not really, I don't see the connection between both places.
>
> Do you have a stack trace for the crash?


The crash seems not to be limited to that assertion. I have also seen:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: 
Assertion `!qiov || qiov->size == acb->bytes' failed.


Altough harder to trigger I catch this backtrace in gdb:


qemu-system-x86_64: ../block/block-backend.c:1497: blk_aio_write_entry: 
Assertion `!qiov || qiov->size == acb->bytes' failed.
[Wechseln zu Thread 0x77fa8f40 (LWP 17852)]

Thread 1 "qemu-system-x86" hit Breakpoint 1, __GI_abort () at abort.c:49
49    abort.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  0x742567e0 in __GI_abort () at abort.c:49
#1  0x7424648a in __assert_fail_base (fmt=0x743cd750 "%s%s%s:%u: 
%s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x55e638e0 
"!qiov || qiov->size == acb->bytes", file=file@entry=0x55e634b2 
"../block/block-backend.c", line=line@entry=1497, 
function=function@entry=0x55e63b20 <__PRETTY_FUNCTION__.32450> 
"blk_aio_write_entry") at assert.c:92
#2  0x74246502 in __GI___assert_fail 
(assertion=assertion@entry=0x55e638e0 "!qiov || qiov->size == acb->byte

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-10-25 Thread Peter Lieven

Am 16.09.21 um 14:34 schrieb Peter Lieven:

Am 09.07.21 um 12:21 schrieb Kevin Wolf:

Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:

Am 08.07.2021 um 14:18 schrieb Kevin Wolf :

Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:

Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:

I will have a decent look after my vacation.

Sounds good, thanks. Enjoy your vacation!

As I had to fire up my laptop to look into another issue anyway, I
have sent two patches for updating MAINTAINERS and to fix the int vs.
bool mix for task->complete.

I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending
patches can be quite relaxing :-)

Heh, fair enough. :-)


But thanks anyway.


As Paolos fix (5f50be9b5) is relatively new and there are maybe other
non obvious problems when removing the BH indirection and we are close
to soft freeze I would leave the BH removal change for 6.2.

Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Yeah, I guess technically it is, though I doubt it's measurable.



As promised I was trying to remove the indirection through the BH after Qemu 
6.1 release.

However, if I remove the BH I run into the following assertion while running 
some fio tests:


qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: Assertion 
`blk->in_flight > 0' failed.


Any idea?


This is what I changed:


diff --git a/block/rbd.c b/block/rbd.c
index 3cb24f9981..bc1dbc20f7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
uint64_t size)
 return 0;
 }

-static void qemu_rbd_finish_bh(void *opaque)
-{
-    RBDTask *task = opaque;
-    task->complete = true;
-    aio_co_wake(task->co);
-}
-
 /*
  * This is the completion callback function for all rbd aio calls
  * started from qemu_rbd_start_co().
@@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, 
RBDTask *task)
 {
 task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
-    qemu_rbd_finish_bh, task);
+    task->complete = true;
+    aio_co_wake(task->co);
 }


Peter




Kevin, Paolo, any idea?


Thanks,

Peter


--

Mit freundlichen Grüßen

Peter Lieven

...

  KAMP Netzwerkdienste GmbH
  Vestische Str. 89-91 | 46117 Oberhausen
  Tel: +49 (0) 208.89 402-50 | Fax: +49 (0) 208.89 402-40
  p...@kamp.de | http://www.kamp.de

  Geschäftsführer: Heiner Lante | Michael Lante
  Amtsgericht Duisburg | HRB Nr. 12154
  USt-Id-Nr.: DE 120607556

...





[PATCH V5] block/rbd: implement bdrv_co_block_status

2021-10-12 Thread Peter Lieven
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 112 
 1 file changed, 112 insertions(+)

V4->V5:
 - rename rbd_diff_req to RBDDiffIterateReq, use typedef and move
   defintion to top [Ilya]
 - rename callback to qemu_rbd_diff_iterate_cb [Ilya]
 - assert that req.bytes == 0 if !req.exists and r == 0 [Ilya]

V3->V4:
 - make req.exists a bool [Ilya]
 - simplify callback under the assuption that we never receive a cb
   for a hole since we do not diff against a snapshot [Ilya]
 - remove out label [Ilya]
 - rename ret to status [Ilya]

V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
  unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
  commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]


diff --git a/block/rbd.c b/block/rbd.c
index 701fbf2b0c..def96292e0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -97,6 +97,12 @@ typedef struct RBDTask {
 int64_t ret;
 } RBDTask;
 
+typedef struct RBDDiffIterateReq {
+uint64_t offs;
+uint64_t bytes;
+bool exists;
+} RBDDiffIterateReq;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -1259,6 +1265,111 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_diff_iterate_cb(uint64_t offs, size_t len,
+int exists, void *opaque)
+{
+RBDDiffIterateReq *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+/*
+ * we do not diff against a snapshot so we should never receive a callback
+ * for a hole.
+ */
+assert(exists);
+
+if (!req->exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+req->bytes += len;
+req->exists = true;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int status, r;
+RBDDiffIterateReq req = { .offs = offset };
+uint64_t features, flags;
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*map = offset;
+*file = bs;
+*pnum = bytes;
+
+/* check if RBD image supports fast-diff */
+r = rbd_get_features(s->image, );
+if (r < 0) {
+  

[PATCH V4] block/rbd: implement bdrv_co_block_status

2021-10-07 Thread Peter Lieven
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 111 
 1 file changed, 111 insertions(+)

V3->V4:
 - make req.exists a bool [Ilya]
 - simplify callback under the assuption that we never receive a cb
   for a hole since we do not diff against a snapshot [Ilya]
 - remove out label [Ilya]
 - rename ret to status [Ilya]

V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
  unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
  commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

diff --git a/block/rbd.c b/block/rbd.c
index 701fbf2b0c..b9fa8e78eb 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,116 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+bool exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+/*
+ * we do not diff against a snapshot so we should never receive a callback
+ * for a hole.
+ */
+assert(exists);
+
+if (!req->exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+req->bytes += len;
+req->exists = true;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int status, r;
+struct rbd_diff_req req = { .offs = offset };
+uint64_t features, flags;
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+status = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*map = offset;
+*file = bs;
+*pnum = bytes;
+
+/* check if RBD image supports fast-diff */
+r = rbd_get_features(s->image, );
+if (r < 0) {
+return status;
+}
+if (!(features & RBD_FEATURE_FAST_DIFF)) {
+return status;
+}
+
+/* check if RBD fast-diff result is valid */
+r = rbd_get_flags(s->image, );
+if (r < 0) {
+return status;
+}
+if (flags & RBD_FLAG_FAST_DIFF_INVALID) {
+return status;
+}
+
+r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
+  qemu_rbd_co_block_status_cb, );
+if (r < 0 && r != QEMU_RBD_EXIT_DIFF_

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-07 Thread Peter Lieven

Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret t

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven

Am 05.10.21 um 10:36 schrieb Ilya Dryomov:

On Tue, Oct 5, 2021 at 10:19 AM Peter Lieven  wrote:

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 126 
   1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?


That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?



+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret t

Re: [PATCH V3] block/rbd: implement bdrv_co_block_status

2021-10-05 Thread Peter Lieven

Am 05.10.21 um 09:54 schrieb Ilya Dryomov:

On Thu, Sep 16, 2021 at 2:21 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
   unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
   commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

  block/rbd.c | 126 
  1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
  return spec_info;
  }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;

Hi Peter,

Nit: make exists a bool.  The one in the callback has to be an int
because of the callback signature but let's not spread that.


+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;

Do you have a test case for when this branch is taken?



That would happen if you diff from a snapshot, the question is if it can also 
happen if the image is a clone from a snapshot?





+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;

Nit: I would rename ret to status or something like that to make
it clear(er) that it is an actual value and never an error.  Or,

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-09-16 Thread Peter Lieven

Am 09.07.21 um 12:21 schrieb Kevin Wolf:

Am 08.07.2021 um 20:23 hat Peter Lieven geschrieben:

Am 08.07.2021 um 14:18 schrieb Kevin Wolf :

Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:

Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:

I will have a decent look after my vacation.

Sounds good, thanks. Enjoy your vacation!

As I had to fire up my laptop to look into another issue anyway, I
have sent two patches for updating MAINTAINERS and to fix the int vs.
bool mix for task->complete.

I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending
patches can be quite relaxing :-)

Heh, fair enough. :-)


But thanks anyway.


As Paolos fix (5f50be9b5) is relatively new and there are maybe other
non obvious problems when removing the BH indirection and we are close
to soft freeze I would leave the BH removal change for 6.2.

Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Yeah, I guess technically it is, though I doubt it's measurable.



As promised I was trying to remove the indirection through the BH after Qemu 
6.1 release.

However, if I remove the BH I run into the following assertion while running 
some fio tests:


qemu-system-x86_64: ../block/block-backend.c:1197: blk_wait_while_drained: Assertion 
`blk->in_flight > 0' failed.


Any idea?


This is what I changed:


diff --git a/block/rbd.c b/block/rbd.c
index 3cb24f9981..bc1dbc20f7 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1063,13 +1063,6 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
uint64_t size)
 return 0;
 }

-static void qemu_rbd_finish_bh(void *opaque)
-{
-    RBDTask *task = opaque;
-    task->complete = true;
-    aio_co_wake(task->co);
-}
-
 /*
  * This is the completion callback function for all rbd aio calls
  * started from qemu_rbd_start_co().
@@ -1083,8 +1076,8 @@ static void qemu_rbd_completion_cb(rbd_completion_t c, 
RBDTask *task)
 {
 task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-    aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
-    qemu_rbd_finish_bh, task);
+    task->complete = true;
+    aio_co_wake(task->co);
 }


Peter






[PATCH V3] block/rbd: implement bdrv_co_block_status

2021-09-16 Thread Peter Lieven
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff feature which depends on the object-map and
exclusive-lock features. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V2->V3:
- check rbd_flags every time (they can change during runtime) [Ilya]
- also check for fast-diff invalid flag [Ilya]
- *map and *file cant be NULL [Ilya]
- set ret = BDRV_BLOCK_ZERO | BDRV_BLOCK_OFFSET_VALID in case of an
  unallocated area [Ilya]
- typo: catched -> caught [Ilya]
- changed wording about fast-diff, object-map and exclusive lock in
  commit msg [Ilya]

V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

 block/rbd.c | 126 
 1 file changed, 126 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..3cb24f9981 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1259,6 +1259,131 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we caught all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;
+struct rbd_diff_req req = { .offs = offset };
+uint64_t features, flags;
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+*map = offset;
+*file = bs;
+*pnum = bytes;
+
+/* check if RBD image supports fast-diff */
+r = rbd_get_features(s->image, );
+if (r < 0) {
+goto out;
+}
+if (!(features & RBD_FEATURE_FAST_DIFF)) {
+goto out;
+}
+
+/* check if RBD fast-diff result is 

Re: [PATCH V2] block/rbd: implement bdrv_co_block_status

2021-09-02 Thread Peter Lieven

Am 24.08.21 um 22:39 schrieb Ilya Dryomov:

On Mon, Aug 23, 2021 at 11:38 AM Peter Lieven  wrote:

Am 22.08.21 um 23:02 schrieb Ilya Dryomov:

On Tue, Aug 10, 2021 at 3:41 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff features which depends on the object-map

Hi Peter,

Nit: "has the fast-diff feature which depends on the object-map and
exclusive-lock features"


will reword in V3.



and exclusive-lock. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

   block/rbd.c | 125 
   1 file changed, 125 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..8692e76f40 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
   char *namespace;
   uint64_t image_size;
   uint64_t object_size;
+uint64_t features;
   } BDRVRBDState;

   typedef struct RBDTask {
@@ -983,6 +984,13 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
   s->image_size = info.size;
   s->object_size = info.obj_size;

+r = rbd_get_features(s->image, >features);
+if (r < 0) {
+error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+goto failed_post_open;
+}

The object-map and fast-diff features can be enabled/disabled while the
image is open so this should probably go to qemu_rbd_co_block_status().


+
   /* If we are using an rbd snapshot, we must be r/o, otherwise
* leave as-is */
   if (s->snap != NULL) {
@@ -1259,6 +1267,122 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
   return spec_info;
   }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we catched all cases above and allocation state has not

catched -> caught


+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ 

Re: [PATCH V2] block/rbd: implement bdrv_co_block_status

2021-08-23 Thread Peter Lieven

Am 22.08.21 um 23:02 schrieb Ilya Dryomov:

On Tue, Aug 10, 2021 at 3:41 PM Peter Lieven  wrote:

the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff features which depends on the object-map

Hi Peter,

Nit: "has the fast-diff feature which depends on the object-map and
exclusive-lock features"



will reword in V3.





and exclusive-lock. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

  block/rbd.c | 125 
  1 file changed, 125 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..8692e76f40 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
  char *namespace;
  uint64_t image_size;
  uint64_t object_size;
+uint64_t features;
  } BDRVRBDState;

  typedef struct RBDTask {
@@ -983,6 +984,13 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
  s->image_size = info.size;
  s->object_size = info.obj_size;

+r = rbd_get_features(s->image, >features);
+if (r < 0) {
+error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+goto failed_post_open;
+}

The object-map and fast-diff features can be enabled/disabled while the
image is open so this should probably go to qemu_rbd_co_block_status().


+
  /* If we are using an rbd snapshot, we must be r/o, otherwise
   * leave as-is */
  if (s->snap != NULL) {
@@ -1259,6 +1267,122 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
  return spec_info;
  }

+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we catched all cases above and allocation state has not

catched -> caught


+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, i

[PATCH V2] block/rbd: implement bdrv_co_block_status

2021-08-10 Thread Peter Lieven
the qemu rbd driver currently lacks support for bdrv_co_block_status.
This results mainly in incorrect progress during block operations (e.g.
qemu-img convert with an rbd image as source).

This patch utilizes the rbd_diff_iterate2 call from librbd to detect
allocated and unallocated (all zero areas).

To avoid querying the ceph OSDs for the answer this is only done if
the image has the fast-diff features which depends on the object-map
and exclusive-lock. In this case it is guaranteed that the information
is present in memory in the librbd client and thus very fast.

If fast-diff is not available all areas are reported to be allocated
which is the current behaviour if bdrv_co_block_status is not implemented.

Signed-off-by: Peter Lieven 
---
V1->V2:
- add commit comment [Stefano]
- use failed_post_open [Stefano]
- remove redundant assert [Stefano]
- add macro+comment for the magic -9000 value [Stefano]
- always set *file if its non NULL [Stefano]

 block/rbd.c | 125 
 1 file changed, 125 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..8692e76f40 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
 char *namespace;
 uint64_t image_size;
 uint64_t object_size;
+uint64_t features;
 } BDRVRBDState;
 
 typedef struct RBDTask {
@@ -983,6 +984,13 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->image_size = info.size;
 s->object_size = info.obj_size;
 
+r = rbd_get_features(s->image, >features);
+if (r < 0) {
+error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+goto failed_post_open;
+}
+
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
 if (s->snap != NULL) {
@@ -1259,6 +1267,122 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+/*
+ * rbd_diff_iterate2 allows to interrupt the exection by returning a negative
+ * value in the callback routine. Choose a value that does not conflict with
+ * an existing exitcode and return it if we want to prematurely stop the
+ * execution because we detected a change in the allocation status.
+ */
+#define QEMU_RBD_EXIT_DIFF_ITERATE2 -9000
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return QEMU_RBD_EXIT_DIFF_ITERATE2;
+}
+
+/*
+ * assert that we catched all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;
+struct rbd_diff_req req = { .offs = offset };
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+if (ma

Re: [PATCH] block/rbd: implement bdrv_co_block_status

2021-08-10 Thread Peter Lieven

Am 10.08.21 um 10:51 schrieb Stefano Garzarella:

On Mon, Aug 09, 2021 at 03:41:36PM +0200, Peter Lieven wrote:

Please, can you add a description?
For example also describing what happens if RBD image does not support 
RBD_FEATURE_FAST_DIFF.



Sure.





Signed-off-by: Peter Lieven 
---
block/rbd.c | 119 
1 file changed, 119 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..ef1eaa6af3 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
    char *namespace;
    uint64_t image_size;
    uint64_t object_size;
+    uint64_t features;
} BDRVRBDState;

typedef struct RBDTask {
@@ -983,6 +984,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
    s->image_size = info.size;
    s->object_size = info.obj_size;

+    r = rbd_get_features(s->image, >features);
+    if (r < 0) {
+    error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+    rbd_close(s->image);
+    goto failed_open;

  ^
You can use `failed_post_open` label here, so you can avoid to call rbd_close().



Bad me, I developed this patch in a Qemu version where failed_post_open wasn't 
present...





+    }
+
    /* If we are using an rbd snapshot, we must be r/o, otherwise
 * leave as-is */
    if (s->snap != NULL) {
@@ -1259,6 +1268,115 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
    return spec_info;
}

+typedef struct rbd_diff_req {
+    uint64_t offs;
+    uint64_t bytes;
+    int exists;
+} rbd_diff_req;
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+    struct rbd_diff_req *req = opaque;
+
+    assert(req->offs + req->bytes <= offs);
+    assert(offs >= req->offs + req->bytes);


I think just one of the two asserts is enough, isn't that the same condition?



Right.





+
+    if (req->exists && offs > req->offs + req->bytes) {
+    /*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+    return -9000;

 ^
What is this magical value?

Please add a macro (with a comment) and also use it below in other places.



Will add in V2.





+    }
+    if (req->exists && !exists) {
+    /*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+    return -9000;
+    }
+    if (!req->exists && exists && offs > req->offs) {
+    /*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+    req->bytes = offs - req->offs;
+    return -9000;
+    }
+
+    /*
+ * assert that we catched all cases above and allocation state has not
+ * changed during callbacks.
+ */
+    assert(exists == req->exists || !req->bytes);
+    req->exists = exists;
+
+    /*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+    assert(!req->exists || offs == req->offs + req->bytes);
+    req->bytes = offs + len - req->offs;
+
+    return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+    BDRVRBDState *s = bs->opaque;
+    int ret, r;
+    struct rbd_diff_req req = { .offs = offset };
+
+    assert(offset + bytes <= s->image_size);
+
+    /* default to all sectors allocated */
+    ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+    if (map) {
+    *map = offset;
+    }
+    *pnum = bytes;
+
+    /* RBD image does not support fast-diff */
+    if (!(s->features & RBD_FEATURE_FAST_DIFF)) {
+    goto out;
+    }
+
+    r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
+  qemu_rbd_co_block_status_cb, );
+    if (r < 0 && r != -9000) {
+    goto out;
+    }
+    assert(req.bytes <= bytes);
+    if (!req.exists) {
+    if (r == 0 && !req.bytes) {
+    /*
+ * rbd_diff_iterate2 does not invoke callbacks for unallocated 
areas
+ * except for the case where an o

[PATCH] block/rbd: implement bdrv_co_block_status

2021-08-09 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 119 
 1 file changed, 119 insertions(+)

diff --git a/block/rbd.c b/block/rbd.c
index dcf82b15b8..ef1eaa6af3 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -88,6 +88,7 @@ typedef struct BDRVRBDState {
 char *namespace;
 uint64_t image_size;
 uint64_t object_size;
+uint64_t features;
 } BDRVRBDState;
 
 typedef struct RBDTask {
@@ -983,6 +984,14 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 s->image_size = info.size;
 s->object_size = info.obj_size;
 
+r = rbd_get_features(s->image, >features);
+if (r < 0) {
+error_setg_errno(errp, -r, "error getting image features from %s",
+ s->image_name);
+rbd_close(s->image);
+goto failed_open;
+}
+
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
 if (s->snap != NULL) {
@@ -1259,6 +1268,115 @@ static ImageInfoSpecific 
*qemu_rbd_get_specific_info(BlockDriverState *bs,
 return spec_info;
 }
 
+typedef struct rbd_diff_req {
+uint64_t offs;
+uint64_t bytes;
+int exists;
+} rbd_diff_req;
+
+static int qemu_rbd_co_block_status_cb(uint64_t offs, size_t len,
+   int exists, void *opaque)
+{
+struct rbd_diff_req *req = opaque;
+
+assert(req->offs + req->bytes <= offs);
+assert(offs >= req->offs + req->bytes);
+
+if (req->exists && offs > req->offs + req->bytes) {
+/*
+ * we started in an allocated area and jumped over an unallocated area,
+ * req->bytes contains the length of the allocated area before the
+ * unallocated area. stop further processing.
+ */
+return -9000;
+}
+if (req->exists && !exists) {
+/*
+ * we started in an allocated area and reached a hole. req->bytes
+ * contains the length of the allocated area before the hole.
+ * stop further processing.
+ */
+return -9000;
+}
+if (!req->exists && exists && offs > req->offs) {
+/*
+ * we started in an unallocated area and hit the first allocated
+ * block. req->bytes must be set to the length of the unallocated area
+ * before the allocated area. stop further processing.
+ */
+req->bytes = offs - req->offs;
+return -9000;
+}
+
+/*
+ * assert that we catched all cases above and allocation state has not
+ * changed during callbacks.
+ */
+assert(exists == req->exists || !req->bytes);
+req->exists = exists;
+
+/*
+ * assert that we either return an unallocated block or have got callbacks
+ * for all allocated blocks present.
+ */
+assert(!req->exists || offs == req->offs + req->bytes);
+req->bytes = offs + len - req->offs;
+
+return 0;
+}
+
+static int coroutine_fn qemu_rbd_co_block_status(BlockDriverState *bs,
+ bool want_zero, int64_t 
offset,
+ int64_t bytes, int64_t *pnum,
+ int64_t *map,
+ BlockDriverState **file)
+{
+BDRVRBDState *s = bs->opaque;
+int ret, r;
+struct rbd_diff_req req = { .offs = offset };
+
+assert(offset + bytes <= s->image_size);
+
+/* default to all sectors allocated */
+ret = BDRV_BLOCK_DATA | BDRV_BLOCK_OFFSET_VALID;
+if (map) {
+*map = offset;
+}
+*pnum = bytes;
+
+/* RBD image does not support fast-diff */
+if (!(s->features & RBD_FEATURE_FAST_DIFF)) {
+goto out;
+}
+
+r = rbd_diff_iterate2(s->image, NULL, offset, bytes, true, true,
+  qemu_rbd_co_block_status_cb, );
+if (r < 0 && r != -9000) {
+goto out;
+}
+assert(req.bytes <= bytes);
+if (!req.exists) {
+if (r == 0 && !req.bytes) {
+/*
+ * rbd_diff_iterate2 does not invoke callbacks for unallocated 
areas
+ * except for the case where an overlay has a hole where the parent
+ * has not. This here catches the case where no callback was
+ * invoked at all.
+ */
+req.bytes = bytes;
+}
+ret &= ~BDRV_BLOCK_DATA;
+ret |= BDRV_BLOCK_ZERO;
+}
+*pnum = req.bytes;
+
+out:
+if (ret > 0 && ret & BDRV_BLOCK_OFFSET_VALID && file) {
+*file = bs;
+}
+return ret;
+}
+
 static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1494,6 +1612,7 @@ static BlockDriver bdrv_rbd = {
 #ifdef LIBRBD_SUPPORTS

Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-08 Thread Peter Lieven



> Am 08.07.2021 um 14:18 schrieb Kevin Wolf :
> 
> Am 07.07.2021 um 20:13 hat Peter Lieven geschrieben:
>>> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
>>> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>>>> I will have a decent look after my vacation.
>>> 
>>> Sounds good, thanks. Enjoy your vacation!
>> 
>> As I had to fire up my laptop to look into another issue anyway, I
>> have sent two patches for updating MAINTAINERS and to fix the int vs.
>> bool mix for task->complete.
> 
> I think you need to reevaluate your definition of vacation. ;-)

Lets talk about this when the kids are grown up. Sometimes sending patches can 
be quite relaxing :-)

> 
> But thanks anyway.
> 
>> As Paolos fix (5f50be9b5) is relatively new and there are maybe other
>> non obvious problems when removing the BH indirection and we are close
>> to soft freeze I would leave the BH removal change for 6.2.
> 
> Sure, code cleanups aren't urgent.

Isn’t the indirection also a slight performance drop?

Peter






Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-07 Thread Peter Lieven



> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> 
> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>>> Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
>>> 
>>> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
>>>> This series migrates the qemu rbd driver from the old aio emulation
>>>> to native coroutines and adds write zeroes support which is important
>>>> for block operations.
>>>> 
>>>> To achieve this we first bump the librbd requirement to the already
>>>> outdated luminous release of ceph to get rid of some wrappers and
>>>> ifdef'ry in the code.
>>> 
>>> Thanks, applied to the block branch.
>>> 
>>> I've only had a very quick look at the patches, but I think there is one
>>> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
>>> indirection is probably unnecessary now because aio_co_wake() is thread
>>> safe.
>> 
>> But this is new, isn’t it?
> 
> Not sure in what sense you mean. aio_co_wake() has always been thread
> safe, as far as I know.
> 
> Obviously, the old code didn't use aio_co_wake(), but directly called
> some callbacks, so the part that is new is your coroutine conversion
> that enables getting rid of the BH.
> 
>> We also have this indirection in iscsi and nfs drivers I think.
> 
> Indeed, the resulting codes look the same. In iscsi and nfs it doesn't
> come from an incomplete converstion to coroutines, but they both used
> qemu_coroutine_enter() originally, which resumes the coroutine in the
> current thread...
> 
>> Does it matter that the completion callback is called from an librbd
>> thread? Will the coroutine continue to run in the right thread?
> 
> ...whereas aio_co_wake() resumes the coroutine in the thread where it
> was running before.
> 
> (Before commit 5f50be9b5, this would have been buggy from an librbd
> thread, but now it should work correctly even for threads that are
> neither iothreads nor vcpu threads.)
> 
>> I will have a decent look after my vacation.
> 
> Sounds good, thanks. Enjoy your vacation!


As I had to fire up my laptop to look into another issue anyway, I have sent 
two patches for updating MAINTAINERS and to fix the int vs. bool mix for 
task->complete. As Paolos fix (5f50be9b5) is relatively new and there are maybe 
other non obvious problems when removing the BH indirection and we are close to 
soft freeze I would leave the BH removal change for 6.2.

Best,
Peter





[PATCH] block/rbd: fix type of task->complete

2021-07-07 Thread Peter Lieven
task->complete is a bool not an integer.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 01a7b94d62..dcf82b15b8 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -1066,7 +1066,7 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 static void qemu_rbd_finish_bh(void *opaque)
 {
 RBDTask *task = opaque;
-task->complete = 1;
+task->complete = true;
 aio_co_wake(task->co);
 }
 
-- 
2.17.1





[PATCH] MAINTAINERS: update block/rbd.c maintainer

2021-07-07 Thread Peter Lieven
adding myself as a designated reviewer.

Signed-off-by: Peter Lieven 
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 516db737d1..cfda57e825 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3058,6 +3058,7 @@ F: block/vmdk.c
 
 RBD
 M: Ilya Dryomov 
+R: Peter Lieven 
 L: qemu-bl...@nongnu.org
 S: Supported
 F: block/rbd.c
-- 
2.17.1





Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Peter Lieven



> Am 06.07.2021 um 17:25 schrieb Kevin Wolf :
> 
> Am 06.07.2021 um 16:55 hat Peter Lieven geschrieben:
>>>> Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
>>> 
>>> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
>>>> This series migrates the qemu rbd driver from the old aio emulation
>>>> to native coroutines and adds write zeroes support which is important
>>>> for block operations.
>>>> 
>>>> To achieve this we first bump the librbd requirement to the already
>>>> outdated luminous release of ceph to get rid of some wrappers and
>>>> ifdef'ry in the code.
>>> 
>>> Thanks, applied to the block branch.
>>> 
>>> I've only had a very quick look at the patches, but I think there is one
>>> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
>>> indirection is probably unnecessary now because aio_co_wake() is thread
>>> safe.
>> 
>> But this is new, isn’t it?
> 
> Not sure in what sense you mean. aio_co_wake() has always been thread
> safe, as far as I know.
> 
> Obviously, the old code didn't use aio_co_wake(), but directly called
> some callbacks, so the part that is new is your coroutine conversion
> that enables getting rid of the BH.
> 
>> We also have this indirection in iscsi and nfs drivers I think.
> 
> Indeed, the resulting codes look the same. In iscsi and nfs it doesn't
> come from an incomplete converstion to coroutines, but they both used
> qemu_coroutine_enter() originally, which resumes the coroutine in the
> current thread...

If I remember correctly this would also serialize requests and thus we used 
BHs. libnfs and libiscsi are not thread safe as well and they completely run in 
qemus threads so this wasn’t the original reason.

Thanks for the hints to the relevant commit.

I will send a follow up for rbd/nfs/iscsi in about 2 weeks.

Peter

> 
>> Does it matter that the completion callback is called from an librbd
>> thread? Will the coroutine continue to run in the right thread?
> 
> ...whereas aio_co_wake() resumes the coroutine in the thread where it
> was running before.
> 
> (Before commit 5f50be9b5, this would have been buggy from an librbd
> thread, but now it should work correctly even for threads that are
> neither iothreads nor vcpu threads.)
> 
>> I will have a decent look after my vacation.
> 
> Sounds good, thanks. Enjoy your vacation!
> 
> Kevin
> 





Re: [PATCH v5 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-06 Thread Peter Lieven



> Am 06.07.2021 um 15:19 schrieb Kevin Wolf :
> 
> Am 02.07.2021 um 19:23 hat Ilya Dryomov geschrieben:
>> This series migrates the qemu rbd driver from the old aio emulation
>> to native coroutines and adds write zeroes support which is important
>> for block operations.
>> 
>> To achieve this we first bump the librbd requirement to the already
>> outdated luminous release of ceph to get rid of some wrappers and
>> ifdef'ry in the code.
> 
> Thanks, applied to the block branch.
> 
> I've only had a very quick look at the patches, but I think there is one
> suggestion for a cleanup I can make: The qemu_rbd_finish_bh()
> indirection is probably unnecessary now because aio_co_wake() is thread
> safe.

But this is new, isn’t it?

We also have this indirection in iscsi and nfs drivers I think.

Does it matter that the completion callback is called from an librbd thread? 
Will the coroutine continue to run in the right thread?

I will have a decent look after my vacation.

Anyway, Thanks for applying,
Peter

> 
> (Also, if I were the responsible maintainer, I would prefer true/false
> rather than 0/1 for bools, but that's minor. :-))
> 
> Kevin
> 





Re: [PATCH V4 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-02 Thread Peter Lieven



> Am 02.07.2021 um 14:46 schrieb Ilya Dryomov :
> 
> On Fri, Jul 2, 2021 at 11:09 AM Peter Lieven  wrote:
>> 
>> this series migrates the qemu rbd driver from the old aio emulation
>> to native coroutines and adds write zeroes support which is important
>> for block operations.
>> 
>> To achive this we first bump the librbd requirement to the already
>> outdated luminous release of ceph to get rid of some wrappers and
>> ifdef'ry in the code.
>> 
>> V4->V4:
>> - this patch is now rebased on top of current master
>> - Patch 1: just mention librbd, tweak version numbers [Ilya]
>> - Patch 3: use rbd_get_size instead of rbd_stat [Ilya]
>> - Patch 4: retain comment about using a BH in the callback [Ilya]
>> - Patch 5: set BDRV_REQ_NO_FALLBACK and silently ignore BDRV_REQ_MAY_UNMAP 
>> [Ilya]
>> 
>> V2->V3:
>> - this patch is now rebased on top of current master
>> - Patch 1: only use cc.links and not cc.run to not break
>>   cross-compiling. [Kevin]
>>   Since Qemu 6.1 its okay to rely on librbd >= 12.x since RHEL-7
>>   support was dropped [Daniel]
>> - Patch 4: dropped
>> - Patch 5: store BDS in RBDTask and use bdrv_get_aio_context() [Kevin]
>> 
>> V1->V2:
>> - this patch is now rebased on top of current master with Paolos
>>   upcoming fixes for the meson.build script included:
>>- meson: accept either shared or static libraries if --disable-static
>>- meson: honor --enable-rbd if cc.links test fails
>> - Patch 1: adjusted to meson.build script
>> - Patch 2: unchanged
>> - Patch 3: new patch
>> - Patch 4: do not implement empty detach_aio_context callback [Jason]
>> - Patch 5: - fix aio completion cleanup in error case [Jason]
>>- return error codes from librbd
>> - Patch 6: - add support for thick provisioning [Jason]
>>- do not set write zeroes alignment
>> - Patch 7: new patch
>> 
>> Peter Lieven (6):
>>  block/rbd: bump librbd requirement to luminous release
>>  block/rbd: store object_size in BDRVRBDState
>>  block/rbd: update s->image_size in qemu_rbd_getlength
>>  block/rbd: migrate from aio to coroutines
>>  block/rbd: add write zeroes support
>>  block/rbd: drop qemu_rbd_refresh_limits
>> 
>> block/rbd.c | 406 
>> meson.build |   7 +-
>> 2 files changed, 128 insertions(+), 285 deletions(-)
>> 
>> --
>> 2.17.1
>> 
>> 
> 
> Looks good to me!
> 
> Kevin picked up Or's encryption patch, so there are a few simple
> conflicts with https://repo.or.cz/qemu/kevin.git block now.  Do you
> want to rebase on top of Kevin's block branch and repost with
> "Based-on: <20210627114635.39326-1-...@il.ibm.com>" or some such in
> the cover letter or should I?
> 

Please do, i am already ooo and off for vacation. I wasn’t aware of a conflict 
in Kevin’s git repo, sorry.

Peter

> Thanks,
> 
>Ilya





[PATCH V4 5/6] block/rbd: add write zeroes support

2021-07-02 Thread Peter Lieven
this patch wittingly sets BDRV_REQ_NO_FALLBACK and silently ignores 
BDRV_REQ_MAY_UNMAP
for older librbd versions.

The rationale for this is as following (citing Ilya Dryomov current RBD 
maintainer):
---8<---
a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
   and as a consequence always unmap if librbd is too old

   It's not clear what qemu's expectation is but in general Write
   Zeroes is allowed to unmap.  The only guarantee is that subsequent
   reads return zeroes, everything else is a hint.  This is how it is
   specified in the kernel and in the NVMe spec.

   In particular, block/nvme.c implements it as follows:

   if (flags & BDRV_REQ_MAY_UNMAP) {
   cdw12 |= (1 << 25);
   }

   This sets the Deallocate bit.  But if it's not set, the device may
   still deallocate:

   """
   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
   command, and the namespace supports clearing all bytes to 0h in the
   values read (e.g., bits 2:0 in the DLFEAT field are set to 001b)
   from a deallocated logical block and its metadata (excluding
   protection information), then for each specified logical block, the
   controller:
   - should deallocate that logical block;

   ...

   If the Deallocate bit is cleared to '0' in a Write Zeroes command,
   and the namespace supports clearing all bytes to 0h in the values
   read (e.g., bits 2:0 in the DLFEAT field are set to 001b) from
   a deallocated logical block and its metadata (excluding protection
   information), then, for each specified logical block, the
   controller:
   - may deallocate that logical block;
   """

   
https://nvmexpress.org/wp-content/uploads/NVM-Express-NVM-Command-Set-Specification-2021.06.02-Ratified-1.pdf

b) set BDRV_REQ_NO_FALLBACK in supported_zero_flags

   Again, it's not clear what qemu expects here, but without it we end
   up in a ridiculous situation where specifying the "don't allow slow
   fallback" switch immediately fails all efficient zeroing requests on
   a device where Write Zeroes is always efficient:

   $ qemu-io -c 'help write' | grep -- '-[zun]'
-n, -- with -z, don't allow slow fallback
-u, -- with -z, allow unmapping
-z, -- write zeroes using blk_co_pwrite_zeroes

   $ qemu-io -f rbd -c 'write -z -u -n 0 1M' rbd:foo/bar
   write failed: Operation not supported
--->8---

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 32 +++-
 1 file changed, 31 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index be0471944a..149317d33c 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP | BDRV_REQ_NO_FALLBACK;
+#endif
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -827,6 +832,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState 
*bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -897,6 +914,16 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1120,6 +1147,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.17.1





[PATCH V4 6/6] block/rbd: drop qemu_rbd_refresh_limits

2021-07-02 Thread Peter Lieven
librbd supports 1 byte alignment for all aio operations.

Currently, there is no API call to query limits from the ceph backend.
So drop the bdrv_refresh_limits completely until there is such an API call.

Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
---
 block/rbd.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 149317d33c..93f4bc8b93 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -228,14 +228,6 @@ done:
 return;
 }
 
-
-static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
-{
-/* XXX Does RBD support AIO on less than 512-byte alignment? */
-bs->bl.request_alignment = 512;
-}
-
-
 static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
  Error **errp)
 {
@@ -1130,7 +1122,6 @@ static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
 .bdrv_parse_filename= qemu_rbd_parse_filename,
-.bdrv_refresh_limits= qemu_rbd_refresh_limits,
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
-- 
2.17.1





[PATCH V4 1/6] block/rbd: bump librbd requirement to luminous release

2021-07-02 Thread Peter Lieven
even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.
Qemu 6.1 dropped the support for RHEL-7 which was the last supported
OS that required an older librbd.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 120 
 meson.build |   7 ++-
 2 files changed, 13 insertions(+), 114 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 26f64cce7c..6b1cbe1d75 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -55,24 +55,10 @@
  * leading "\".
  */
 
-/* rbd_aio_discard added in 0.1.2 */
-#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
-#define LIBRBD_SUPPORTS_DISCARD
-#else
-#undef LIBRBD_SUPPORTS_DISCARD
-#endif
-
 #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
 
 #define RBD_MAX_SNAPS 100
 
-/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
-#ifdef LIBRBD_SUPPORTS_IOVEC
-#define LIBRBD_USE_IOVEC 1
-#else
-#define LIBRBD_USE_IOVEC 0
-#endif
-
 typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
@@ -84,7 +70,6 @@ typedef struct RBDAIOCB {
 BlockAIOCB common;
 int64_t ret;
 QEMUIOVector *qiov;
-char *bounce;
 RBDAIOCmd cmd;
 int error;
 struct BDRVRBDState *s;
@@ -94,7 +79,6 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int64_t size;
-char *buf;
 int64_t ret;
 } RADOSCB;
 
@@ -342,13 +326,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 
 static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
 {
-if (LIBRBD_USE_IOVEC) {
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-} else {
-memset(rcb->buf + offs, 0, rcb->size - offs);
-}
+RBDAIOCB *acb = rcb->acb;
+iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
+   acb->qiov->size - offs);
 }
 
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
@@ -504,13 +484,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 
 g_free(rcb);
 
-if (!LIBRBD_USE_IOVEC) {
-if (acb->cmd == RBD_AIO_READ) {
-qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-}
-qemu_vfree(acb->bounce);
-}
-
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 
 qemu_aio_unref(acb);
@@ -878,28 +851,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
  rbd_finish_bh, rcb);
 }
 
-static int rbd_aio_discard_wrapper(rbd_image_t image,
-   uint64_t off,
-   uint64_t len,
-   rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_DISCARD
-return rbd_aio_discard(image, off, len, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
-static int rbd_aio_flush_wrapper(rbd_image_t image,
- rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
-return rbd_aio_flush(image, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
  int64_t off,
  QEMUIOVector *qiov,
@@ -922,21 +873,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 
 rcb = g_new(RADOSCB, 1);
 
-if (!LIBRBD_USE_IOVEC) {
-if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
-acb->bounce = NULL;
-} else {
-acb->bounce = qemu_try_blockalign(bs, qiov->size);
-if (acb->bounce == NULL) {
-goto failed;
-}
-}
-if (cmd == RBD_AIO_WRITE) {
-qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-}
-rcb->buf = acb->bounce;
-}
-
 acb->ret = 0;
 acb->error = 0;
 acb->s = s;
@@ -950,7 +886,7 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 }
 
 switch (cmd) {
-case RBD_AIO_WRITE: {
+case RBD_AIO_WRITE:
 /*
  * RBD APIs don't allow us to write more than actual size, so in order
  * to support growing images, we resize the image before write
@@ -962,25 +898,16 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 goto failed_completion;
 }
 }
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_write(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
 break;
-}
 case RBD_AIO_READ:
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_r

[PATCH V4 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-07-02 Thread Peter Lieven
this series migrates the qemu rbd driver from the old aio emulation
to native coroutines and adds write zeroes support which is important
for block operations.

To achive this we first bump the librbd requirement to the already
outdated luminous release of ceph to get rid of some wrappers and
ifdef'ry in the code.

V4->V4:
 - this patch is now rebased on top of current master
 - Patch 1: just mention librbd, tweak version numbers [Ilya]
 - Patch 3: use rbd_get_size instead of rbd_stat [Ilya]
 - Patch 4: retain comment about using a BH in the callback [Ilya]
 - Patch 5: set BDRV_REQ_NO_FALLBACK and silently ignore BDRV_REQ_MAY_UNMAP 
[Ilya]

V2->V3:
 - this patch is now rebased on top of current master
 - Patch 1: only use cc.links and not cc.run to not break
   cross-compiling. [Kevin]
   Since Qemu 6.1 its okay to rely on librbd >= 12.x since RHEL-7
   support was dropped [Daniel]
 - Patch 4: dropped
 - Patch 5: store BDS in RBDTask and use bdrv_get_aio_context() [Kevin]

V1->V2:
 - this patch is now rebased on top of current master with Paolos
   upcoming fixes for the meson.build script included:
- meson: accept either shared or static libraries if --disable-static
- meson: honor --enable-rbd if cc.links test fails
 - Patch 1: adjusted to meson.build script
 - Patch 2: unchanged
 - Patch 3: new patch
 - Patch 4: do not implement empty detach_aio_context callback [Jason]
 - Patch 5: - fix aio completion cleanup in error case [Jason]
- return error codes from librbd
 - Patch 6: - add support for thick provisioning [Jason]
- do not set write zeroes alignment
 - Patch 7: new patch

Peter Lieven (6):
  block/rbd: bump librbd requirement to luminous release
  block/rbd: store object_size in BDRVRBDState
  block/rbd: update s->image_size in qemu_rbd_getlength
  block/rbd: migrate from aio to coroutines
  block/rbd: add write zeroes support
  block/rbd: drop qemu_rbd_refresh_limits

 block/rbd.c | 406 
 meson.build |   7 +-
 2 files changed, 128 insertions(+), 285 deletions(-)

-- 
2.17.1





[PATCH V4 2/6] block/rbd: store object_size in BDRVRBDState

2021-07-02 Thread Peter Lieven
Signed-off-by: Peter Lieven 
Reviewed-by: Ilya Dryomov 
---
 block/rbd.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 6b1cbe1d75..b4caea4f1b 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -90,6 +90,7 @@ typedef struct BDRVRBDState {
 char *snap;
 char *namespace;
 uint64_t image_size;
+uint64_t object_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -675,6 +676,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 const QDictEntry *e;
 Error *local_err = NULL;
 char *keypairs, *secretid;
+rbd_image_info_t info;
 int r;
 
 keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
@@ -739,13 +741,15 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto failed_open;
 }
 
-r = rbd_get_size(s->image, >image_size);
+r = rbd_stat(s->image, , sizeof(info));
 if (r < 0) {
-error_setg_errno(errp, -r, "error getting image size from %s",
+error_setg_errno(errp, -r, "error getting image info from %s",
  s->image_name);
 rbd_close(s->image);
 goto failed_open;
 }
+s->image_size = info.size;
+s->object_size = info.obj_size;
 
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
@@ -957,15 +961,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState *bs,
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
-int r;
-
-r = rbd_stat(s->image, , sizeof(info));
-if (r < 0) {
-return r;
-}
-
-bdi->cluster_size = info.obj_size;
+bdi->cluster_size = s->object_size;
 return 0;
 }
 
-- 
2.17.1





[PATCH V4 4/6] block/rbd: migrate from aio to coroutines

2021-07-02 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 252 +++-
 1 file changed, 90 insertions(+), 162 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 1f8dc84079..be0471944a 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -66,22 +66,6 @@ typedef enum {
 RBD_AIO_FLUSH
 } RBDAIOCmd;
 
-typedef struct RBDAIOCB {
-BlockAIOCB common;
-int64_t ret;
-QEMUIOVector *qiov;
-RBDAIOCmd cmd;
-int error;
-struct BDRVRBDState *s;
-} RBDAIOCB;
-
-typedef struct RADOSCB {
-RBDAIOCB *acb;
-struct BDRVRBDState *s;
-int64_t size;
-int64_t ret;
-} RADOSCB;
-
 typedef struct BDRVRBDState {
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -93,6 +77,13 @@ typedef struct BDRVRBDState {
 uint64_t object_size;
 } BDRVRBDState;
 
+typedef struct RBDTask {
+BlockDriverState *bs;
+Coroutine *co;
+bool complete;
+int64_t ret;
+} RBDTask;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -325,13 +316,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 return ret;
 }
 
-static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
-{
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-}
-
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
 static int qemu_rbd_do_create(BlockdevCreateOptions *options,
   const char *keypairs, const char 
*password_secret,
@@ -450,46 +434,6 @@ exit:
 return ret;
 }
 
-/*
- * This aio completion is being called from rbd_finish_bh() and runs in qemu
- * BH context.
- */
-static void qemu_rbd_complete_aio(RADOSCB *rcb)
-{
-RBDAIOCB *acb = rcb->acb;
-int64_t r;
-
-r = rcb->ret;
-
-if (acb->cmd != RBD_AIO_READ) {
-if (r < 0) {
-acb->ret = r;
-acb->error = 1;
-} else if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else {
-if (r < 0) {
-qemu_rbd_memset(rcb, 0);
-acb->ret = r;
-acb->error = 1;
-} else if (r < rcb->size) {
-qemu_rbd_memset(rcb, r);
-if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else if (!acb->error) {
-acb->ret = r;
-}
-}
-
-g_free(rcb);
-
-acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-
-qemu_aio_unref(acb);
-}
-
 static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
 {
 const char **vals;
@@ -826,89 +770,59 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 return 0;
 }
 
-static const AIOCBInfo rbd_aiocb_info = {
-.aiocb_size = sizeof(RBDAIOCB),
-};
-
-static void rbd_finish_bh(void *opaque)
+static void qemu_rbd_finish_bh(void *opaque)
 {
-RADOSCB *rcb = opaque;
-qemu_rbd_complete_aio(rcb);
+RBDTask *task = opaque;
+task->complete = 1;
+aio_co_wake(task->co);
 }
 
 /*
- * This is the callback function for rbd_aio_read and _write
+ * This is the completion callback function for all rbd aio calls
+ * started from qemu_rbd_start_co().
  *
  * Note: this function is being called from a non qemu thread so
  * we need to be careful about what we do here. Generally we only
  * schedule a BH, and do the rest of the io completion handling
- * from rbd_finish_bh() which runs in a qemu context.
+ * from qemu_rbd_finish_bh() which runs in a qemu context.
  */
-static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
+static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
-RBDAIOCB *acb = rcb->acb;
-
-rcb->ret = rbd_aio_get_return_value(c);
+task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-
-replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
- rbd_finish_bh, rcb);
+aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+qemu_rbd_finish_bh, task);
 }
 
-static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
- int64_t off,
- QEMUIOVector *qiov,
- int64_t size,
- BlockCompletionFunc *cb,
- void *opaque,
- RBDAIOCmd cmd)
+static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
+  uint64_t offset,
+  uint64_t bytes,
+  QEMUIOVector *qiov,
+  in

[PATCH V4 3/6] block/rbd: update s->image_size in qemu_rbd_getlength

2021-07-02 Thread Peter Lieven
while at it just call rbd_get_size and avoid rbd_stat.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index b4caea4f1b..1f8dc84079 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -968,15 +968,14 @@ static int qemu_rbd_getinfo(BlockDriverState *bs, 
BlockDriverInfo *bdi)
 static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
 int r;
 
-r = rbd_stat(s->image, , sizeof(info));
+r = rbd_get_size(s->image, >image_size);
 if (r < 0) {
 return r;
 }
 
-return info.size;
+return s->image_size;
 }
 
 static int coroutine_fn qemu_rbd_co_truncate(BlockDriverState *bs,
-- 
2.17.1





Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-27 Thread Peter Lieven



> Am 26.06.2021 um 17:57 schrieb Ilya Dryomov :
> 
> On Mon, Jun 21, 2021 at 10:49 AM Peter Lieven  wrote:
>> 
>>> Am 18.06.21 um 12:34 schrieb Ilya Dryomov:
>>> On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:
>>>> Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
>>>>> On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>>>>>> Signed-off-by: Peter Lieven 
>>>>>> ---
>>>>>>  block/rbd.c | 37 -
>>>>>>  1 file changed, 36 insertions(+), 1 deletion(-)
>>>>>> 
>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>> index 0d8612a988..ee13f08a74 100644
>>>>>> --- a/block/rbd.c
>>>>>> +++ b/block/rbd.c
>>>>>> @@ -63,7 +63,8 @@ typedef enum {
>>>>>>  RBD_AIO_READ,
>>>>>>  RBD_AIO_WRITE,
>>>>>>  RBD_AIO_DISCARD,
>>>>>> -RBD_AIO_FLUSH
>>>>>> +RBD_AIO_FLUSH,
>>>>>> +RBD_AIO_WRITE_ZEROES
>>>>>>  } RBDAIOCmd;
>>>>>> 
>>>>>>  typedef struct BDRVRBDState {
>>>>>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, 
>>>>>> QDict *options, int flags,
>>>>>>  }
>>>>>>  }
>>>>>> 
>>>>>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>>>>>> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
>>>>> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
>>>>> does not really have a notion of non-efficient explicit zeroing.
>>>> 
>>>> This is only true if thick provisioning is supported which is in Octopus 
>>>> onwards, right?
>>> Since Pacific, I think.
>>> 
>>>> So it would only be correct to set this if thick provisioning is supported 
>>>> otherwise we could
>>>> 
>>>> fail with ENOTSUP and then qemu emulates the zeroing with plain writes.
>>> I actually had a question about that.  Why are you returning ENOTSUP
>>> in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
>>> because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?
>>> 
>>> My understanding has always been that BDRV_REQ_MAY_UNMAP is just
>>> a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
>>> but should be perfectly acceptable.  It is certainly better than
>>> returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
>>> zeroing.
>> 
>> 
>> I think this was introduced to support different provisioning modes. If 
>> BDRV_REQ_MAY_UNMAP is not set
>> 
>> the caller of bdrv_write_zeroes expects that the driver does thick 
>> provisioning. If the driver cannot handle that (efficiently)
>> 
>> qemu does a plain zero write.
>> 
>> 
>> I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
>> flag. The original commit states that it was introduced for qemu-img to 
>> efficiently
>> 
>> zero out the target and avoid the slow fallback. When I last worked on 
>> qemu-img convert I remember that there was a call to zero out the target if 
>> bdrv_has_zero_init
>> 
>> is not 1. It seems hat meanwhile a target_is_zero cmdline switch for 
>> qemu-img convert was added to let the user assert that a preexisting target 
>> is zero.
>> 
>> Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
>> for rbd in either of the 2 cases (thick provisioning is support or not)?
> 
> Since no one spoke up I think we should
> 
> a) remove the BDRV_REQ_MAY_UNMAP check in qemu_rbd_co_pwrite_zeroes()
>   and as a consequence always unmap if librbd is too old
> 
>   It's not clear what qemu's expectation is but in general Write
>   Zeroes is allowed to unmap.  The only guarantee is that subsequent
>   reads return zeroes, everything else is a hint.  This is how it is
>   specified in the kernel and in the NVMe spec.
> 
>   In particular, block/nvme.c implements it as follows:
> 
>   if (flags & BDRV_REQ_MAY_UNMAP) {
>   cdw12 |= (1 << 25);
>   }
> 
>   This sets the Deallocate bit.  But if it's not set, the device may
>   still deallocate:
> 
>   """
>   If the Deallocate bit (CDW12.DEAC) is set to '1' in a Write Zeroes
>   command, and the namespace supports clearing all bytes to 0h in the
>   values read (e.g., bits 2:0

Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-21 Thread Peter Lieven

Am 18.06.21 um 12:34 schrieb Ilya Dryomov:

On Fri, Jun 18, 2021 at 11:00 AM Peter Lieven  wrote:

Am 16.06.21 um 14:34 schrieb Ilya Dryomov:

On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:

Signed-off-by: Peter Lieven 
---
  block/rbd.c | 37 -
  1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0d8612a988..ee13f08a74 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
  RBD_AIO_READ,
  RBD_AIO_WRITE,
  RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
  } RBDAIOCmd;

  typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
  }
  }

+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;

I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
does not really have a notion of non-efficient explicit zeroing.


This is only true if thick provisioning is supported which is in Octopus 
onwards, right?

Since Pacific, I think.


So it would only be correct to set this if thick provisioning is supported 
otherwise we could

fail with ENOTSUP and then qemu emulates the zeroing with plain writes.

I actually had a question about that.  Why are you returning ENOTSUP
in case BDRV_REQ_MAY_UNMAP is not specified and that can't be fulfilled
because librbd is too old for RBD_WRITE_ZEROES_FLAG_THICK_PROVISION?

My understanding has always been that BDRV_REQ_MAY_UNMAP is just
a hint.  Deallocating if BDRV_REQ_MAY_UNMAP is specified is not nice
but should be perfectly acceptable.  It is certainly better than
returning ENOTSUP, particularly if ENOTSUP causes Qemu to do plain
zeroing.



I think this was introduced to support different provisioning modes. If 
BDRV_REQ_MAY_UNMAP is not set

the caller of bdrv_write_zeroes expects that the driver does thick 
provisioning. If the driver cannot handle that (efficiently)

qemu does a plain zero write.


I am still not fully understanding the meaning of the BDRV_REQ_NO_FALLBACK 
flag. The original commit states that it was introduced for qemu-img to 
efficiently

zero out the target and avoid the slow fallback. When I last worked on qemu-img 
convert I remember that there was a call to zero out the target if 
bdrv_has_zero_init

is not 1. It seems hat meanwhile a target_is_zero cmdline switch for qemu-img 
convert was added to let the user assert that a preexisting target is zero.

Maybe someone can help here if it would be right to set BDRV_REQ_NO_FALLBACK 
for rbd in either of the 2 cases (thick provisioning is support or not)?


Thanks

Peter







Re: [PATCH V3 4/6] block/rbd: migrate from aio to coroutines

2021-06-18 Thread Peter Lieven
Am 17.06.21 um 16:43 schrieb Ilya Dryomov:
> On Wed, May 19, 2021 at 4:27 PM Peter Lieven  wrote:
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 255 ++--
>>  1 file changed, 87 insertions(+), 168 deletions(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 97a2ae4c84..0d8612a988 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -66,22 +66,6 @@ typedef enum {
>>  RBD_AIO_FLUSH
>>  } RBDAIOCmd;
>>
>> -typedef struct RBDAIOCB {
>> -BlockAIOCB common;
>> -int64_t ret;
>> -QEMUIOVector *qiov;
>> -RBDAIOCmd cmd;
>> -int error;
>> -struct BDRVRBDState *s;
>> -} RBDAIOCB;
>> -
>> -typedef struct RADOSCB {
>> -RBDAIOCB *acb;
>> -struct BDRVRBDState *s;
>> -int64_t size;
>> -int64_t ret;
>> -} RADOSCB;
>> -
>>  typedef struct BDRVRBDState {
>>  rados_t cluster;
>>  rados_ioctx_t io_ctx;
>> @@ -93,6 +77,13 @@ typedef struct BDRVRBDState {
>>  uint64_t object_size;
>>  } BDRVRBDState;
>>
>> +typedef struct RBDTask {
>> +BlockDriverState *bs;
>> +Coroutine *co;
>> +bool complete;
>> +int64_t ret;
>> +} RBDTask;
>> +
>>  static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
>>  BlockdevOptionsRbd *opts, bool cache,
>>  const char *keypairs, const char *secretid,
>> @@ -325,13 +316,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
>> char *keypairs_json,
>>  return ret;
>>  }
>>
>> -static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>> -{
>> -RBDAIOCB *acb = rcb->acb;
>> -iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
>> -   acb->qiov->size - offs);
>> -}
>> -
>>  /* FIXME Deprecate and remove keypairs or make it available in QMP. */
>>  static int qemu_rbd_do_create(BlockdevCreateOptions *options,
>>const char *keypairs, const char 
>> *password_secret,
>> @@ -450,46 +434,6 @@ exit:
>>  return ret;
>>  }
>>
>> -/*
>> - * This aio completion is being called from rbd_finish_bh() and runs in qemu
>> - * BH context.
>> - */
>> -static void qemu_rbd_complete_aio(RADOSCB *rcb)
>> -{
>> -RBDAIOCB *acb = rcb->acb;
>> -int64_t r;
>> -
>> -r = rcb->ret;
>> -
>> -if (acb->cmd != RBD_AIO_READ) {
>> -if (r < 0) {
>> -acb->ret = r;
>> -acb->error = 1;
>> -} else if (!acb->error) {
>> -acb->ret = rcb->size;
>> -}
>> -} else {
>> -if (r < 0) {
>> -qemu_rbd_memset(rcb, 0);
>> -acb->ret = r;
>> -acb->error = 1;
>> -} else if (r < rcb->size) {
>> -qemu_rbd_memset(rcb, r);
>> -if (!acb->error) {
>> -acb->ret = rcb->size;
>> -}
>> -} else if (!acb->error) {
>> -acb->ret = r;
>> -}
>> -}
>> -
>> -g_free(rcb);
>> -
>> -acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
>> -
>> -qemu_aio_unref(acb);
>> -}
>> -
>>  static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
>>  {
>>  const char **vals;
>> @@ -826,89 +770,50 @@ static int qemu_rbd_resize(BlockDriverState *bs, 
>> uint64_t size)
>>  return 0;
>>  }
>>
>> -static const AIOCBInfo rbd_aiocb_info = {
>> -.aiocb_size = sizeof(RBDAIOCB),
>> -};
>> -
>> -static void rbd_finish_bh(void *opaque)
>> +static void qemu_rbd_finish_bh(void *opaque)
>>  {
>> -RADOSCB *rcb = opaque;
>> -qemu_rbd_complete_aio(rcb);
>> +RBDTask *task = opaque;
>> +task->complete = 1;
>> +aio_co_wake(task->co);
>>  }
>>
>> -/*
>> - * This is the callback function for rbd_aio_read and _write
>> - *
>> - * Note: this function is being called from a non qemu thread so
>> - * we need to be careful about what we do here. Generally we only
>> - * schedule a BH, and do the rest of the io completion handling
>> - * from rbd_finish_bh() which runs in a qemu context.
>> - */
> I would adapt this comment ins

Re: [PATCH V3 5/6] block/rbd: add write zeroes support

2021-06-18 Thread Peter Lieven
Am 16.06.21 um 14:34 schrieb Ilya Dryomov:
> On Wed, May 19, 2021 at 4:28 PM Peter Lieven  wrote:
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 37 -
>>  1 file changed, 36 insertions(+), 1 deletion(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 0d8612a988..ee13f08a74 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -63,7 +63,8 @@ typedef enum {
>>  RBD_AIO_READ,
>>  RBD_AIO_WRITE,
>>  RBD_AIO_DISCARD,
>> -RBD_AIO_FLUSH
>> +RBD_AIO_FLUSH,
>> +RBD_AIO_WRITE_ZEROES
>>  } RBDAIOCmd;
>>
>>  typedef struct BDRVRBDState {
>> @@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
>> *options, int flags,
>>  }
>>  }
>>
>> +#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
>> +bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
> I wonder if we should also set BDRV_REQ_NO_FALLBACK here since librbd
> does not really have a notion of non-efficient explicit zeroing.


This is only true if thick provisioning is supported which is in Octopus 
onwards, right?

So it would only be correct to set this if thick provisioning is supported 
otherwise we could

fail with ENOTSUP and then qemu emulates the zeroing with plain writes.


Peter






Re: [PATCH V3 1/6] block/rbd: bump librbd requirement to luminous release

2021-06-18 Thread Peter Lieven
Am 16.06.21 um 14:26 schrieb Ilya Dryomov:
> On Wed, May 19, 2021 at 4:26 PM Peter Lieven  wrote:
>> even luminous (version 12.2) is unmaintained for over 3 years now.
>> Bump the requirement to get rid of the ifdef'ry in the code.
>> Qemu 6.1 dropped the support for RHEL-7 which was the last supported
>> OS that required an older librbd.
>>
>> Signed-off-by: Peter Lieven 
>> ---
>>  block/rbd.c | 120 
>>  meson.build |   7 ++-
>>  2 files changed, 13 insertions(+), 114 deletions(-)
>>
>> diff --git a/block/rbd.c b/block/rbd.c
>> index 26f64cce7c..6b1cbe1d75 100644
>> --- a/block/rbd.c
>> +++ b/block/rbd.c
>> @@ -55,24 +55,10 @@
>>   * leading "\".
>>   */
>>
>> -/* rbd_aio_discard added in 0.1.2 */
>> -#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
>> -#define LIBRBD_SUPPORTS_DISCARD
>> -#else
>> -#undef LIBRBD_SUPPORTS_DISCARD
>> -#endif
>> -
>>  #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
>>
>>  #define RBD_MAX_SNAPS 100
>>
>> -/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
>> -#ifdef LIBRBD_SUPPORTS_IOVEC
>> -#define LIBRBD_USE_IOVEC 1
>> -#else
>> -#define LIBRBD_USE_IOVEC 0
>> -#endif
>> -
>>  typedef enum {
>>  RBD_AIO_READ,
>>  RBD_AIO_WRITE,
>> @@ -84,7 +70,6 @@ typedef struct RBDAIOCB {
>>  BlockAIOCB common;
>>  int64_t ret;
>>  QEMUIOVector *qiov;
>> -char *bounce;
>>  RBDAIOCmd cmd;
>>  int error;
>>  struct BDRVRBDState *s;
>> @@ -94,7 +79,6 @@ typedef struct RADOSCB {
>>  RBDAIOCB *acb;
>>  struct BDRVRBDState *s;
>>  int64_t size;
>> -char *buf;
>>  int64_t ret;
>>  } RADOSCB;
>>
>> @@ -342,13 +326,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
>> char *keypairs_json,
>>
>>  static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
>>  {
>> -if (LIBRBD_USE_IOVEC) {
>> -RBDAIOCB *acb = rcb->acb;
>> -iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
>> -   acb->qiov->size - offs);
>> -} else {
>> -memset(rcb->buf + offs, 0, rcb->size - offs);
>> -}
>> +RBDAIOCB *acb = rcb->acb;
>> +iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
>> +   acb->qiov->size - offs);
>>  }
>>
>>  /* FIXME Deprecate and remove keypairs or make it available in QMP. */
>> @@ -504,13 +484,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
>>
>>  g_free(rcb);
>>
>> -if (!LIBRBD_USE_IOVEC) {
>> -if (acb->cmd == RBD_AIO_READ) {
>> -qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
>> -}
>> -qemu_vfree(acb->bounce);
>> -}
>> -
>>  acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
>>
>>  qemu_aio_unref(acb);
>> @@ -878,28 +851,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, 
>> RADOSCB *rcb)
>>   rbd_finish_bh, rcb);
>>  }
>>
>> -static int rbd_aio_discard_wrapper(rbd_image_t image,
>> -   uint64_t off,
>> -   uint64_t len,
>> -   rbd_completion_t comp)
>> -{
>> -#ifdef LIBRBD_SUPPORTS_DISCARD
>> -return rbd_aio_discard(image, off, len, comp);
>> -#else
>> -return -ENOTSUP;
>> -#endif
>> -}
>> -
>> -static int rbd_aio_flush_wrapper(rbd_image_t image,
>> - rbd_completion_t comp)
>> -{
>> -#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
>> -return rbd_aio_flush(image, comp);
>> -#else
>> -return -ENOTSUP;
>> -#endif
>> -}
>> -
>>  static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>>   int64_t off,
>>   QEMUIOVector *qiov,
>> @@ -922,21 +873,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
>>
>>  rcb = g_new(RADOSCB, 1);
>>
>> -if (!LIBRBD_USE_IOVEC) {
>> -if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
>> -acb->bounce = NULL;
>> -} else {
>> -acb->bounce = qemu_try_blockalign(bs, qiov->size);
>> -if (acb->bounce == NULL) {
>> - 

[PATCH V3 2/6] block/rbd: store object_size in BDRVRBDState

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 6b1cbe1d75..b4caea4f1b 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -90,6 +90,7 @@ typedef struct BDRVRBDState {
 char *snap;
 char *namespace;
 uint64_t image_size;
+uint64_t object_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -675,6 +676,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 const QDictEntry *e;
 Error *local_err = NULL;
 char *keypairs, *secretid;
+rbd_image_info_t info;
 int r;
 
 keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
@@ -739,13 +741,15 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto failed_open;
 }
 
-r = rbd_get_size(s->image, >image_size);
+r = rbd_stat(s->image, , sizeof(info));
 if (r < 0) {
-error_setg_errno(errp, -r, "error getting image size from %s",
+error_setg_errno(errp, -r, "error getting image info from %s",
  s->image_name);
 rbd_close(s->image);
 goto failed_open;
 }
+s->image_size = info.size;
+s->object_size = info.obj_size;
 
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
@@ -957,15 +961,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState *bs,
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
-int r;
-
-r = rbd_stat(s->image, , sizeof(info));
-if (r < 0) {
-return r;
-}
-
-bdi->cluster_size = info.obj_size;
+bdi->cluster_size = s->object_size;
 return 0;
 }
 
-- 
2.17.1





[PATCH V3 5/6] block/rbd: add write zeroes support

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 37 -
 1 file changed, 36 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index 0d8612a988..ee13f08a74 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -705,6 +706,10 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+#endif
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -818,6 +823,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState 
*bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -888,6 +905,21 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+#ifndef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+return -ENOTSUP;
+}
+#endif
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1113,6 +1145,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.17.1





[PATCH V3 4/6] block/rbd: migrate from aio to coroutines

2021-05-19 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 255 ++--
 1 file changed, 87 insertions(+), 168 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 97a2ae4c84..0d8612a988 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -66,22 +66,6 @@ typedef enum {
 RBD_AIO_FLUSH
 } RBDAIOCmd;
 
-typedef struct RBDAIOCB {
-BlockAIOCB common;
-int64_t ret;
-QEMUIOVector *qiov;
-RBDAIOCmd cmd;
-int error;
-struct BDRVRBDState *s;
-} RBDAIOCB;
-
-typedef struct RADOSCB {
-RBDAIOCB *acb;
-struct BDRVRBDState *s;
-int64_t size;
-int64_t ret;
-} RADOSCB;
-
 typedef struct BDRVRBDState {
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -93,6 +77,13 @@ typedef struct BDRVRBDState {
 uint64_t object_size;
 } BDRVRBDState;
 
+typedef struct RBDTask {
+BlockDriverState *bs;
+Coroutine *co;
+bool complete;
+int64_t ret;
+} RBDTask;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -325,13 +316,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 return ret;
 }
 
-static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
-{
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-}
-
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
 static int qemu_rbd_do_create(BlockdevCreateOptions *options,
   const char *keypairs, const char 
*password_secret,
@@ -450,46 +434,6 @@ exit:
 return ret;
 }
 
-/*
- * This aio completion is being called from rbd_finish_bh() and runs in qemu
- * BH context.
- */
-static void qemu_rbd_complete_aio(RADOSCB *rcb)
-{
-RBDAIOCB *acb = rcb->acb;
-int64_t r;
-
-r = rcb->ret;
-
-if (acb->cmd != RBD_AIO_READ) {
-if (r < 0) {
-acb->ret = r;
-acb->error = 1;
-} else if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else {
-if (r < 0) {
-qemu_rbd_memset(rcb, 0);
-acb->ret = r;
-acb->error = 1;
-} else if (r < rcb->size) {
-qemu_rbd_memset(rcb, r);
-if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else if (!acb->error) {
-acb->ret = r;
-}
-}
-
-g_free(rcb);
-
-acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-
-qemu_aio_unref(acb);
-}
-
 static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
 {
 const char **vals;
@@ -826,89 +770,50 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 return 0;
 }
 
-static const AIOCBInfo rbd_aiocb_info = {
-.aiocb_size = sizeof(RBDAIOCB),
-};
-
-static void rbd_finish_bh(void *opaque)
+static void qemu_rbd_finish_bh(void *opaque)
 {
-RADOSCB *rcb = opaque;
-qemu_rbd_complete_aio(rcb);
+RBDTask *task = opaque;
+task->complete = 1;
+aio_co_wake(task->co);
 }
 
-/*
- * This is the callback function for rbd_aio_read and _write
- *
- * Note: this function is being called from a non qemu thread so
- * we need to be careful about what we do here. Generally we only
- * schedule a BH, and do the rest of the io completion handling
- * from rbd_finish_bh() which runs in a qemu context.
- */
-static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
+static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
-RBDAIOCB *acb = rcb->acb;
-
-rcb->ret = rbd_aio_get_return_value(c);
+task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-
-replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
- rbd_finish_bh, rcb);
+aio_bh_schedule_oneshot(bdrv_get_aio_context(task->bs),
+qemu_rbd_finish_bh, task);
 }
 
-static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
- int64_t off,
- QEMUIOVector *qiov,
- int64_t size,
- BlockCompletionFunc *cb,
- void *opaque,
- RBDAIOCmd cmd)
+static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
+  uint64_t offset,
+  uint64_t bytes,
+  QEMUIOVector *qiov,
+  int flags,
+  RBDAIOCmd cmd)
 {
-RBDAIOCB *acb;
-RADOSCB *rcb = NULL;
+BDRVRBDState *s = bs->opaque;
+RBDTask task 

[PATCH V3 6/6] block/rbd: drop qemu_rbd_refresh_limits

2021-05-19 Thread Peter Lieven
librbd supports 1 byte alignment for all aio operations.

Currently, there is no API call to query limits from the ceph backend.
So drop the bdrv_refresh_limits completely until there is such an API call.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index ee13f08a74..368a674aa0 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -228,14 +228,6 @@ done:
 return;
 }
 
-
-static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
-{
-/* XXX Does RBD support AIO on less than 512-byte alignment? */
-bs->bl.request_alignment = 512;
-}
-
-
 static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
  Error **errp)
 {
@@ -1128,7 +1120,6 @@ static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
 .bdrv_parse_filename= qemu_rbd_parse_filename,
-.bdrv_refresh_limits= qemu_rbd_refresh_limits,
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
-- 
2.17.1





[PATCH V3 0/6] block/rbd: migrate to coroutines and add write zeroes support

2021-05-19 Thread Peter Lieven
this series migrates the qemu rbd driver from the old aio emulation
to native coroutines and adds write zeroes support which is important
for block operations.

To achive this we first bump the librbd requirement to the already
outdated luminous release of ceph to get rid of some wrappers and
ifdef'ry in the code.

V2->V3:
 - this patch is now rebased on top of current master
 - Patch 1: only use cc.links and not cc.run to not break
   cross-compiling. [Kevin]
   Since Qemu 6.1 its okay to rely on librbd >= 12.x since RHEL-7
   support was dropped [Daniel]
 - Patch 4: dropped
 - Patch 5: store BDS in RBDTask and use bdrv_get_aio_context() [Kevin]

V1->V2:
 - this patch is now rebased on top of current master with Paolos
   upcoming fixes for the meson.build script included:
- meson: accept either shared or static libraries if --disable-static
- meson: honor --enable-rbd if cc.links test fails
 - Patch 1: adjusted to meson.build script
 - Patch 2: unchanged
 - Patch 3: new patch
 - Patch 4: do not implement empty detach_aio_context callback [Jason]
 - Patch 5: - fix aio completion cleanup in error case [Jason]
- return error codes from librbd
 - Patch 6: - add support for thick provisioning [Jason]
- do not set write zeroes alignment
 - Patch 7: new patch

Peter Lieven (6):
  block/rbd: bump librbd requirement to luminous release
  block/rbd: store object_size in BDRVRBDState
  block/rbd: update s->image_size in qemu_rbd_getlength
  block/rbd: migrate from aio to coroutines
  block/rbd: add write zeroes support
  block/rbd: drop qemu_rbd_refresh_limits

 block/rbd.c | 408 
 meson.build |   7 +-
 2 files changed, 128 insertions(+), 287 deletions(-)

-- 
2.17.1





[PATCH V3 3/6] block/rbd: update s->image_size in qemu_rbd_getlength

2021-05-19 Thread Peter Lieven
in case the image size changed we should adjust our internally stored size as 
well.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/rbd.c b/block/rbd.c
index b4caea4f1b..97a2ae4c84 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -976,6 +976,7 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 return r;
 }
 
+s->image_size = info.size;
 return info.size;
 }
 
-- 
2.17.1





[PATCH V3 1/6] block/rbd: bump librbd requirement to luminous release

2021-05-19 Thread Peter Lieven
even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.
Qemu 6.1 dropped the support for RHEL-7 which was the last supported
OS that required an older librbd.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 120 
 meson.build |   7 ++-
 2 files changed, 13 insertions(+), 114 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 26f64cce7c..6b1cbe1d75 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -55,24 +55,10 @@
  * leading "\".
  */
 
-/* rbd_aio_discard added in 0.1.2 */
-#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
-#define LIBRBD_SUPPORTS_DISCARD
-#else
-#undef LIBRBD_SUPPORTS_DISCARD
-#endif
-
 #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
 
 #define RBD_MAX_SNAPS 100
 
-/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
-#ifdef LIBRBD_SUPPORTS_IOVEC
-#define LIBRBD_USE_IOVEC 1
-#else
-#define LIBRBD_USE_IOVEC 0
-#endif
-
 typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
@@ -84,7 +70,6 @@ typedef struct RBDAIOCB {
 BlockAIOCB common;
 int64_t ret;
 QEMUIOVector *qiov;
-char *bounce;
 RBDAIOCmd cmd;
 int error;
 struct BDRVRBDState *s;
@@ -94,7 +79,6 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int64_t size;
-char *buf;
 int64_t ret;
 } RADOSCB;
 
@@ -342,13 +326,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 
 static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
 {
-if (LIBRBD_USE_IOVEC) {
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-} else {
-memset(rcb->buf + offs, 0, rcb->size - offs);
-}
+RBDAIOCB *acb = rcb->acb;
+iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
+   acb->qiov->size - offs);
 }
 
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
@@ -504,13 +484,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 
 g_free(rcb);
 
-if (!LIBRBD_USE_IOVEC) {
-if (acb->cmd == RBD_AIO_READ) {
-qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-}
-qemu_vfree(acb->bounce);
-}
-
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 
 qemu_aio_unref(acb);
@@ -878,28 +851,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
  rbd_finish_bh, rcb);
 }
 
-static int rbd_aio_discard_wrapper(rbd_image_t image,
-   uint64_t off,
-   uint64_t len,
-   rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_DISCARD
-return rbd_aio_discard(image, off, len, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
-static int rbd_aio_flush_wrapper(rbd_image_t image,
- rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
-return rbd_aio_flush(image, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
  int64_t off,
  QEMUIOVector *qiov,
@@ -922,21 +873,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 
 rcb = g_new(RADOSCB, 1);
 
-if (!LIBRBD_USE_IOVEC) {
-if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
-acb->bounce = NULL;
-} else {
-acb->bounce = qemu_try_blockalign(bs, qiov->size);
-if (acb->bounce == NULL) {
-goto failed;
-}
-}
-if (cmd == RBD_AIO_WRITE) {
-qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-}
-rcb->buf = acb->bounce;
-}
-
 acb->ret = 0;
 acb->error = 0;
 acb->s = s;
@@ -950,7 +886,7 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 }
 
 switch (cmd) {
-case RBD_AIO_WRITE: {
+case RBD_AIO_WRITE:
 /*
  * RBD APIs don't allow us to write more than actual size, so in order
  * to support growing images, we resize the image before write
@@ -962,25 +898,16 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 goto failed_completion;
 }
 }
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_write(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
 break;
-}
 case RBD_AIO_READ:
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_r

Re: [RFC PATCH 2/2] qemu-img convert: Fix sparseness detection

2021-05-19 Thread Peter Lieven
Am 20.04.21 um 18:52 schrieb Vladimir Sementsov-Ogievskiy:
> 20.04.2021 18:04, Kevin Wolf wrote:
>> Am 20.04.2021 um 16:31 hat Vladimir Sementsov-Ogievskiy geschrieben:
>>> 15.04.2021 18:22, Kevin Wolf wrote:
 In order to avoid RMW cycles, is_allocated_sectors() treats zeroed areas
 like non-zero data if the end of the checked area isn't aligned. This
 can improve the efficiency of the conversion and was introduced in
 commit 8dcd3c9b91a.

 However, it comes with a correctness problem: qemu-img convert is
 supposed to sparsify areas that contain only zeros, which it doesn't do
 any more. It turns out that this even happens when not only the
 unaligned area is zeroed, but also the blocks before and after it. In
 the bug report, conversion of a fragmented 10G image containing only
 zeros resulted in an image consuming 2.82 GiB even though the expected
 size is only 4 KiB.

 As a tradeoff between both, let's ignore zeroed sectors only after
 non-zero data to fix the alignment, but if we're only looking at zeros,
 keep them as such, even if it may mean additional RMW cycles.

>>>
>>> Hmm.. If I understand correctly, we are going to do unaligned
>>> write-zero. And that helps.
>>
>> This can happen (mostly raw images on block devices, I think?), but
>> usually it just means skipping the write because we know that the target
>> image is already zeroed.
>>
>> What it does mean is that if the next part is data, we'll have an
>> unaligned data write.
>>
>>> Doesn't that mean that alignment is wrongly detected?
>>
>> The problem is that you can have bdrv_block_status_above() return the
>> same allocation status multiple times in a row, but *pnum can be
>> unaligned for the conversion.
>>
>> We only look at a single range returned by it when detecting the
>> alignment, so it could be that we have zero buffers for both 0-11 and
>> 12-16 and detect two misaligned ranges, when both together are a
>> perfectly aligned zeroed range.
>>
>> In theory we could try to do some lookahead and merge ranges where
>> possible, which should give us the perfect result, but it would make the
>> code considerably more complicated. (Whether we want to merge them
>> doesn't only depend on the block status, but possibly also on the
>> content of a DATA range.)
>>
>> Kevin
>>
>
> Oh, I understand now the problem, thanks for explanation.
>
> Hmm, yes that means, that if the whole buf is zero, is_allocated_sectors must 
> not align it down, to be possibly "merged" with next chunk if it is zero too.
>
> But it's still good to align zeroes down, if data starts somewhere inside the 
> buf, isn't it?
>
> what about something like this:
>
> diff --git a/qemu-img.c b/qemu-img.c
> index babb5573ab..d1704584a0 100644
> --- a/qemu-img.c
> +++ b/qemu-img.c
> @@ -1167,19 +1167,39 @@ static int is_allocated_sectors(const uint8_t *buf, 
> int n, int *pnum,
>  }
>  }
>  
> +    if (i == n) {
> +    /*
> + * The whole buf is the same.
> + *
> + * if it's data, just return it. It's the old behavior.
> + *
> + * if it's zero, just return too. It will work good if target is 
> alredy
> + * zeroed. And if next chunk is zero too we'll have no RMW and no 
> reason
> + * to write data.
> + */
> +    *pnum = i;
> +    return !is_zero;
> +    }
> +
>  tail = (sector_num + i) & (alignment - 1);
>  if (tail) {
>  if (is_zero && i <= tail) {
> -    /* treat unallocated areas which only consist
> - * of a small tail as allocated. */
> +    /*
> + * For sure next sector after i is data, and it will rewrite this
> + * tail anyway due to RMW. So, let's just write data now.
> + */
>  is_zero = false;
>  }
>  if (!is_zero) {
> -    /* align up end offset of allocated areas. */
> +    /* If possible, align up end offset of allocated areas. */
>  i += alignment - tail;
>  i = MIN(i, n);
>  } else {
> -    /* align down end offset of zero areas. */
> +    /*
> + * For sure next sector after i is data, and it will rewrite this
> + * tail anyway due to RMW. Better is avoid RMW and write zeroes 
> up
> + * to aligned bound.
> + */
>  i -= tail;
>  }
>  }
>
>

I think we forgot to follow up on this. Has anyone tested this suggestion?

Otherwise, I would try to rerun the tests I did with the my old and Kevins 
suggestion.


Peter






Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven



Von meinem iPhone gesendet

> Am 19.04.2021 um 14:31 schrieb Kevin Wolf :
> 
> Am 19.04.2021 um 11:13 hat Peter Lieven geschrieben:
>> 
>> 
>>>> Am 19.04.2021 um 10:36 schrieb Peter Lieven :
>>> 
>>> 
>>> 
>>>> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
>>>> 
>>>> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
>>>> sparsification in order to get aligned requests on the target image. At
>>>> the time, I thought the impact would be small, but it turns out that
>>>> this can end up wasting gigabytes of storagee (like converting a fully
>>>> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
>>>> 
>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
>>>> 
>>>> I'm not entirely sure how to attack this best since this is a tradeoff,
>>>> but maybe the approach in this series is still good enough for the case
>>>> that you wanted to fix back then?
>>>> 
>>>> Of course, it would be possible to have a more complete fix like looking
>>>> forward a few blocks more before writing data, but that would probably
>>>> not be entirely trivial because you would have to merge blocks with ZERO
>>>> block status with DATA blocks that contain only zeros. I'm not sure if
>>>> it's worth this complication of the code.
>>> 
>>> I will try to look into this asap.
>> 
>> Besides from the reproducer described in the ticket, I retried my old
>> conversion test in our environment:
>> 
>> Before commit 8dcd3c9b91: reads 4608 writes 14959
>> After commit 8dcd3c9b91: reads 0 writes 14924
>> With Kevins patch: reads 110 writes 14924
>> 
>> I think this is a good result if it avoids other issues.
> 
> Sounds like a promising way to make the tradeoff. Thanks for testing!

is this sth for 6.0-rc4?

Peter






Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven


> Am 19.04.2021 um 10:36 schrieb Peter Lieven :
> 
> 
> 
>> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
>> 
>> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
>> sparsification in order to get aligned requests on the target image. At
>> the time, I thought the impact would be small, but it turns out that
>> this can end up wasting gigabytes of storagee (like converting a fully
>> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
>> 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
>> 
>> I'm not entirely sure how to attack this best since this is a tradeoff,
>> but maybe the approach in this series is still good enough for the case
>> that you wanted to fix back then?
>> 
>> Of course, it would be possible to have a more complete fix like looking
>> forward a few blocks more before writing data, but that would probably
>> not be entirely trivial because you would have to merge blocks with ZERO
>> block status with DATA blocks that contain only zeros. I'm not sure if
>> it's worth this complication of the code.
> 
> I will try to look into this asap.

Besides from the reproducer described in the ticket, I retried my old 
conversion test in our environment:

Before commit 8dcd3c9b91: reads 4608 writes 14959
After commit 8dcd3c9b91: reads 0 writes 14924
With Kevins patch: reads 110 writes 14924

I think this is a good result if it avoids other issues.

Peter



Re: [RFC PATCH 0/2] qemu-img convert: Fix sparseness detection

2021-04-19 Thread Peter Lieven



> Am 15.04.2021 um 17:22 schrieb Kevin Wolf :
> 
> Peter, three years ago you changed 'qemu-img convert' to sacrifice some
> sparsification in order to get aligned requests on the target image. At
> the time, I thought the impact would be small, but it turns out that
> this can end up wasting gigabytes of storagee (like converting a fully
> zeroed 10 GB image taking 2.8 GB instead of a few kilobytes).
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1882917
> 
> I'm not entirely sure how to attack this best since this is a tradeoff,
> but maybe the approach in this series is still good enough for the case
> that you wanted to fix back then?
> 
> Of course, it would be possible to have a more complete fix like looking
> forward a few blocks more before writing data, but that would probably
> not be entirely trivial because you would have to merge blocks with ZERO
> block status with DATA blocks that contain only zeros. I'm not sure if
> it's worth this complication of the code.

I will try to look into this asap.

Is there a hint which FS I need to set the extent hint when creating the raw 
image? I was not able to do that.

Peter





Re: QEMU RBD is slow with QCOW2 images

2021-03-03 Thread Peter Lieven
Am 03.03.21 um 19:47 schrieb Jason Dillaman:
> On Wed, Mar 3, 2021 at 12:41 PM Stefano Garzarella  
> wrote:
>> Hi Jason,
>> as reported in this BZ [1], when qemu-img creates a QCOW2 image on RBD
>> writing data is very slow compared to a raw file.
>>
>> Comparing raw vs QCOW2 image creation with RBD I found that we use a
>> different object size, for the raw file I see '4 MiB objects', for QCOW2
>> I see '64 KiB objects' as reported on comment 14 [2].
>> This should be the main issue of slowness, indeed forcing in the code 4
>> MiB object size also for QCOW2 increased the speed a lot.
>>
>> Looking better I discovered that for raw files, we call rbd_create()
>> with obj_order = 0 (if 'cluster_size' options is not defined), so the
>> default object size is used.
>> Instead for QCOW2, we use obj_order = 16, since the default
>> 'cluster_size' defined for QCOW2, is 64 KiB.
>>
>> Using '-o cluster_size=2M' with qemu-img changed only the qcow2 cluster
>> size, since in qcow2_co_create_opts() we remove the 'cluster_size' from
>> QemuOpts calling qemu_opts_to_qdict_filtered().
>> For some reason that I have yet to understand, after this deletion,
>> however remains in QemuOpts the default value of 'cluster_size' for
>> qcow2 (64 KiB), that it's used in qemu_rbd_co_create_opts()
>>
>> At this point my doubts are:
>> Does it make sense to use the same cluster_size as qcow2 as object_size
>> in RBD?
> No, not really. But it also doesn't really make any sense to put a
> QCOW2 image within an RBD image. To clarify from the BZ, OpenStack
> does not put QCOW2 images on RBD, it converts QCOW2 images into raw
> images to store in RBD.


As discussed earlier the only reasonable format for rbd image is raw.

What is the idea behind putting a qcow2 on an rbd pool?

Jason and I even discussed shortly durign the review of the rbd driver rewrite 
I posted

earlier if it was ok to drop support for writing past the end of file.


Anyway the reason why it is so slow is that write requests serialize if the

qcow2 file grows. If there is a sane reason why we need qcow2 on rbd

we need to implement at least preallocation mode = full to overcome

the serialization.


Peter





Re: block/throttle and burst bucket

2021-03-01 Thread Peter Lieven
Am 01.03.21 um 11:59 schrieb Kevin Wolf:
> Am 26.02.2021 um 13:33 hat Peter Lieven geschrieben:
>> Am 26.02.21 um 10:27 schrieb Alberto Garcia:
>>> On Thu 25 Feb 2021 06:34:48 PM CET, Peter Lieven  wrote:
>>>> I was wondering if there is a way to check from outside (qmp etc.) if
>>>> a throttled block device has exceeded the iops_max_length seconds of
>>>> time bursting up to iops_max and is now hard limited to the iops limit
>>>> that is supplied?
>>>>
>>>> Would it be also a good idea to exetend the accounting to account for
>>>> requests that must have waited before being sent out to the backend
>>>> device?
>>> No, there's no such interface as far as I'm aware. I think one problem
>>> is that throttling is now done using a filter, that can be inserted
>>> anywhere in the node graph, and accounting is done at the BlockBackend
>>> level.
>>>
>>> We don't even have a query-block-throttle function. I actually started
>>> to write one six years ago but it was never finished.
>>
>> A quick idea that came to my mind was to add an option to emit a QMP
>> event if the burst_bucket is exhausted and hard limits are enforced.
> Do you actually need to do something every time that it's exceeded, so
> QEMU needs to be the active part sending out an event, or is it
> something that you need to check in specific places and could reasonably
> query on demand?
>
> For the latter, my idea would have been adding a new read-only QOM
> property to the throttle group object that exposes how much is still
> left. When it becomes 0, the hard limits are enforced.
>
>> There seems to be something wrong in the throttling code anyway.
>> Throttling causes addtional i/o latency always even if the actual iops
>> rate is far away from the limits and ever more far away from the burst
>> limits. I will dig into this.
>>
>> My wishlist:
>>
>>  - have a possibility to query the throttling state.
>>  - have counters for no of delayed ops and for how long they were delayed.
>>  - have counters for untrottled <= 4k request performance for a backend 
>> storage device.
>>
>> The later two seem not trivial as you mentioned.
> Do you need the information per throttle node or per throttle group? For
> the latter, the same QOM property approach would work.


Hi Kevin,


per throttle-group information would be sufficient. So you would expose the the 
level of the bucket and

additionally a counter for throttled vs. total ops and total delay?


Why we talk about throttling I still do not understand the following part in 
util/throttle.c function throttle_compute_wait


    if (!bkt->max) {
    /* If bkt->max is 0 we still want to allow short bursts of I/O
 * from the guest, otherwise every other request will be throttled
 * and performance will suffer considerably. */
    bucket_size = (double) bkt->avg / 10;
    burst_bucket_size = 0;
    } else {
    /* If we have a burst limit then we have to wait until all I/O
 * at burst rate has finished before throttling to bkt->avg */
    bucket_size = bkt->max * bkt->burst_length;
    burst_bucket_size = (double) bkt->max / 10;
    }


Why burst_bucket_size = bkt->max / 10?

>From what I understand it should be bkt->max. Otherwise we compare the "extra" 
>against a tenth of the bucket capacity

and schedule a timer where it is not necessary.


What am I missing here?



Peter






Re: block/throttle and burst bucket

2021-02-26 Thread Peter Lieven
Am 26.02.21 um 10:27 schrieb Alberto Garcia:
> On Thu 25 Feb 2021 06:34:48 PM CET, Peter Lieven  wrote:
>> I was wondering if there is a way to check from outside (qmp etc.) if
>> a throttled block device has exceeded the iops_max_length seconds of
>> time bursting up to iops_max and is now hard limited to the iops limit
>> that is supplied?
>>
>> Would it be also a good idea to exetend the accounting to account for
>> requests that must have waited before being sent out to the backend
>> device?
> No, there's no such interface as far as I'm aware. I think one problem
> is that throttling is now done using a filter, that can be inserted
> anywhere in the node graph, and accounting is done at the BlockBackend
> level.
>
> We don't even have a query-block-throttle function. I actually started
> to write one six years ago but it was never finished.


A quick idea that came to my mind was to add an option to emit a QMP event if 
the burst_bucket is exhausted

and hard limits are enforced.


There seems to be something wrong in the throttling code anyway. Throttling 
causes addtional i/o latency always even if

the actual iops rate is far away from the limits and ever more far away from 
the burst limits. I will dig into this.


My wishlist:

 - have a possibility to query the throttling state.

 - have counters for no of delayed ops and for how long they were delayed.

 - have counters for untrottled <= 4k request performance for a backend storage 
device.


The later two seem not trivial as you mentioned.


Peter






block/throttle and burst bucket

2021-02-25 Thread Peter Lieven
Hi,


I was wondering if there is a way to check from outside (qmp etc.) if a 
throttled block device has exceeded the iops_max_length seconds of time 
bursting up to iops_max and is now hard limited to the iops limit that is 
supplied?


Would it be also a good idea to exetend the accounting to account for requests 
that must have waited before being sent out to the backend device?


Thanks,

Peter





Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 13:13 schrieb Kevin Wolf:

Am 15.02.2021 um 12:45 hat Peter Lieven geschrieben:

Am 15.02.21 um 12:41 schrieb Daniel P. Berrangé:

On Mon, Feb 15, 2021 at 12:32:24PM +0100, Peter Lieven wrote:

Am 15.02.21 um 11:24 schrieb Daniel P. Berrangé:

On Tue, Jan 26, 2021 at 12:25:34PM +0100, Peter Lieven wrote:

even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

We have clear rules on when we bump minimum versions, determined by
the OS platforms we target:

 https://qemu.readthedocs.io/en/latest/system/build-platforms.html

At this time RHEL-7 is usually the oldest platform, and it
builds with RBD 10.2.5, so we can't bump the version to 12.2.

I'm afraid this patch has to be dropped.

I have asked exactly this question before I started work on this series and got 
reply

from Jason that he sees no problem in bumping to a release which is already 
unmaintained

for 3 years.

I'm afraid Jason is wrong here.  It doesn't matter what the upstream
consider the support status to be. QEMU targets what the OS vendors
ship, and they still consider this to be a supported version.


Okay, but the whole coroutine stuff would get a total mess with all
the ifdef'ry.

Hm, but how are these ifdefs even related to the coroutine conversation?
It's a bit more code that you're moving around, but shouldn't it be
unchanged from the old code, just moving from an AIO callback to a
coroutine? Or am I missing some complications?



No, the ifdef's only come back in for the write zeroes part.





Would it be an option to make a big ifdef in the rbd driver? One with
old code for < 12.0.0 and one

with new code for >= 12.0.0?

I don't think this is a good idea, this would be a huge mess to
maintain.

The conversion is probably a good idea in general, simply because it's
more in line with the rest of the block layer, but I don't think it adds
anything per se, so it's hard to justify such duplication with the
benefits it brings.



I would wait for Jasons comment on the rbd part of the series and then spin a V3

with a for-6.1 tag.


Peter




Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 12:51 schrieb Daniel P. Berrangé:

On Mon, Feb 15, 2021 at 12:45:01PM +0100, Peter Lieven wrote:

Am 15.02.21 um 12:41 schrieb Daniel P. Berrangé:

On Mon, Feb 15, 2021 at 12:32:24PM +0100, Peter Lieven wrote:

Am 15.02.21 um 11:24 schrieb Daniel P. Berrangé:

On Tue, Jan 26, 2021 at 12:25:34PM +0100, Peter Lieven wrote:

even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

We have clear rules on when we bump minimum versions, determined by
the OS platforms we target:

 https://qemu.readthedocs.io/en/latest/system/build-platforms.html

At this time RHEL-7 is usually the oldest platform, and it
builds with RBD 10.2.5, so we can't bump the version to 12.2.

I'm afraid this patch has to be dropped.

I have asked exactly this question before I started work on this series and got 
reply

from Jason that he sees no problem in bumping to a release which is already 
unmaintained

for 3 years.

I'm afraid Jason is wrong here.  It doesn't matter what the upstream
consider the support status to be. QEMU targets what the OS vendors
ship, and they still consider this to be a supported version.


Okay, but the whole coroutine stuff would get a total mess with all the 
ifdef'ry.

Doesn't seem like the write zeros code is adding much more comapred to
the ifdefs that already exist...



Yes, I don't like it as well, but write zeroes support was only added in 
Nautilus (14.x) and the thick provisioning

that Jason asked me to add came only with Octopus (15.x).






Would it be an option to make a big ifdef in the rbd driver? One with old code for 
< 12.0.0 and one

with new code for >= 12.0.0?

..but I don't have a strong opinion on that, since I'm not maintaining this
driver.


BTW, we will be free to drop RHEL-7 in the next development cycle of
QEMU, starting after the forthcoming 6.0.0 release is out, as it will
fall out of our OS support matrix.



Thanks for that hint. I would say lets hold this series back until Qemu 6.1.

Where can I find the OS support matrix for 6.1 - maybe we can bump the 
requirement to nautilus to

reduce the ifdef'ry further.


Peter






Re: [PATCH V2 4/7] block/rbd: add bdrv_attach_aio_context

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 11:20 schrieb Kevin Wolf:

Am 26.01.2021 um 12:25 hat Peter Lieven geschrieben:

Signed-off-by: Peter Lieven 
---
  block/rbd.c | 15 +--
  1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f68ebcf240..7abd0252c9 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -91,6 +91,7 @@ typedef struct BDRVRBDState {
  char *namespace;
  uint64_t image_size;
  uint64_t object_size;
+AioContext *aio_context;
  } BDRVRBDState;

A commit message explaining the why would be helpful here.

This is already stored in BlockDriverState, which should be available
everywhere. Keeping redundant information needs a good justification,
which seems unlikely when BlockDriverState and BDRVRBDState are already
connected through the BlockDriverState.opaque pointer.

The rest of the series doesn't seem to make more use of it either.



You are right. I was not aware that the aio_context is already there.

We keep a local copy of aio_context in iscsi and nfs driver as well. That

is where I got it from. I will change it if we don't drop the series completely.


Peter






Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 12:41 schrieb Daniel P. Berrangé:

On Mon, Feb 15, 2021 at 12:32:24PM +0100, Peter Lieven wrote:

Am 15.02.21 um 11:24 schrieb Daniel P. Berrangé:

On Tue, Jan 26, 2021 at 12:25:34PM +0100, Peter Lieven wrote:

even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

We have clear rules on when we bump minimum versions, determined by
the OS platforms we target:

https://qemu.readthedocs.io/en/latest/system/build-platforms.html

At this time RHEL-7 is usually the oldest platform, and it
builds with RBD 10.2.5, so we can't bump the version to 12.2.

I'm afraid this patch has to be dropped.


I have asked exactly this question before I started work on this series and got 
reply

from Jason that he sees no problem in bumping to a release which is already 
unmaintained

for 3 years.

I'm afraid Jason is wrong here.  It doesn't matter what the upstream
consider the support status to be. QEMU targets what the OS vendors
ship, and they still consider this to be a supported version.



Okay, but the whole coroutine stuff would get a total mess with all the 
ifdef'ry.

Would it be an option to make a big ifdef in the rbd driver? One with old code for 
< 12.0.0 and one

with new code for >= 12.0.0?


Peter





Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 11:19 schrieb Daniel P. Berrangé:

On Mon, Feb 15, 2021 at 11:11:23AM +0100, Kevin Wolf wrote:

Am 26.01.2021 um 12:25 hat Peter Lieven geschrieben:

even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

Signed-off-by: Peter Lieven 
diff --git a/meson.build b/meson.build
index 5943aa8a51..02d263ad33 100644
--- a/meson.build
+++ b/meson.build
@@ -691,19 +691,24 @@ if not get_option('rbd').auto() or have_block
 required: get_option('rbd'),
 kwargs: static_kwargs)
if librados.found() and librbd.found()
-if cc.links('''
+result = cc.run('''

Doesn't running compiled binaries break cross compilation?


#include 
#include 
int main(void) {
  rados_t cluster;
  rados_create(, NULL);
+rados_shutdown(cluster);
+#if LIBRBD_VERSION_CODE < LIBRBD_VERSION(1, 12, 0)
+return 1;
+#endif
  return 0;

Would #error achieve what you want without running the binary?

But most, if not all, other version checks use pkg-config instead of
trying to compile code, so that's probably what we should be doing here,
too.

Yep, for something that is merely a version number check there's no
need to compile anything. pkg-config can just validate the version
straightup.



I would have loved to, but at least the Ubuntu/Debian packages do not contain

pkg-config files. I can switch to #error, of course. My initial version of the 
patch

distinguished between can't compile and version is too old. With #error we just

can say doesn't compile, but I think this would be ok.


Peter





Re: [PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-02-15 Thread Peter Lieven

Am 15.02.21 um 11:24 schrieb Daniel P. Berrangé:

On Tue, Jan 26, 2021 at 12:25:34PM +0100, Peter Lieven wrote:

even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

We have clear rules on when we bump minimum versions, determined by
the OS platforms we target:

   https://qemu.readthedocs.io/en/latest/system/build-platforms.html

At this time RHEL-7 is usually the oldest platform, and it
builds with RBD 10.2.5, so we can't bump the version to 12.2.

I'm afraid this patch has to be dropped.



I have asked exactly this question before I started work on this series and got 
reply

from Jason that he sees no problem in bumping to a release which is already 
unmaintained

for 3 years.


If qemu 6.0 is required to build on RHEL-7 than I am afraid we can abandon the 
whole series.


Peter





Re: [PATCH] qemu-img: add seek and -n option to dd command

2021-02-05 Thread Peter Lieven
Am 05.02.21 um 09:18 schrieb Max Reitz:
> On 04.02.21 21:09, Peter Lieven wrote:
>> Am 02.02.21 um 16:51 schrieb Eric Blake:
>>> On 1/28/21 8:07 AM, Peter Lieven wrote:
>>>> Signed-off-by: Peter Lieven 
>>> Your commit message says 'what', but not 'why'.  Generally, the one-line
>>> 'what' works well as the subject line, but you want the commit body to
>>> give an argument why your patch should be applied, rather than blank.
>>>
>>> Here's the last time we tried to improve qemu-img dd:
>>> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg02618.html
>>
>>
>> I was not aware of that story. My use case is that I want to be
>>
>> able to "patch" an image that Qemu is able to handle by overwriting
>>
>> certain sectors. And I especially do not want to "mount" that image
>>
>> via qemu-nbd because I might not trust it. I totally want to avoid that the 
>> host
>>
>> system tries to analyse that image in terms of scanning the bootsector, 
>> partprobe,
>>
>> lvm etc. pp.
>
> qemu will have FUSE exporting as of 6.0 (didn’t quite make it into 5.2), so 
> you can do something like this:
>
> $ qemu-storage-daemon \
>     --blockdev node-name=export,driver=qcow2,\
> file.driver=file,file.filename=image.qcow2 \
>     --export fuse,id=fuse,node-name=export,mountpoint=image.qcow2
>
> This exports the image on image.qcow2 (i.e., on itself) and so by accessing 
> the image file you then get raw access to its contents (so you can use system 
> tools like dd).
>
> Doesn’t require root rights, and shouldn’t make the kernel scan anything, 
> because it’s exported as just a regular file.


Okay, but that is still more housekeeping than just invoking a single command.

Would it be an option to extend qemu-io to write data at a certain offset which 
it reads from STDIN?


Peter






Re: [PATCH] qemu-img: add seek and -n option to dd command

2021-02-04 Thread Peter Lieven
Am 02.02.21 um 16:51 schrieb Eric Blake:
> On 1/28/21 8:07 AM, Peter Lieven wrote:
>> Signed-off-by: Peter Lieven 
> Your commit message says 'what', but not 'why'.  Generally, the one-line
> 'what' works well as the subject line, but you want the commit body to
> give an argument why your patch should be applied, rather than blank.
>
> Here's the last time we tried to improve qemu-img dd:
> https://lists.gnu.org/archive/html/qemu-devel/2018-08/msg02618.html


I was not aware of that story. My use case is that I want to be

able to "patch" an image that Qemu is able to handle by overwriting

certain sectors. And I especially do not want to "mount" that image

via qemu-nbd because I might not trust it. I totally want to avoid that the host

system tries to analyse that image in terms of scanning the bootsector, 
partprobe,

lvm etc. pp.


>
> where I also proposed adding seek=, and fixing skip= with count=.  Your
> patch does not do the latter.  But the bigger complaint back then was
> that 'qemu-img copy' should be able to do everything, and that qemu-img
> dd should then just be a thin shim around 'qemu-img copy', rather than
> having two parallel projects that diverge in their implementations.


understood. I was not aware of an issue with skip and count.

The patch works for me and I wanted to share it. But when I read

the thread it seems that it would be a difficult task to get it merged.


>
> Your patch does not have the typical '---' divider and diffstat between
> the commit message and the patch proper; this may be a factor of which
> git packages you have installed, but having the diffstat present makes
> it easier to see at a glance what your patch touches without reading the
> entire email.  I had to go hunting to learn if you added iotest coverage
> of this new feature...
>
> ...and the answer was no, you didn't.  You'll need to add that in v2
> (see the link to my earlier attempt at modifying dd for an example).


I did not. Maybe I accidently killed the '---' divider. If I will make a V2 I 
will add

an I/O test.


Thanks for your suggestions,

Peter






[PATCH] qemu-img: add seek and -n option to dd command

2021-01-28 Thread Peter Lieven
Signed-off-by: Peter Lieven 

diff --git a/docs/tools/qemu-img.rst b/docs/tools/qemu-img.rst
index b615aa8419..7d4564c2b8 100644
--- a/docs/tools/qemu-img.rst
+++ b/docs/tools/qemu-img.rst
@@ -209,6 +209,10 @@ Parameters to dd subcommand:
 
 .. program:: qemu-img-dd
 
+.. option:: -n
+
+  Skip the creation of the output file
+
 .. option:: bs=BLOCK_SIZE
 
   Defines the block size
@@ -229,6 +233,10 @@ Parameters to dd subcommand:
 
   Sets the number of input blocks to skip
 
+.. option:: sseek=BLOCKS
+
+  Sets the number of blocks to seek into the output
+
 Parameters to snapshot subcommand:
 
 .. program:: qemu-img-snapshot
diff --git a/qemu-img.c b/qemu-img.c
index 8597d069af..d7f390e382 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -213,12 +213,17 @@ static void QEMU_NORETURN help(void)
"  '-s' run in Strict mode - fail on different image size or sector 
allocation\n"
"\n"
"Parameters to dd subcommand:\n"
+   "  '-n' skips the target volume creation (useful if the volume is 
created\n"
+   "   prior to running qemu-img). Note that he behaviour is not 
identical to\n"
+   "   original dd option conv=nocreat. The output is neither 
truncated nor\n"
+   "   is it possible to write past the end of an existing file.\n"
"  'bs=BYTES' read and write up to BYTES bytes at a time "
"(default: 512)\n"
"  'count=N' copy only N input blocks\n"
"  'if=FILE' read from FILE\n"
"  'of=FILE' write to FILE\n"
-   "  'skip=N' skip N bs-sized blocks at the start of input\n";
+   "  'skip=N' skip N bs-sized blocks at the start of input\n"
+   "  'seek=N' seek N bs-sized blocks into the output\n";
 
 printf("%s\nSupported formats:", help_msg);
 bdrv_iterate_format(format_print, NULL, false);
@@ -4885,6 +4890,7 @@ static int img_bitmap(int argc, char **argv)
 #define C_IF  04
 #define C_OF  010
 #define C_SKIP020
+#define C_SEEK040
 
 struct DdInfo {
 unsigned int flags;
@@ -4964,6 +4970,19 @@ static int img_dd_skip(const char *arg,
 return 0;
 }
 
+static int img_dd_seek(const char *arg,
+   struct DdIo *in, struct DdIo *out,
+   struct DdInfo *dd)
+{
+out->offset = cvtnum("seek", arg);
+
+if (in->offset < 0) {
+return 1;
+}
+
+return 0;
+}
+
 static int img_dd(int argc, char **argv)
 {
 int ret = 0;
@@ -4980,7 +4999,7 @@ static int img_dd(int argc, char **argv)
 const char *fmt = NULL;
 int64_t size = 0;
 int64_t block_count = 0, out_pos, in_pos;
-bool force_share = false;
+bool force_share = false, skip_create = false;
 struct DdInfo dd = {
 .flags = 0,
 .count = 0,
@@ -5004,6 +5023,7 @@ static int img_dd(int argc, char **argv)
 { "if", img_dd_if, C_IF },
 { "of", img_dd_of, C_OF },
 { "skip", img_dd_skip, C_SKIP },
+{ "seek", img_dd_seek, C_SEEK },
 { NULL, NULL, 0 }
 };
 const struct option long_options[] = {
@@ -5014,7 +5034,7 @@ static int img_dd(int argc, char **argv)
 { 0, 0, 0, 0 }
 };
 
-while ((c = getopt_long(argc, argv, ":hf:O:U", long_options, NULL))) {
+while ((c = getopt_long(argc, argv, ":hf:O:Un", long_options, NULL))) {
 if (c == EOF) {
 break;
 }
@@ -5037,6 +5057,9 @@ static int img_dd(int argc, char **argv)
 case 'U':
 force_share = true;
 break;
+case 'n':
+skip_create = true;
+break;
 case OPTION_OBJECT:
 if (!qemu_opts_parse_noisily(_object_opts, optarg, true)) {
 ret = -1;
@@ -5116,22 +5139,25 @@ static int img_dd(int argc, char **argv)
 ret = -1;
 goto out;
 }
-if (!drv->create_opts) {
-error_report("Format driver '%s' does not support image creation",
- drv->format_name);
-ret = -1;
-goto out;
-}
-if (!proto_drv->create_opts) {
-error_report("Protocol driver '%s' does not support image creation",
- proto_drv->format_name);
-ret = -1;
-goto out;
-}
-create_opts = qemu_opts_append(create_opts, drv->create_opts);
-create_opts = qemu_opts_append(create_opts, proto_drv->create_opts);
 
-opts = qemu_opts_create(create_opts, NULL, 0, _abort);
+if (!skip_create) {
+if (!drv->create_opts) {
+error_report("Format driver '%s' does not support image creation",
+ drv->format_name);
+ret = -1;
+goto out;
+}
+  

[PATCH V2 2/7] block/rbd: store object_size in BDRVRBDState

2021-01-26 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 18 +++---
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index a191c74619..1028596c68 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -90,6 +90,7 @@ typedef struct BDRVRBDState {
 char *snap;
 char *namespace;
 uint64_t image_size;
+uint64_t object_size;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -663,6 +664,7 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 const QDictEntry *e;
 Error *local_err = NULL;
 char *keypairs, *secretid;
+rbd_image_info_t info;
 int r;
 
 keypairs = g_strdup(qdict_get_try_str(options, "=keyvalue-pairs"));
@@ -727,13 +729,15 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 goto failed_open;
 }
 
-r = rbd_get_size(s->image, >image_size);
+r = rbd_stat(s->image, , sizeof(info));
 if (r < 0) {
-error_setg_errno(errp, -r, "error getting image size from %s",
+error_setg_errno(errp, -r, "error getting image info from %s",
  s->image_name);
 rbd_close(s->image);
 goto failed_open;
 }
+s->image_size = info.size;
+s->object_size = info.obj_size;
 
 /* If we are using an rbd snapshot, we must be r/o, otherwise
  * leave as-is */
@@ -945,15 +949,7 @@ static BlockAIOCB *qemu_rbd_aio_flush(BlockDriverState *bs,
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
-rbd_image_info_t info;
-int r;
-
-r = rbd_stat(s->image, , sizeof(info));
-if (r < 0) {
-return r;
-}
-
-bdi->cluster_size = info.obj_size;
+bdi->cluster_size = s->object_size;
 return 0;
 }
 
-- 
2.17.1





[PATCH V2 1/7] block/rbd: bump librbd requirement to luminous release

2021-01-26 Thread Peter Lieven
even luminous (version 12.2) is unmaintained for over 3 years now.
Bump the requirement to get rid of the ifdef'ry in the code.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 120 
 meson.build |  13 --
 2 files changed, 17 insertions(+), 116 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 9071a00e3f..a191c74619 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -55,24 +55,10 @@
  * leading "\".
  */
 
-/* rbd_aio_discard added in 0.1.2 */
-#if LIBRBD_VERSION_CODE >= LIBRBD_VERSION(0, 1, 2)
-#define LIBRBD_SUPPORTS_DISCARD
-#else
-#undef LIBRBD_SUPPORTS_DISCARD
-#endif
-
 #define OBJ_MAX_SIZE (1UL << OBJ_DEFAULT_OBJ_ORDER)
 
 #define RBD_MAX_SNAPS 100
 
-/* The LIBRBD_SUPPORTS_IOVEC is defined in librbd.h */
-#ifdef LIBRBD_SUPPORTS_IOVEC
-#define LIBRBD_USE_IOVEC 1
-#else
-#define LIBRBD_USE_IOVEC 0
-#endif
-
 typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
@@ -84,7 +70,6 @@ typedef struct RBDAIOCB {
 BlockAIOCB common;
 int64_t ret;
 QEMUIOVector *qiov;
-char *bounce;
 RBDAIOCmd cmd;
 int error;
 struct BDRVRBDState *s;
@@ -94,7 +79,6 @@ typedef struct RADOSCB {
 RBDAIOCB *acb;
 struct BDRVRBDState *s;
 int64_t size;
-char *buf;
 int64_t ret;
 } RADOSCB;
 
@@ -332,13 +316,9 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 
 static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
 {
-if (LIBRBD_USE_IOVEC) {
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-} else {
-memset(rcb->buf + offs, 0, rcb->size - offs);
-}
+RBDAIOCB *acb = rcb->acb;
+iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
+   acb->qiov->size - offs);
 }
 
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
@@ -493,13 +473,6 @@ static void qemu_rbd_complete_aio(RADOSCB *rcb)
 
 g_free(rcb);
 
-if (!LIBRBD_USE_IOVEC) {
-if (acb->cmd == RBD_AIO_READ) {
-qemu_iovec_from_buf(acb->qiov, 0, acb->bounce, acb->qiov->size);
-}
-qemu_vfree(acb->bounce);
-}
-
 acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
 
 qemu_aio_unref(acb);
@@ -866,28 +839,6 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
  rbd_finish_bh, rcb);
 }
 
-static int rbd_aio_discard_wrapper(rbd_image_t image,
-   uint64_t off,
-   uint64_t len,
-   rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_DISCARD
-return rbd_aio_discard(image, off, len, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
-static int rbd_aio_flush_wrapper(rbd_image_t image,
- rbd_completion_t comp)
-{
-#ifdef LIBRBD_SUPPORTS_AIO_FLUSH
-return rbd_aio_flush(image, comp);
-#else
-return -ENOTSUP;
-#endif
-}
-
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
  int64_t off,
  QEMUIOVector *qiov,
@@ -910,21 +861,6 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 
 rcb = g_new(RADOSCB, 1);
 
-if (!LIBRBD_USE_IOVEC) {
-if (cmd == RBD_AIO_DISCARD || cmd == RBD_AIO_FLUSH) {
-acb->bounce = NULL;
-} else {
-acb->bounce = qemu_try_blockalign(bs, qiov->size);
-if (acb->bounce == NULL) {
-goto failed;
-}
-}
-if (cmd == RBD_AIO_WRITE) {
-qemu_iovec_to_buf(acb->qiov, 0, acb->bounce, qiov->size);
-}
-rcb->buf = acb->bounce;
-}
-
 acb->ret = 0;
 acb->error = 0;
 acb->s = s;
@@ -938,7 +874,7 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 }
 
 switch (cmd) {
-case RBD_AIO_WRITE: {
+case RBD_AIO_WRITE:
 /*
  * RBD APIs don't allow us to write more than actual size, so in order
  * to support growing images, we resize the image before write
@@ -950,25 +886,16 @@ static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
 goto failed_completion;
 }
 }
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_write(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_writev(s->image, qiov->iov, qiov->niov, off, c);
 break;
-}
 case RBD_AIO_READ:
-#ifdef LIBRBD_SUPPORTS_IOVEC
-r = rbd_aio_readv(s->image, qiov->iov, qiov->niov, off, c);
-#else
-r = rbd_aio_read(s->image, off, size, rcb->buf, c);
-#endif
+r = rbd_aio_readv(s->image, qiov->i

[PATCH V2 3/7] block/rbd: update s->image_size in qemu_rbd_getlength

2021-01-26 Thread Peter Lieven
in case the image size changed we should adjust our internally stored size as 
well.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/block/rbd.c b/block/rbd.c
index 1028596c68..f68ebcf240 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -964,6 +964,7 @@ static int64_t qemu_rbd_getlength(BlockDriverState *bs)
 return r;
 }
 
+s->image_size = info.size;
 return info.size;
 }
 
-- 
2.17.1





[PATCH V2 5/7] block/rbd: migrate from aio to coroutines

2021-01-26 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 253 ++--
 1 file changed, 86 insertions(+), 167 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 7abd0252c9..d11a3c6dd1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -66,22 +66,6 @@ typedef enum {
 RBD_AIO_FLUSH
 } RBDAIOCmd;
 
-typedef struct RBDAIOCB {
-BlockAIOCB common;
-int64_t ret;
-QEMUIOVector *qiov;
-RBDAIOCmd cmd;
-int error;
-struct BDRVRBDState *s;
-} RBDAIOCB;
-
-typedef struct RADOSCB {
-RBDAIOCB *acb;
-struct BDRVRBDState *s;
-int64_t size;
-int64_t ret;
-} RADOSCB;
-
 typedef struct BDRVRBDState {
 rados_t cluster;
 rados_ioctx_t io_ctx;
@@ -94,6 +78,13 @@ typedef struct BDRVRBDState {
 AioContext *aio_context;
 } BDRVRBDState;
 
+typedef struct RBDTask {
+BDRVRBDState *s;
+Coroutine *co;
+bool complete;
+int64_t ret;
+} RBDTask;
+
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
 BlockdevOptionsRbd *opts, bool cache,
 const char *keypairs, const char *secretid,
@@ -316,13 +307,6 @@ static int qemu_rbd_set_keypairs(rados_t cluster, const 
char *keypairs_json,
 return ret;
 }
 
-static void qemu_rbd_memset(RADOSCB *rcb, int64_t offs)
-{
-RBDAIOCB *acb = rcb->acb;
-iov_memset(acb->qiov->iov, acb->qiov->niov, offs, 0,
-   acb->qiov->size - offs);
-}
-
 /* FIXME Deprecate and remove keypairs or make it available in QMP. */
 static int qemu_rbd_do_create(BlockdevCreateOptions *options,
   const char *keypairs, const char 
*password_secret,
@@ -440,46 +424,6 @@ exit:
 return ret;
 }
 
-/*
- * This aio completion is being called from rbd_finish_bh() and runs in qemu
- * BH context.
- */
-static void qemu_rbd_complete_aio(RADOSCB *rcb)
-{
-RBDAIOCB *acb = rcb->acb;
-int64_t r;
-
-r = rcb->ret;
-
-if (acb->cmd != RBD_AIO_READ) {
-if (r < 0) {
-acb->ret = r;
-acb->error = 1;
-} else if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else {
-if (r < 0) {
-qemu_rbd_memset(rcb, 0);
-acb->ret = r;
-acb->error = 1;
-} else if (r < rcb->size) {
-qemu_rbd_memset(rcb, r);
-if (!acb->error) {
-acb->ret = rcb->size;
-}
-} else if (!acb->error) {
-acb->ret = r;
-}
-}
-
-g_free(rcb);
-
-acb->common.cb(acb->common.opaque, (acb->ret > 0 ? 0 : acb->ret));
-
-qemu_aio_unref(acb);
-}
-
 static char *qemu_rbd_mon_host(BlockdevOptionsRbd *opts, Error **errp)
 {
 const char **vals;
@@ -817,88 +761,49 @@ static int qemu_rbd_resize(BlockDriverState *bs, uint64_t 
size)
 return 0;
 }
 
-static const AIOCBInfo rbd_aiocb_info = {
-.aiocb_size = sizeof(RBDAIOCB),
-};
-
-static void rbd_finish_bh(void *opaque)
+static void qemu_rbd_finish_bh(void *opaque)
 {
-RADOSCB *rcb = opaque;
-qemu_rbd_complete_aio(rcb);
+RBDTask *task = opaque;
+task->complete = 1;
+aio_co_wake(task->co);
 }
 
-/*
- * This is the callback function for rbd_aio_read and _write
- *
- * Note: this function is being called from a non qemu thread so
- * we need to be careful about what we do here. Generally we only
- * schedule a BH, and do the rest of the io completion handling
- * from rbd_finish_bh() which runs in a qemu context.
- */
-static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB *rcb)
+static void qemu_rbd_completion_cb(rbd_completion_t c, RBDTask *task)
 {
-RBDAIOCB *acb = rcb->acb;
-
-rcb->ret = rbd_aio_get_return_value(c);
+task->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
-
-replay_bh_schedule_oneshot_event(acb->s->aio_context, rbd_finish_bh, rcb);
+aio_bh_schedule_oneshot(task->s->aio_context, qemu_rbd_finish_bh, task);
 }
 
-static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
- int64_t off,
- QEMUIOVector *qiov,
- int64_t size,
- BlockCompletionFunc *cb,
- void *opaque,
- RBDAIOCmd cmd)
+static int coroutine_fn qemu_rbd_start_co(BlockDriverState *bs,
+  uint64_t offset,
+  uint64_t bytes,
+  QEMUIOVector *qiov,
+  int flags,
+  RBDAIOCmd cmd)
 {
-RBDAIOCB *acb;
-RADOSCB *rcb = NULL;
+BDRVRBDState *s = bs->opaque;
+RBDTask task = { .s = s, .co = qemu_coroutine_self() };
 rbd_completion_t c;
 int r;
 
-BDRVR

[PATCH V2 7/7] block/rbd: drop qemu_rbd_refresh_limits

2021-01-26 Thread Peter Lieven
librbd supports 1 byte alignment for all aio operations.

Currently, there is no API call to query limits from the ceph backend.
So drop the bdrv_refresh_limits completely until there is such an API call.

Signed-off-by: Peter Lieven 
---
 block/rbd.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index 35dc1dc90e..5f96fbf3d1 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -219,14 +219,6 @@ done:
 return;
 }
 
-
-static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
-{
-/* XXX Does RBD support AIO on less than 512-byte alignment? */
-bs->bl.request_alignment = 512;
-}
-
-
 static int qemu_rbd_set_auth(rados_t cluster, BlockdevOptionsRbd *opts,
  Error **errp)
 {
@@ -1124,7 +1116,6 @@ static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
 .bdrv_parse_filename= qemu_rbd_parse_filename,
-.bdrv_refresh_limits= qemu_rbd_refresh_limits,
 .bdrv_file_open = qemu_rbd_open,
 .bdrv_close = qemu_rbd_close,
 .bdrv_reopen_prepare= qemu_rbd_reopen_prepare,
-- 
2.17.1





[PATCH V2 6/7] block/rbd: add write zeroes support

2021-01-26 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 36 +++-
 1 file changed, 35 insertions(+), 1 deletion(-)

diff --git a/block/rbd.c b/block/rbd.c
index d11a3c6dd1..35dc1dc90e 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -63,7 +63,8 @@ typedef enum {
 RBD_AIO_READ,
 RBD_AIO_WRITE,
 RBD_AIO_DISCARD,
-RBD_AIO_FLUSH
+RBD_AIO_FLUSH,
+RBD_AIO_WRITE_ZEROES
 } RBDAIOCmd;
 
 typedef struct BDRVRBDState {
@@ -695,6 +696,9 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 
 s->aio_context = bdrv_get_aio_context(bs);
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+bs->supported_zero_flags = BDRV_REQ_MAY_UNMAP;
+#endif
 
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
@@ -808,6 +812,18 @@ static int coroutine_fn qemu_rbd_start_co(BlockDriverState 
*bs,
 case RBD_AIO_FLUSH:
 r = rbd_aio_flush(s->image, c);
 break;
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+case RBD_AIO_WRITE_ZEROES: {
+int zero_flags = 0;
+#ifdef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+zero_flags = RBD_WRITE_ZEROES_FLAG_THICK_PROVISION;
+}
+#endif
+r = rbd_aio_write_zeroes(s->image, offset, bytes, c, zero_flags, 0);
+break;
+}
+#endif
 default:
 r = -EINVAL;
 }
@@ -878,6 +894,21 @@ static int coroutine_fn 
qemu_rbd_co_pdiscard(BlockDriverState *bs,
 return qemu_rbd_start_co(bs, offset, count, NULL, 0, RBD_AIO_DISCARD);
 }
 
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+static int
+coroutine_fn qemu_rbd_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
+  int count, BdrvRequestFlags flags)
+{
+#ifndef RBD_WRITE_ZEROES_FLAG_THICK_PROVISION
+if (!(flags & BDRV_REQ_MAY_UNMAP)) {
+return -ENOTSUP;
+}
+#endif
+return qemu_rbd_start_co(bs, offset, count, NULL, flags,
+ RBD_AIO_WRITE_ZEROES);
+}
+#endif
+
 static int qemu_rbd_getinfo(BlockDriverState *bs, BlockDriverInfo *bdi)
 {
 BDRVRBDState *s = bs->opaque;
@@ -1110,6 +1141,9 @@ static BlockDriver bdrv_rbd = {
 .bdrv_co_pwritev= qemu_rbd_co_pwritev,
 .bdrv_co_flush_to_disk  = qemu_rbd_co_flush,
 .bdrv_co_pdiscard   = qemu_rbd_co_pdiscard,
+#ifdef LIBRBD_SUPPORTS_WRITE_ZEROES
+.bdrv_co_pwrite_zeroes  = qemu_rbd_co_pwrite_zeroes,
+#endif
 
 .bdrv_snapshot_create   = qemu_rbd_snap_create,
 .bdrv_snapshot_delete   = qemu_rbd_snap_remove,
-- 
2.17.1





[PATCH V2 0/7] block/rbd: migrate to coroutines and add write zeroes support

2021-01-26 Thread Peter Lieven
this series migrates the qemu rbd driver from the old aio emulation
to native coroutines and adds write zeroes support which is important
for block operations.

To achive this we first bump the librbd requirement to the already
outdated luminous release of ceph to get rid of some wrappers and
ifdef'ry in the code.

V1->V2:
 - this patch is now rebased on top of current master with Paolos
   upcoming fixes for the meson.build script included:
- meson: accept either shared or static libraries if --disable-static
- meson: honor --enable-rbd if cc.links test fails
 - Patch 1: adjusted to meson.build script
 - Patch 2: unchanged
 - Patch 3: new patch
 - Patch 4: do not implement empty detach_aio_context callback [Jason]
 - Patch 5: - fix aio completion cleanup in error case [Jason]
- return error codes from librbd
 - Patch 6: - add support for thick provisioning [Jason]
- do not set write zeroes alignment
 - Patch 7: new patch

Peter Lieven (7):
  block/rbd: bump librbd requirement to luminous release
  block/rbd: store object_size in BDRVRBDState
  block/rbd: update s->image_size in qemu_rbd_getlength
  block/rbd: add bdrv_attach_aio_context
  block/rbd: migrate from aio to coroutines
  block/rbd: add write zeroes support
  block/rbd: drop qemu_rbd_refresh_limits

 block/rbd.c | 418 +---
 meson.build |  13 +-
 2 files changed, 142 insertions(+), 289 deletions(-)

-- 
2.17.1





[PATCH V2 4/7] block/rbd: add bdrv_attach_aio_context

2021-01-26 Thread Peter Lieven
Signed-off-by: Peter Lieven 
---
 block/rbd.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/block/rbd.c b/block/rbd.c
index f68ebcf240..7abd0252c9 100644
--- a/block/rbd.c
+++ b/block/rbd.c
@@ -91,6 +91,7 @@ typedef struct BDRVRBDState {
 char *namespace;
 uint64_t image_size;
 uint64_t object_size;
+AioContext *aio_context;
 } BDRVRBDState;
 
 static int qemu_rbd_connect(rados_t *cluster, rados_ioctx_t *io_ctx,
@@ -749,6 +750,8 @@ static int qemu_rbd_open(BlockDriverState *bs, QDict 
*options, int flags,
 }
 }
 
+s->aio_context = bdrv_get_aio_context(bs);
+
 /* When extending regular files, we get zeros from the OS */
 bs->supported_truncate_flags = BDRV_REQ_ZERO_WRITE;
 
@@ -839,8 +842,7 @@ static void rbd_finish_aiocb(rbd_completion_t c, RADOSCB 
*rcb)
 rcb->ret = rbd_aio_get_return_value(c);
 rbd_aio_release(c);
 
-replay_bh_schedule_oneshot_event(bdrv_get_aio_context(acb->common.bs),
- rbd_finish_bh, rcb);
+replay_bh_schedule_oneshot_event(acb->s->aio_context, rbd_finish_bh, rcb);
 }
 
 static BlockAIOCB *rbd_start_aio(BlockDriverState *bs,
@@ -1160,6 +1162,13 @@ static const char *const qemu_rbd_strong_runtime_opts[] 
= {
 NULL
 };
 
+static void qemu_rbd_attach_aio_context(BlockDriverState *bs,
+   AioContext *new_context)
+{
+BDRVRBDState *s = bs->opaque;
+s->aio_context = new_context;
+}
+
 static BlockDriver bdrv_rbd = {
 .format_name= "rbd",
 .instance_size  = sizeof(BDRVRBDState),
@@ -1189,6 +1198,8 @@ static BlockDriver bdrv_rbd = {
 .bdrv_snapshot_goto = qemu_rbd_snap_rollback,
 .bdrv_co_invalidate_cache = qemu_rbd_co_invalidate_cache,
 
+.bdrv_attach_aio_context  = qemu_rbd_attach_aio_context,
+
 .strong_runtime_opts= qemu_rbd_strong_runtime_opts,
 };
 
-- 
2.17.1





Re: [PATCH] meson: honor --enable-rbd if cc.links test fails

2021-01-26 Thread Peter Lieven
Am 26.01.21 um 11:27 schrieb Paolo Bonzini:
> If the link test failed, compilation proceeded with RBD disabled,
> even if --enable-rbd was used on the configure command line.
> Fix that.
>
> Signed-off-by: Paolo Bonzini 
> ---
>  meson.build | 24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/meson.build b/meson.build
> index f991d4274d..5943aa8a51 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -690,15 +690,21 @@ if not get_option('rbd').auto() or have_block
>librbd = cc.find_library('rbd', has_headers: ['rbd/librbd.h'],
> required: get_option('rbd'),
> kwargs: static_kwargs)
> -  if librados.found() and librbd.found() and cc.links('''
> -#include 
> -#include 
> -int main(void) {
> -  rados_t cluster;
> -  rados_create(, NULL);
> -  return 0;
> -}''', dependencies: [librbd, librados])
> -rbd = declare_dependency(dependencies: [librbd, librados])
> +  if librados.found() and librbd.found()
> +if cc.links('''
> +  #include 
> +  #include 
> +  int main(void) {
> +rados_t cluster;
> +rados_create(, NULL);
> +return 0;
> +  }''', dependencies: [librbd, librados])
> +  rbd = declare_dependency(dependencies: [librbd, librados])
> +elif get_option('rbd').enabled()
> +  error('could not link librados')
> +else
> +  warning('could not link librados, disabling')
> +endif
>endif
>  endif
>  


That was fast, tank you.


Tested-by: Peter Lieven 


Peter





Re: configure does not detect librados or librbd since the switch to meson

2021-01-25 Thread Peter Lieven
Am 25.01.21 um 22:57 schrieb Paolo Bonzini:
> On 25/01/21 20:47, Peter Lieven wrote:
>>> Can you include the meson-logs/meson-log.txt output?
>>
>> Sure:https://pastebin.com/u3XtbDvQ
>
> Does this work for you?
>
> diff --git a/meson.build b/meson.build
> index 690d48a6fd..a662772c4a 100644
> --- a/meson.build
> +++ b/meson.build
> @@ -14,6 +14,9 @@ config_host = keyval.load(meson.current_build_dir() / 
> 'config-host.mak')
>  enable_modules = 'CONFIG_MODULES' in config_host
>  enable_static = 'CONFIG_STATIC' in config_host
>
> +# Allow both shared and static libraries unless --enable-static
> +static_kwargs = enable_static ? {'static': true} : {}
> +
>  # Temporary directory used for files created while
>  # configure runs. Since it is in the build directory
>  # we can safely blow away any previous version of it
> @@ -679,10 +682,10 @@ endif
>  rbd = not_found
>  if not get_option('rbd').auto() or have_block
>    librados = cc.find_library('rados', required: get_option('rbd'),
> - static: enable_static)
> + kwargs: static_kwargs)
>    librbd = cc.find_library('rbd', has_headers: ['rbd/librbd.h'],
>     required: get_option('rbd'),
> -   static: enable_static)
> +   kwargs: static_kwargs)
>    if librados.found() and librbd.found() and cc.links('''
>  #include 
>  #include 
> @@ -693,6 +696,9 @@ if not get_option('rbd').auto() or have_block
>  }''', dependencies: [librbd, librados])
>  rbd = declare_dependency(dependencies: [librbd, librados])
>    endif
> +  if not rbd.found() and get_option('rbd').enabled()
> +    error('could not link librbd')
> +  endif
>  endif
>
>  glusterfs = not_found
>
> (It's not a complete patch, all instances of "static: enable_static" would 
> need to be changed because other libraries could have the same issue).


Yes, it does.


Please CC me, when you submit a complete patch. I will build my V2 of the rbd 
driver rewrite on top of this then.


Thanks,

Peter





Re: configure does not detect librados or librbd since the switch to meson

2021-01-25 Thread Peter Lieven
Am 25.01.21 um 16:24 schrieb Paolo Bonzini:
> On 25/01/21 15:31, Peter Lieven wrote:
>> on Dedian / Ubuntu configure does no longer detect librbd / librados
>> since the switch to meson.
>>
>> I need to add dirs: ['/usr/lib'] to the cc.find_library for librados
>> and librbd. But I am not familiar with meson
>>
>> and can't say if thats the appropriate fix.
>
> Can you include the meson-logs/meson-log.txt output?


Sure: https://pastebin.com/u3XtbDvQ


>
>> Further issue: if I specify configure --enable-rbd and cc.links fails
>> the configure command succeeds and rbd support is disabled.
>
> That's a separate bug.


For the rbd check I can address this as well in the series. Sadly, librbd has 
no pkg-config (yet). So, I have to create a C file that checks for the version.


Peter






Re: configure does not detect librados or librbd since the switch to meson

2021-01-25 Thread Peter Lieven
Am 25.01.21 um 15:13 schrieb Peter Lieven:
> Hi,
>
>
> on Dedian / Ubuntu configure does no longer detect librbd / librados since 
> the switch to meson.
>
> I need to add dirs: ['/usr/lib'] to the cc.find_library for librados and 
> librbd. But I am not familiar with meson
>
> and can't say if thats the appropriate fix.
>
>
> I would be thankful for a hint. I would create a patch to fix this and 
> include it upfront of my rbd driver rewrite
>
> that I would like to respin asap.


Further issue: if I specify configure --enable-rbd and cc.links fails the 
configure command succeeds and rbd support is disabled.


This seems to be an issue with all cc.links calls in the meson.build script.


Peter





configure does not detect librados or librbd since the switch to meson

2021-01-25 Thread Peter Lieven
Hi,


on Dedian / Ubuntu configure does no longer detect librbd / librados since the 
switch to meson.

I need to add dirs: ['/usr/lib'] to the cc.find_library for librados and 
librbd. But I am not familiar with meson

and can't say if thats the appropriate fix.


I would be thankful for a hint. I would create a patch to fix this and include 
it upfront of my rbd driver rewrite

that I would like to respin asap.


Peter





Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-21 Thread Peter Lieven
Am 21.01.21 um 20:42 schrieb Jason Dillaman:
> On Wed, Jan 20, 2021 at 6:01 PM Peter Lieven  wrote:
>>
>>> Am 19.01.2021 um 15:20 schrieb Jason Dillaman :
>>>
>>> On Tue, Jan 19, 2021 at 4:36 AM Peter Lieven  wrote:
>>>>> Am 18.01.21 um 23:33 schrieb Jason Dillaman:
>>>>> On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
>>>>>> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
>>>>>>> On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
>>>>>>>> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
>>>>>>>>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>>>>>>>>>> since we implement byte interfaces and librbd supports aio on byte 
>>>>>>>>>> granularity we can lift
>>>>>>>>>> the 512 byte alignment.
>>>>>>>>>> Signed-off-by: Peter Lieven 
>>>>>>>>>> ---
>>>>>>>>>> block/rbd.c | 2 --
>>>>>>>>>> 1 file changed, 2 deletions(-)
>>>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>>>> index 27b4404adf..8673e8f553 100644
>>>>>>>>>> --- a/block/rbd.c
>>>>>>>>>> +++ b/block/rbd.c
>>>>>>>>>> @@ -223,8 +223,6 @@ done:
>>>>>>>>>> static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error 
>>>>>>>>>> **errp)
>>>>>>>>>> {
>>>>>>>>>>BDRVRBDState *s = bs->opaque;
>>>>>>>>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
>>>>>>>>>> -bs->bl.request_alignment = 512;
>>>>>>>>> Just a suggestion, but perhaps improve discard alignment, max discard,
>>>>>>>>> optimal alignment (if that's something QEMU handles internally) if not
>>>>>>>>> overridden by the user.
>>>>>>>> Qemu supports max_discard and discard_alignment. Is there a call to 
>>>>>>>> get these limits
>>>>>>>> from librbd?
>>>>>>>> What do you mean by optimal_alignment? The object size?
>>>>>>> krbd does a good job of initializing defaults [1] where optimal and
>>>>>>> discard alignment is 64KiB (can actually be 4KiB now), max IO size for
>>>>>>> writes, discards, and write-zeroes is the object size * the stripe
>>>>>>> count.
>>>>>> Okay, I will have a look at it. If qemu issues a write, discard, 
>>>>>> write_zero greater than
>>>>>> obj_size  * stripe count will librbd split it internally or will the 
>>>>>> request fail?
>>>>> librbd will handle it as needed. My goal is really just to get the
>>>>> hints down the guest OS.
>>>>>> Regarding the alignment it seems that rbd_dev->opts->alloc_size is 
>>>>>> something that comes from the device
>>>>>> configuration and not from rbd? I don't have that information inside the 
>>>>>> Qemu RBD driver.
>>>>> librbd doesn't really have the information either. The 64KiB guess
>>>>> that krbd uses was a compromise since that was the default OSD
>>>>> allocation size for HDDs since Luminous. Starting with Pacific that
>>>>> default is going down to 4KiB.
>>>> I will try to adjust these values as far as it is possible and makes sense.
>>>> Is there a way to check the minimum supported OSD release in the backend 
>>>> from librbd / librados?
>>> It's not a minimum -- RADOS will gladly access 1 byte writes as well.
>>> It's really just the optimal (performance and space-wise). Sadly,
>>> there is no realistic way to query this data from the backend.
>> So you would suggest to advertise an optimal transfer length of 64k and max 
>> transfer length of obj size * stripe count to the guest unless we have an 
>> API in the future to query these limits from the backend?
> I'll open a Ceph tracker ticket to expose these via the API in a future 
> release.


That would be good to have!


>
>> I would leave request alignment at 1 byte as otherwise Qemu will issue RMWs 
>> for all write requests that do not align. Everything that comes from a guest 
>> OS is very likely 4k aligned anyway.
> My goal is

Re: [PATCH 7/7] block/rbd: change request alignment to 1 byte

2021-01-20 Thread Peter Lieven

> Am 19.01.2021 um 15:20 schrieb Jason Dillaman :
> 
> On Tue, Jan 19, 2021 at 4:36 AM Peter Lieven  wrote:
>>> Am 18.01.21 um 23:33 schrieb Jason Dillaman:
>>> On Fri, Jan 15, 2021 at 10:39 AM Peter Lieven  wrote:
>>>> Am 15.01.21 um 16:27 schrieb Jason Dillaman:
>>>>> On Thu, Jan 14, 2021 at 2:59 PM Peter Lieven  wrote:
>>>>>> Am 14.01.21 um 20:19 schrieb Jason Dillaman:
>>>>>>> On Sun, Dec 27, 2020 at 11:42 AM Peter Lieven  wrote:
>>>>>>>> since we implement byte interfaces and librbd supports aio on byte 
>>>>>>>> granularity we can lift
>>>>>>>> the 512 byte alignment.
>>>>>>>> Signed-off-by: Peter Lieven 
>>>>>>>> ---
>>>>>>>> block/rbd.c | 2 --
>>>>>>>> 1 file changed, 2 deletions(-)
>>>>>>>> diff --git a/block/rbd.c b/block/rbd.c
>>>>>>>> index 27b4404adf..8673e8f553 100644
>>>>>>>> --- a/block/rbd.c
>>>>>>>> +++ b/block/rbd.c
>>>>>>>> @@ -223,8 +223,6 @@ done:
>>>>>>>> static void qemu_rbd_refresh_limits(BlockDriverState *bs, Error **errp)
>>>>>>>> {
>>>>>>>>BDRVRBDState *s = bs->opaque;
>>>>>>>> -/* XXX Does RBD support AIO on less than 512-byte alignment? */
>>>>>>>> -bs->bl.request_alignment = 512;
>>>>>>> Just a suggestion, but perhaps improve discard alignment, max discard,
>>>>>>> optimal alignment (if that's something QEMU handles internally) if not
>>>>>>> overridden by the user.
>>>>>> Qemu supports max_discard and discard_alignment. Is there a call to get 
>>>>>> these limits
>>>>>> from librbd?
>>>>>> What do you mean by optimal_alignment? The object size?
>>>>> krbd does a good job of initializing defaults [1] where optimal and
>>>>> discard alignment is 64KiB (can actually be 4KiB now), max IO size for
>>>>> writes, discards, and write-zeroes is the object size * the stripe
>>>>> count.
>>>> Okay, I will have a look at it. If qemu issues a write, discard, 
>>>> write_zero greater than
>>>> obj_size  * stripe count will librbd split it internally or will the 
>>>> request fail?
>>> librbd will handle it as needed. My goal is really just to get the
>>> hints down the guest OS.
>>>> Regarding the alignment it seems that rbd_dev->opts->alloc_size is 
>>>> something that comes from the device
>>>> configuration and not from rbd? I don't have that information inside the 
>>>> Qemu RBD driver.
>>> librbd doesn't really have the information either. The 64KiB guess
>>> that krbd uses was a compromise since that was the default OSD
>>> allocation size for HDDs since Luminous. Starting with Pacific that
>>> default is going down to 4KiB.
>> I will try to adjust these values as far as it is possible and makes sense.
>> Is there a way to check the minimum supported OSD release in the backend 
>> from librbd / librados?
> 
> It's not a minimum -- RADOS will gladly access 1 byte writes as well.
> It's really just the optimal (performance and space-wise). Sadly,
> there is no realistic way to query this data from the backend.

So you would suggest to advertise an optimal transfer length of 64k and max 
transfer length of obj size * stripe count to the guest unless we have an API 
in the future to query these limits from the backend?

I would leave request alignment at 1 byte as otherwise Qemu will issue RMWs for 
all write requests that do not align. Everything that comes from a guest OS is 
very likely 4k aligned anyway.

Peter





  1   2   3   4   5   6   7   8   9   10   >