Re: [PATCH v5 2/4] vl: Initialise main loop earlier

2020-02-19 Thread Wolfgang Bumiller
On Tue, Feb 18, 2020 at 04:40:34PM +0100, Kevin Wolf wrote:
> We want to be able to use qemu_aio_context in the monitor
> initialisation.
> 
> Signed-off-by: Kevin Wolf 
> Reviewed-by: Marc-André Lureau 
> Reviewed-by: Markus Armbruster 
> ---
>  vl.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/vl.c b/vl.c
> index 794f2e5733..98bc51e089 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -2894,6 +2894,11 @@ int main(int argc, char **argv, char **envp)
>  runstate_init();
>  precopy_infrastructure_init();
>  postcopy_infrastructure_init();
> +
> +if (qemu_init_main_loop(_loop_err)) {
> +error_report_err(main_loop_err);
> +exit(1);
> +}
>  monitor_init_globals();

This is a tiny bit scary, as we now have around 1kloc of code between
here and os_daemonize() where in the future we may accidentally cause
the aio context's on-demand thread pool to spawn before fork()ing
(silently losing the threads again - we did have such an issue right
there in monitor_init_globals() in the past)

For *now* it should be fine since currently that wouldn't have worked,
but we may need to keep an eye out for that in future patches.
Basically, not everything we do between here and os_daemonize() way down
below is actually allowed to freely use the main loop's aio context even
if now it already exists from this point onward.

I wonder if aio_get_thread_pool() should check the daemonization state
(maybe with some debug #ifdef)?

>  
>  if (qcrypto_init() < 0) {
> @@ -3811,11 +3816,6 @@ int main(int argc, char **argv, char **envp)
>  qemu_unlink_pidfile_notifier.notify = qemu_unlink_pidfile;
>  qemu_add_exit_notifier(_unlink_pidfile_notifier);
>  
> -if (qemu_init_main_loop(_loop_err)) {
> -error_report_err(main_loop_err);
> -exit(1);
> -}
> -
>  #ifdef CONFIG_SECCOMP
>  olist = qemu_find_opts_err("sandbox", NULL);
>  if (olist) {
> -- 
> 2.20.1




Re: backup_calculate_cluster_size does not consider source

2019-11-06 Thread Wolfgang Bumiller
On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote:
> On 06.11.19 09:32, Stefan Hajnoczi wrote:
> > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote:
> >> Example: Backup from ceph disk (rbd_cache=false) to local disk:
> >>
> >> backup_calculate_cluster_size returns 64K (correct for my local .raw image)
> >>
> >> Then the backup job starts to read 64K blocks from ceph.
> >>
> >> But ceph always reads 4M block, so this is incredibly slow and produces
> >> way too much network traffic.
> >>
> >> Why does backup_calculate_cluster_size does not consider the block size 
> >> from
> >> the source disk? 
> >>
> >> cluster_size = MAX(block_size_source, block_size_target)
> 
> So Ceph always transmits 4 MB over the network, no matter what is
> actually needed?  That sounds, well, interesting.

Or at least it generates that much I/O - in the end, it can slow down
the backup by up to a multi-digit factor...

> backup_calculate_cluster_size() doesn’t consider the source size because
> to my knowledge there is no other medium that behaves this way.  So I
> suppose the assumption was always that the block size of the source
> doesn’t matter, because a partial read is always possible (without
> having to read everything).

Unless you enable qemu-side caching this only works until the
block/cluster size of the source exceeds the one of the target.

> What would make sense to me is to increase the buffer size in general.
> I don’t think we need to copy clusters at a time, and
> 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for
> backup writes that are triggered by guest writes.  We haven’t yet
> increased the copy size for background writes, though.  We can do that,
> of course.  (And probably should.)
> 
> The thing is, it just seems unnecessary to me to take the source cluster
> size into account in general.  It seems weird that a medium only allows
> 4 MB reads, because, well, guests aren’t going to take that into account.

But guests usually have a page cache, which is why in many setups qemu
(and thereby the backup process) often doesn't.




[Qemu-block] [PATCH] ratelimit: don't align wait time with slices

2018-02-06 Thread Wolfgang Bumiller
It is possible for rate limited writes to keep overshooting a slice's
quota by a tiny amount causing the slice-aligned waiting period to
effectively halve the rate.

Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com>
---
Copied the Ccs from the discussion thread, hope that's fine, as I also
just noticed that for my reply containing this snippet I had hit reply
on the mail that did not contain those Ccs yet, sorry about that.

 include/qemu/ratelimit.h | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/qemu/ratelimit.h b/include/qemu/ratelimit.h
index 8dece483f5..1b38291823 100644
--- a/include/qemu/ratelimit.h
+++ b/include/qemu/ratelimit.h
@@ -36,7 +36,7 @@ typedef struct {
 static inline int64_t ratelimit_calculate_delay(RateLimit *limit, uint64_t n)
 {
 int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
-uint64_t delay_slices;
+double delay_slices;
 
 assert(limit->slice_quota && limit->slice_ns);
 
@@ -55,12 +55,11 @@ static inline int64_t ratelimit_calculate_delay(RateLimit 
*limit, uint64_t n)
 return 0;
 }
 
-/* Quota exceeded. Calculate the next time slice we may start
- * sending data again. */
-delay_slices = (limit->dispatched + limit->slice_quota - 1) /
-limit->slice_quota;
+/* Quota exceeded. Wait based on the excess amount and then start a new
+ * slice. */
+delay_slices = (double)limit->dispatched / limit->slice_quota;
 limit->slice_end_time = limit->slice_start_time +
-delay_slices * limit->slice_ns;
+(uint64_t)(delay_slices * limit->slice_ns);
 return limit->slice_end_time - now;
 }
 
-- 
2.11.0





Re: [Qemu-block] [Qemu-discuss] Converting qcow2 image to raw thin lv

2017-02-13 Thread Wolfgang Bumiller
On Mon, Feb 13, 2017 at 11:04:30AM +0100, Kevin Wolf wrote:
> Am 12.02.2017 um 01:58 hat Nir Soffer geschrieben:
> > On Sat, Feb 11, 2017 at 12:23 AM, Nir Soffer  wrote:
> > > Hi all,
> > >
> > > I'm trying to convert images (mostly qcow2) to raw format on thin lv,
> > > hoping to write only the allocated blocks on the thin lv, but
> > > it seems that qemu-img cannot write sparse image on a block
> > > device.
> > >
> > > (...)
> > 
> > So it seems that qemu-img is trying to write a sparse image.
> > 
> > I tested again with empty file:
> > 
> > truncate -s 20m empty
> > 
> > Using strace, qemu-img checks the device discard_zeroes_data:
> > 
> > ioctl(11, BLKDISCARDZEROES, 0)  = 0
> > 
> > Then it find that the source is empty:
> > 
> > lseek(10, 0, SEEK_DATA) = -1 ENXIO (No such device
> > or address)
> > 
> > Then it issues one call
> > 
> > [pid 10041] ioctl(11, BLKZEROOUT, 0x7f6049c82ba0) = 0
> > 
> > And fsync and close the destination.
> > 
> > # grep -s "" /sys/block/dm-57/queue/discard_*
> > /sys/block/dm-57/queue/discard_granularity:65536
> > /sys/block/dm-57/queue/discard_max_bytes:17179869184
> > /sys/block/dm-57/queue/discard_zeroes_data:0
> > 
> > I wonder why discard_zeroes_data is 0, while discarding
> > blocks seems to zero them.
> > 
> > Seems that this this bug:
> > https://bugzilla.redhat.com/835622
> > 
> > thin lv does promise (by default) to zero new allocated blocks,
> > and it does returns zeros when reading unallocated data, like
> > a sparse file.
> > 
> > Since qemu does not know that the thin lv is not allocated, it cannot
> > skip empty blocks safely.
> > 
> > It would be useful if it had a flag to force sparsness when the
> > user knows that this operation is safe, or maybe we need a thin lvm
> > driver?
> 
> Yes, I think your analysis is correct, I seem to remember that I've seen
> this happen before.
> 
> The Right Thing (TM) to do, however, seems to be fixing the kernel so
> that BLKDISCARDZEROES correctly returns that discard does in fact zero
> out blocks on this device. As soon as this ioctl works correctly,
> qemu-img should just automatically do what you want.
> 
> Now if it turns out it is important to support older kernels without the
> fix, we can think about a driver-specific option for the 'file' driver
> that overrides the kernel's value. But I really want to make sure that
> we use such workarounds only in addition, not instead of doing the
> proper root cause fix in the kernel.
> 
> So can you please bring it up with the LVM people?

I'm not sure it's that easy. The discard granularity of LVM thin is not
equal to their reported block/sector sizes, but to the size of the
chunks they allocate.

  # blockdev --getss /dev/dm-9
  512
  # blockdev --getbsz /dev/dm-9
  4096
  # blockdev --getpbsz /dev/dm-9
  4096
  # cat /sys/block/dm-9/queue/discard_granularity
  131072
  #

I currently don't see qemu using the discard_granularity property for
this purpose. IIRC the code for write_zeroes() eg. simply checks the
discard_zeroes flag but not what size it is trying to zero-out/discard.

We have an experimental semi-complete "can-do-footshooting" 'zeroinit'
filter for this purpose to basically explicitly set the "has_zero_init"
flag and drop "write_zeroes()" calls to blocks at an address greater
than the highest written one up to that point.
It should use a dirty bitmap instead and is sort of dangerous this way
which is why it's not on the qemu-devel list. But if this approach is at
all acceptable (despite being a hack) I could improve it and send it to
the list?
https://github.com/Blub/qemu/commit/6f6f38d2ef8f22a12f72e4d60f8a1fa978ac569a
(you'd just prefix the destination with `zeroinit:` in the qemu-img
command)

Additionally I'm currently still playing with the details and quirks of
various storages (lvm/dm thin, rbd, zvols) in an attempt to create a
tool to convert between various storages. (I did some successful tests
converting disk images between these storages & qcow2 together with
their snapshots in a COW-aware way...) I'm planning on releasing some
experimental code soon-ish (there's still some polishing to do though to
the documentation, the library's API and the format - and the qcow2
support is a patch for qemu-img to use the library.)

My adventures into dm-thin metadata allows me to answer this one though:

> > or maybe we need a thin lvm driver?

Probably not. It does not support SEEK_DATA/SEEK_HOLE and to my
knowledge also has no other sane metadata querying methods. You'd have
to read the metadata device instead. To do this properly you have to
reserve a metadata snapshot and there can only ever be one of those per
pool, which means you could only have 1 such disk in total running on a
system and no other dm-thin metadata aware tool could be used during
that time (otherwise the reserver operations will fail with an error and
qemu would have to wait a lot...).




Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads

2016-12-05 Thread Wolfgang Bumiller
On Fri, Dec 02, 2016 at 01:13:28PM -0600, Eric Blake wrote:
> On 12/01/2016 04:59 AM, Wolfgang Bumiller wrote:
> > Fixes #1644754.
> > 
> > Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com>
> > ---
> > I'm not sure what the original rationale was to treat both partial
> > reads as well as well as writes as I/O error. (Seems to have happened
> > from original glusterfs v1 to v2 series with a note but no reasoning
> > for the read side as far as I could see.)
> > The general direction lately seems to be to move away from sector
> > based block APIs. Also eg. the NFS code allows partial reads. (It
> > does, however, have an old patch (c2eb918e3) dedicated to aligning
> > sizes to 512 byte boundaries for file creation for compatibility to
> > other parts of qemu like qcow2. This already happens in glusterfs,
> > though, but if you move a file from a different storage over to
> > glusterfs you may end up with a qcow2 file with eg. the L1 table in
> > the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
> > but not _end_ at one.)
> > 
> >  block/gluster.c | 10 +-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/block/gluster.c b/block/gluster.c
> > index 891c13b..3db0bf8 100644
> > --- a/block/gluster.c
> > +++ b/block/gluster.c
> > @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB {
> >  int ret;
> >  Coroutine *coroutine;
> >  AioContext *aio_context;
> > +bool is_write;
> >  } GlusterAIOCB;
> >  
> >  typedef struct BDRVGlusterState {
> > @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, 
> > ssize_t ret, void *arg)
> >  acb->ret = 0; /* Success */
> >  } else if (ret < 0) {
> >  acb->ret = -errno; /* Read/Write failed */
> > +} else if (acb->is_write) {
> > +acb->ret = -EIO; /* Partial write - fail it */
> >  } else {
> > -acb->ret = -EIO; /* Partial read/write - fail it */
> > +acb->ret = 0; /* Success */
> 
> Does this properly guarantee that the portion beyond EOF reads as zero?

I'd argue this wasn't necessarily the case before either, considering
the first check starts with `!ret`:

if (!ret || ret == acb->size) {
acb->ret = 0; /* Success */

A read right at EOF would return 0 and be treated as success there, no?
Iow. it wouldn't zero out the destination buffer as far as I can see.
Come to think of it, I'm not too fond of this part of the check for the
write case either.

> Would it be better to switch to byte-based interfaces rather than
> continue to force gluster interaction in 512-byte sector chunks, since
> gluster can obviously store files that are not 512-aligned?




[Qemu-block] [RFC PATCH] glusterfs: allow partial reads

2016-12-01 Thread Wolfgang Bumiller
Fixes #1644754.

Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com>
---
I'm not sure what the original rationale was to treat both partial
reads as well as well as writes as I/O error. (Seems to have happened
from original glusterfs v1 to v2 series with a note but no reasoning
for the read side as far as I could see.)
The general direction lately seems to be to move away from sector
based block APIs. Also eg. the NFS code allows partial reads. (It
does, however, have an old patch (c2eb918e3) dedicated to aligning
sizes to 512 byte boundaries for file creation for compatibility to
other parts of qemu like qcow2. This already happens in glusterfs,
though, but if you move a file from a different storage over to
glusterfs you may end up with a qcow2 file with eg. the L1 table in
the last 80 bytes of the file aligned to _begin_ at a 512 boundary,
but not _end_ at one.)

 block/gluster.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/block/gluster.c b/block/gluster.c
index 891c13b..3db0bf8 100644
--- a/block/gluster.c
+++ b/block/gluster.c
@@ -41,6 +41,7 @@ typedef struct GlusterAIOCB {
 int ret;
 Coroutine *coroutine;
 AioContext *aio_context;
+bool is_write;
 } GlusterAIOCB;
 
 typedef struct BDRVGlusterState {
@@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, 
ssize_t ret, void *arg)
 acb->ret = 0; /* Success */
 } else if (ret < 0) {
 acb->ret = -errno; /* Read/Write failed */
+} else if (acb->is_write) {
+acb->ret = -EIO; /* Partial write - fail it */
 } else {
-acb->ret = -EIO; /* Partial read/write - fail it */
+acb->ret = 0; /* Success */
 }
 
 aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb);
@@ -965,6 +968,7 @@ static coroutine_fn int 
qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs,
 acb.ret = 0;
 acb.coroutine = qemu_coroutine_self();
 acb.aio_context = bdrv_get_aio_context(bs);
+acb.is_write = true;
 
 ret = glfs_zerofill_async(s->fd, offset, size, gluster_finish_aiocb, );
 if (ret < 0) {
@@ -1087,9 +1091,11 @@ static coroutine_fn int 
qemu_gluster_co_rw(BlockDriverState *bs,
 acb.aio_context = bdrv_get_aio_context(bs);
 
 if (write) {
+acb.is_write = true;
 ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0,
  gluster_finish_aiocb, );
 } else {
+acb.is_write = false;
 ret = glfs_preadv_async(s->fd, qiov->iov, qiov->niov, offset, 0,
 gluster_finish_aiocb, );
 }
@@ -1153,6 +1159,7 @@ static coroutine_fn int 
qemu_gluster_co_flush_to_disk(BlockDriverState *bs)
 acb.ret = 0;
 acb.coroutine = qemu_coroutine_self();
 acb.aio_context = bdrv_get_aio_context(bs);
+acb.is_write = true;
 
 ret = glfs_fsync_async(s->fd, gluster_finish_aiocb, );
 if (ret < 0) {
@@ -1199,6 +1206,7 @@ static coroutine_fn int 
qemu_gluster_co_pdiscard(BlockDriverState *bs,
 acb.ret = 0;
 acb.coroutine = qemu_coroutine_self();
 acb.aio_context = bdrv_get_aio_context(bs);
+acb.is_write = true;
 
 ret = glfs_discard_async(s->fd, offset, size, gluster_finish_aiocb, );
 if (ret < 0) {
-- 
2.1.4