Re: [PATCH v5 2/4] vl: Initialise main loop earlier
On Tue, Feb 18, 2020 at 04:40:34PM +0100, Kevin Wolf wrote: > We want to be able to use qemu_aio_context in the monitor > initialisation. > > Signed-off-by: Kevin Wolf > Reviewed-by: Marc-André Lureau > Reviewed-by: Markus Armbruster > --- > vl.c | 10 +- > 1 file changed, 5 insertions(+), 5 deletions(-) > > diff --git a/vl.c b/vl.c > index 794f2e5733..98bc51e089 100644 > --- a/vl.c > +++ b/vl.c > @@ -2894,6 +2894,11 @@ int main(int argc, char **argv, char **envp) > runstate_init(); > precopy_infrastructure_init(); > postcopy_infrastructure_init(); > + > +if (qemu_init_main_loop(_loop_err)) { > +error_report_err(main_loop_err); > +exit(1); > +} > monitor_init_globals(); This is a tiny bit scary, as we now have around 1kloc of code between here and os_daemonize() where in the future we may accidentally cause the aio context's on-demand thread pool to spawn before fork()ing (silently losing the threads again - we did have such an issue right there in monitor_init_globals() in the past) For *now* it should be fine since currently that wouldn't have worked, but we may need to keep an eye out for that in future patches. Basically, not everything we do between here and os_daemonize() way down below is actually allowed to freely use the main loop's aio context even if now it already exists from this point onward. I wonder if aio_get_thread_pool() should check the daemonization state (maybe with some debug #ifdef)? > > if (qcrypto_init() < 0) { > @@ -3811,11 +3816,6 @@ int main(int argc, char **argv, char **envp) > qemu_unlink_pidfile_notifier.notify = qemu_unlink_pidfile; > qemu_add_exit_notifier(_unlink_pidfile_notifier); > > -if (qemu_init_main_loop(_loop_err)) { > -error_report_err(main_loop_err); > -exit(1); > -} > - > #ifdef CONFIG_SECCOMP > olist = qemu_find_opts_err("sandbox", NULL); > if (olist) { > -- > 2.20.1
Re: backup_calculate_cluster_size does not consider source
On Wed, Nov 06, 2019 at 10:37:04AM +0100, Max Reitz wrote: > On 06.11.19 09:32, Stefan Hajnoczi wrote: > > On Tue, Nov 05, 2019 at 11:02:44AM +0100, Dietmar Maurer wrote: > >> Example: Backup from ceph disk (rbd_cache=false) to local disk: > >> > >> backup_calculate_cluster_size returns 64K (correct for my local .raw image) > >> > >> Then the backup job starts to read 64K blocks from ceph. > >> > >> But ceph always reads 4M block, so this is incredibly slow and produces > >> way too much network traffic. > >> > >> Why does backup_calculate_cluster_size does not consider the block size > >> from > >> the source disk? > >> > >> cluster_size = MAX(block_size_source, block_size_target) > > So Ceph always transmits 4 MB over the network, no matter what is > actually needed? That sounds, well, interesting. Or at least it generates that much I/O - in the end, it can slow down the backup by up to a multi-digit factor... > backup_calculate_cluster_size() doesn’t consider the source size because > to my knowledge there is no other medium that behaves this way. So I > suppose the assumption was always that the block size of the source > doesn’t matter, because a partial read is always possible (without > having to read everything). Unless you enable qemu-side caching this only works until the block/cluster size of the source exceeds the one of the target. > What would make sense to me is to increase the buffer size in general. > I don’t think we need to copy clusters at a time, and > 0e2402452f1f2042923a5 has indeed increased the copy size to 1 MB for > backup writes that are triggered by guest writes. We haven’t yet > increased the copy size for background writes, though. We can do that, > of course. (And probably should.) > > The thing is, it just seems unnecessary to me to take the source cluster > size into account in general. It seems weird that a medium only allows > 4 MB reads, because, well, guests aren’t going to take that into account. But guests usually have a page cache, which is why in many setups qemu (and thereby the backup process) often doesn't.
[Qemu-block] [PATCH] ratelimit: don't align wait time with slices
It is possible for rate limited writes to keep overshooting a slice's quota by a tiny amount causing the slice-aligned waiting period to effectively halve the rate. Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com> --- Copied the Ccs from the discussion thread, hope that's fine, as I also just noticed that for my reply containing this snippet I had hit reply on the mail that did not contain those Ccs yet, sorry about that. include/qemu/ratelimit.h | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/include/qemu/ratelimit.h b/include/qemu/ratelimit.h index 8dece483f5..1b38291823 100644 --- a/include/qemu/ratelimit.h +++ b/include/qemu/ratelimit.h @@ -36,7 +36,7 @@ typedef struct { static inline int64_t ratelimit_calculate_delay(RateLimit *limit, uint64_t n) { int64_t now = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); -uint64_t delay_slices; +double delay_slices; assert(limit->slice_quota && limit->slice_ns); @@ -55,12 +55,11 @@ static inline int64_t ratelimit_calculate_delay(RateLimit *limit, uint64_t n) return 0; } -/* Quota exceeded. Calculate the next time slice we may start - * sending data again. */ -delay_slices = (limit->dispatched + limit->slice_quota - 1) / -limit->slice_quota; +/* Quota exceeded. Wait based on the excess amount and then start a new + * slice. */ +delay_slices = (double)limit->dispatched / limit->slice_quota; limit->slice_end_time = limit->slice_start_time + -delay_slices * limit->slice_ns; +(uint64_t)(delay_slices * limit->slice_ns); return limit->slice_end_time - now; } -- 2.11.0
Re: [Qemu-block] [Qemu-discuss] Converting qcow2 image to raw thin lv
On Mon, Feb 13, 2017 at 11:04:30AM +0100, Kevin Wolf wrote: > Am 12.02.2017 um 01:58 hat Nir Soffer geschrieben: > > On Sat, Feb 11, 2017 at 12:23 AM, Nir Sofferwrote: > > > Hi all, > > > > > > I'm trying to convert images (mostly qcow2) to raw format on thin lv, > > > hoping to write only the allocated blocks on the thin lv, but > > > it seems that qemu-img cannot write sparse image on a block > > > device. > > > > > > (...) > > > > So it seems that qemu-img is trying to write a sparse image. > > > > I tested again with empty file: > > > > truncate -s 20m empty > > > > Using strace, qemu-img checks the device discard_zeroes_data: > > > > ioctl(11, BLKDISCARDZEROES, 0) = 0 > > > > Then it find that the source is empty: > > > > lseek(10, 0, SEEK_DATA) = -1 ENXIO (No such device > > or address) > > > > Then it issues one call > > > > [pid 10041] ioctl(11, BLKZEROOUT, 0x7f6049c82ba0) = 0 > > > > And fsync and close the destination. > > > > # grep -s "" /sys/block/dm-57/queue/discard_* > > /sys/block/dm-57/queue/discard_granularity:65536 > > /sys/block/dm-57/queue/discard_max_bytes:17179869184 > > /sys/block/dm-57/queue/discard_zeroes_data:0 > > > > I wonder why discard_zeroes_data is 0, while discarding > > blocks seems to zero them. > > > > Seems that this this bug: > > https://bugzilla.redhat.com/835622 > > > > thin lv does promise (by default) to zero new allocated blocks, > > and it does returns zeros when reading unallocated data, like > > a sparse file. > > > > Since qemu does not know that the thin lv is not allocated, it cannot > > skip empty blocks safely. > > > > It would be useful if it had a flag to force sparsness when the > > user knows that this operation is safe, or maybe we need a thin lvm > > driver? > > Yes, I think your analysis is correct, I seem to remember that I've seen > this happen before. > > The Right Thing (TM) to do, however, seems to be fixing the kernel so > that BLKDISCARDZEROES correctly returns that discard does in fact zero > out blocks on this device. As soon as this ioctl works correctly, > qemu-img should just automatically do what you want. > > Now if it turns out it is important to support older kernels without the > fix, we can think about a driver-specific option for the 'file' driver > that overrides the kernel's value. But I really want to make sure that > we use such workarounds only in addition, not instead of doing the > proper root cause fix in the kernel. > > So can you please bring it up with the LVM people? I'm not sure it's that easy. The discard granularity of LVM thin is not equal to their reported block/sector sizes, but to the size of the chunks they allocate. # blockdev --getss /dev/dm-9 512 # blockdev --getbsz /dev/dm-9 4096 # blockdev --getpbsz /dev/dm-9 4096 # cat /sys/block/dm-9/queue/discard_granularity 131072 # I currently don't see qemu using the discard_granularity property for this purpose. IIRC the code for write_zeroes() eg. simply checks the discard_zeroes flag but not what size it is trying to zero-out/discard. We have an experimental semi-complete "can-do-footshooting" 'zeroinit' filter for this purpose to basically explicitly set the "has_zero_init" flag and drop "write_zeroes()" calls to blocks at an address greater than the highest written one up to that point. It should use a dirty bitmap instead and is sort of dangerous this way which is why it's not on the qemu-devel list. But if this approach is at all acceptable (despite being a hack) I could improve it and send it to the list? https://github.com/Blub/qemu/commit/6f6f38d2ef8f22a12f72e4d60f8a1fa978ac569a (you'd just prefix the destination with `zeroinit:` in the qemu-img command) Additionally I'm currently still playing with the details and quirks of various storages (lvm/dm thin, rbd, zvols) in an attempt to create a tool to convert between various storages. (I did some successful tests converting disk images between these storages & qcow2 together with their snapshots in a COW-aware way...) I'm planning on releasing some experimental code soon-ish (there's still some polishing to do though to the documentation, the library's API and the format - and the qcow2 support is a patch for qemu-img to use the library.) My adventures into dm-thin metadata allows me to answer this one though: > > or maybe we need a thin lvm driver? Probably not. It does not support SEEK_DATA/SEEK_HOLE and to my knowledge also has no other sane metadata querying methods. You'd have to read the metadata device instead. To do this properly you have to reserve a metadata snapshot and there can only ever be one of those per pool, which means you could only have 1 such disk in total running on a system and no other dm-thin metadata aware tool could be used during that time (otherwise the reserver operations will fail with an error and qemu would have to wait a lot...).
Re: [Qemu-block] [Qemu-devel] [RFC PATCH] glusterfs: allow partial reads
On Fri, Dec 02, 2016 at 01:13:28PM -0600, Eric Blake wrote: > On 12/01/2016 04:59 AM, Wolfgang Bumiller wrote: > > Fixes #1644754. > > > > Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com> > > --- > > I'm not sure what the original rationale was to treat both partial > > reads as well as well as writes as I/O error. (Seems to have happened > > from original glusterfs v1 to v2 series with a note but no reasoning > > for the read side as far as I could see.) > > The general direction lately seems to be to move away from sector > > based block APIs. Also eg. the NFS code allows partial reads. (It > > does, however, have an old patch (c2eb918e3) dedicated to aligning > > sizes to 512 byte boundaries for file creation for compatibility to > > other parts of qemu like qcow2. This already happens in glusterfs, > > though, but if you move a file from a different storage over to > > glusterfs you may end up with a qcow2 file with eg. the L1 table in > > the last 80 bytes of the file aligned to _begin_ at a 512 boundary, > > but not _end_ at one.) > > > > block/gluster.c | 10 +- > > 1 file changed, 9 insertions(+), 1 deletion(-) > > > > diff --git a/block/gluster.c b/block/gluster.c > > index 891c13b..3db0bf8 100644 > > --- a/block/gluster.c > > +++ b/block/gluster.c > > @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB { > > int ret; > > Coroutine *coroutine; > > AioContext *aio_context; > > +bool is_write; > > } GlusterAIOCB; > > > > typedef struct BDRVGlusterState { > > @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, > > ssize_t ret, void *arg) > > acb->ret = 0; /* Success */ > > } else if (ret < 0) { > > acb->ret = -errno; /* Read/Write failed */ > > +} else if (acb->is_write) { > > +acb->ret = -EIO; /* Partial write - fail it */ > > } else { > > -acb->ret = -EIO; /* Partial read/write - fail it */ > > +acb->ret = 0; /* Success */ > > Does this properly guarantee that the portion beyond EOF reads as zero? I'd argue this wasn't necessarily the case before either, considering the first check starts with `!ret`: if (!ret || ret == acb->size) { acb->ret = 0; /* Success */ A read right at EOF would return 0 and be treated as success there, no? Iow. it wouldn't zero out the destination buffer as far as I can see. Come to think of it, I'm not too fond of this part of the check for the write case either. > Would it be better to switch to byte-based interfaces rather than > continue to force gluster interaction in 512-byte sector chunks, since > gluster can obviously store files that are not 512-aligned?
[Qemu-block] [RFC PATCH] glusterfs: allow partial reads
Fixes #1644754. Signed-off-by: Wolfgang Bumiller <w.bumil...@proxmox.com> --- I'm not sure what the original rationale was to treat both partial reads as well as well as writes as I/O error. (Seems to have happened from original glusterfs v1 to v2 series with a note but no reasoning for the read side as far as I could see.) The general direction lately seems to be to move away from sector based block APIs. Also eg. the NFS code allows partial reads. (It does, however, have an old patch (c2eb918e3) dedicated to aligning sizes to 512 byte boundaries for file creation for compatibility to other parts of qemu like qcow2. This already happens in glusterfs, though, but if you move a file from a different storage over to glusterfs you may end up with a qcow2 file with eg. the L1 table in the last 80 bytes of the file aligned to _begin_ at a 512 boundary, but not _end_ at one.) block/gluster.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/block/gluster.c b/block/gluster.c index 891c13b..3db0bf8 100644 --- a/block/gluster.c +++ b/block/gluster.c @@ -41,6 +41,7 @@ typedef struct GlusterAIOCB { int ret; Coroutine *coroutine; AioContext *aio_context; +bool is_write; } GlusterAIOCB; typedef struct BDRVGlusterState { @@ -716,8 +717,10 @@ static void gluster_finish_aiocb(struct glfs_fd *fd, ssize_t ret, void *arg) acb->ret = 0; /* Success */ } else if (ret < 0) { acb->ret = -errno; /* Read/Write failed */ +} else if (acb->is_write) { +acb->ret = -EIO; /* Partial write - fail it */ } else { -acb->ret = -EIO; /* Partial read/write - fail it */ +acb->ret = 0; /* Success */ } aio_bh_schedule_oneshot(acb->aio_context, qemu_gluster_complete_aio, acb); @@ -965,6 +968,7 @@ static coroutine_fn int qemu_gluster_co_pwrite_zeroes(BlockDriverState *bs, acb.ret = 0; acb.coroutine = qemu_coroutine_self(); acb.aio_context = bdrv_get_aio_context(bs); +acb.is_write = true; ret = glfs_zerofill_async(s->fd, offset, size, gluster_finish_aiocb, ); if (ret < 0) { @@ -1087,9 +1091,11 @@ static coroutine_fn int qemu_gluster_co_rw(BlockDriverState *bs, acb.aio_context = bdrv_get_aio_context(bs); if (write) { +acb.is_write = true; ret = glfs_pwritev_async(s->fd, qiov->iov, qiov->niov, offset, 0, gluster_finish_aiocb, ); } else { +acb.is_write = false; ret = glfs_preadv_async(s->fd, qiov->iov, qiov->niov, offset, 0, gluster_finish_aiocb, ); } @@ -1153,6 +1159,7 @@ static coroutine_fn int qemu_gluster_co_flush_to_disk(BlockDriverState *bs) acb.ret = 0; acb.coroutine = qemu_coroutine_self(); acb.aio_context = bdrv_get_aio_context(bs); +acb.is_write = true; ret = glfs_fsync_async(s->fd, gluster_finish_aiocb, ); if (ret < 0) { @@ -1199,6 +1206,7 @@ static coroutine_fn int qemu_gluster_co_pdiscard(BlockDriverState *bs, acb.ret = 0; acb.coroutine = qemu_coroutine_self(); acb.aio_context = bdrv_get_aio_context(bs); +acb.is_write = true; ret = glfs_discard_async(s->fd, offset, size, gluster_finish_aiocb, ); if (ret < 0) { -- 2.1.4