Re: [Qemu-block] [Qemu-devel] [RFC PATCH 0/5] atapi: change unlimited recursion to while loop
On 02/23/2018 10:26 AM, Paolo Bonzini wrote: > Real hardware doesn't have an unlimited stack, so the unlimited > recursion in the ATAPI code smells a bit. In fact, the call to > ide_transfer_start easily becomes a tail call with a small change > to the code (patch 4). The remaining four patches move code around > so as to the turn the call back to ide_atapi_cmd_reply_end into > another tail call, and then convert the (double) tail recursion into > a while loop. > > I'm not sure how this can be tested, apart from adding a READ CD > test to ahci-test (which I don't really have time for now, hence > the RFC tag). The existing AHCI tests still pass, so patches 1-3 > aren't complete crap. > > Paolo > > Paolo Bonzini (5): > ide: push call to end_transfer_func out of start_transfer callback > ide: push end_transfer callback to ide_transfer_halt > ide: make ide_transfer_stop idempotent > atapi: call ide_set_irq before ide_transfer_start > ide: introduce ide_transfer_start_norecurse > > hw/ide/ahci.c | 12 +++- > hw/ide/atapi.c| 37 - > hw/ide/core.c | 37 +++-- > include/hw/ide/internal.h | 3 +++ > 4 files changed, 53 insertions(+), 36 deletions(-) > ACK receipt, I will get to this soon, sorry!
Re: [Qemu-block] [PATCH] vl: introduce vm_shutdown()
On Tue, 02/27 15:30, Stefan Hajnoczi wrote: > On Fri, Feb 23, 2018 at 04:20:44PM +0800, Fam Zheng wrote: > > On Tue, 02/20 13:10, Stefan Hajnoczi wrote: > > > 1. virtio_scsi_handle_cmd_vq() racing with iothread_stop_all() hits the > > >virtio_scsi_ctx_check() assertion failure because the BDS AioContext > > >has been modified by iothread_stop_all(). > > > > Does this patch fix the issue completely? IIUC virtio_scsi_handle_cmd can > > already be entered at the time of main thread calling > > virtio_scsi_clear_aio(), > > so this race condition still exists: > > > > main thread iothread > > - > > vm_shutdown > > ... > > virtio_bus_stop_ioeventfd > > virtio_scsi_dataplane_stop > > aio_poll() > > ... > > > > virtio_scsi_data_plane_handle_cmd() > > aio_context_acquire(s->ctx) > > virtio_scsi_acquire(s).enter > > virtio_scsi_clear_aio() > > aio_context_release(s->ctx) > > virtio_scsi_acquire(s).return > > virtio_scsi_handle_cmd_vq() > > ... > > virtqueue_pop() > > > > Is it possible that the above virtqueue_pop() still returns one element > > that was > > queued before vm_shutdown() was called? > > No, it can't because virtio_scsi_clear_aio() invokes > virtio_queue_host_notifier_aio_read(>host_notifier) to process the > virtqueue. By the time we get back to iothread's > virtio_scsi_data_plane_handle_cmd() the virtqueue is already empty. > > Vcpus have been paused so no additional elements can slip into the > virtqueue. So there is: static void virtio_queue_host_notifier_aio_read(EventNotifier *n) { VirtQueue *vq = container_of(n, VirtQueue, host_notifier); if (event_notifier_test_and_clear(n)) { virtio_queue_notify_aio_vq(vq); } } Guest kicks after adding an element to VQ, but we check ioeventfd before trying virtqueue_pop(). Is that a problem? If VCPUs are paused after enqueuing but before kicking VQ, the ioeventfd is not set, the virtqueue is not processed here. Fam
Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property
On 02/27/2018 03:57 PM, John Snow wrote: On 02/27/2018 03:16 PM, Eric Blake wrote: On 02/23/2018 05:51 PM, John Snow wrote: Expose the "manual" property via QAPI for the backup-related jobs. As of this commit, this allows the management API to request the "concluded" and "dismiss" semantics for backup jobs. +# @manual: True to use an expanded, more explicit job control flow. +# Jobs may transition from a running state to a pending state, +# where they must be instructed to complete manually via +# block-job-finalize. +# Jobs belonging to a transaction must either all or all not use this +# setting. Once a transaction reaches a pending state, issuing the +# finalize command to any one job in the transaction is sufficient +# to finalize the entire transaction. The previous commit message talked about mixed-manual transactions, but this seems to imply it is not possible. I'm fine if we don't support mixed-manual transactions, but wonder if it means any changes to the series. Otherwise looks reasonable from the UI point of view. More seriously, this documentation I wrote doesn't address the totality of the expanded flow. I omitted dismiss here by accident as well. This is at best a partial definition of the 'manual' property. I'd like to use _this_ patch to ask the question: "What should the proper noun for the QEMU 2.12+ Expanded Block Job Management Flow Mechanism be?" "Manual" actually doesn't sound too bad; I could also see "Explicit job flow", as in, "within a transaction, all jobs should have the same setting for the choice of Explicit Job Flow" (but then the name 'manual' would have to be changed to match). The idea of a central document, that gets referred to from multiple spots in the QAPI docs, rather than duplicating information throughout the QAPI docs, is reasonable. I'm not too sure, but "Manual mode" leaves a lot to be desired. I keep calling it something like "2.12+ Job Management" but that's not really descriptive. That, and if someone ever backports the enhanced state machine to a 2.11 branch, it becomes a misnomer. I conceptualize the feature as the addition of a purposefully more "needy" and less automatic completion mechanism, hence the "manual" Anyway, I'd like to figure out a good "documentation name" for it so I can point all instances of the creation property (for drive-backup, drive-mirror, and everyone else) to a central location that explains the STM and what exactly the differences between manual=on/off are. I'd then like to expose this property via query and link the documentation there to this description, too. "Explicit" and "Manual" are the two best options coming to me as I type this email. It'd be nice-- under the same arguments that prompted 'dismiss'-- to say that if a client crashes it can reconnect and discover what kind of attention certain jobs will need by asking for the manual property back. --js -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property
On 02/27/2018 03:16 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Expose the "manual" property via QAPI for the backup-related jobs. >> As of this commit, this allows the management API to request the >> "concluded" and "dismiss" semantics for backup jobs. >> >> Signed-off-by: John Snow>> --- >> blockdev.c | 19 --- >> qapi/block-core.json | 32 ++-- >> 2 files changed, 42 insertions(+), 9 deletions(-) >> > >> +++ b/qapi/block-core.json >> @@ -1177,6 +1177,16 @@ >> # @job-id: identifier for the newly-created block job. If >> # omitted, the device name will be used. (Since 2.7) >> # >> +# @manual: True to use an expanded, more explicit job control flow. >> +# Jobs may transition from a running state to a pending state, >> +# where they must be instructed to complete manually via >> +# block-job-finalize. >> +# Jobs belonging to a transaction must either all or all not >> use this >> +# setting. Once a transaction reaches a pending state, >> issuing the >> +# finalize command to any one job in the transaction is >> sufficient >> +# to finalize the entire transaction. > > The previous commit message talked about mixed-manual transactions, but > this seems to imply it is not possible. I'm fine if we don't support > mixed-manual transactions, but wonder if it means any changes to the > series. > > Otherwise looks reasonable from the UI point of view. > More seriously, this documentation I wrote doesn't address the totality of the expanded flow. I omitted dismiss here by accident as well. This is at best a partial definition of the 'manual' property. I'd like to use _this_ patch to ask the question: "What should the proper noun for the QEMU 2.12+ Expanded Block Job Management Flow Mechanism be?" I'm not too sure, but "Manual mode" leaves a lot to be desired. I keep calling it something like "2.12+ Job Management" but that's not really descriptive. I conceptualize the feature as the addition of a purposefully more "needy" and less automatic completion mechanism, hence the "manual" Anyway, I'd like to figure out a good "documentation name" for it so I can point all instances of the creation property (for drive-backup, drive-mirror, and everyone else) to a central location that explains the STM and what exactly the differences between manual=on/off are. I'd then like to expose this property via query and link the documentation there to this description, too. It'd be nice-- under the same arguments that prompted 'dismiss'-- to say that if a client crashes it can reconnect and discover what kind of attention certain jobs will need by asking for the manual property back. --js
Re: [Qemu-block] [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management
On 02/24/2018 09:31 AM, no-re...@patchew.org wrote: > Hi, > > This series seems to have some coding style problems. See output below for > more information: > > Type: series > Message-id: 20180223235142.21501-1-js...@redhat.com > Subject: [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management > > === TEST SCRIPT BEGIN === > #!/bin/bash > > BASE=base > n=1 > total=$(git log --oneline $BASE.. | wc -l) > failed=0 > > git config --local diff.renamelimit 0 > git config --local diff.renames True > git config --local diff.algorithm histogram > > commits="$(git log --format=%H --reverse $BASE..)" > for c in $commits; do > echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..." > if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; > then > failed=1 > echo > fi > n=$((n+1)) > done > > exit $failed > === TEST SCRIPT END === > > Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384 > Switched to a new branch 'test' > 230e578fa2 blockjobs: add manual_mgmt option to transactions > f278a5155a iotests: test manual job dismissal > 8e473ab4a8 blockjobs: Expose manual property > 7ad2d0164f blockjobs: add block-job-finalize > 3857c91315 blockjobs: add PENDING status and event > 18eb8a4130 blockjobs: add waiting status > daf9613432 blockjobs: add prepare callback > 78be501212 blockjobs: add block_job_txn_apply function > 4b659abe69 blockjobs: add commit, abort, clean helpers > 4023046d76 blockjobs: ensure abort is called for cancelled jobs > e9300b122e blockjobs: add block_job_dismiss > 4fc045eae4 blockjobs: add NULL state > e6aa454753 blockjobs: add CONCLUDED state > 78efa2f937 blockjobs: add ABORTING state > 057ad2472f blockjobs: add block_job_verb permission table > c62c5b75a3 iotests: add pause_wait > 4aadb9c38c blockjobs: add state transition table > afc594c4b0 blockjobs: add status enum > 434d3811fa blockjobs: add manual property > fc3e3eebc9 blockjobs: model single jobs as transactions > 8d32662676 blockjobs: fix set-speed kick > > === OUTPUT BEGIN === > Checking PATCH 1/21: blockjobs: fix set-speed kick... > Checking PATCH 2/21: blockjobs: model single jobs as transactions... > Checking PATCH 3/21: blockjobs: add manual property... > Checking PATCH 4/21: blockjobs: add status enum... > Checking PATCH 5/21: blockjobs: add state transition table... > ERROR: space prohibited before open square bracket '[' > #81: FILE: blockjob.c:48: > +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #82: FILE: blockjob.c:49: > +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #83: FILE: blockjob.c:50: > +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0}, > > ERROR: space prohibited before open square bracket '[' > #84: FILE: blockjob.c:51: > +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #85: FILE: blockjob.c:52: > +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1}, > > ERROR: space prohibited before open square bracket '[' > #86: FILE: blockjob.c:53: > +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0}, > > total: 6 errors, 0 warnings, 90 lines checked > > Your patch has style problems, please review. If any of these errors > are false positives report them to the maintainer, see > CHECKPATCH in MAINTAINERS. > -EWONTFIX unless someone screams louder than me. > Checking PATCH 6/21: iotests: add pause_wait... > Checking PATCH 7/21: blockjobs: add block_job_verb permission table... > Checking PATCH 8/21: blockjobs: add ABORTING state... > ERROR: space prohibited before open square bracket '[' > #61: FILE: blockjob.c:48: > +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #62: FILE: blockjob.c:49: > +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #63: FILE: blockjob.c:50: > +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0, 1}, > > ERROR: space prohibited before open square bracket '[' > #64: FILE: blockjob.c:51: > +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #65: FILE: blockjob.c:52: > +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1, 1}, > > ERROR: space prohibited before open square bracket '[' > #66: FILE: blockjob.c:53: > +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0, 0}, > > ERROR: space prohibited before open square bracket '[' > #67: FILE: blockjob.c:54: > +/* X: */ [BLOCK_JOB_STATUS_ABORTING] = {0, 0, 0, 0, 0, 0, 0}, > > total: 7 errors, 0 warnings, 62 lines checked -EWONTFIX > > Your patch has style problems, please review. If any of these errors > are false positives report
Re: [Qemu-block] [Qemu-devel] [RFC v4 00/21] blockjobs: add explicit job management
On 02/25/2018 06:25 PM, no-re...@patchew.org wrote: > CC scsi/qemu-pr-helper.o > /var/tmp/patchew-tester-tmp-9ukz0kme/src/blockjob.c: In function > ‘block_job_txn_apply.isra.8’: > /var/tmp/patchew-tester-tmp-9ukz0kme/src/blockjob.c:511:5: error: ‘rc’ may be > used uninitialized in this function [-Werror=maybe-uninitialized] > return rc; > ^ > CC qemu-bridge-helper.o > CC blockdev.o > CC blockdev-nbd.o > CC bootdevice.o > CC iothread.o > CC qdev-monitor.o > CC device-hotplug.o > CC os-posix.o > CC bt-host.o > CC bt-vhci.o > CC dma-helpers.o > CC vl.o > CC tpm.o > CC device_tree.o > CC qmp-marshal.o > CC qmp.o > CC hmp.o > CC cpus-common.o > CC audio/audio.o > cc1: all warnings being treated as errors > make: *** [blockjob.o] Error 1 > make: *** Waiting for unfinished jobs > === OUTPUT END === > > Test command exited with code: 2 > > > --- > Email generated automatically by Patchew [http://patchew.org/]. > Please send your feedback to patchew-de...@freelists.org No idea why this would only trigger on ppc/be, but I'll look into it.
Re: [Qemu-block] [Qemu-devel] [RFC v4 17/21] blockjobs: add PENDING status and event
On 02/27/2018 03:05 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> For jobs utilizing the new manual workflow, we intend to prohibit >> them from modifying the block graph until the management layer provides >> an explicit ACK via block-job-finalize to move the process forward. >> >> To distinguish this runstate from "ready" or "waiting," we add a new >> "pending" event. >> >> For now, the transition from PENDING to CONCLUDED/ABORTING is automatic, >> but a future commit will add the explicit block-job-finalize step. >> >> Transitions: >> Waiting -> Pending: Normal transition. >> Pending -> Concluded: Normal transition. >> Pending -> Aborting: Late transactional failures and cancellations. >> >> Removed Transitions: >> Waiting -> Concluded: Jobs must go to PENDING first. >> >> Verbs: >> Cancel: Can be applied to a pending job. >> > >> +## >> +# @BLOCK_JOB_PENDING: >> +# >> +# Emitted when a block job is awaiting explicit authorization to >> finalize graph >> +# changes via @block-job-finalize. If this job is part of a >> transaction, it will >> +# not emit this event until the transaction has converged first. > > Same question of whether this new event is always emitted (and older > clients presumably ignore it), or only emitted for clients that > requested new-style state management. > Old style jobs will skip the broadcast of the event, but will still transition to the state. However, since transition is synchronous, you likely won't see this state show up in a query for old style jobs. That was the intent, anyway. I wanted to be nonintrusive, and felt that this event was likely not useful in any way unless we were using the new state management scheme. In the old style, this event will be fully synchronous with COMPLETED or CANCELLED, for instance. --js
Re: [Qemu-block] [Qemu-devel] [RFC v4 16/21] blockjobs: add waiting status
On 02/27/2018 03:00 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> For jobs that are stuck waiting on others in a transaction, it would >> be nice to know that they are no longer "running" in that sense, but >> instead are waiting on other jobs in the transaction. >> >> Jobs that are "waiting" in this sense cannot be meaningfully altered >> any longer as they have left their running loop. The only meaningful >> user verb for jobs in this state is "cancel," which will cancel the >> whole transaction, too. >> >> Transitions: >> Running -> Waiting: Normal transition. >> Ready -> Waiting: Normal transition. >> Waiting -> Aborting: Transactional cancellation. >> Waiting -> Concluded: Normal transition. >> >> Removed Transitions: >> Running -> Concluded: Jobs must go to WAITING first. >> Ready -> Concluded: Jobs must go to WAITING fisrt. > > s/fisrt/first/ > >> +++ b/blockjob.c > >> @@ -3934,6 +3938,29 @@ >> 'offset': 'int', >> 'speed' : 'int' } } >> +## >> +# @BLOCK_JOB_WAITING: >> +# >> +# Emitted when a block job that is part of a transction has stopped >> work and is > > s/transction/transaction/ > >> +# waiting for other jobs in the transaction to reach the same state. > > Is this event emitted only for 'new-style' transactions (old drivers > will never see it, because they don't request new style), or always (old > drivers will see, but presumably ignore, it)? > ...! Actually, I meant to remove the WAITING *event* entirely, this is a mistake. It's purely an informational state that clients likely cannot make any real use of, because regardless of old or new style, jobs will transition automatically to "PENDING." That said, old or new, the state is visible from query-block-jobs. --js
Re: [Qemu-block] [Qemu-devel] [RFC v4 15/21] blockjobs: add prepare callback
On 02/27/2018 02:56 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Some jobs upon finalization may need to perform some work that can >> still fail. If these jobs are part of a transaction, it's important >> that these callbacks fail the entire transaction. >> >> We allow for a new callback in addition to commit/abort/clean that >> allows us the opportunity to have fairly late-breaking failures >> in the transactional process. >> >> The expected flow is: >> >> - All jobs in a transaction converge to the WAITING state >> (added in a forthcoming commit) >> - All jobs prepare to call either commit/abort >> - If any job fails, is canceled, or fails preparation, all jobs >> call their .abort callback. >> - All jobs enter the PENDING state, awaiting manual intervention >> (also added in a forthcoming commit) >> - block-job-finalize is issued by the user/management layer >> - All jobs call their commit callbacks. >> >> Signed-off-by: John Snow>> --- >> blockjob.c | 34 +++--- >> include/block/blockjob_int.h | 10 ++ >> 2 files changed, 41 insertions(+), 3 deletions(-) >> > >> @@ -467,17 +480,22 @@ static void block_job_cancel_async(BlockJob *job) >> job->cancelled = true; >> } >> -static void block_job_txn_apply(BlockJobTxn *txn, void fn(BlockJob *)) >> +static int block_job_txn_apply(BlockJobTxn *txn, int fn(BlockJob *)) >> { >> AioContext *ctx; >> BlockJob *job, *next; >> + int rc; >> QLIST_FOREACH_SAFE(job, >jobs, txn_list, next) { >> ctx = blk_get_aio_context(job->blk); >> aio_context_acquire(ctx); >> - fn(job); >> + rc = fn(job); >> aio_context_release(ctx); >> + if (rc) { >> + break; >> + } > > This short-circuits the application of the function to the rest of the > group. Is that ever going to be a problem? > With what I've written, I don't think so -- but I can't guarantee someone won't misunderstand the semantics of it and it will become a problem. It is a potentially dangerous function in that way. --js
Re: [Qemu-block] [Qemu-devel] [RFC v4 12/21] blockjobs: ensure abort is called for cancelled jobs
On 02/27/2018 02:49 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Presently, even if a job is canceled post-completion as a result of >> a failing peer in a transaction, it will still call .commit because >> nothing has updated or changed its return code. >> >> The reason why this does not cause problems currently is because >> backup's implementation of .commit checks for cancellation itself. >> >> I'd like to simplify this contract: >> >> (1) Abort is called if the job/transaction fails >> (2) Commit is called if the job/transaction succeeds >> >> To this end: A job's return code, if 0, will be forcibly set as >> -ECANCELED if that job has already concluded. Remove the now >> redundant check in the backup job implementation. >> >> We need to check for cancellation in both block_job_completed >> AND block_job_completed_single, because jobs may be cancelled between >> those two calls; for instance in transactions. >> >> The check in block_job_completed could be removed, but there's no >> point in starting to attempt to succeed a transaction that we know >> in advance will fail. >> >> This does NOT affect mirror jobs that are "canceled" during their >> synchronous phase. The mirror job itself forcibly sets the canceled >> property to false prior to ceding control, so such cases will invoke >> the "commit" callback. >> >> Signed-off-by: John Snow>> --- >> block/backup.c | 2 +- >> block/trace-events | 1 + >> blockjob.c | 19 +++ >> 3 files changed, 17 insertions(+), 5 deletions(-) > > More lines of code, but the contract does seem simpler and useful for > the later patches. > Unfortunately yes, but it would be less overall if more than Backup made use of commit/abort, which I anticipate in the future when I try to start switching jobs over to using the .prepare callback. It was just genuinely shocking to me that we'd call commit(), but backup secretly knew we wanted abort. That type of logic belongs in the core dispatch layer. > Reviewed-by: Eric Blake > Thanks for the reviews!
Re: [Qemu-block] [Qemu-devel] [RFC v4 20/21] iotests: test manual job dismissal
On 02/27/2018 03:21 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Signed-off-by: John Snow>> --- >> tests/qemu-iotests/056 | 195 >> + >> tests/qemu-iotests/056.out | 4 +- >> 2 files changed, 197 insertions(+), 2 deletions(-) >> > > I'm not sure if this covers everything in the series, but it looks like It definitely doesn't! > a reasonable expansion and hits a lot of the highlights. At any rate, > it's always better to add tests, and the test passing is a good bet that > the new code will be harder to regress. > More to follow, but I was afraid of wasting time if this series didn't look on the whole serviceable. I'll probably focus efforts on expanding blockjob-txn and blockjob unit tests. > Reviewed-by: Eric Blake >
Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property
On 02/27/2018 03:16 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Expose the "manual" property via QAPI for the backup-related jobs. >> As of this commit, this allows the management API to request the >> "concluded" and "dismiss" semantics for backup jobs. >> >> Signed-off-by: John Snow>> --- >> blockdev.c | 19 --- >> qapi/block-core.json | 32 ++-- >> 2 files changed, 42 insertions(+), 9 deletions(-) >> > >> +++ b/qapi/block-core.json >> @@ -1177,6 +1177,16 @@ >> # @job-id: identifier for the newly-created block job. If >> # omitted, the device name will be used. (Since 2.7) >> # >> +# @manual: True to use an expanded, more explicit job control flow. >> +# Jobs may transition from a running state to a pending state, >> +# where they must be instructed to complete manually via >> +# block-job-finalize. >> +# Jobs belonging to a transaction must either all or all not >> use this >> +# setting. Once a transaction reaches a pending state, >> issuing the >> +# finalize command to any one job in the transaction is >> sufficient >> +# to finalize the entire transaction. > > The previous commit message talked about mixed-manual transactions, but > this seems to imply it is not possible. I'm fine if we don't support > mixed-manual transactions, but wonder if it means any changes to the > series. > > Otherwise looks reasonable from the UI point of view. > Refactor hell. The intent (and my belief) is that as of right now you CAN mix them. In earlier drafts, it was not always the case.
Re: [Qemu-block] [Qemu-devel] [RFC v4 21/21] blockjobs: add manual_mgmt option to transactions
On 02/23/2018 05:51 PM, John Snow wrote: This allows us to easily force the option for all jobs belonging to a transaction to ensure consistency with how all those jobs will be handled. This is purely a convenience. Signed-off-by: John Snow--- +++ b/qapi/transaction.json @@ -79,7 +79,8 @@ ## { 'struct': 'TransactionProperties', 'data': { - '*completion-mode': 'ActionCompletionMode' + '*completion-mode': 'ActionCompletionMode', + '*manual-mgmt': 'bool' Missing QAPI documentation (what you have elsewhere in the C code can probably be copied here, though). The UI aspect makes sense (I can declare one manual at the transaction level instead of multiple manual declarations per member level within the transaction). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 20/21] iotests: test manual job dismissal
On 02/23/2018 05:51 PM, John Snow wrote: Signed-off-by: John Snow--- tests/qemu-iotests/056 | 195 + tests/qemu-iotests/056.out | 4 +- 2 files changed, 197 insertions(+), 2 deletions(-) I'm not sure if this covers everything in the series, but it looks like a reasonable expansion and hits a lot of the highlights. At any rate, it's always better to add tests, and the test passing is a good bet that the new code will be harder to regress. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 19/21] blockjobs: Expose manual property
On 02/23/2018 05:51 PM, John Snow wrote: Expose the "manual" property via QAPI for the backup-related jobs. As of this commit, this allows the management API to request the "concluded" and "dismiss" semantics for backup jobs. Signed-off-by: John Snow--- blockdev.c | 19 --- qapi/block-core.json | 32 ++-- 2 files changed, 42 insertions(+), 9 deletions(-) +++ b/qapi/block-core.json @@ -1177,6 +1177,16 @@ # @job-id: identifier for the newly-created block job. If # omitted, the device name will be used. (Since 2.7) # +# @manual: True to use an expanded, more explicit job control flow. +# Jobs may transition from a running state to a pending state, +# where they must be instructed to complete manually via +# block-job-finalize. +# Jobs belonging to a transaction must either all or all not use this +# setting. Once a transaction reaches a pending state, issuing the +# finalize command to any one job in the transaction is sufficient +# to finalize the entire transaction. The previous commit message talked about mixed-manual transactions, but this seems to imply it is not possible. I'm fine if we don't support mixed-manual transactions, but wonder if it means any changes to the series. Otherwise looks reasonable from the UI point of view. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 18/21] blockjobs: add block-job-finalize
On 02/23/2018 05:51 PM, John Snow wrote: Instead of automatically transitioning from PENDING to CONCLUDED, gate the .prepare() and .commit() phases behind an explicit acknowledgement provided by the QMP monitor if manual completion mode has been requested. This allows us to perform graph changes in prepare and/or commit so that graph changes do not occur autonomously without knowledge of the controlling management layer. Transactions that have reached the "PENDING" state together can all be moved to invoke their finalization methods by issuing block_job_finalize to any one job in the transaction. Jobs in a transaction with mixed job->manual settings will remain stuck in the "WAITING" state until block_job_finalize is authored on the job(s) that have reached the "PENDING" state. These jobs are not allowed to progress because other jobs in the transaction may still fail during their preparation phase during finalization, so these jobs must remain in the WAITING phase until success is guaranteed. These jobs will then automatically dismiss themselves, but jobs that had the manual property set will remain at CONCLUDED as normal. Reviewed-by: Eric Blake-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 17/21] blockjobs: add PENDING status and event
On 02/23/2018 05:51 PM, John Snow wrote: For jobs utilizing the new manual workflow, we intend to prohibit them from modifying the block graph until the management layer provides an explicit ACK via block-job-finalize to move the process forward. To distinguish this runstate from "ready" or "waiting," we add a new "pending" event. For now, the transition from PENDING to CONCLUDED/ABORTING is automatic, but a future commit will add the explicit block-job-finalize step. Transitions: Waiting -> Pending: Normal transition. Pending -> Concluded: Normal transition. Pending -> Aborting: Late transactional failures and cancellations. Removed Transitions: Waiting -> Concluded: Jobs must go to PENDING first. Verbs: Cancel: Can be applied to a pending job. +## +# @BLOCK_JOB_PENDING: +# +# Emitted when a block job is awaiting explicit authorization to finalize graph +# changes via @block-job-finalize. If this job is part of a transaction, it will +# not emit this event until the transaction has converged first. Same question of whether this new event is always emitted (and older clients presumably ignore it), or only emitted for clients that requested new-style state management. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 16/21] blockjobs: add waiting status
On 02/23/2018 05:51 PM, John Snow wrote: For jobs that are stuck waiting on others in a transaction, it would be nice to know that they are no longer "running" in that sense, but instead are waiting on other jobs in the transaction. Jobs that are "waiting" in this sense cannot be meaningfully altered any longer as they have left their running loop. The only meaningful user verb for jobs in this state is "cancel," which will cancel the whole transaction, too. Transitions: Running -> Waiting: Normal transition. Ready -> Waiting: Normal transition. Waiting -> Aborting: Transactional cancellation. Waiting -> Concluded: Normal transition. Removed Transitions: Running -> Concluded: Jobs must go to WAITING first. Ready -> Concluded: Jobs must go to WAITING fisrt. s/fisrt/first/ +++ b/blockjob.c @@ -3934,6 +3938,29 @@ 'offset': 'int', 'speed' : 'int' } } +## +# @BLOCK_JOB_WAITING: +# +# Emitted when a block job that is part of a transction has stopped work and is s/transction/transaction/ +# waiting for other jobs in the transaction to reach the same state. Is this event emitted only for 'new-style' transactions (old drivers will never see it, because they don't request new style), or always (old drivers will see, but presumably ignore, it)? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 15/21] blockjobs: add prepare callback
On 02/23/2018 05:51 PM, John Snow wrote: Some jobs upon finalization may need to perform some work that can still fail. If these jobs are part of a transaction, it's important that these callbacks fail the entire transaction. We allow for a new callback in addition to commit/abort/clean that allows us the opportunity to have fairly late-breaking failures in the transactional process. The expected flow is: - All jobs in a transaction converge to the WAITING state (added in a forthcoming commit) - All jobs prepare to call either commit/abort - If any job fails, is canceled, or fails preparation, all jobs call their .abort callback. - All jobs enter the PENDING state, awaiting manual intervention (also added in a forthcoming commit) - block-job-finalize is issued by the user/management layer - All jobs call their commit callbacks. Signed-off-by: John Snow--- blockjob.c | 34 +++--- include/block/blockjob_int.h | 10 ++ 2 files changed, 41 insertions(+), 3 deletions(-) @@ -467,17 +480,22 @@ static void block_job_cancel_async(BlockJob *job) job->cancelled = true; } -static void block_job_txn_apply(BlockJobTxn *txn, void fn(BlockJob *)) +static int block_job_txn_apply(BlockJobTxn *txn, int fn(BlockJob *)) { AioContext *ctx; BlockJob *job, *next; +int rc; QLIST_FOREACH_SAFE(job, >jobs, txn_list, next) { ctx = blk_get_aio_context(job->blk); aio_context_acquire(ctx); -fn(job); +rc = fn(job); aio_context_release(ctx); +if (rc) { +break; +} This short-circuits the application of the function to the rest of the group. Is that ever going to be a problem? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 14/21] blockjobs: add block_job_txn_apply function
On 02/23/2018 05:51 PM, John Snow wrote: Simply apply a function transaction-wide. A few more uses of this in forthcoming patches. Signed-off-by: John Snow--- blockjob.c | 24 +++- 1 file changed, 15 insertions(+), 9 deletions(-) @@ -565,13 +577,7 @@ static void block_job_completed_txn_success(BlockJob *job) } } /* We are the last completed job, commit the transaction. */ -QLIST_FOREACH_SAFE(other_job, >jobs, txn_list, next) { -ctx = blk_get_aio_context(other_job->blk); -aio_context_acquire(ctx); -assert(other_job->ret == 0); We lost this assertion. Hopefully that doesn't matter in the long run. -block_job_completed_single(other_job); -aio_context_release(ctx); -} +block_job_txn_apply(txn, block_job_completed_single); } /* Assumes the block_job_mutex is held */ Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 13/21] blockjobs: add commit, abort, clean helpers
On 02/23/2018 05:51 PM, John Snow wrote: The completed_single function is getting a little mucked up with checking to see which callbacks exist, so let's factor them out. Signed-off-by: John Snow--- blockjob.c | 35 ++- 1 file changed, 26 insertions(+), 9 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 12/21] blockjobs: ensure abort is called for cancelled jobs
On 02/23/2018 05:51 PM, John Snow wrote: Presently, even if a job is canceled post-completion as a result of a failing peer in a transaction, it will still call .commit because nothing has updated or changed its return code. The reason why this does not cause problems currently is because backup's implementation of .commit checks for cancellation itself. I'd like to simplify this contract: (1) Abort is called if the job/transaction fails (2) Commit is called if the job/transaction succeeds To this end: A job's return code, if 0, will be forcibly set as -ECANCELED if that job has already concluded. Remove the now redundant check in the backup job implementation. We need to check for cancellation in both block_job_completed AND block_job_completed_single, because jobs may be cancelled between those two calls; for instance in transactions. The check in block_job_completed could be removed, but there's no point in starting to attempt to succeed a transaction that we know in advance will fail. This does NOT affect mirror jobs that are "canceled" during their synchronous phase. The mirror job itself forcibly sets the canceled property to false prior to ceding control, so such cases will invoke the "commit" callback. Signed-off-by: John Snow--- block/backup.c | 2 +- block/trace-events | 1 + blockjob.c | 19 +++ 3 files changed, 17 insertions(+), 5 deletions(-) More lines of code, but the contract does seem simpler and useful for the later patches. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 11/21] blockjobs: add block_job_dismiss
On 02/23/2018 05:51 PM, John Snow wrote: For jobs that have reached their CONCLUDED state, prior to having their last reference put down (meaning jobs that have completed successfully, unsuccessfully, or have been canceled), allow the user to dismiss the job's lingering status report via block-job-dismiss. This gives management APIs the chance to conclusively determine if a job failed or succeeded, even if the event broadcast was missed. Note that jobs do not yet linger in any such state, they are freed immediately upon reaching this previously-unnamed state. such a state is added immediately in the next commit. Verbs: Dismiss: operates on CONCLUDED jobs only. Signed-off-by: John Snow--- Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 09/21] blockjobs: add CONCLUDED state
On 02/27/2018 02:38 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> add a new state "CONCLUDED" that identifies a job that has ceased all >> operations. The wording was chosen to avoid any phrasing that might >> imply success, error, or cancellation. The task has simply ceased all >> operation and can never again perform any work. >> >> ("finished", "done", and "completed" might all imply success.) >> >> Transitions: >> Running -> Concluded: normal completion >> Ready -> Concluded: normal completion >> Aborting -> Concluded: error and cancellations >> >> Verbs: >> None as of this commit. (a future commit adds 'dismiss') >> > > >> @@ -620,7 +623,9 @@ void block_job_user_resume(BlockJob *job, Error >> **errp) >> void block_job_cancel(BlockJob *job) >> { >> - if (block_job_started(job)) { >> + if (job->status == BLOCK_JOB_STATUS_CONCLUDED) { >> + return; >> + } else if (block_job_started(job)) { > > It's also possible to do: > > if () { > return ; > } > if (...) > > instead of chaining with an 'else if'. Matter of personal taste, so I > won't make you change it. > > Reviewed-by: Eric Blake> I guess because in this case, what I did adds two SLOC instead of three. If the other code did not need to be guarded by an if(), I'd otherwise agree.
Re: [Qemu-block] [Qemu-devel] [RFC v4 10/21] blockjobs: add NULL state
On 02/23/2018 05:51 PM, John Snow wrote: Add a new state that specifically demarcates when we begin to permanently demolish a job after it has performed all work. This makes the transition explicit in the STM table and highlights conditions under which a job may be demolished. Reviewed-by: Eric Blake-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 07/21] blockjobs: add block_job_verb permission table
On 02/27/2018 02:25 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> Which commands ("verbs") are appropriate for jobs in which state is >> also somewhat burdensome to keep track of. >> >> As of this commit, it looks rather useless, but begins to look more >> interesting the more states we add to the STM table. >> >> A recurring theme is that no verb will apply to an 'undefined' job. > > Back to the argument of whether we even want that state. > By design, it is intended to reject all commands. Though, what I can do is just remove the UNDEFINED state but have the "CREATED" state start at "1" still. In effect, this gives you the empty row and column for the 'UNDEFINED' state, except now it's *literally* undefined. However, the transition then needs to be set explicitly. The way the code is written now, the status *only* ever changes from the transition function, which makes it easy to search for and reason about all possible transition points which I consider a feature of the design. >> >> === >> Changes >> === >> >> (1) >> >> To facilitate "nice" error checking, all five major block-job verb >> interfaces in blockjob.c now support an errp parameter: >> >> - block_job_user_cancel is added as a new interface. >> - block_job_user_pause gains an errp paramter >> - block_job_user_resume gains an errp parameter >> - block_job_set_speed already had an errp parameter. >> - block_job_complete already had an errp parameter. >> >> (2) >> >> block-job-pause and block-job-resume will no longer no-op when trying >> to pause an already paused job, or trying to resume a job that isn't >> paused. These functions will now report that they did not perform the >> action requested because it was not possible. >> >> iotests have been adjusted to address this new behavior. > > Seems reasonable. Hopefully shouldn't trip up libvirt too badly (if > libvirt attempted a redundant job transition that used to silently > succeed and now fails, the failure message should be pretty obvious that > it was a no-op attempt). > Yeah, it's arguable whether or not this is an API change. (A) I'm not prohibiting anything that would have worked before. I am just notifying the client that whatever they were trying to do had no effect. (B) That notification is an error, though, so some code paths that may have relied on jamming pause signals into the pipe may be surprised at the new response. If I can get away with it, I obviously prefer to warn the user that the pause/resume had no effect or couldn't be applied. >> >> (3) >> >> block-job-complete doesn't worry about checking !block_job_started, >> because the permission table guards against this. >> >> (4) >> >> test-bdrv-drain's job implementation needs to announce that it is >> 'ready' now, in order to be completed. >> >> Signed-off-by: John Snow>> --- > > Reviewed-by: Eric Blake >
Re: [Qemu-block] [Qemu-devel] [RFC v4 09/21] blockjobs: add CONCLUDED state
On 02/23/2018 05:51 PM, John Snow wrote: add a new state "CONCLUDED" that identifies a job that has ceased all operations. The wording was chosen to avoid any phrasing that might imply success, error, or cancellation. The task has simply ceased all operation and can never again perform any work. ("finished", "done", and "completed" might all imply success.) Transitions: Running -> Concluded: normal completion Ready-> Concluded: normal completion Aborting -> Concluded: error and cancellations Verbs: None as of this commit. (a future commit adds 'dismiss') @@ -620,7 +623,9 @@ void block_job_user_resume(BlockJob *job, Error **errp) void block_job_cancel(BlockJob *job) { -if (block_job_started(job)) { +if (job->status == BLOCK_JOB_STATUS_CONCLUDED) { +return; +} else if (block_job_started(job)) { It's also possible to do: if () { return ; } if (...) instead of chaining with an 'else if'. Matter of personal taste, so I won't make you change it. Reviewed-by: Eric Blake-- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 08/21] blockjobs: add ABORTING state
On 02/23/2018 05:51 PM, John Snow wrote: Add a new state ABORTING. This makes transitions from normative states to error states explicit in the STM, and serves as a disambiguation for which states may complete normally when normal end-states (CONCLUDED) are added in future commits. Notably, Paused/Standby jobs do not transition directly to aborting, as they must wake up first and cooperate in their cancellation. Transitions: Running -> Aborting: can be cancelled or encounter an error Ready -> Aborting: can be cancelled or encounter an error Verbs: None. The job must finish cleaning itself up and report its final status. +-+ |UNDEFINED| +--+--+ | +--v+ |CREATED| +--++ | +--v+ +--+ +-+RUNNING<->PAUSED| | +--++ +--+ || | +--v--+ +---+ +-+READY<--->STANDBY| | +-+ +---+ | +--v-+ |ABORTING| ++ Signed-off-by: John Snow--- blockjob.c | 31 ++- qapi/block-core.json | 7 ++- 2 files changed, 24 insertions(+), 14 deletions(-) +++ b/qapi/block-core.json @@ -996,10 +996,15 @@ # @standby: The job is ready, but paused. This is nearly identical to @paused. # The job may return to @ready or otherwise be canceled. # +# @aborting: The job is in the process of being aborted, and will finish with +#an error. The job will afterwards report that it is @concluded. @concluded isn't defined yet, but I'm okay with the minor future reference, as it's less churn to get to the description that works at the end of the series. Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 07/21] blockjobs: add block_job_verb permission table
On 02/23/2018 05:51 PM, John Snow wrote: Which commands ("verbs") are appropriate for jobs in which state is also somewhat burdensome to keep track of. As of this commit, it looks rather useless, but begins to look more interesting the more states we add to the STM table. A recurring theme is that no verb will apply to an 'undefined' job. Back to the argument of whether we even want that state. === Changes === (1) To facilitate "nice" error checking, all five major block-job verb interfaces in blockjob.c now support an errp parameter: - block_job_user_cancel is added as a new interface. - block_job_user_pause gains an errp paramter - block_job_user_resume gains an errp parameter - block_job_set_speed already had an errp parameter. - block_job_complete already had an errp parameter. (2) block-job-pause and block-job-resume will no longer no-op when trying to pause an already paused job, or trying to resume a job that isn't paused. These functions will now report that they did not perform the action requested because it was not possible. iotests have been adjusted to address this new behavior. Seems reasonable. Hopefully shouldn't trip up libvirt too badly (if libvirt attempted a redundant job transition that used to silently succeed and now fails, the failure message should be pretty obvious that it was a no-op attempt). (3) block-job-complete doesn't worry about checking !block_job_started, because the permission table guards against this. (4) test-bdrv-drain's job implementation needs to announce that it is 'ready' now, in order to be completed. Signed-off-by: John Snow--- Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table
On 02/27/2018 01:58 PM, Eric Blake wrote: > On 02/23/2018 05:51 PM, John Snow wrote: >> The state transition table has mostly been implied. We're about to make >> it a bit more complex, so let's make the STM explicit instead. >> >> Perform state transitions with a function that for now just asserts the >> transition is appropriate. >> >> Transitions: >> Undefined -> Created: During job initialization. > > Unless we use Created as the default state 0 for g_new0(). > I liked the idea of letting jobs be created in an "indeterminate" state until we actually initialize them to be of use -- that is, jobs that could be said to semantically understand and act on the "START" verb (which is, as of today, an internal command only.) The only meaningful action on a job of indeterminate state, then, would be to DEFINE that job. (I.e. what block_job_create does.) What I'm getting at is that block_job_start() on a job that was just created will explode, and I'd like "created" to mean "This job can be started." It's not a distinction that matters in the codebase RIGHT NOW, but that's how I came to think of the STM. We could likely optimize that transition out because we always create and immediately define it, but it felt ... nicer from an (internal) API point of view to have defined the construction and destruction transitions explicitly. Anyway, it can be removed; it's not a hill worth dying on. I only insist that the bike shed not be olive green. >> Created -> Running: Once the job is started. >> Jobs cannot transition from "Created" to "Paused" >> directly, but will instead synchronously >> transition >> to running to paused immediately. >> Running -> Paused: Normal workflow for pauses. >> Running -> Ready: Normal workflow for jobs reaching their sync point. >> (e.g. mirror) >> Ready -> Standby: Normal workflow for pausing ready jobs. >> Paused -> Running: Normal resume. >> Standby -> Ready: Resume of a Standby job. >> >> >> +-+ >> |UNDEFINED| >> +--+--+ >> | >> +--v+ >> |CREATED| >> +--++ >> | >> +--v+ +--+ >> |RUNNING<->PAUSED| >> +--++ +--+ >> | >> +--v--+ +---+ >> |READY<--->STANDBY| >> +-+ +---+ >> >> >> Notably, there is no state presently defined as of this commit that >> deals with a job after the "running" or "ready" states, so this table >> will be adjusted alongside the commits that introduce those states. > > The ascii-art tables help in this and other patches. Thanks for > producing them. > >> >> Signed-off-by: John Snow>> --- >> block/trace-events | 3 +++ >> blockjob.c | 42 -- >> 2 files changed, 39 insertions(+), 6 deletions(-) >> > >> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) >> +{ >> + BlockJobStatus s0 = job->status; >> + if (s0 == s1) { >> + return; >> + } If I remove this clause, I actually would have noticed that technically I attempt to transition from CREATED to CREATED. Maybe I ought to remove this... >> + assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); > > Or, if you keep the zero state distinct from valid states, this could be > 'assert(s1 > 0 && ...)' > >> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) >> job->pause_count--; >> job->busy = true; >> job->paused = false; >> - job->status = BLOCK_JOB_STATUS_RUNNING; >> + block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING); >> bdrv_coroutine_enter(blk_bs(job->blk), job->co); >> } >> @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const >> BlockJobDriver *driver, >> job->refcnt = 1; >> job->manual = (flags & BLOCK_JOB_MANUAL); >> job->status = BLOCK_JOB_STATUS_CREATED; >> + block_job_state_transition(job, BLOCK_JOB_STATUS_CREATED); > > Oops, missing a deletion of the job->status assignment line. > I am indeed.
Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table
On 02/27/2018 01:58 PM, Eric Blake wrote: > The ascii-art tables help in this and other patches. Thanks for > producing them. Funny story: the original version of the STM present in these patches was too complex and I was not able to draw a chart for them. Kevin's insistence that I do draw a chart led to considerable slimming of possible transitions. --js
Re: [Qemu-block] [Qemu-devel] [RFC v4 06/21] iotests: add pause_wait
On 02/23/2018 05:51 PM, John Snow wrote: Split out the pause command into the actual pause and the wait. Not every usage presently needs to resubmit a pause request. The intent with the next commit will be to explicitly disallow redundant or meaningless pause/resume requests, so the tests need to become more judicious to reflect that. Signed-off-by: John Snow--- tests/qemu-iotests/030| 6 ++ tests/qemu-iotests/055| 17 ++--- tests/qemu-iotests/iotests.py | 12 3 files changed, 16 insertions(+), 19 deletions(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table
On 02/27/2018 12:08 PM, Kevin Wolf wrote: > Am 27.02.2018 um 17:45 hat John Snow geschrieben: >> >> >> On 02/27/2018 11:27 AM, Kevin Wolf wrote: >>> Am 24.02.2018 um 00:51 hat John Snow geschrieben: The state transition table has mostly been implied. We're about to make it a bit more complex, so let's make the STM explicit instead. Perform state transitions with a function that for now just asserts the transition is appropriate. Transitions: Undefined -> Created: During job initialization. Created -> Running: Once the job is started. Jobs cannot transition from "Created" to "Paused" directly, but will instead synchronously transition to running to paused immediately. Running -> Paused: Normal workflow for pauses. Running -> Ready: Normal workflow for jobs reaching their sync point. (e.g. mirror) Ready -> Standby: Normal workflow for pausing ready jobs. Paused-> Running: Normal resume. Standby -> Ready: Resume of a Standby job. +-+ |UNDEFINED| +--+--+ | +--v+ |CREATED| +--++ | +--v+ +--+ |RUNNING<->PAUSED| +--++ +--+ | +--v--+ +---+ |READY<--->STANDBY| +-+ +---+ Notably, there is no state presently defined as of this commit that deals with a job after the "running" or "ready" states, so this table will be adjusted alongside the commits that introduce those states. Signed-off-by: John Snow--- block/trace-events | 3 +++ blockjob.c | 42 -- 2 files changed, 39 insertions(+), 6 deletions(-) diff --git a/block/trace-events b/block/trace-events index 02dd80ff0c..b75a0c8409 100644 --- a/block/trace-events +++ b/block/trace-events @@ -4,6 +4,9 @@ bdrv_open_common(void *bs, const char *filename, int flags, const char *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\"" bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d" +# blockjob.c +block_job_state_transition(void *job, int ret, const char *legal, const char *s0, const char *s1) "job %p (ret: %d) attempting %s transition (%s-->%s)" + # block/block-backend.c blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" diff --git a/blockjob.c b/blockjob.c index 1be9c20cff..d745b3bb69 100644 --- a/blockjob.c +++ b/blockjob.c @@ -28,6 +28,7 @@ #include "block/block.h" #include "block/blockjob_int.h" #include "block/block_int.h" +#include "block/trace.h" #include "sysemu/block-backend.h" #include "qapi/error.h" #include "qapi/qmp/qerror.h" @@ -41,6 +42,34 @@ * block_job_enter. */ static QemuMutex block_job_mutex; +/* BlockJob State Transition Table */ +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = { + /* U, C, R, P, Y, S */ +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0}, >>> >>> Even at the end of the series, this is the only use of >>> BLOCK_JOB_STATUS_UNDEFINED. >>> +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0}, +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0}, +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0}, +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1}, +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0}, +}; + +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) +{ +BlockJobStatus s0 = job->status; +if (s0 == s1) { +return; +} +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ? + "allowed" : "disallowed", + qapi_enum_lookup(_lookup, + s0), + qapi_enum_lookup(_lookup, + s1)); +assert(BlockJobSTT[s0][s1]); +job->status = s1; +} + static void block_job_lock(void) { qemu_mutex_lock(_job_mutex); @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) job->pause_count--;
Re: [Qemu-block] [Qemu-devel] [RFC v4 05/21] blockjobs: add state transition table
On 02/23/2018 05:51 PM, John Snow wrote: The state transition table has mostly been implied. We're about to make it a bit more complex, so let's make the STM explicit instead. Perform state transitions with a function that for now just asserts the transition is appropriate. Transitions: Undefined -> Created: During job initialization. Unless we use Created as the default state 0 for g_new0(). Created -> Running: Once the job is started. Jobs cannot transition from "Created" to "Paused" directly, but will instead synchronously transition to running to paused immediately. Running -> Paused: Normal workflow for pauses. Running -> Ready: Normal workflow for jobs reaching their sync point. (e.g. mirror) Ready -> Standby: Normal workflow for pausing ready jobs. Paused-> Running: Normal resume. Standby -> Ready: Resume of a Standby job. +-+ |UNDEFINED| +--+--+ | +--v+ |CREATED| +--++ | +--v+ +--+ |RUNNING<->PAUSED| +--++ +--+ | +--v--+ +---+ |READY<--->STANDBY| +-+ +---+ Notably, there is no state presently defined as of this commit that deals with a job after the "running" or "ready" states, so this table will be adjusted alongside the commits that introduce those states. The ascii-art tables help in this and other patches. Thanks for producing them. Signed-off-by: John Snow--- block/trace-events | 3 +++ blockjob.c | 42 -- 2 files changed, 39 insertions(+), 6 deletions(-) +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) +{ +BlockJobStatus s0 = job->status; +if (s0 == s1) { +return; +} +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); Or, if you keep the zero state distinct from valid states, this could be 'assert(s1 > 0 && ...)' @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) job->pause_count--; job->busy = true; job->paused = false; -job->status = BLOCK_JOB_STATUS_RUNNING; +block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING); bdrv_coroutine_enter(blk_bs(job->blk), job->co); } @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver, job->refcnt= 1; job->manual= (flags & BLOCK_JOB_MANUAL); job->status= BLOCK_JOB_STATUS_CREATED; +block_job_state_transition(job, BLOCK_JOB_STATUS_CREATED); Oops, missing a deletion of the job->status assignment line. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [RFC v4 07/21] blockjobs: add block_job_verb permission table
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > Which commands ("verbs") are appropriate for jobs in which state is > also somewhat burdensome to keep track of. > > As of this commit, it looks rather useless, but begins to look more > interesting the more states we add to the STM table. > > A recurring theme is that no verb will apply to an 'undefined' job. > > Further, it's not presently possible to restrict the "pause" or "resume" > verbs any more than they are in this commit because of the asynchronous > nature of how jobs enter the PAUSED state; justifications for some > seemingly erroneous applications are given below. > > = > Verbs > = > > Cancel:Any state except undefined. > Pause: Any state except undefined; >'created': Requests that the job pauses as it starts. >'running': Normal usage. (PAUSED) >'paused': The job may be paused for internal reasons, > but the user may wish to force an indefinite > user-pause, so this is allowed. >'ready': Normal usage. (STANDBY) >'standby': Same logic as above. > Resume:Any state except undefined; >'created': Will lift a user's pause-on-start request. >'running': Will lift a pause request before it takes effect. >'paused': Normal usage. >'ready': Will lift a pause request before it takes effect. >'standby': Normal usage. > Set-speed: Any state except undefined, though ready may not be meaningful. > Complete: Only a 'ready' job may accept a complete request. > > > === > Changes > === > > (1) > > To facilitate "nice" error checking, all five major block-job verb > interfaces in blockjob.c now support an errp parameter: > > - block_job_user_cancel is added as a new interface. > - block_job_user_pause gains an errp paramter > - block_job_user_resume gains an errp parameter > - block_job_set_speed already had an errp parameter. > - block_job_complete already had an errp parameter. > > (2) > > block-job-pause and block-job-resume will no longer no-op when trying > to pause an already paused job, or trying to resume a job that isn't > paused. These functions will now report that they did not perform the > action requested because it was not possible. > > iotests have been adjusted to address this new behavior. > > (3) > > block-job-complete doesn't worry about checking !block_job_started, > because the permission table guards against this. > > (4) > > test-bdrv-drain's job implementation needs to announce that it is > 'ready' now, in order to be completed. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 04/21] blockjobs: add status enum
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > We're about to add several new states, and booleans are becoming > unwieldly and difficult to reason about. It would help to have a > more explicit bookkeeping of the state of blockjobs. To this end, > add a new "status" field and add our existing states in a redundant > manner alongside the bools they are replacing: > > UNDEFINED: Placeholder, default state. Not currently visible to QMP >unless changes occur in the future to allow creating jobs >without starting them via QMP. > CREATED: replaces !!job->co && paused && !busy > RUNNING: replaces effectively (!paused && busy) > PAUSED:Nearly redundant with info->paused, which shows pause_count. >This reports the actual status of the job, which almost always >matches the paused request status. It differs in that it is >strictly only true when the job has actually gone dormant. > READY: replaces job->ready. > STANDBY: Paused, but job->ready is true. > > New state additions in coming commits will not be quite so redundant: > > WAITING: Waiting on transaction. This job has finished all the work >it can until the transaction converges, fails, or is canceled. > PENDING: Pending authorization from user. This job has finished all the >work it can until the job or transaction is finalized via >block_job_finalize. This implies the transaction has converged >and left the WAITING phase. > ABORTING: Job has encountered an error condition and is in the process >of aborting. > CONCLUDED: Job has ceased all operations and has a return code available >for query and may be dismissed via block_job_dismiss. > NULL: Job has been dismissed and (should) be destroyed. Should never >be visible to QMP. > > Some of these states appear somewhat superfluous, but it helps define the > expected flow of a job; so some of the states wind up being synchronous > empty transitions. Importantly, jobs can be in only one of these states > at any given time, which helps code and external users alike reason about > the current condition of a job unambiguously. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 06/21] iotests: add pause_wait
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > Split out the pause command into the actual pause and the wait. > Not every usage presently needs to resubmit a pause request. > > The intent with the next commit will be to explicitly disallow > redundant or meaningless pause/resume requests, so the tests > need to become more judicious to reflect that. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 02/21] blockjobs: model single jobs as transactions
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > model all independent jobs as single job transactions. > > It's one less case we have to worry about when we add more states to the > transition machine. This way, we can just treat all job lifetimes exactly > the same. This helps tighten assertions of the STM graph and removes some > conditionals that would have been needed in the coming commits adding a > more explicit job lifetime management API. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 01/21] blockjobs: fix set-speed kick
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > If speed is '0' it's not actually "less than" the previous speed. > Kick the job in this case too. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 03/21] blockjobs: add manual property
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > This property will be used to opt-in to the new BlockJobs workflow > that allows a tighter, more explicit control over transitions from > one runstate to another. > > While we're here, fix up the documentation for block_job_create > a little bit. > > Signed-off-by: John SnowReviewed-by: Kevin Wolf
Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table
Am 27.02.2018 um 17:45 hat John Snow geschrieben: > > > On 02/27/2018 11:27 AM, Kevin Wolf wrote: > > Am 24.02.2018 um 00:51 hat John Snow geschrieben: > >> The state transition table has mostly been implied. We're about to make > >> it a bit more complex, so let's make the STM explicit instead. > >> > >> Perform state transitions with a function that for now just asserts the > >> transition is appropriate. > >> > >> Transitions: > >> Undefined -> Created: During job initialization. > >> Created -> Running: Once the job is started. > >> Jobs cannot transition from "Created" to "Paused" > >> directly, but will instead synchronously transition > >> to running to paused immediately. > >> Running -> Paused: Normal workflow for pauses. > >> Running -> Ready: Normal workflow for jobs reaching their sync point. > >> (e.g. mirror) > >> Ready -> Standby: Normal workflow for pausing ready jobs. > >> Paused-> Running: Normal resume. > >> Standby -> Ready: Resume of a Standby job. > >> > >> > >> +-+ > >> |UNDEFINED| > >> +--+--+ > >>| > >> +--v+ > >> |CREATED| > >> +--++ > >>| > >> +--v+ +--+ > >> |RUNNING<->PAUSED| > >> +--++ +--+ > >>| > >> +--v--+ +---+ > >> |READY<--->STANDBY| > >> +-+ +---+ > >> > >> > >> Notably, there is no state presently defined as of this commit that > >> deals with a job after the "running" or "ready" states, so this table > >> will be adjusted alongside the commits that introduce those states. > >> > >> Signed-off-by: John Snow> >> --- > >> block/trace-events | 3 +++ > >> blockjob.c | 42 -- > >> 2 files changed, 39 insertions(+), 6 deletions(-) > >> > >> diff --git a/block/trace-events b/block/trace-events > >> index 02dd80ff0c..b75a0c8409 100644 > >> --- a/block/trace-events > >> +++ b/block/trace-events > >> @@ -4,6 +4,9 @@ > >> bdrv_open_common(void *bs, const char *filename, int flags, const char > >> *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\"" > >> bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d" > >> > >> +# blockjob.c > >> +block_job_state_transition(void *job, int ret, const char *legal, const > >> char *s0, const char *s1) "job %p (ret: %d) attempting %s transition > >> (%s-->%s)" > >> + > >> # block/block-backend.c > >> blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, > >> int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" > >> blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, > >> int flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" > >> diff --git a/blockjob.c b/blockjob.c > >> index 1be9c20cff..d745b3bb69 100644 > >> --- a/blockjob.c > >> +++ b/blockjob.c > >> @@ -28,6 +28,7 @@ > >> #include "block/block.h" > >> #include "block/blockjob_int.h" > >> #include "block/block_int.h" > >> +#include "block/trace.h" > >> #include "sysemu/block-backend.h" > >> #include "qapi/error.h" > >> #include "qapi/qmp/qerror.h" > >> @@ -41,6 +42,34 @@ > >> * block_job_enter. */ > >> static QemuMutex block_job_mutex; > >> > >> +/* BlockJob State Transition Table */ > >> +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = { > >> + /* U, C, R, P, Y, S */ > >> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0}, > > > > Even at the end of the series, this is the only use of > > BLOCK_JOB_STATUS_UNDEFINED. > > > >> +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0}, > >> +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0}, > >> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0}, > >> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1}, > >> +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0}, > >> +}; > >> + > >> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) > >> +{ > >> +BlockJobStatus s0 = job->status; > >> +if (s0 == s1) { > >> +return; > >> +} > >> +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); > >> +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ? > >> + "allowed" : "disallowed", > >> + > >> qapi_enum_lookup(_lookup, > >> + s0), > >> + > >> qapi_enum_lookup(_lookup, > >> + s1)); > >> +assert(BlockJobSTT[s0][s1]); > >> +job->status = s1; > >> +} > >> + > >> static void block_job_lock(void) > >> { > >> qemu_mutex_lock(_job_mutex); > >> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) > >> job->pause_count--; > >> job->busy = true; > >> job->paused = false; >
[Qemu-block] [PATCH v4 3/5] qcow2: Reduce REFT_OFFSET_MASK
Match our code to the spec change in the previous patch - there's no reason for the refcount table to allow larger offsets than the L1/L2 tables. In practice, no image has more than 64PB of allocated clusters anyways, as anything beyond that can't be expressed via L2 mappings to host offsets. Suggested-by: Alberto GarciaSigned-off-by: Eric Blake --- v4: new patch --- block/qcow2.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/block/qcow2.h b/block/qcow2.h index 883802241fb..afc524ead54 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -442,7 +442,7 @@ typedef enum QCow2MetadataOverlap { #define L2E_OFFSET_MASK 0x00fffe00ULL #define L2E_COMPRESSED_OFFSET_SIZE_MASK 0x3fffULL -#define REFT_OFFSET_MASK 0xfe00ULL +#define REFT_OFFSET_MASK 0x00fffe00ULL static inline int64_t start_of_cluster(BDRVQcow2State *s, int64_t offset) { -- 2.14.3
[Qemu-block] [PATCH v4 5/5] qcow2: Avoid memory over-allocation on compressed images
When reading a compressed image, we were allocating s->cluster_data to 32*cluster_size + 512 (possibly over 64 megabytes, for an image with 2M clusters). Let's check out the history: Back when qcow2 was first written, we used s->cluster_data for everything, including copy_sectors() and encryption, where we want to operate on more than one cluster at once. Obviously, at that point, the buffer had to be aligned for other users, even though compression itself doesn't require any alignment (the fact that the compressed data generally starts mid-sector means that aligning our buffer buys us nothing - either the protocol already supports byte-based access into whatever offset we want, or we are already using a bounce buffer to read a full sector, and copying into our destination no longer requires alignment). But commit 1b9f1491 (v1.1!) changed things to allocate parallel buffers on demand rather than sharing a single buffer, for encryption and COW, leaving compression as the final client of s->cluster_data. That use was still preserved, because if a single compressed cluster is read more than once, we reuse the cache instead of decompressing it a second time (someday, we may come up with better caching to avoid wasting repeated decompressions while still being more parallel, but that is a task for another patch; the XXX comment in qcow2_co_preadv for QCOW2_CLUSTER_COMPRESSED is telling). Much later, in commit de82815d (v2.2), we noticed that a 64M allocation is prone to failure, so we switched over to a graceful memory allocation error message. Elsewhere in the code, we do g_malloc(2 * cluster_size) without ever checking for failure, but even 4M starts to be large enough that trying to be nice is worth the effort, so we want to keep that aspect. Then even later, in 3e4c7052 (2.11), we realized that allocating a large buffer up front for every qcow2 image is expensive, and switched to lazy allocation only for images that actually had compressed clusters. But in the process, we never even bothered to check whether what we were allocating still made sense in its new context! So, it's time to cut back on the waste. A compressed cluster written by qemu will NEVER occupy more than an uncompressed cluster, but based on mid-sector alignment, we may still need to read 1 cluster + 1 sector in order to recover enough bytes for the decompression. But third-party producers of qcow2 may not be as smart, and gzip DOES document that because the compression stream adds metadata, and because of the pigeonhole principle, there are worst case scenarios where attempts to compress will actually inflate an image, by up to 0.015% (or 62 sectors larger for an unfortunate 2M compression). In fact, the qcow2 spec permits an all-ones sector count, plus a full sector containing the initial offset, for a maximum read of 2 full clusters; and thanks to the way decompression works, even though such a value is probably too large for the actual compressed data, it really doesn't matter if we read too much (gzip ignores slop, once it has decoded a full cluster). So it's easier to just allocate cluster_data to be as large as we can ever possibly see; even if it still wastes up to 2M on any image created by qcow2, that's still an improvment of 60M less waste than pre-patch. Signed-off-by: Eric Blake--- v4: fix off-by-one in assertion and commit message [Berto] v3: tighten allocation [Berto] v2: actually check allocation failure (previous version meant to use g_malloc, but ended up posted with g_try_malloc without checking); add assertions outside of conditional, improve commit message to better match reality now that qcow2 spec bug has been fixed --- block/qcow2-cluster.c | 27 ++- block/qcow2.c | 2 +- 2 files changed, 19 insertions(+), 10 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index 85be7d5e340..1f7f00cbea2 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1598,20 +1598,29 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset) sector_offset = coffset & 511; csize = nb_csectors * 512 - sector_offset; -/* Allocate buffers on first decompress operation, most images are - * uncompressed and the memory overhead can be avoided. The buffers - * are freed in .bdrv_close(). +/* Allocate buffers on the first decompress operation; most + * images are uncompressed and the memory overhead can be + * avoided. The buffers are freed in .bdrv_close(). qemu + * never writes an inflated cluster, and gzip itself never + * inflates a problematic cluster by more than 0.015%, but the + * qcow2 format allows up to 1 byte short of 2 full clusters + * when including the sector containing offset. gzip ignores + * trailing slop, so just allocate that much up front rather + * than reject third-party images with
Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table
Am 24.02.2018 um 00:51 hat John Snow geschrieben: > The state transition table has mostly been implied. We're about to make > it a bit more complex, so let's make the STM explicit instead. > > Perform state transitions with a function that for now just asserts the > transition is appropriate. > > Transitions: > Undefined -> Created: During job initialization. > Created -> Running: Once the job is started. > Jobs cannot transition from "Created" to "Paused" > directly, but will instead synchronously transition > to running to paused immediately. > Running -> Paused: Normal workflow for pauses. > Running -> Ready: Normal workflow for jobs reaching their sync point. > (e.g. mirror) > Ready -> Standby: Normal workflow for pausing ready jobs. > Paused-> Running: Normal resume. > Standby -> Ready: Resume of a Standby job. > > > +-+ > |UNDEFINED| > +--+--+ >| > +--v+ > |CREATED| > +--++ >| > +--v+ +--+ > |RUNNING<->PAUSED| > +--++ +--+ >| > +--v--+ +---+ > |READY<--->STANDBY| > +-+ +---+ > > > Notably, there is no state presently defined as of this commit that > deals with a job after the "running" or "ready" states, so this table > will be adjusted alongside the commits that introduce those states. > > Signed-off-by: John Snow> --- > block/trace-events | 3 +++ > blockjob.c | 42 -- > 2 files changed, 39 insertions(+), 6 deletions(-) > > diff --git a/block/trace-events b/block/trace-events > index 02dd80ff0c..b75a0c8409 100644 > --- a/block/trace-events > +++ b/block/trace-events > @@ -4,6 +4,9 @@ > bdrv_open_common(void *bs, const char *filename, int flags, const char > *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\"" > bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d" > > +# blockjob.c > +block_job_state_transition(void *job, int ret, const char *legal, const > char *s0, const char *s1) "job %p (ret: %d) attempting %s transition > (%s-->%s)" > + > # block/block-backend.c > blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int > flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" > blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int > flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" > diff --git a/blockjob.c b/blockjob.c > index 1be9c20cff..d745b3bb69 100644 > --- a/blockjob.c > +++ b/blockjob.c > @@ -28,6 +28,7 @@ > #include "block/block.h" > #include "block/blockjob_int.h" > #include "block/block_int.h" > +#include "block/trace.h" > #include "sysemu/block-backend.h" > #include "qapi/error.h" > #include "qapi/qmp/qerror.h" > @@ -41,6 +42,34 @@ > * block_job_enter. */ > static QemuMutex block_job_mutex; > > +/* BlockJob State Transition Table */ > +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = { > + /* U, C, R, P, Y, S */ > +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0}, Even at the end of the series, this is the only use of BLOCK_JOB_STATUS_UNDEFINED. > +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0}, > +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0}, > +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0}, > +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1}, > +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0}, > +}; > + > +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) > +{ > +BlockJobStatus s0 = job->status; > +if (s0 == s1) { > +return; > +} > +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); > +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ? > + "allowed" : "disallowed", > + qapi_enum_lookup(_lookup, > + s0), > + qapi_enum_lookup(_lookup, > + s1)); > +assert(BlockJobSTT[s0][s1]); > +job->status = s1; > +} > + > static void block_job_lock(void) > { > qemu_mutex_lock(_job_mutex); > @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) > job->pause_count--; > job->busy = true; > job->paused = false; > -job->status = BLOCK_JOB_STATUS_RUNNING; > +block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING); > bdrv_coroutine_enter(blk_bs(job->blk), job->co); > } > > @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const > BlockJobDriver *driver, > job->refcnt= 1; > job->manual= (flags & BLOCK_JOB_MANUAL); > job->status= BLOCK_JOB_STATUS_CREATED; > +block_job_state_transition(job,
Re: [Qemu-block] [Qemu-devel] [RFC v4 02/21] blockjobs: model single jobs as transactions
On 02/23/2018 05:51 PM, John Snow wrote: model all independent jobs as single job transactions. It's one less case we have to worry about when we add more states to the transition machine. This way, we can just treat all job lifetimes exactly the same. This helps tighten assertions of the STM graph and removes some conditionals that would have been needed in the coming commits adding a more explicit job lifetime management API. Signed-off-by: John Snow--- Reviewed-by: Eric Blake @@ -729,6 +727,17 @@ void *block_job_create(const char *job_id, const BlockJobDriver *driver, return NULL; } } + +/* Single jobs are modeled as single-job transactions for sake of + * consolidating the job management logic */ +if (!txn) { +txn = block_job_txn_new(); +block_job_txn_add_job(txn, job); +block_job_txn_unref(txn); +} else { +block_job_txn_add_job(txn, job); +} + return job; -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[Qemu-block] [PATCH v4 1/5] qcow2: Prefer byte-based calls into bs->file
We had only three sector-based stragglers left; convert them to use our preferred byte-based accesses. Signed-off-by: Eric BlakeReviewed-by: Alberto Garcia --- v2: indentation fix --- block/qcow2-cluster.c | 5 ++--- block/qcow2-refcount.c | 6 +++--- 2 files changed, 5 insertions(+), 6 deletions(-) diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c index e406b0f3b9e..85be7d5e340 100644 --- a/block/qcow2-cluster.c +++ b/block/qcow2-cluster.c @@ -1615,13 +1615,12 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset) } BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED); -ret = bdrv_read(bs->file, coffset >> 9, s->cluster_data, -nb_csectors); +ret = bdrv_pread(bs->file, coffset, s->cluster_data, csize); if (ret < 0) { return ret; } if (decompress_buffer(s->cluster_cache, s->cluster_size, - s->cluster_data + sector_offset, csize) < 0) { + s->cluster_data, csize) < 0) { return -EIO; } s->cluster_cache_offset = coffset; diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index d46b69d7f34..28afbb1b5ea 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -2310,8 +2310,8 @@ write_refblocks: on_disk_refblock = (void *)((char *) *refcount_table + refblock_index * s->cluster_size); -ret = bdrv_write(bs->file, refblock_offset / BDRV_SECTOR_SIZE, - on_disk_refblock, s->cluster_sectors); +ret = bdrv_pwrite(bs->file, refblock_offset, on_disk_refblock, + s->cluster_size); if (ret < 0) { fprintf(stderr, "ERROR writing refblock: %s\n", strerror(-ret)); goto fail; @@ -2533,7 +2533,7 @@ fail: * - 0 if writing to this offset will not affect the mentioned metadata * - a positive QCow2MetadataOverlap value indicating one overlapping section * - a negative value (-errno) indicating an error while performing a check, - * e.g. when bdrv_read failed on QCOW2_OL_INACTIVE_L2 + * e.g. when bdrv_pread failed on QCOW2_OL_INACTIVE_L2 */ int qcow2_check_metadata_overlap(BlockDriverState *bs, int ign, int64_t offset, int64_t size) -- 2.14.3
Re: [Qemu-block] [Qemu-devel] [RFC v4 03/21] blockjobs: add manual property
On 02/23/2018 05:51 PM, John Snow wrote: This property will be used to opt-in to the new BlockJobs workflow that allows a tighter, more explicit control over transitions from one runstate to another. While we're here, fix up the documentation for block_job_create a little bit. Signed-off-by: John Snow--- blockjob.c | 1 + include/block/blockjob.h | 10 ++ include/block/blockjob_int.h | 4 +++- 3 files changed, 14 insertions(+), 1 deletion(-) Reviewed-by: Eric Blake -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [RFC v4 01/21] blockjobs: fix set-speed kick
On 02/23/2018 05:51 PM, John Snow wrote: If speed is '0' it's not actually "less than" the previous speed. Kick the job in this case too. Signed-off-by: John Snow--- blockjob.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Eric Blake diff --git a/blockjob.c b/blockjob.c index 3f52f29f75..24833ef30f 100644 --- a/blockjob.c +++ b/blockjob.c @@ -499,7 +499,7 @@ void block_job_set_speed(BlockJob *job, int64_t speed, Error **errp) } job->speed = speed; -if (speed <= old_speed) { +if (speed && speed <= old_speed) { return; } -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
[Qemu-block] [PATCH v4 2/5] qcow2: Document some maximum size constraints
Although off_t permits up to 63 bits (8EB) of file offsets, in practice, we're going to hit other limits first. Document some of those limits in the qcow2 spec, and how choice of cluster size can influence some of the limits. While at it, notice that since we cannot map any virtual cluster to any address higher than 64 PB (56 bits) (due to the current L1/L2 field encoding stopping at bit 55), it makes little sense to require the refcount table to access host offsets beyond that point. Mark the upper bits of the refcount table entries as reserved to match the L1/L2 table, with no ill effects, since it is unlikely that there are any existing images larger than 64PB in the first place, and thus all existing images already have those bits as 0. If 64PB proves to be too small in the future, we could enlarge all three uses of bit 55 into the reserved bits at that time. However, there is one limit that reserved bits don't help with: for compressed clusters, the L2 layout requires an ever-smaller maximum host offset as cluster size gets larger, down to a 512 TB maximum with 2M clusters. Signed-off-by: Eric Blake-- v4: more wording tweaks v3: new patch --- docs/interop/qcow2.txt | 38 +++--- 1 file changed, 35 insertions(+), 3 deletions(-) diff --git a/docs/interop/qcow2.txt b/docs/interop/qcow2.txt index feb711fb6a8..e32d391e66b 100644 --- a/docs/interop/qcow2.txt +++ b/docs/interop/qcow2.txt @@ -40,7 +40,17 @@ The first cluster of a qcow2 image contains the file header: with larger cluster sizes. 24 - 31: size -Virtual disk size in bytes +Virtual disk size in bytes. + +Note: with a 2 MB cluster size, the maximum +virtual size is 2 EB (61 bits) for a sparse file, +but other sizing limitations in refcount and L1/L2 +tables mean that an image cannot have more than 64 +PB of populated clusters (and an image may hit +other sizing limitations as well, such as +underlying protocol limits). With a 512 byte +cluster size, the maximum virtual size drops to +128 GB (37 bits). 32 - 35: crypt_method 0 for no encryption @@ -318,6 +328,13 @@ for each host cluster. A refcount of 0 means that the cluster is free, 1 means that it is used, and >= 2 means that it is used and any write access must perform a COW (copy on write) operation. +The refcount table has implications on the maximum host file size; a +larger cluster size is required for the refcount table to cover larger +offsets. Furthermore, all qcow2 metadata must currently reside at +offsets below 64 PB (56 bits) (this limit could be enlarged by putting +reserved bits into use, but only if a similar limit on L1/L2 tables is +revisited at the same time). + The refcounts are managed in a two-level table. The first level is called refcount table and has a variable size (which is stored in the header). The refcount table can cover multiple clusters, however it needs to be contiguous @@ -341,7 +358,7 @@ Refcount table entry: Bit 0 - 8:Reserved (set to 0) - 9 - 63:Bits 9-63 of the offset into the image file at which the + 9 - 55:Bits 9-55 of the offset into the image file at which the refcount block starts. Must be aligned to a cluster boundary. @@ -349,6 +366,8 @@ Refcount table entry: been allocated. All refcounts managed by this refcount block are 0. +56 - 63:Reserved (set to 0) + Refcount block entry (x = refcount_bits - 1): Bit 0 - x:Reference count of the cluster. If refcount_bits implies a @@ -365,6 +384,17 @@ The L1 table has a variable size (stored in the header) and may use multiple clusters, however it must be contiguous in the image file. L2 tables are exactly one cluster in size. +The L1 and L2 tables have implications on the maximum virtual file +size; a larger cluster size is required for the guest to have access +to more space. Furthermore, a virtual cluster must currently map to a +host offset below 64 PB (56 bits) (this limit could be enlarged by +putting reserved bits into use, but only if a similar limit on +refcount tables is revisited at the same time). Additionally, with +larger cluster sizes, compressed clusters have a smaller limit on host +cluster mappings (a 2M cluster size requires compressed clusters to +reside below 512 TB (49 bits), where enlarging this would require an +incompatible layout change). + Given a offset into the virtual disk, the offset into the image file can be obtained as follows: @@ -427,7 +457,9 @@ Standard Cluster Descriptor: Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)): Bit 0 - x-1: Host cluster
[Qemu-block] [PATCH v4 4/5] qcow2: Don't allow overflow during cluster allocation
Our code was already checking that we did not attempt to allocate more clusters than what would fit in an INT64 (the physical maximimum if we can access a full off_t's worth of data). But this does not catch smaller limits enforced by various spots in the qcow2 image description: L1 and normal clusters of L2 are documented as having bits 63-56 reserved for other purposes, capping our maximum offset at 64PB (bit 55 is the maximum bit set). And for compressed images with 2M clusters, the cap drops the maximum offset to bit 48, or a maximum offset of 512TB. If we overflow that offset, we would write compressed data into one place, but try to decompress from another, which won't work. I don't have 512TB handy to prove whether things break if we compress so much data that we overflow that limit, and don't think that iotests can (quickly) test it either. Test 138 comes close (it corrupts an image into thinking something lives at 32PB, which is half the maximum for L1 sizing - although it relies on 512-byte clusters). But that test points out that we will generally hit other limits first (such as running out of memory for the refcount table, or exceeding file system limits like 16TB on ext4, etc), so this is more a theoretical safety valve than something likely to be hit. Signed-off-by: Eric BlakeReviewed-by: Alberto Garcia --- v3: use s->cluster_offset_mask instead of open-coding it [Berto], use MIN() to clamp offset on small cluster size, add spec patch first to justify clamping even on refcount allocations --- block/qcow2.h | 6 ++ block/qcow2-refcount.c | 21 ++--- 2 files changed, 20 insertions(+), 7 deletions(-) diff --git a/block/qcow2.h b/block/qcow2.h index afc524ead54..785c8d6e222 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -41,6 +41,12 @@ #define QCOW_MAX_CRYPT_CLUSTERS 32 #define QCOW_MAX_SNAPSHOTS 65536 +/* Field widths in qcow2 mean normal cluster offsets cannot reach + * 64PB; depending on cluster size, compressed clusters can have a + * smaller limit (64PB for up to 16k clusters, then ramps down to + * 512TB for 2M clusters). */ +#define QCOW_MAX_CLUSTER_OFFSET ((1ULL << 56) - 1) + /* 8 MB refcount table is enough for 2 PB images at 64k cluster size * (128 GB for 512 byte clusters, 2 EB for 2 MB clusters) */ #define QCOW_MAX_REFTABLE_SIZE 0x80 diff --git a/block/qcow2-refcount.c b/block/qcow2-refcount.c index 28afbb1b5ea..c46a2257a9e 100644 --- a/block/qcow2-refcount.c +++ b/block/qcow2-refcount.c @@ -31,7 +31,8 @@ #include "qemu/bswap.h" #include "qemu/cutils.h" -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size); +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size, +uint64_t max); static int QEMU_WARN_UNUSED_RESULT update_refcount(BlockDriverState *bs, int64_t offset, int64_t length, uint64_t addend, bool decrease, enum qcow2_discard_type type); @@ -362,7 +363,8 @@ static int alloc_refcount_block(BlockDriverState *bs, } /* Allocate the refcount block itself and mark it as used */ -int64_t new_block = alloc_clusters_noref(bs, s->cluster_size); +int64_t new_block = alloc_clusters_noref(bs, s->cluster_size, + QCOW_MAX_CLUSTER_OFFSET); if (new_block < 0) { return new_block; } @@ -947,7 +949,8 @@ int qcow2_update_cluster_refcount(BlockDriverState *bs, /* return < 0 if error */ -static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size) +static int64_t alloc_clusters_noref(BlockDriverState *bs, uint64_t size, +uint64_t max) { BDRVQcow2State *s = bs->opaque; uint64_t i, nb_clusters, refcount; @@ -972,9 +975,9 @@ retry: } /* Make sure that all offsets in the "allocated" range are representable - * in an int64_t */ + * in the requested max */ if (s->free_cluster_index > 0 && -s->free_cluster_index - 1 > (INT64_MAX >> s->cluster_bits)) +s->free_cluster_index - 1 > (max >> s->cluster_bits)) { return -EFBIG; } @@ -994,7 +997,7 @@ int64_t qcow2_alloc_clusters(BlockDriverState *bs, uint64_t size) BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC); do { -offset = alloc_clusters_noref(bs, size); +offset = alloc_clusters_noref(bs, size, QCOW_MAX_CLUSTER_OFFSET); if (offset < 0) { return offset; } @@ -1076,7 +1079,11 @@ int64_t qcow2_alloc_bytes(BlockDriverState *bs, int size) free_in_cluster = s->cluster_size - offset_into_cluster(s, offset); do { if (!offset || free_in_cluster < size) { -int64_t new_cluster = alloc_clusters_noref(bs, s->cluster_size); +int64_t new_cluster; + +new_cluster = alloc_clusters_noref(bs, s->cluster_size, +
Re: [Qemu-block] [Qemu-devel] [RFC v4 04/21] blockjobs: add status enum
On 02/23/2018 05:51 PM, John Snow wrote: We're about to add several new states, and booleans are becoming unwieldly and difficult to reason about. It would help to have a more explicit bookkeeping of the state of blockjobs. To this end, add a new "status" field and add our existing states in a redundant manner alongside the bools they are replacing: Some of these states appear somewhat superfluous, but it helps define the expected flow of a job; so some of the states wind up being synchronous empty transitions. Importantly, jobs can be in only one of these states at any given time, which helps code and external users alike reason about the current condition of a job unambiguously. Signed-off-by: John Snow--- +++ b/qapi/block-core.json @@ -955,6 +955,32 @@ { 'enum': 'BlockJobType', 'data': ['commit', 'stream', 'mirror', 'backup'] } +## +# @BlockJobStatus: +# +# Indicates the present state of a given blockjob in its lifetime. +# +# @undefined: Erroneous, default state. Should not ever be visible. +# +# @created: The job has been created, but not yet started. +# +# @running: The job is currently running. +# +# @paused: The job is running, but paused. The pause may be requested by +# either the QMP user or by internal processes. +# +# @ready: The job is running, but is ready for the user to signal completion. +# This is used for long-running jobs like mirror that are designed to +# run indefinitely. +# +# @standby: The job is ready, but paused. This is nearly identical to @paused. +# The job may return to @ready or otherwise be canceled. +# +# Since: 2.12 +## +{ 'enum': 'BlockJobStatus', + 'data': ['undefined', 'created', 'running', 'paused', 'ready', 'standby'] } Do we need 'undefined' in the list if it should never be visible? I'm okay if you keep it because it makes other code easier, but as Kevin points out in 5/21, if you aren't even using it, it might be worth considering dropping the state altogether. Otherwise, the patch looks fine to me. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [RFC v4 05/21] blockjobs: add state transition table
On 02/27/2018 11:27 AM, Kevin Wolf wrote: > Am 24.02.2018 um 00:51 hat John Snow geschrieben: >> The state transition table has mostly been implied. We're about to make >> it a bit more complex, so let's make the STM explicit instead. >> >> Perform state transitions with a function that for now just asserts the >> transition is appropriate. >> >> Transitions: >> Undefined -> Created: During job initialization. >> Created -> Running: Once the job is started. >> Jobs cannot transition from "Created" to "Paused" >> directly, but will instead synchronously transition >> to running to paused immediately. >> Running -> Paused: Normal workflow for pauses. >> Running -> Ready: Normal workflow for jobs reaching their sync point. >> (e.g. mirror) >> Ready -> Standby: Normal workflow for pausing ready jobs. >> Paused-> Running: Normal resume. >> Standby -> Ready: Resume of a Standby job. >> >> >> +-+ >> |UNDEFINED| >> +--+--+ >>| >> +--v+ >> |CREATED| >> +--++ >>| >> +--v+ +--+ >> |RUNNING<->PAUSED| >> +--++ +--+ >>| >> +--v--+ +---+ >> |READY<--->STANDBY| >> +-+ +---+ >> >> >> Notably, there is no state presently defined as of this commit that >> deals with a job after the "running" or "ready" states, so this table >> will be adjusted alongside the commits that introduce those states. >> >> Signed-off-by: John Snow>> --- >> block/trace-events | 3 +++ >> blockjob.c | 42 -- >> 2 files changed, 39 insertions(+), 6 deletions(-) >> >> diff --git a/block/trace-events b/block/trace-events >> index 02dd80ff0c..b75a0c8409 100644 >> --- a/block/trace-events >> +++ b/block/trace-events >> @@ -4,6 +4,9 @@ >> bdrv_open_common(void *bs, const char *filename, int flags, const char >> *format_name) "bs %p filename \"%s\" flags 0x%x format_name \"%s\"" >> bdrv_lock_medium(void *bs, bool locked) "bs %p locked %d" >> >> +# blockjob.c >> +block_job_state_transition(void *job, int ret, const char *legal, const >> char *s0, const char *s1) "job %p (ret: %d) attempting %s transition >> (%s-->%s)" >> + >> # block/block-backend.c >> blk_co_preadv(void *blk, void *bs, int64_t offset, unsigned int bytes, int >> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" >> blk_co_pwritev(void *blk, void *bs, int64_t offset, unsigned int bytes, int >> flags) "blk %p bs %p offset %"PRId64" bytes %u flags 0x%x" >> diff --git a/blockjob.c b/blockjob.c >> index 1be9c20cff..d745b3bb69 100644 >> --- a/blockjob.c >> +++ b/blockjob.c >> @@ -28,6 +28,7 @@ >> #include "block/block.h" >> #include "block/blockjob_int.h" >> #include "block/block_int.h" >> +#include "block/trace.h" >> #include "sysemu/block-backend.h" >> #include "qapi/error.h" >> #include "qapi/qmp/qerror.h" >> @@ -41,6 +42,34 @@ >> * block_job_enter. */ >> static QemuMutex block_job_mutex; >> >> +/* BlockJob State Transition Table */ >> +bool BlockJobSTT[BLOCK_JOB_STATUS__MAX][BLOCK_JOB_STATUS__MAX] = { >> + /* U, C, R, P, Y, S */ >> +/* U: */ [BLOCK_JOB_STATUS_UNDEFINED] = {0, 1, 0, 0, 0, 0}, > > Even at the end of the series, this is the only use of > BLOCK_JOB_STATUS_UNDEFINED. > >> +/* C: */ [BLOCK_JOB_STATUS_CREATED] = {0, 0, 1, 0, 0, 0}, >> +/* R: */ [BLOCK_JOB_STATUS_RUNNING] = {0, 0, 0, 1, 1, 0}, >> +/* P: */ [BLOCK_JOB_STATUS_PAUSED]= {0, 0, 1, 0, 0, 0}, >> +/* Y: */ [BLOCK_JOB_STATUS_READY] = {0, 0, 0, 0, 0, 1}, >> +/* S: */ [BLOCK_JOB_STATUS_STANDBY] = {0, 0, 0, 0, 1, 0}, >> +}; >> + >> +static void block_job_state_transition(BlockJob *job, BlockJobStatus s1) >> +{ >> +BlockJobStatus s0 = job->status; >> +if (s0 == s1) { >> +return; >> +} >> +assert(s1 >= 0 && s1 <= BLOCK_JOB_STATUS__MAX); >> +trace_block_job_state_transition(job, job->ret, BlockJobSTT[s0][s1] ? >> + "allowed" : "disallowed", >> + >> qapi_enum_lookup(_lookup, >> + s0), >> + >> qapi_enum_lookup(_lookup, >> + s1)); >> +assert(BlockJobSTT[s0][s1]); >> +job->status = s1; >> +} >> + >> static void block_job_lock(void) >> { >> qemu_mutex_lock(_job_mutex); >> @@ -320,7 +349,7 @@ void block_job_start(BlockJob *job) >> job->pause_count--; >> job->busy = true; >> job->paused = false; >> -job->status = BLOCK_JOB_STATUS_RUNNING; >> +block_job_state_transition(job, BLOCK_JOB_STATUS_RUNNING); >> bdrv_coroutine_enter(blk_bs(job->blk), job->co); >> } >> >> @@ -704,6 +733,7 @@ void *block_job_create(const char *job_id, const >> BlockJobDriver *driver, >> job->refcnt
Re: [Qemu-block] [Qemu-devel] [RFC] qemu-img: Drop BLK_ZERO from convert
On Mon, Feb 26, 2018 at 06:03:13PM +0100, Max Reitz wrote: > There are filesystems (among which is tmpfs) that have a hard time > reporting allocation status. That is definitely a bug in them. > > However, there is no good reason why qemu-img convert should query the > allocation status in the first place. It does zero detection by itself > anyway, so we can detect unallocated areas ourselves. > > Furthermore, if a filesystem driver has any sense, reading unallocated > data should take just as much time as lseek(SEEK_DATA) + memset(). So > the only overhead we introduce by dropping the manual lseek() call is a > memset() in the driver and a buffer_is_zero() in qemu-img, both of which > should be relatively quick. This makes sense. Which file systems did you test this patch on? XFS, ext4, and tmpfs would be a good minimal test set to prove the patch. Perhaps with two input files: 1. A file that is mostly filled with data. 2. A file that is only sparsely populated with data. The time taken should be comparable with the time before this patch. signature.asc Description: PGP signature
[Qemu-block] [RFC PATCH 1/2] migrate: Allow incoming migration without defer
Allow a QEMU instance which has been started and used without the "-incoming" flag to accept an incoming migration with the "migrate-incoming" QMP command. This allows the user to dump the VM state to an external file then revert to that state at a later time without restarting QEMU. --- migration/migration.c | 12 +--- vl.c | 2 ++ 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/migration/migration.c b/migration/migration.c index 0aa596f867..05595a4cec 100644 --- a/migration/migration.c +++ b/migration/migration.c @@ -1321,14 +1321,14 @@ void migrate_del_blocker(Error *reason) void qmp_migrate_incoming(const char *uri, Error **errp) { Error *local_err = NULL; -static bool once = true; -if (!deferred_incoming) { -error_setg(errp, "For use with '-incoming defer'"); +if (runstate_check(RUN_STATE_INMIGRATE)) { +error_setg(errp, "The incoming migration has already been started"); return; } -if (!once) { -error_setg(errp, "The incoming migration has already been started"); + +if (!deferred_incoming) { +vm_stop(RUN_STATE_INMIGRATE); } qemu_start_incoming_migration(uri, _err); @@ -1337,8 +1337,6 @@ void qmp_migrate_incoming(const char *uri, Error **errp) error_propagate(errp, local_err); return; } - -once = false; } bool migration_is_blocked(Error **errp) diff --git a/vl.c b/vl.c index 9e7235df6d..a91eda039e 100644 --- a/vl.c +++ b/vl.c @@ -634,6 +634,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_PAUSED, RUN_STATE_POSTMIGRATE }, { RUN_STATE_PAUSED, RUN_STATE_PRELAUNCH }, { RUN_STATE_PAUSED, RUN_STATE_COLO}, +{ RUN_STATE_PAUSED, RUN_STATE_INMIGRATE }, { RUN_STATE_POSTMIGRATE, RUN_STATE_RUNNING }, { RUN_STATE_POSTMIGRATE, RUN_STATE_FINISH_MIGRATE }, @@ -665,6 +666,7 @@ static const RunStateTransition runstate_transitions_def[] = { { RUN_STATE_RUNNING, RUN_STATE_WATCHDOG }, { RUN_STATE_RUNNING, RUN_STATE_GUEST_PANICKED }, { RUN_STATE_RUNNING, RUN_STATE_COLO}, +{ RUN_STATE_RUNNING, RUN_STATE_INMIGRATE }, { RUN_STATE_SAVE_VM, RUN_STATE_RUNNING }, -- 2.16.2
[Qemu-block] [RFC PATCH 2/2] migrate: Tests for migrating to an already used QEMU
Currently this appears to work for X86_64, but PPC64 fails due to some unexpected data on the serial port after migration. This probably requires quite a bit more work. --- tests/migration-test.c | 113 + 1 file changed, 85 insertions(+), 28 deletions(-) diff --git a/tests/migration-test.c b/tests/migration-test.c index 74f9361bdd..8217d2e83e 100644 --- a/tests/migration-test.c +++ b/tests/migration-test.c @@ -125,19 +125,17 @@ static void init_bootfile_ppc(const char *bootpath) * we get an 'A' followed by an endless string of 'B's * but on the destination we won't have the A. */ -static void wait_for_serial(const char *side) +static void wait_for_serial(const char *side, gboolean started) { char *serialpath = g_strdup_printf("%s/%s", tmpfs, side); FILE *serialfile = fopen(serialpath, "r"); const char *arch = qtest_get_arch(); -int started = (strcmp(side, "src_serial") == 0 && - strcmp(arch, "ppc64") == 0) ? 0 : 1; g_free(serialpath); do { int readvalue = fgetc(serialfile); -if (!started) { +if (!started && !strcmp(arch, "ppc64")) { /* SLOF prints its banner before starting test, * to ignore it, mark the start of the test with '_', * ignore all characters until this marker @@ -358,16 +356,21 @@ static void migrate_set_capability(QTestState *who, const char *capability, QDECREF(rsp); } -static void migrate(QTestState *who, const char *uri) +static void migrate(QTestState *who, const char *uri, gboolean incoming) { QDict *rsp; +const gchar *cmd_name = incoming ? "migrate-incoming" : "migrate"; gchar *cmd; -cmd = g_strdup_printf("{ 'execute': 'migrate'," +cmd = g_strdup_printf("{ 'execute': '%s'," "'arguments': { 'uri': '%s' } }", - uri); + cmd_name, uri); rsp = qtest_qmp(who, cmd); g_free(cmd); +while (qdict_haskey(rsp, "event")) { +QDECREF(rsp); +rsp = qtest_qmp_receive(who); +} g_assert(qdict_haskey(rsp, "return")); QDECREF(rsp); } @@ -382,15 +385,22 @@ static void migrate_start_postcopy(QTestState *who) } static void test_migrate_start(QTestState **from, QTestState **to, - const char *uri) + const char *uri, gboolean incoming) { -gchar *cmd_src, *cmd_dst; +gchar *cmd_src; +GString *cmd_dst = g_string_new(NULL); char *bootpath = g_strdup_printf("%s/bootsect", tmpfs); +char *bootpath_dst; const char *arch = qtest_get_arch(); const char *accel = "kvm:tcg"; got_stop = false; +if (incoming) +bootpath_dst = bootpath; +else +bootpath_dst = g_strdup_printf("%s/bootsect_dst", tmpfs); + if (strcmp(arch, "i386") == 0 || strcmp(arch, "x86_64") == 0) { init_bootfile_x86(bootpath); cmd_src = g_strdup_printf("-machine accel=%s -m 150M" @@ -398,12 +408,14 @@ static void test_migrate_start(QTestState **from, QTestState **to, " -serial file:%s/src_serial" " -drive file=%s,format=raw", accel, tmpfs, bootpath); -cmd_dst = g_strdup_printf("-machine accel=%s -m 150M" - " -name target,debug-threads=on" - " -serial file:%s/dest_serial" - " -drive file=%s,format=raw" - " -incoming %s", - accel, tmpfs, bootpath, uri); +if (!incoming) +init_bootfile_x86(bootpath_dst); + +g_string_printf(cmd_dst, "-machine accel=%s -m 150M" +" -name target,debug-threads=on" +" -serial file:%s/dest_serial" +" -drive file=%s,format=raw", +accel, tmpfs, bootpath_dst); } else if (strcmp(arch, "ppc64") == 0) { /* On ppc64, the test only works with kvm-hv, but not with kvm-pr */ @@ -416,22 +428,29 @@ static void test_migrate_start(QTestState **from, QTestState **to, " -serial file:%s/src_serial" " -drive file=%s,if=pflash,format=raw", accel, tmpfs, bootpath); -cmd_dst = g_strdup_printf("-machine accel=%s -m 256M" - " -name target,debug-threads=on" - " -serial file:%s/dest_serial" - " -incoming %s", - accel, tmpfs, uri); +g_string_printf(cmd_dst, "-machine accel=%s -m 256M" +" -name target,debug-threads=on" +" -serial file:%s/dest_serial", +accel,
[Qemu-block] [RFC PATCH 0/2] Increase usability of external snapshots
Hello, Following on from the discussion about creating savevm/loadvm QMP equivalents. I decided to take the advice given that we should use external snapshots. However reverting to a snapshot currently requires QEMU to be restarted with "-incoming". Restarting QEMU requires a fair bit of book keeping to be done by the management application and may take measurably more time. Also to quote the previous discussion: >>> > Then you could just use the regular migrate QMP commands for loading >>> > and saving snapshots. Might need a little extra work on the incoming >>> > side, since we need to be able to load snapshots, despite QEMU not >>> > being started with '-incoming defer', but might still be doable ? >>> > This would theoretically give us progress monitoring, cancellation, >>> > etc for free. >>> >>> What actually stops this working other than the sanity check in >>> migrate_incoming ? >> >> No idea really - not looked closely at the code implications. > > It would be a plus for migration code, right now there are _two_ > implementations, and savevm/loadvm one gets less love. > > And we will check "much more" the way to load migration in a > non-pristine qemu, so > > Later, Juan. This patch just changes a few lines which shuts down the currently running VM and puts the QEMU instance into a state where it can accept an incoming migration. I haven't tested it yet with more complex scenarios, but it appears to work fine so far. I have also started work on a new migration test, which is almost identical to the existing post-copy test, but migrates to a non-pristine QEMU. It works on X86, but not yet on PPC64 because the serial output is different. It needs quite a lot more work, but before I continue, I would like to know first if this is something people fundamentally object to? Richard Palethorpe (2): migrate: Allow incoming migration without defer migrate: Tests for migrating to an already used QEMU migration/migration.c | 12 +++--- tests/migration-test.c | 113 + vl.c | 2 + 3 files changed, 92 insertions(+), 35 deletions(-) -- 2.16.2
[Qemu-block] [PATCH v4 0/5] minor compression improvements
In v4: - patch 2 has even more wording tweaks [Kevin] - patch 3 is new [Berto] - patch 5 fixes an off-by-one [Berto] Eric Blake (5): qcow2: Prefer byte-based calls into bs->file qcow2: Document some maximum size constraints qcow2: Reduce REFT_OFFSET_MASK qcow2: Don't allow overflow during cluster allocation qcow2: Avoid memory over-allocation on compressed images docs/interop/qcow2.txt | 38 +++--- block/qcow2.h | 8 +++- block/qcow2-cluster.c | 32 block/qcow2-refcount.c | 27 +-- block/qcow2.c | 2 +- 5 files changed, 80 insertions(+), 27 deletions(-) -- 2.14.3
Re: [Qemu-block] [PATCH] vl: introduce vm_shutdown()
On Fri, Feb 23, 2018 at 04:20:44PM +0800, Fam Zheng wrote: > On Tue, 02/20 13:10, Stefan Hajnoczi wrote: > > 1. virtio_scsi_handle_cmd_vq() racing with iothread_stop_all() hits the > >virtio_scsi_ctx_check() assertion failure because the BDS AioContext > >has been modified by iothread_stop_all(). > > Does this patch fix the issue completely? IIUC virtio_scsi_handle_cmd can > already be entered at the time of main thread calling virtio_scsi_clear_aio(), > so this race condition still exists: > > main thread iothread > - > vm_shutdown > ... > virtio_bus_stop_ioeventfd > virtio_scsi_dataplane_stop > aio_poll() > ... > > virtio_scsi_data_plane_handle_cmd() > aio_context_acquire(s->ctx) > virtio_scsi_acquire(s).enter > virtio_scsi_clear_aio() > aio_context_release(s->ctx) > virtio_scsi_acquire(s).return > virtio_scsi_handle_cmd_vq() > ... > virtqueue_pop() > > Is it possible that the above virtqueue_pop() still returns one element that > was > queued before vm_shutdown() was called? No, it can't because virtio_scsi_clear_aio() invokes virtio_queue_host_notifier_aio_read(>host_notifier) to process the virtqueue. By the time we get back to iothread's virtio_scsi_data_plane_handle_cmd() the virtqueue is already empty. Vcpus have been paused so no additional elements can slip into the virtqueue. signature.asc Description: PGP signature
Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints
> Note that L1 and L2 fields both stop at bit 55 currently, but do have > room for expansion up to bit 61; so all three limits (if we include > refcount table in the set capped at bit 55 for now) could be raised > simultaneously if we find 64P too small in the future. 64 petabytes ought to be enough for anybody :-) Berto
Re: [Qemu-block] [PATCH v3 3/7] qapi: Replace qobject_to_X(o) by qobject_to(o, X)
On 02/26/2018 06:01 AM, Max Reitz wrote: +++ b/block.c @@ -1457,7 +1457,7 @@ static QDict *parse_json_filename(const char *filename, Error **errp) return NULL; } - options = qobject_to_qdict(options_obj); + options = qobject_to(options_obj, QDict); Bikeshedding - would it read any easier as: options = qobject_to(QDict, options_obj); ? If so, your Coccinelle script can be touched up, and patch 2/7 swaps argument order around, so it would be tolerable but still slightly busywork to regenerate the series. But I'm not strongly attached to either order, and so I'm also willing to take this as-is (especially since that's less work), if no one else has a strong opinion that swapping order would aid legibility. Well, same for me. :-) In a template/generic language, we'd write the type first (e.g. qobject_cast(options_obj)). But maybe we'd write the object first, too (e.g. options_obj.cast()). And the current order of the arguments follows the order in the name ("qobject" options_obj "to" QDict). But maybe it's more natural to read it as "qobject to" QDict "applied to" options_obj. I don't know either. Okay, after looking for existing uses of type names in macro calls, I see: qemu/compiler.h: #ifndef container_of #define container_of(ptr, type, member) ({ \ const typeof(((type *) 0)->member) *__mptr = (ptr); \ (type *) ((char *) __mptr - offsetof(type, member));}) #endif /* Convert from a base type to a parent type, with compile time checking. */ #ifdef __GNUC__ #define DO_UPCAST(type, field, dev) ( __extension__ ( { \ char __attribute__((unused)) offset_must_be_zero[ \ -offsetof(type, field)]; \ container_of(dev, type, field);})) #else #define DO_UPCAST(type, field, dev) container_of(dev, type, field) #endif #define typeof_field(type, field) typeof(((type *)0)->field) qapi/clone-visitor.h: /* * Deep-clone QAPI object @src of the given @type, and return the result. * * Not usable on QAPI scalars (integers, strings, enums), nor on a * QAPI object that references the 'any' type. Safe when @src is NULL. */ #define QAPI_CLONE(type, src) \ /* * Copy deep clones of @type members from @src to @dst. * * Not usable on QAPI scalars (integers, strings, enums), nor on a * QAPI object that references the 'any' type. */ #define QAPI_CLONE_MEMBERS(type, dst, src) \ 2 out of 3 macros in compiler.h put the type name first, and container_of() puts it in the middle of 3. It's even weirder because DO_UPCAST(t, f, d) calls container_of(d, t, f), where the inconsistency makes it a mental struggle to figure out how to read the two macros side by side, compared to if we had just been consistent. Meanwhile, all of the macros in qapi put the type name first. So at this point, I'm 70:30 in favor of doing the rename to have qobject_to(type, obj) for consistency with majority of other macros that take a type name (type names are already unusual as arguments to macros, whether or not the macro is named with ALL_CAPS). (Sorry, I know that means more busy work for you, if you agree with my reasoning) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints
On 02/27/2018 05:47 AM, Kevin Wolf wrote: Am 22.02.2018 um 16:59 hat Eric Blake geschrieben: Although off_t permits up to 63 bits (8EB) of file offsets, in practice, we're going to hit other limits first. Document some of those limits in the qcow2 spec, and how choice of cluster size can influence some of the limits. While at it, notice that since we cannot map any virtual cluster to any address higher than 64 PB (56 bits) (due to the L1/L2 field encoding), it makes little sense to require the refcount table to access host offsets beyond that point. Mark the upper bits of the refcount table entries as reserved, with no ill effects, since it is unlikely that there are any existing images larger than 64PB in the first place, and thus all existing images already have those bits as 0. Signed-off-by: Eric BlakeI think it would be good to mention the exact reason for the 56 bits in the spec. Even this commit message is rather vague ('L1/L2 field encoding'), so if at some point someone wonders, if we couldn't simply extend the allowed range, they won't easily see that it's related to compressed clusters. Note that L1 and L2 fields both stop at bit 55 currently, but do have room for expansion up to bit 61; so all three limits (if we include refcount table in the set capped at bit 55 for now) could be raised simultaneously if we find 64P too small in the future. Compressed clusters are also related, but there, the limit is even smaller - with 2M clusters, a compressed cluster must reside within the first 512T of host offsets, and there are no free bits available for allowing additional compressed cluster storage without making an incompatible change of how compressed clusters are represented. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [Qemu-devel] [PATCH v2 14/16] block/mirror: Add active mirroring
On 02/27/2018 03:34 AM, Fam Zheng wrote: On Mon, 01/22 23:08, Max Reitz wrote: @@ -1151,7 +1285,48 @@ static int coroutine_fn bdrv_mirror_top_preadv(BlockDriverState *bs, static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs, uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) { -return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags); +MirrorOp *op = NULL; +MirrorBDSOpaque *s = bs->opaque; +QEMUIOVector bounce_qiov; +void *bounce_buf; +int ret = 0; +bool copy_to_target; + +copy_to_target = s->job->ret >= 0 && + s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; + +if (copy_to_target) { +/* The guest might concurrently modify the data to write; but + * the data on source and destination must match, so we have + * to use a bounce buffer if we are going to write to the + * target now. */ +bounce_buf = qemu_blockalign(bs, bytes); +iov_to_buf_full(qiov->iov, qiov->niov, 0, bounce_buf, bytes); Quorum doesn't use a bounce buffer, so I think we can get away without it too: a guest concurrently modifying the buffer isn't a concern in practice. Arguably, that's a bug in quorum. We also use a bounce buffer for the same reason when encrypting. We really do need to make sure that bits landing in more than one storage location come from the same point in time. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH v3 2/7] qapi: Add qobject_to()
On Sat 24 Feb 2018 04:40:28 PM CET, Max Reitz wrote: > This is a dynamic casting macro that, given a QObject type, returns an > object as that type or NULL if the object is of a different type (or > NULL itself). > > The macro uses lower-case letters because: > 1. There does not seem to be a hard rule on whether qemu macros have to >be upper-cased, > 2. The current situation in qapi/qmp is inconsistent (compare e.g. >QINCREF() vs. qdict_put()), > 3. qobject_to() will evaluate its @obj parameter only once, thus it is >generally not important to the caller whether it is a macro or not, > 4. I prefer it aesthetically. > > Signed-off-by: Max ReitzReviewed-by: Alberto Garcia Berto
Re: [Qemu-block] [PATCH v3 1/7] compiler: Add QEMU_BUILD_BUG_MSG() macro
On Sat 24 Feb 2018 04:40:27 PM CET, Max Reitz wrote: > _Static_assert() allows us to specify messages, and that may come in > handy. Even without _Static_assert(), encouraging developers to put a > helpful message next to the QEMU_BUILD_BUG_* may make debugging easier > whenever it breaks. > > Signed-off-by: Max ReitzReviewed-by: Alberto Garcia Berto
Re: [Qemu-block] [RFC PATCH 1/2] qcow2: Allow checking and repairing corrupted internal snapshots
On Mon 26 Feb 2018 02:40:08 PM CET, Max Reitz wrote: >> +"L1 table is too large; snapshot table entry >> corrupted\n", >> +(fix & BDRV_FIX_ERRORS) ? "Deleting" : "ERROR", >> +sn->id_str, sn->name, sn->l1_size); >> +found_corruption = true; >> +} > > This code assumes the snapshot table itself has been valid. Why should > it be when it contains garbage entries? > >> + >> +if (found_corruption) { >> +result->corruptions++; >> +sn->l1_size = 0; /* Prevent this L1 table from being used */ >> +if (fix & BDRV_FIX_ERRORS) { >> +ret = qcow2_snapshot_delete(bs, sn->id_str, sn->name, >> NULL); > > So calling this is actually very dangerous. It modifies the snapshot > table which I wouldn't trust is actually just a snapshot table. It > could intersect any other structure in the qcow2 image. Yes, we do an > overlap check, but that only protects metadata, and I don't really > want to see an overlap check corruption when repairing the image; > especially since this means you cannot fix the corruption. I think you're right. One thing that could be done is check that the area where the snapshot table is supposed to be doesn't overlap with any other data structures. > I don't quite know myself what to do instead, but I guess my main > point would be: Before any (potentially) destructive changes are made, > the user should have the chance of still opening the image read-only > and copying all the data off somewhere else. Which of course again > means we shouldn't prevent the user from opening an image because a > snapshot is broken. Ok, you've convinced me. I'll see what I can come up with. I guess initially we can simply add checks for sn->l1_table_offset and sn->l1_size in the places where they're used. Berto
Re: [Qemu-block] [PATCH v3 2/4] qcow2: Document some maximum size constraints
Am 22.02.2018 um 16:59 hat Eric Blake geschrieben: > Although off_t permits up to 63 bits (8EB) of file offsets, in > practice, we're going to hit other limits first. Document some > of those limits in the qcow2 spec, and how choice of cluster size > can influence some of the limits. > > While at it, notice that since we cannot map any virtual cluster > to any address higher than 64 PB (56 bits) (due to the L1/L2 field > encoding), it makes little sense to require the refcount table to > access host offsets beyond that point. Mark the upper bits of > the refcount table entries as reserved, with no ill effects, since > it is unlikely that there are any existing images larger than 64PB > in the first place, and thus all existing images already have those > bits as 0. > > Signed-off-by: Eric BlakeI think it would be good to mention the exact reason for the 56 bits in the spec. Even this commit message is rather vague ('L1/L2 field encoding'), so if at some point someone wonders, if we couldn't simply extend the allowed range, they won't easily see that it's related to compressed clusters. Kevin
Re: [Qemu-block] [PATCH v2 00/16] block/mirror: Add active-sync mirroring
On Sat, 02/24 16:42, Max Reitz wrote: > Pïng Looks good in general. I've left some small questions though. Fam
Re: [Qemu-block] [PATCH v2 15/16] block/mirror: Add copy mode QAPI interface
On Mon, 01/22 23:08, Max Reitz wrote: > This patch allows the user to specify whether to use active or only > passive mode for mirror block jobs. Currently, this setting will remain I think you want s/passive/background/ in the whole patch. > constant for the duration of the entire block job. > > Signed-off-by: Max Reitz> --- > qapi/block-core.json | 11 +-- > include/block/block_int.h | 4 +++- > block/mirror.c| 12 +++- > blockdev.c| 9 - > 4 files changed, 27 insertions(+), 9 deletions(-) > > diff --git a/qapi/block-core.json b/qapi/block-core.json > index ba1fd736f5..5fafa5fcac 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -1573,6 +1573,9 @@ > # written. Both will result in identical contents. > # Default is true. (Since 2.4) > # > +# @copy-mode: when to copy data to the destination; defaults to 'passive' > +# (Since: 2.12) > +# > # Since: 1.3 > ## > { 'struct': 'DriveMirror', > @@ -1582,7 +1585,7 @@ > '*speed': 'int', '*granularity': 'uint32', > '*buf-size': 'int', '*on-source-error': 'BlockdevOnError', > '*on-target-error': 'BlockdevOnError', > -'*unmap': 'bool' } } > +'*unmap': 'bool', '*copy-mode': 'MirrorCopyMode' } } > > ## > # @BlockDirtyBitmap: > @@ -1761,6 +1764,9 @@ > #above @device. If this option is not given, a node name > is > #autogenerated. (Since: 2.9) > # > +# @copy-mode: when to copy data to the destination; defaults to 'passive' > +# (Since: 2.12) > +# > # Returns: nothing on success. > # > # Since: 2.6 > @@ -1781,7 +1787,8 @@ > '*speed': 'int', '*granularity': 'uint32', > '*buf-size': 'int', '*on-source-error': 'BlockdevOnError', > '*on-target-error': 'BlockdevOnError', > -'*filter-node-name': 'str' } } > +'*filter-node-name': 'str', > +'*copy-mode': 'MirrorCopyMode' } } > > ## > # @block_set_io_throttle: > diff --git a/include/block/block_int.h b/include/block/block_int.h > index 03f3fdd129..1fda4d3d43 100644 > --- a/include/block/block_int.h > +++ b/include/block/block_int.h > @@ -948,6 +948,7 @@ void commit_active_start(const char *job_id, > BlockDriverState *bs, > * @filter_node_name: The node name that should be assigned to the filter > * driver that the mirror job inserts into the graph above @bs. NULL means > that > * a node name should be autogenerated. > + * @copy_mode: When to trigger writes to the target. > * @errp: Error object. > * > * Start a mirroring operation on @bs. Clusters that are allocated > @@ -961,7 +962,8 @@ void mirror_start(const char *job_id, BlockDriverState > *bs, >MirrorSyncMode mode, BlockMirrorBackingMode backing_mode, >BlockdevOnError on_source_error, >BlockdevOnError on_target_error, > - bool unmap, const char *filter_node_name, Error **errp); > + bool unmap, const char *filter_node_name, > + MirrorCopyMode copy_mode, Error **errp); > > /* > * backup_job_create: > diff --git a/block/mirror.c b/block/mirror.c > index 83082adb64..3b23886a5a 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -1409,7 +1409,7 @@ static void mirror_start_job(const char *job_id, > BlockDriverState *bs, > const BlockJobDriver *driver, > bool is_none_mode, BlockDriverState *base, > bool auto_complete, const char > *filter_node_name, > - bool is_mirror, > + bool is_mirror, MirrorCopyMode copy_mode, > Error **errp) > { > MirrorBlockJob *s; > @@ -1515,7 +1515,7 @@ static void mirror_start_job(const char *job_id, > BlockDriverState *bs, > s->on_target_error = on_target_error; > s->is_none_mode = is_none_mode; > s->backing_mode = backing_mode; > -s->copy_mode = MIRROR_COPY_MODE_BACKGROUND; > +s->copy_mode = copy_mode; > s->base = base; > s->granularity = granularity; > s->buf_size = ROUND_UP(buf_size, granularity); > @@ -1582,7 +1582,8 @@ void mirror_start(const char *job_id, BlockDriverState > *bs, >MirrorSyncMode mode, BlockMirrorBackingMode backing_mode, >BlockdevOnError on_source_error, >BlockdevOnError on_target_error, > - bool unmap, const char *filter_node_name, Error **errp) > + bool unmap, const char *filter_node_name, > + MirrorCopyMode copy_mode, Error **errp) > { > bool is_none_mode; > BlockDriverState *base; > @@ -1597,7 +1598,7 @@ void mirror_start(const char *job_id, BlockDriverState > *bs, > speed, granularity, buf_size,
Re: [Qemu-block] [PATCH v2 14/16] block/mirror: Add active mirroring
On Mon, 01/22 23:08, Max Reitz wrote: > @@ -1151,7 +1285,48 @@ static int coroutine_fn > bdrv_mirror_top_preadv(BlockDriverState *bs, > static int coroutine_fn bdrv_mirror_top_pwritev(BlockDriverState *bs, > uint64_t offset, uint64_t bytes, QEMUIOVector *qiov, int flags) > { > -return bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags); > +MirrorOp *op = NULL; > +MirrorBDSOpaque *s = bs->opaque; > +QEMUIOVector bounce_qiov; > +void *bounce_buf; > +int ret = 0; > +bool copy_to_target; > + > +copy_to_target = s->job->ret >= 0 && > + s->job->copy_mode == MIRROR_COPY_MODE_WRITE_BLOCKING; > + > +if (copy_to_target) { > +/* The guest might concurrently modify the data to write; but > + * the data on source and destination must match, so we have > + * to use a bounce buffer if we are going to write to the > + * target now. */ > +bounce_buf = qemu_blockalign(bs, bytes); > +iov_to_buf_full(qiov->iov, qiov->niov, 0, bounce_buf, bytes); Quorum doesn't use a bounce buffer, so I think we can get away without it too: a guest concurrently modifying the buffer isn't a concern in practice. > + > +qemu_iovec_init(_qiov, 1); > +qemu_iovec_add(_qiov, bounce_buf, bytes); > +qiov = _qiov; > + > +op = active_write_prepare(s->job, offset, bytes); > +} > + > +ret = bdrv_co_pwritev(bs->backing, offset, bytes, qiov, flags); > +if (ret < 0) { > +goto out; > +} > + > +if (copy_to_target) { > +do_sync_target_write(s->job, offset, bytes, qiov, flags); > +} > + > +out: > +if (copy_to_target) { > +active_write_settle(op); > + > +qemu_iovec_destroy(_qiov); > +qemu_vfree(bounce_buf); > +} > +return ret; > } > > static int coroutine_fn bdrv_mirror_top_flush(BlockDriverState *bs) Don't you need to update bdrv_mirror_top_pdiscard and bdrv_mirror_top_pwritev too? Fam
Re: [Qemu-block] [PATCH v2 12/16] block/mirror: Distinguish active from passive ops
On Mon, 01/22 23:08, Max Reitz wrote: > Currently, the mirror block job only knows passive operations. But once > we introduce active writes, we need to distinguish between the two; for > example, mirror_wait_for_free_in_flight_slot() should wait for a passive > operation because active writes will not use the same in-flight slots. > > Signed-off-by: Max Reitz> --- > block/mirror.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/block/mirror.c b/block/mirror.c > index 2363e79563..bb46f3c4e9 100644 > --- a/block/mirror.c > +++ b/block/mirror.c > @@ -89,6 +89,7 @@ struct MirrorOp { > int64_t *bytes_handled; > > bool is_pseudo_op; > +bool is_active_write; > CoQueue waiting_requests; > > QTAILQ_ENTRY(MirrorOp) next; > @@ -281,8 +282,10 @@ static inline void > mirror_wait_for_free_in_flight_slot(MirrorBlockJob *s) > * some other operation to start, which may in fact be the > * caller of this function. Since there is only one pseudo op > * at any given time, we will always find some real operation > - * to wait on. */ > -if (!op->is_pseudo_op) { > + * to wait on. > + * Also, only non-active operations use up in-flight slots, so > + * we can ignore active operations. */ > +if (!op->is_pseudo_op && !op->is_active_write) { > qemu_co_queue_wait(>waiting_requests, NULL); > return; > } > -- > 2.14.3 > I'd just squash this patch into 14 to avoid code churn. Fam
Re: [Qemu-block] [PATCH v2 11/16] block/dirty-bitmap: Add bdrv_dirty_iter_next_area
On Mon, 01/22 23:08, Max Reitz wrote: > This new function allows to look for a consecutively dirty area in a > dirty bitmap. > > Signed-off-by: Max Reitz> --- > include/block/dirty-bitmap.h | 2 ++ > block/dirty-bitmap.c | 51 > > 2 files changed, 53 insertions(+) > > diff --git a/include/block/dirty-bitmap.h b/include/block/dirty-bitmap.h > index a591c27213..35f3ccc44c 100644 > --- a/include/block/dirty-bitmap.h > +++ b/include/block/dirty-bitmap.h > @@ -79,6 +79,8 @@ void bdrv_set_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, > void bdrv_reset_dirty_bitmap_locked(BdrvDirtyBitmap *bitmap, > int64_t offset, int64_t bytes); > int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter); > +bool bdrv_dirty_iter_next_area(BdrvDirtyBitmapIter *iter, uint64_t > max_offset, > + uint64_t *offset, int *bytes); > void bdrv_set_dirty_iter(BdrvDirtyBitmapIter *hbi, int64_t offset); > int64_t bdrv_get_dirty_count(BdrvDirtyBitmap *bitmap); > int64_t bdrv_get_meta_dirty_count(BdrvDirtyBitmap *bitmap); > diff --git a/block/dirty-bitmap.c b/block/dirty-bitmap.c > index 50564fa1e2..484b5dda43 100644 > --- a/block/dirty-bitmap.c > +++ b/block/dirty-bitmap.c > @@ -501,6 +501,57 @@ int64_t bdrv_dirty_iter_next(BdrvDirtyBitmapIter *iter) > return hbitmap_iter_next(>hbi, true); > } > > +/** > + * Return the next consecutively dirty area in the dirty bitmap > + * belonging to the given iterator @iter. > + * > + * @max_offset: Maximum value that may be returned for > + * *offset + *bytes > + * @offset: Will contain the start offset of the next dirty area > + * @bytes: Will contain the length of the next dirty area > + * > + * Returns: True if a dirty area could be found before max_offset > + * (which means that *offset and *bytes then contain valid > + * values), false otherwise. Also document the change to the iter cursor depending on the return value? Fam
Re: [Qemu-block] [PATCH v2 10/16] hbitmap: Add @advance param to hbitmap_iter_next()
On Mon, 01/22 23:08, Max Reitz wrote: > diff --git a/tests/test-hbitmap.c b/tests/test-hbitmap.c > index 9091c639b3..2a2aa5bd43 100644 > --- a/tests/test-hbitmap.c > +++ b/tests/test-hbitmap.c > @@ -46,7 +46,7 @@ static void hbitmap_test_check(TestHBitmapData *data, > > i = first; > for (;;) { > -next = hbitmap_iter_next(); > +next = hbitmap_iter_next(, true); > if (next < 0) { > next = data->size; > } > @@ -435,25 +435,25 @@ static void > test_hbitmap_iter_granularity(TestHBitmapData *data, > /* Note that hbitmap_test_check has to be invoked manually in this test. > */ > hbitmap_test_init(data, 131072 << 7, 7); > hbitmap_iter_init(, data->hb, 0); > -g_assert_cmpint(hbitmap_iter_next(), <, 0); > +g_assert_cmpint(hbitmap_iter_next(, true), <, 0); > > hbitmap_test_set(data, ((L2 + L1 + 1) << 7) + 8, 8); > hbitmap_iter_init(, data->hb, 0); > -g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7); > -g_assert_cmpint(hbitmap_iter_next(), <, 0); > +g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7); > +g_assert_cmpint(hbitmap_iter_next(, true), <, 0); > > hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7); > -g_assert_cmpint(hbitmap_iter_next(), <, 0); > +g_assert_cmpint(hbitmap_iter_next(, true), <, 0); > > hbitmap_test_set(data, (131072 << 7) - 8, 8); > hbitmap_iter_init(, data->hb, 0); > -g_assert_cmpint(hbitmap_iter_next(), ==, (L2 + L1 + 1) << 7); > -g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7); > -g_assert_cmpint(hbitmap_iter_next(), <, 0); > +g_assert_cmpint(hbitmap_iter_next(, true), ==, (L2 + L1 + 1) << 7); > +g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7); > +g_assert_cmpint(hbitmap_iter_next(, true), <, 0); > > hbitmap_iter_init(, data->hb, (L2 + L1 + 2) << 7); > -g_assert_cmpint(hbitmap_iter_next(), ==, 131071 << 7); > -g_assert_cmpint(hbitmap_iter_next(), <, 0); > +g_assert_cmpint(hbitmap_iter_next(, true), ==, 131071 << 7); > +g_assert_cmpint(hbitmap_iter_next(, true), <, 0); > } Please add tests for advance=false too.
Re: [Qemu-block] [PATCH v2 06/16] block/mirror: Use CoQueue to wait on in-flight ops
On Mon, 01/22 23:07, Max Reitz wrote: > qemu_iovec_destroy(>qiov); > -g_free(op); > > -if (s->waiting_for_io) { > -qemu_coroutine_enter(s->common.co); > -} > +qemu_co_queue_restart_all(>waiting_requests); > +g_free(op); OK, this answers my question to patch 5. Fam